A few days ago, while sitting in the machine learning class, I was thinking about the following problem about designing a classifier:
AFAIK, In traditional SVM, the goal is maximizing the minimum gap. To cope with noise, we may add slack variables to our constrained optimization programming.
BUT, why don’t we just maximize the total gap (sum of gaps) instead of max. of the min?! This way, outliers act as negative gaps. This problem would be defined as Linear Programming, so we have efficient solution for it.
I just wonder if this approach has been used before. Or is it convertible to some other commonly-used objective functions?
no idea?!
For separable data, you want to maximize max gap, as opposed to “sum of gaps” because of Fat Shattering dimension concepts. “Max gap” corresponds to margin, and a classifier with large margin will have good generalization power. Not sure if anything like that can be said for classifier with large “sum of gaps”
And for non-separable case, I don’t think the
“maximize max gap” intuition holds. For non-separable data you minimize ||w||+”Sum of hinge losses”.
Bennett gives nice inuition on how SVM finds the boundary for non-separable case. (http://www.idi.ntnu.no/emner/it3704/lectures/papers/Bennett_2000_Support.pdf)
In short — in separable case you find hyperplane that maximizes margin to two convex hulls (formed by datapoints of each class), and in non-separable case the convex hull’s are formed by capping the influence of any datapoint (so it becomes smaller), and finding best separating boundary between the reduced convex hulls
Thanks Yaroslav for your discussion.
Actually, I do not know what the interpretation of maximizing the sum of gaps is – if any.
Actually there’s something related — instead of maximizing w^2 which corresponds to maximizing the margin, one can minimize “number of support vectors”, because few support vectors means good generalization. The problem is intractable, but we can approximate number of SV’s by Sum_i alpha_i, this leads to linear programming problem and has been done before
Actually there’s something related — instead of maximizing w^2 which corresponds to maximizing the margin, one can minimize “number of support vectors”, because few support vectors means good generalization. The problem is intractable, but we can approximate number of SV’s by Sum_i alpha_i, this leads to linear programming problem and has been done before