Adjusting λ — regularization & the objective
Learning is usually cast as minimizing an objective function = average loss (fit the data) + regularization (stay simple). For the soft-margin SVM that's hinge loss + (λ/2)‖θ‖². Since the margin is 1/‖θ‖, penalizing ‖θ‖ widens the margin — and λ sets the balance. Drag λ and watch the boundary trade fit for margin.
J(θ,θ₀) = (1/n) Σ max(0, 1 − y⁽ⁱ⁾(θ·x⁽ⁱ⁾+θ₀)) + (λ/2)‖θ‖²Controls
Objective J = avg loss + reg
Sweeping λ — margin vs. average loss
Each point on these curves is the optimum for that λ. As λ grows (right), the margin widens but the average loss rises — the dashed line marks your current λ.
The objective function — purpose, origins & where else it shows up
Why an objective = loss + regularization?
A classifier that only minimizes training loss will happily contort itself to nail every point — and then generalize badly to new data (overfitting). Regularization adds a penalty for complexity so the model prefers a simpler explanation. The objective balances the two:
λ is the dial on that balance — the bias–variance tradeoff made into a single number. Large λ → simpler model (here, a wider margin) that tolerates more training mistakes; small λ → fit the data hard, at the risk of overfitting.
Origins
The idea is regularized risk minimization — minimize empirical loss plus a complexity penalty — formalized in Vapnik's structural risk minimization for SVMs. The squared-norm penalty itself is older still: Tikhonov regularization (a.k.a. ridge), introduced to stabilize ill-posed problems. The unifying theme is the bias–variance tradeoff: a little bias (the penalty) buys a lot less variance.
Here: the soft-margin SVM
Our loss is the hinge loss max(0, 1 − y(θ·x+θ₀)): zero once a point is correctly classified with margin ≥ 1, and growing linearly as it crosses into the margin or the wrong side (the violators, ringed on the plot). The regularizer is (λ/2)‖θ‖², and because the margin is 1‖θ‖, shrinking ‖θ‖ is exactly widening the margin. So as λ→0 you approach the hard-margin SVM (fit at all costs); as λ grows you buy a wider, simpler margin by accepting some slack.
The same template elsewhere
Swap the loss and the penalty and you get much of classical ML — the loss + λ·penalty shape is everywhere:
- Ridge regression — squared error + λ‖θ‖² (L2). Same penalty as here.
- Lasso — squared error + λ‖θ‖₁ (L1), which also drives weights to exactly zero (feature selection).
- Regularized logistic regression — log-loss + λ‖θ‖², the workhorse linear classifier.
- Neural networks — any loss + λ‖weights‖², known as weight decay; dropout and early stopping are regularizers in spirit too.
In every case λ answers the same question this page asks: how much should fitting the data bend to staying simple?