Adjusting λ — regularization & the objective

Learning is usually cast as minimizing an objective function = average loss (fit the data) + regularization (stay simple). For the soft-margin SVM that's hinge loss + (λ/2)‖θ‖². Since the margin is 1/‖θ‖, penalizing ‖θ‖ widens the margin — and λ sets the balance. Drag λ and watch the boundary trade fit for margin.

J(θ,θ₀) = (1/n) Σ max(0, 1 − y⁽ⁱ⁾(θ·x⁽ⁱ⁾+θ₀)) + (λ/2)‖θ‖²

+1 class −1 class boundary θ·x+θ₀=0 margins ±1 violator (margin < 1)

Controls

Regularization

less reg · fit hardmore reg · wide margin

Edit points — click to add, drag to move

Data set

Objective J = avg loss + reg

avg loss (λ/2)‖θ‖² J = —

Sweeping λ — margin vs. average loss

Each point on these curves is the optimum for that λ. As λ grows (right), the margin widens but the average loss rises — the dashed line marks your current λ.

The objective function — purpose, origins & where else it shows up

Why an objective = loss + regularization?

A classifier that only minimizes training loss will happily contort itself to nail every point — and then generalize badly to new data (overfitting). Regularization adds a penalty for complexity so the model prefers a simpler explanation. The objective balances the two:

objective = average loss + λ · (regularizer)

λ is the dial on that balance — the bias–variance tradeoff made into a single number. Large λ → simpler model (here, a wider margin) that tolerates more training mistakes; small λ → fit the data hard, at the risk of overfitting.

Origins

The idea is regularized risk minimization — minimize empirical loss plus a complexity penalty — formalized in Vapnik's structural risk minimization for SVMs. The squared-norm penalty itself is older still: Tikhonov regularization (a.k.a. ridge), introduced to stabilize ill-posed problems. The unifying theme is the bias–variance tradeoff: a little bias (the penalty) buys a lot less variance.

Here: the soft-margin SVM

Our loss is the hinge loss max(0, 1 − y(θ·x+θ₀)): zero once a point is correctly classified with margin ≥ 1, and growing linearly as it crosses into the margin or the wrong side (the violators, ringed on the plot). The regularizer is (λ/2)‖θ‖², and because the margin is 1‖θ‖, shrinking ‖θ‖ is exactly widening the margin. So as λ→0 you approach the hard-margin SVM (fit at all costs); as λ grows you buy a wider, simpler margin by accepting some slack.

The same template elsewhere

Swap the loss and the penalty and you get much of classical ML — the loss + λ·penalty shape is everywhere:

Ridge regression — squared error + λ‖θ‖² (L2). Same penalty as here.
Lasso — squared error + λ‖θ‖₁ (L1), which also drives weights to exactly zero (feature selection).
Regularized logistic regression — log-loss + λ‖θ‖², the workhorse linear classifier.
Neural networks — any loss + λ‖weights‖², known as weight decay; dropout and early stopping are regularizers in spirit too.

In every case λ answers the same question this page asks: how much should fitting the data bend to staying simple?