How Does a Model Learn?

A complete walkthrough of machine learning training — using a model with just two parameters.

1. Build the intuition first

Imagine you're blindfolded, feeling your way around a volume knob, trying to find just the right level. You have no idea where the right position is, so you give it a random turn — too loud. You nudge it down a little. Still a bit much. Nudge it again. Slowly, you get closer and closer to that sweet spot.

Machine learning training works exactly like this: make a guess, see how far off you are, nudge things in the right direction, and repeat. Over time, the model's parameters converge toward the correct answer.

Let's walk through every step of this process with a minimal example.

2. Meet the model

Our model is a single formula with two parameters:

y = a × x + b

a and b are the parameters we want to "learn." We give them arbitrary initial values before training — the goal is to make them progressively more accurate. The full experiment setup:

Training data

x = 1
y_true = 4

Initial guess

a = 3
b = 4

Learning rate η

0.1

Loss function

squared error

The learning rate controls how much we adjust the knob each step. Too large and we overshoot; too small and we need many more steps. 0.1 is a common starting point.

With the initial parameters: y_pred = 3 × 1 + 4 = 7, but the true answer is 4. We're off by 3. Training starts here.

3. What happens at each step

Every training round, the model does four things:

Predict

Use the current a and b to compute y_pred.

y_pred = a × x + b

Compute error and Loss

The difference between prediction and truth is the error. Squaring it gives the Loss. Squaring penalizes large errors far more than small ones, making the training signal clearer.

error = y_pred - y_true Loss = error²

Compute the gradient (assigning blame)

This step answers: if we increase a by a tiny amount, how does Loss change? The steeper the change, the more a is "responsible" for the current error.

Definition of gradient

gradient = L(a + Δa) − L(a) Δa as Δa → 0

Plugging in numbers — given L(a) = (ax + b − y_true)², let e = error:

L(a + Δa) = (e + Δa·x)² = e² + 2e·Δa·x + (Δa·x)² L(a + Δa) − L(a) = 2e·Δa·x + (Δa·x)² Divide by Δa, let Δa → 0:

∂L/∂a = 2 × error × x　　∂L/∂b = 2 × error

The gradient is not a magic formula — it's just the result of expanding the Loss definition and cancelling terms step by step.

Update parameters

Nudge a and b in the direction that reduces the Loss. The learning rate controls how big a nudge.

a = a - 0.1 × (∂L/∂a) b = b - 0.1 × (∂L/∂b)

Update done. Start the next round. Repeat all four steps.

4. Watching the numbers change

The table below records parameters, predictions, errors, and loss at each step. Focus on the error and Loss columns — watch them shrink step by step.

Step	a	b	y_pred	error	grad ∂a	grad ∂b	Loss
0	3.0000	4.0000	7.0000	3.0000	6.0000	6.0000	9.0000
1	2.4000	3.4000	5.8000	1.8000	3.6000	3.6000	3.2400
2	2.0400	3.0400	5.0800	1.0800	2.1600	2.1600	1.1664
3	1.8240	2.8240	4.6480	0.6480	1.2960	1.2960	0.4199
4	1.6944	2.6944	4.3888	0.3888	0.7776	0.7776	0.1512
5	1.6166	2.6166	4.2332	0.2332	0.4664	0.4664	0.0544

A visual sense of how Loss drops —

Loss over training steps (relative to initial value)

Step 0

9.00

Step 1

3.24

Step 2

1.17

Step 3

0.42

Step 4

0.15

Step 5

0.05

⚠️ We only have one data point (x=1, y=4), so any a and b satisfying a+b=4 would predict perfectly. In practice you need large, diverse datasets to push the parameters toward a solution that genuinely generalises.

5. Patterns and principles

What do you notice?

📉

Smaller error → smaller steps

The gradient shrinks as the error shrinks, so each update automatically becomes more cautious. The closer you get to the target, the lighter the touch.

⚖️

a and b are corrected in sync

In this example x=1, so both parameters receive identical gradients and move together. In real models, each parameter gets its own distinct gradient.

♾️

Training is iteration, not solving

The model never "calculates" the right answer — it approaches it incrementally. This is the core idea behind how deep learning handles complex problems.

📦

One data point is not enough

Both a=1, b=3 and a=0, b=4 satisfy a+b=4. Real training requires massive, diverse datasets to push the parameters toward a solution that actually generalises.

Keep this skeleton in mind — every neural network, no matter how large, loops through the same four steps:

Step 1Predict

→

Step 2Compute Loss

→

Step 3Gradient

→

Step 4Update

↩ repeat until Loss is small enough

Training is not about computing the right answer directly. It's about reading the current error and pushing the parameters a little closer to better — one step at a time.

Two parameters or a hundred billion — the mechanism is identical.