Lesson 4 of many About 20 minutes Goal: Real tensors, real gradients, real training loop

PyTorch from Zero

You have seen vectors, matrices, and tensors as ideas. Now you will hold them in real code. PyTorch is the library that lets Python do tensor math on your GPU — and more importantly, lets it remember every operation so it can figure out how to improve itself.

One sentence: PyTorch = tensors + automatic gradient computation. Everything else in deep learning is built on these two things.

1. Creating tensors

A PyTorch tensor is exactly what you learned in Lesson 3 — a list of numbers — but one that can run on your GPU and track gradients.

import torch

# Create tensors from Python lists (just like Lesson 2/3)
x = torch.tensor([1.0, 2.0, 3.0])
print(x)         # tensor([1., 2., 3.])
print(x.shape)   # torch.Size([3])   ← it's a 3-dim vector

# 2D tensor (matrix)
M = torch.tensor([[1.0, 2.0],
                  [3.0, 4.0]])
print(M.shape)   # torch.Size([2, 2])

Notice .shape — you will read this constantly in transformer code. Every tensor has a shape; every bug is usually a shape mismatch.

2. Tensor operations — the same math as Lesson 3

The operations from Lesson 3 are real PyTorch code.

a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])

print(a + b)    # tensor([5., 7., 9.])   ← element-wise add
print(a * b)    # tensor([ 4., 10., 18.]) ← element-wise multiply

# Matrix multiply — the core of every transformer layer
A = torch.tensor([[2.0, 0.0],
                  [1.0, 3.0]])
v = torch.tensor([1.0, 2.0])

print(A @ v)    # tensor([2., 7.])  ← matrix × vector from Lesson 3

The @ operator is matrix multiply. You will see it everywhere in transformer code. Every attention head, every feed-forward layer, every projection — all @.

3. The magic: automatic gradients

This is the part that makes deep learning possible. When you train an LLM, you need to know: for each weight in the model, how much does changing it reduce the loss? Computing this by hand for billions of weights is impossible. PyTorch does it automatically.

Tell PyTorch to track a tensor with requires_grad=True:

w = torch.tensor(2.0, requires_grad=True)  # a single "weight"
x = torch.tensor(3.0)                       # input

y = w * x           # forward: y = 2 × 3 = 6
loss = (y - 5)**2   # how wrong? (6 − 5)² = 1

loss.backward()     # ← PyTorch computes ALL gradients
print(w.grad)       # tensor(4.)
# "increase w by 1 → loss increases by 4. So decrease w."

PyTorch traces the computation graph forward, then walks it backward to compute gradients.

4. A real training loop

Now put it together. This is a complete, working PyTorch training loop — the skeleton of every LLM training run ever written.

import torch
import torch.nn as nn

# Model: one Linear layer (a learned matrix multiply)
model = nn.Linear(1, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# Data: y = 2x  (we want the model to learn this)
X = torch.tensor([[1.0], [2.0], [3.0], [4.0]])
y = torch.tensor([[2.0], [4.0], [6.0], [8.0]])

for step in range(200):
    pred = model(X)                    # 1. forward pass
    loss = ((pred - y)**2).mean()      # 2. measure error

    optimizer.zero_grad()              # 3. clear old gradients
    loss.backward()                    # 4. compute new gradients
    optimizer.step()                   # 5. adjust weights

    if step % 50 == 0:
        print(f"Step {step}: loss = {loss.item():.4f}")

# Step   0: loss = 18.3241
# Step  50: loss =  0.0412
# Step 100: loss =  0.0093
# Step 150: loss =  0.0021

Five lines, same every time: forward → loss → zero_grad → backward → step. The model teaches itself from data.

Connect to Lesson 2: Your Python loop from Lesson 2 was for step in range(10000): adjust weights. That pseudocode is exactly steps 3–5 above. You already understood training — now you have the real code.

5. `nn.Linear` — the transformer's building block

nn.Linear is a matrix multiply with a learned weight matrix. Every single layer in a transformer is built from this one thing.

# GPT-2 small feed-forward layer
ff = nn.Linear(768, 3072)
print(ff.weight.shape)   # torch.Size([3072, 768])
                          # ← a 3072×768 learned matrix

# Every forward pass:  output = input @ weight.T + bias
# That's it. Billions of parameters, same operation.

GPT-2 small has 12 transformer layers. Each has two of these nn.Linear calls. That is the model. You now understand its fundamental operation.

Primary source — Andrej Karpathy builds GPT live

Let's build GPT from scratch — Andrej Karpathy (YouTube, 2hr)

Watch from 0:00–25:00. Karpathy writes torch.tensor, nn.Linear, and the training loop you just read — in a real working GPT. Every line will click now.

Check your understanding

Tweet this insight

"Day 4 ✅ PyTorch from zero

PyTorch = tensor in → math → tensor out. But it remembers every operation.

loss.backward() computes every gradient automatically. That's why deep learning scaled.

@shaktidev001 #LLMengineeringfromscratch

🔗 llmengineeringfromscratch.pages.dev"

Post to Twitter / X →

Ask your teacher anything. What does optimizer.zero_grad() actually do? Why does loss.backward() know which tensors to differentiate? Just ask Claude directly.

PyTorch from Zero

1. Creating tensors

2. Tensor operations — the same math as Lesson 3

3. The magic: automatic gradients

4. A real training loop

5. nn.Linear — the transformer's building block

Check your understanding

5. `nn.Linear` — the transformer's building block