You have seen vectors, matrices, and tensors as ideas. Now you will hold them in real code. PyTorch is the library that lets Python do tensor math on your GPU — and more importantly, lets it remember every operation so it can figure out how to improve itself.
A PyTorch tensor is exactly what you learned in Lesson 3 — a list of numbers — but one that can run on your GPU and track gradients.
import torch
# Create tensors from Python lists (just like Lesson 2/3)
x = torch.tensor([1.0, 2.0, 3.0])
print(x) # tensor([1., 2., 3.])
print(x.shape) # torch.Size([3]) ← it's a 3-dim vector
# 2D tensor (matrix)
M = torch.tensor([[1.0, 2.0],
[3.0, 4.0]])
print(M.shape) # torch.Size([2, 2])
Notice .shape — you will read this constantly in transformer code. Every tensor has a shape; every bug is usually a shape mismatch.
The operations from Lesson 3 are real PyTorch code.
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])
print(a + b) # tensor([5., 7., 9.]) ← element-wise add
print(a * b) # tensor([ 4., 10., 18.]) ← element-wise multiply
# Matrix multiply — the core of every transformer layer
A = torch.tensor([[2.0, 0.0],
[1.0, 3.0]])
v = torch.tensor([1.0, 2.0])
print(A @ v) # tensor([2., 7.]) ← matrix × vector from Lesson 3
The @ operator is matrix multiply. You will see it everywhere in transformer code. Every attention head, every feed-forward layer, every projection — all @.
This is the part that makes deep learning possible. When you train an LLM, you need to know: for each weight in the model, how much does changing it reduce the loss? Computing this by hand for billions of weights is impossible. PyTorch does it automatically.
Tell PyTorch to track a tensor with requires_grad=True:
w = torch.tensor(2.0, requires_grad=True) # a single "weight" x = torch.tensor(3.0) # input y = w * x # forward: y = 2 × 3 = 6 loss = (y - 5)**2 # how wrong? (6 − 5)² = 1 loss.backward() # ← PyTorch computes ALL gradients print(w.grad) # tensor(4.) # "increase w by 1 → loss increases by 4. So decrease w."
PyTorch traces the computation graph forward, then walks it backward to compute gradients.
Now put it together. This is a complete, working PyTorch training loop — the skeleton of every LLM training run ever written.
import torch
import torch.nn as nn
# Model: one Linear layer (a learned matrix multiply)
model = nn.Linear(1, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
# Data: y = 2x (we want the model to learn this)
X = torch.tensor([[1.0], [2.0], [3.0], [4.0]])
y = torch.tensor([[2.0], [4.0], [6.0], [8.0]])
for step in range(200):
pred = model(X) # 1. forward pass
loss = ((pred - y)**2).mean() # 2. measure error
optimizer.zero_grad() # 3. clear old gradients
loss.backward() # 4. compute new gradients
optimizer.step() # 5. adjust weights
if step % 50 == 0:
print(f"Step {step}: loss = {loss.item():.4f}")
# Step 0: loss = 18.3241
# Step 50: loss = 0.0412
# Step 100: loss = 0.0093
# Step 150: loss = 0.0021
Five lines, same every time: forward → loss → zero_grad → backward → step. The model teaches itself from data.
for step in range(10000): adjust weights. That pseudocode is exactly steps 3–5 above. You already understood training — now you have the real code.
nn.Linear — the transformer's building blocknn.Linear is a matrix multiply with a learned weight matrix. Every single layer in a transformer is built from this one thing.
# GPT-2 small feed-forward layer
ff = nn.Linear(768, 3072)
print(ff.weight.shape) # torch.Size([3072, 768])
# ← a 3072×768 learned matrix
# Every forward pass: output = input @ weight.T + bias
# That's it. Billions of parameters, same operation.
GPT-2 small has 12 transformer layers. Each has two of these nn.Linear calls. That is the model. You now understand its fundamental operation.
Watch from 0:00–25:00. Karpathy writes torch.tensor, nn.Linear, and the training loop you just read — in a real working GPT. Every line will click now.
optimizer.zero_grad() actually do? Why does loss.backward() know which tensors to differentiate? Just ask Claude directly.