You do not need all of Python. You need the four things that appear in every neural network: numbers, lists, functions, and loops. That is the lesson.
A variable is a named box that holds a value.
learning_rate = 0.001 num_layers = 6 temperature = 0.7 print(learning_rate) # 0.001 print(num_layers * 2) # 12
These three variables appear in almost every LLM you will ever work with. learning_rate controls how fast the model learns. num_layers is how many transformer blocks stack on top of each other. temperature controls how random the output is.
A list holds many values in order. In LLMs, lists hold tokens — the chunks of text the model processes.
tokens = [15496, 11, 995, 0] # "Hello, world!" as token IDs print(tokens[0]) # 15496 (first token) print(len(tokens)) # 4 (how many tokens)
When you type "Hello, world!" to an LLM, it does not see letters — it sees a list of numbers like this. Each number maps to a chunk of text in its vocabulary. This is called tokenisation.
A function takes input, does something, returns output. You will write and read hundreds of these.
def add(a, b):
return a + b
result = add(3, 4)
print(result) # 7
In a transformer, every single operation — attention, feed-forward, layer normalisation — is a function. The model itself is a function: text in, text out.
def gpt(prompt):
tokens = tokenise(prompt)
output = transformer(tokens)
return decode(output)
response = gpt("The sky is")
# "The sky is blue."
That is not real code yet — but it is exactly how a real GPT function works structurally. You will write this for real in Lesson 5.
A loop runs the same code many times. Training an LLM is fundamentally a loop: show the model text, measure how wrong it is, nudge it to be less wrong. Repeat millions of times.
for step in range(10000):
loss = model.forward(batch)
loss.backward()
optimizer.step()
if step % 1000 == 0:
print(f"Step {step}: loss = {loss:.4f}")
Again — not real code yet. But this is the skeleton of every training loop you will write. loss measures how wrong the model is. backward() figures out what to adjust. optimizer.step() makes the adjustment. Loop until the model is good.
Here is a tiny working Python program that uses all four concepts. Read it — you will understand more than you expect.
# A tiny fake "LLM" that predicts the next word
vocabulary = ["hello", "world", "the", "sky", "is", "blue"]
def predict_next(word):
pairs = {"the": "sky", "sky": "is", "is": "blue", "hello": "world"}
return pairs.get(word, "?")
sentence = ["the"]
for step in range(4):
last_word = sentence[-1]
next_word = predict_next(last_word)
sentence.append(next_word)
print(sentence)
# ['the', 'sky', 'is', 'blue']
This is a one-rule "LLM" — it predicts the next word from a lookup table. A real LLM does the same thing, but the lookup table has billions of entries and is learned from data instead of hand-coded.
Create a new notebook, paste the code above, press Shift+Enter. You just ran Python. Do this for every code block in this lesson.
loss.backward() actually does? Just ask Claude directly.