Lesson 3 — Vectors & Matrices

Every single token in an LLM is a vector. Every single layer in a transformer is a matrix operation. You cannot build a GPT without these two shapes. This lesson makes them feel obvious.

Promise: No calculus. No proofs. No exams. Just what these things are and why transformers need them.

1. A vector is a list of numbers

What makes it a vector vs just a list? Context — you intend those numbers to mean something together, not separately.

2. Words are vectors — this is the key idea

When an LLM reads the word "king", it does not see the letters K-I-N-G. It converts that word into a vector — hundreds of numbers — that encodes meaning.

[ 0.87 -0.12 0.44 0.03 ... 508 more ... ]

"king" as a 512-number vector (embedding)

This is called an embedding. Every word, every token, every sentence gets turned into a vector before the transformer ever sees it.

Why? Because vectors can do something letters cannot: arithmetic on meaning.

Subtract the "man" direction, add the "woman" direction — the result is closest to "queen". This is real. It works.

3. Vectors have a size — and size matters

The number of numbers in a vector is called its dimension. GPT-2 small uses 768-dimensional word vectors. GPT-3 uses 12,288.

More dimensions = more capacity to encode subtle meaning. A 3-dimensional "king" vector could capture royalty, gender, and maybe age. A 768-dimensional vector captures hundreds of subtle relationships.

4. A matrix is a grid of numbers

5. Matrix × vector = the core of every transformer

Here is the one operation that runs millions of times inside a transformer: you multiply a matrix by a vector to get a new vector.

201 -130

2×3 matrix

[ 123 ]

input vector

[ 5 5 ]

output vector

The matrix is not data — it is a transformation. It takes an input vector (the word embedding) and produces an output vector (a new representation). This is how attention works. This is how feed-forward layers work. Every layer in a transformer is this operation, repeated.

Mental model: The matrix is the model's "learned knowledge." The vector is the current word/token. Multiplying them applies what the model knows to what it is looking at right now.

6. Tensors — just more dimensions

When you feed 8 sentences of 512 tokens each into a transformer, the input is a 3D tensor: shape [8, 512, 768] — 8 sequences, 512 tokens each, 768 dimensions per token.

Primary source — watch this after the quiz

Let's build GPT from scratch — Andrej Karpathy (YouTube, 2hr)

Start at 0:00 — the first 20 minutes walk through exactly these shapes in real PyTorch code. Pause and run along. This is Lesson 4's prep.

Tweet this insight

"Day 3 of building an LLM from scratch with zero CS background.

Mind-blowing insight: when GPT reads 'king', it doesn't see letters. It sees a list of 768 numbers. And king − man + woman ≈ queen. Vector arithmetic on meaning.

This is real. It works. #LLMengineeringfromscratch"

Post to Twitter / X →

Ask your teacher anything. Want to see the king−man+woman math with actual numbers? Confused about dimensions? Just ask Claude directly.

Vectors & Matrices

1. A vector is a list of numbers

2. Words are vectors — this is the key idea

3. Vectors have a size — and size matters

4. A matrix is a grid of numbers

5. Matrix × vector = the core of every transformer

6. Tensors — just more dimensions

Check your understanding