Lesson 3 of many About 15 minutes No calculus — just shapes

Vectors & Matrices

Every single token in an LLM is a vector. Every single layer in a transformer is a matrix operation. You cannot build a GPT without these two shapes. This lesson makes them feel obvious.

Promise: No calculus. No proofs. No exams. Just what these things are and why transformers need them.

1. A vector is a list of numbers

That is the whole definition. A vector is a list of numbers.

[ 0.2 -0.4 0.9 ]
a vector with 3 numbers

In Python, you already saw this in Lesson 2 — it is just a list:

v = [0.2, -0.4, 0.9]   # this is a vector

What makes it a vector vs just a list? Context — you intend those numbers to mean something together, not separately.

2. Words are vectors — this is the key idea

When an LLM reads the word "king", it does not see the letters K-I-N-G. It converts that word into a vector — hundreds of numbers — that encodes meaning.

[ 0.87 -0.12 0.44 0.03 ... 508 more ... ]
"king" as a 512-number vector (embedding)

This is called an embedding. Every word, every token, every sentence gets turned into a vector before the transformer ever sees it.

Why? Because vectors can do something letters cannot: arithmetic on meaning.

king man + woman queen ← vector arithmetic finds this

Subtract the "man" direction, add the "woman" direction — the result is closest to "queen". This is real. It works.

3. Vectors have a size — and size matters

The number of numbers in a vector is called its dimension. GPT-2 small uses 768-dimensional word vectors. GPT-3 uses 12,288.

More dimensions = more capacity to encode subtle meaning. A 3-dimensional "king" vector could capture royalty, gender, and maybe age. A 768-dimensional vector captures hundreds of subtle relationships.

dim=3
tiny
dim=768
GPT-2
dim=12288
GPT-3

4. A matrix is a grid of numbers

A matrix is just a table of numbers. Rows and columns.

201 -130 014
a 3×3 matrix (3 rows, 3 columns)

In Python / PyTorch, a matrix is a list of lists:

M = [[2,  0,  1],
     [-1, 3,  0],
     [0,  1,  4]]

5. Matrix × vector = the core of every transformer

Here is the one operation that runs millions of times inside a transformer: you multiply a matrix by a vector to get a new vector.

201 -130
2×3 matrix
×
[ 123 ]
input vector
=
[ 5 5 ]
output vector

The matrix is not data — it is a transformation. It takes an input vector (the word embedding) and produces an output vector (a new representation). This is how attention works. This is how feed-forward layers work. Every layer in a transformer is this operation, repeated.

Mental model: The matrix is the model's "learned knowledge." The vector is the current word/token. Multiplying them applies what the model knows to what it is looking at right now.

6. Tensors — just more dimensions

PyTorch calls everything a tensor. A tensor is just the general word:

When you feed 8 sentences of 512 tokens each into a transformer, the input is a 3D tensor: shape [8, 512, 768] — 8 sequences, 512 tokens each, 768 dimensions per token.

Primary source — watch this after the quiz
Let's build GPT from scratch — Andrej Karpathy (YouTube, 2hr)

Start at 0:00 — the first 20 minutes walk through exactly these shapes in real PyTorch code. Pause and run along. This is Lesson 4's prep.

Check your understanding

Tweet this insight
"Day 3 of building an LLM from scratch with zero CS background.

Mind-blowing insight: when GPT reads 'king', it doesn't see letters. It sees a list of 768 numbers. And king − man + woman ≈ queen. Vector arithmetic on meaning.

This is real. It works. #LLMengineeringfromscratch"
Post to Twitter / X →

Lesson 2: Python from Zero

Ask your teacher anything. Want to see the king−man+woman math with actual numbers? Confused about dimensions? Just ask Claude directly.