Every single token in an LLM is a vector. Every single layer in a transformer is a matrix operation. You cannot build a GPT without these two shapes. This lesson makes them feel obvious.
That is the whole definition. A vector is a list of numbers.
In Python, you already saw this in Lesson 2 — it is just a list:
v = [0.2, -0.4, 0.9] # this is a vector
What makes it a vector vs just a list? Context — you intend those numbers to mean something together, not separately.
When an LLM reads the word "king", it does not see the letters K-I-N-G. It converts that word into a vector — hundreds of numbers — that encodes meaning.
This is called an embedding. Every word, every token, every sentence gets turned into a vector before the transformer ever sees it.
Why? Because vectors can do something letters cannot: arithmetic on meaning.
Subtract the "man" direction, add the "woman" direction — the result is closest to "queen". This is real. It works.
The number of numbers in a vector is called its dimension. GPT-2 small uses 768-dimensional word vectors. GPT-3 uses 12,288.
More dimensions = more capacity to encode subtle meaning. A 3-dimensional "king" vector could capture royalty, gender, and maybe age. A 768-dimensional vector captures hundreds of subtle relationships.
A matrix is just a table of numbers. Rows and columns.
In Python / PyTorch, a matrix is a list of lists:
M = [[2, 0, 1],
[-1, 3, 0],
[0, 1, 4]]
Here is the one operation that runs millions of times inside a transformer: you multiply a matrix by a vector to get a new vector.
The matrix is not data — it is a transformation. It takes an input vector (the word embedding) and produces an output vector (a new representation). This is how attention works. This is how feed-forward layers work. Every layer in a transformer is this operation, repeated.
PyTorch calls everything a tensor. A tensor is just the general word:
When you feed 8 sentences of 512 tokens each into a transformer, the input is a 3D tensor: shape [8, 512, 768] — 8 sequences, 512 tokens each, 768 dimensions per token.
Start at 0:00 — the first 20 minutes walk through exactly these shapes in real PyTorch code. Pause and run along. This is Lesson 4's prep.