From Zero to Agents
A visual, hands-on course

Understand AI,
from nothing
to agents.

Fifteen days. No coding. Every idea arrives as a real-world picture first, then a thing you can play with. Built for a curious mind that's strong in maths and physics.

PHASE 1

Foundations

What does it even mean for a machine to learn?

  • · Teach by example, not rules
  • · Features & boundaries
  • · Rolling downhill
PHASE 2

Neural Networks

Stacking simple decisions into something powerful.

  • · The neuron
  • · How networks learn
  • · Why depth matters
PHASE 3

Transformers

How machines read, write, and pay attention.

  • · Words as vectors
  • · Attention
  • · How an LLM writes
PHASE 4

Agents

A model that uses tools, remembers, and acts.

  • · Prompting & tools
  • · Memory
  • · The agent loop
Reference

The words,
in plain English

Every term in the course, defined simply — with the everyday picture that goes with it. Start typing to filter.

How it all fits together
Worked example · a spam filter Every email is an example. Its features might be: number of links, whether it says “free”, and if the subject is ALL CAPS. Its label is what you once called it — spam or not. The classifier (a kind of model) gives each feature a weight, adds a bias, and makes a prediction. At first it's wrong often — that gap is the error. We find the gradient (which weights to blame), and step by the learning rate to nudge them: gradient descent. After many epochs the error is small and the model has carved a decision boundary between spam and not. Run it on a brand-new email → that's inference. (If it merely memorised your old emails → overfitting.)
The learning loop — the heart of it
1 · FORWARD — MAKE A GUESS Data examples · features · labels Model (classifier) weights · bias Prediction its guess Error / loss how wrong 2 · LEARN — STEP DOWNHILL gradient → gradient descent · learning rate → nudge the weights
  • That classifier box can be a simple KNN (compare to the nearest examples — your Day 1 demo) or a logistic regression (one straight boundary — Day 2).
  • The error forms a landscape with valleys: a shallow local minimum you can get stuck in, vs. the true global minimum.
  • One full sweep through all the data is an epoch; training repeats the loop for many of them.
…and it scales up to the whole course
PHASE 1
The learning loop

One model: data → prediction → error → gradient descent nudges the weights.

PHASE 2
Neural network

Stack many of those units in layers; backpropagation sends the error back through all of them.

PHASE 3
Transformer

Add embeddings + attention so it reads language and predicts the next token — an LLM.

PHASE 4
Agent

Wrap the LLM in a loop with tools and memory so it can act toward a goal.

Every term above (and the rest of the course) is defined below — search to jump to any one.

Foundations Phase 1 · you're here

Classifier

A model whose whole job is to sort things into categories — cat or dog, spam or not-spam, ripe or unripe.

Model

The pattern a machine has learned from examples. It's what the machine actually uses to make a guess.

Training (learning)

The process of showing the model examples and adjusting it again and again until it gets good.

Data / dataset

The collection of examples a model learns from. More good examples usually means a better model.

Feature

A measurable clue describing one example — a fruit's colour, a person's height, a pixel's darkness.

Label

The correct answer attached to a training example, telling the model "this one is a cat."

Weight

A number the model tunes that says how much a feature matters to the decision.

Bias (the number)

An extra adjustable number that nudges the result up or down. (Different from "bias" meaning unfairness.)

Decision boundary

The line or surface a classifier draws to separate one category from another.

Error / loss

A single number for how wrong the model is right now. The whole goal of learning is to make it small.

Gradient / slope

The direction in which the error drops fastest. It's exactly the derivative from calculus.

Gradient descent

The method of learning: feel the slope, take a small step downhill on the error, repeat thousands of times.

Learning rate

How big each downhill step is. Too small is painfully slow; too big overshoots and flies off.

Minimum (local vs global)

A low point in the error. A shallow dip you can get stuck in (local) vs. the true lowest point (global = best answer).

Prediction / inference

Using an already-trained model to make a guess on something new. (Training = learning; inference = using.)

k-Nearest Neighbours (KNN)

A dead-simple classifier: to label something new, look at the most similar examples you've already seen and copy theirs.

Logistic regression

A simple model that learns one straight decision boundary between two classes.

Overfitting

When a model memorises the exact training examples instead of the general pattern — great on practice, poor on anything new.

Neural Networks Phase 2 · coming up

Neuron

One tiny decision: multiply each input by a weight, add them up, and pass the total through a "switch."

Neural network

Many neurons wired together in layers, so together they can learn patterns a single line never could.

Activation function

The "switch" inside a neuron that decides how strongly it fires for a given input.

Forward pass

Data flowing in through the network, layer by layer, until an answer pops out the end.

Backpropagation

Sending the error backwards through the network so every weight learns which way to adjust.

Layer

A row of neurons. Early layers spot simple things (edges); deeper ones spot meaning (faces).

Parameters

All the weights and biases a model has learned. Modern models have billions of them.

Epoch

One complete pass through the entire training dataset.

Language & Transformers Phase 3 · coming up

Token

A small chunk of text — roughly a word or part of a word — that the model reads and writes one at a time.

Vector

Just a list of numbers, which you can also picture as a point or arrow in space.

Embedding

Turning a word into a vector so that related words sit close together and maths can capture meaning.

Attention

A mechanism that lets each word look at the other words that matter to it, to understand context.

Transformer

The architecture — built mostly from attention — behind today's AI like ChatGPT and Claude.

LLM (large language model)

A huge transformer trained to predict the next token. From that one skill it can write, answer, and summarise.

Next-token prediction

How an LLM writes: guess the most fitting next word, add it, then guess again — on repeat.

Temperature

A dial for how adventurous vs. safe the model's word choices are. High = creative, low = predictable.

Context window

How much text the model can hold "in mind" at one time. Beyond it, earlier text is forgotten.

Hallucination

When a model states something false but confidently. It predicts plausible-sounding text, not guaranteed truth.

Agents Phase 4 · coming up

Prompt

The instructions you give a model in plain language. Clearer prompts give better results.

Tool / function calling

Letting a model use a calculator, a search, or a database so it can actually do things, not just talk.

Agent

A model placed in a loop so it can plan, act with tools, see the result, and keep going toward a goal.

Memory (agentic)

Short-term (the context window) plus long-term — a store the agent writes to and later searches by meaning.

RAG (retrieval-augmented generation)

Fetching the most relevant saved facts and handing them to the model, so it answers from real information.

Fine-tuning

Training an existing model a bit more on your own examples to make it a specialist.

No term matches that — try another word.
Day 00 · Start here

What even is AI?

Before we teach a machine anything — what are we actually talking about? And why does today's AI feel so different?

The one-sentence answer

Artificial Intelligence is any technique that lets a machine do something we'd normally call "intelligent" — recognising a face, understanding a sentence, planning a route, beating you at chess.

That's a huge umbrella. A chess program from 1997 and ChatGPT are both AI — but they work in completely different ways. The first big idea to get straight is that "AI" is a family, not a single thing.

The family tree
ARTIFICIAL INTELLIGENCE machines doing "intelligent" things MACHINE LEARNING learns patterns from data instead of hand-written rules DEEP LEARNING neural networks with many layers GENERATIVE AI creates new text, images, code — LLMs live here also AI, but not ML: rule-based "expert systems"

Each box sits inside the one before it. All generative AI is deep learning; all deep learning is machine learning; all of it is AI — but plenty of older AI (like rule-based systems) isn't machine learning at all.

The big split — two eras of AI
Traditional AI · recognise & decide
  • What it doesSorts, scores, predicts, or picks — it analyses existing things.
  • Typical job"Is this spam?" · "Will this customer churn?" · "What's the best chess move?"
  • OutputA label, a number, or a choice from fixed options.
  • HowHand-written rules, or simpler models like decision trees and classifiers.
  • ExampleYour bank's fraud alert; Netflix's "you might like…"; a thermostat's logic.
Generative AI · create something new
  • What it doesProduces brand-new content that didn't exist before — it generates.
  • Typical job"Write this email" · "Draw a cat astronaut" · "Explain attention to my son."
  • OutputOpen-ended: paragraphs, images, code, audio.
  • HowHuge deep neural networks (transformers) trained on enormous data.
  • ExampleChatGPT, Claude, Midjourney — the tools that started the recent boom.

The simplest way to feel the difference: traditional AI chooses from what exists; generative AI makes something that didn't. A spam filter picks one of two boxes. An LLM writes a sentence no one has written before. This course starts with the traditional ideas (Days 1–3) because generative AI is built directly on top of them.

Core ideas in traditional AI
Rule-based / expert systems
The earliest AI: humans write thousands of explicit if-this-then-that rules. Powerful where rules are clear (tax software), brittle where they aren't ("is this a cat?").
Search & planning
Exploring many possible moves or routes to find a good one. How chess engines and GPS navigation "think ahead."
Classification
Sorting things into categories — spam/not-spam, ripe/unripe. The workhorse of traditional AI, and exactly where Day 1 begins.
Regression
Predicting a number rather than a category — a house price, tomorrow's temperature.
Clustering
Finding natural groups in data without being told the answers — e.g. grouping customers by behaviour.
Decision trees
A flowchart of yes/no questions the machine learns from data. Simple, fast, and easy to read.

A useful distinction you'll hear: supervised learning (you give labelled examples — classification & regression) vs. unsupervised learning (the machine finds structure itself — clustering). Day 1 is supervised.

Watch out for
"AI means ChatGPT."

ChatGPT is one recent kind (generative). AI has existed since the 1950s and runs quietly in maps, banks, and email today.

"Generative AI replaced traditional AI."

No — traditional AI still powers most real systems. Generative is an addition, built on the same foundations.

"AI = a thinking mind."

Today's AI has no understanding or intent. It's maths finding patterns — astonishingly useful, but not conscious.

Explain it back
In one sentence, what is AI?
What's the core difference between traditional and generative AI?
Is a spam filter generative AI? Why not?
Where does "machine learning" sit inside the AI family?

Next → Now that AI has a shape, we zoom into the idea that powers all the modern stuff: instead of writing rules, we let a machine learn from examples. That's Day 1.

Ripe mango
Day 01 · Foundations

Don't program it. Teach it.

The single biggest idea in AI: machines learn from examples, not from rules we write.

The hook
?

Try to write exact step-by-step rules to tell a cat from a dog — without using the words "cat" or "dog." Pointy ears? Some dogs have them. Bigger? Not a Chihuahua. Every rule breaks.

The analogy

Nobody hands a child a rulebook for ripe mangoes.

You just point — "ripe… not that one… ripe" — and after a dozen examples, something clicks. That clicked-into-place pattern is a model.

Machine learning works exactly like the child at the fruit stall: examples go in, a pattern comes out. The machine writes its own rulebook — one it usually can't even put into words.

The idea

The old way

A human writes the rules; the computer obeys. Great for taxes. Hopeless for "is this a cat?"

rules → answers

The new way

Show it examples; it finds the pattern itself. The pattern it keeps is the model.

examples → pattern
Try it — teach a machine
Workbench · KNN classifierdraw, teach, then let it guess
0
0

Teach a few of each (try smiley vs. frowny), then draw a new one and hit guess.

There are no rules inside that box. It shrinks your drawing to a 16×16 sketch and compares it to every example you taught — picking the label of the ones it looks most like. More examples → better guesses.

Watch out for
"It understands faces."

It doesn't. It's matching shapes to remembered examples — nothing more.

"More data is always better."

Only good data. Biased or sloppy examples teach a biased, sloppy pattern.

Explain it back
What's the difference between programming and machine learning?
Why couldn't we just write rules to recognise faces?
The machine made a mistake — whose fault, and how do you fix it?

Tomorrow → Today it compared whole drawings. But what is it really comparing? Everything boils down to features — a few numbers — and learning becomes finding a line between them.

Doctor with stethoscope
Day 02 · Foundations

Two numbers,
one dividing line.

Data becomes features; a model is just the boundary that separates them.

The hook
?

Describe every classmate with two numbers — height and hours of sleep — and guess who plays basketball. Could you draw one line that mostly separates the players? That line is today's whole lesson.

The analogy

A doctor doesn't memorise every patient.

She learns which signals matter — temperature, blood pressure — and roughly where the line sits between "fine" and "hospital." That boundary is her model.

Two things to notice: she chose which features matter (not shoe size), and she stores a boundary, not the patients. A machine-learned model is exactly the same.

The idea

Features

The measurable clues. Each example becomes a dot in a space.

The boundary

Learning = finding the line that best splits the groups.

w₁ w₂ b

The model

For a straight line, the whole "intelligence" is just three numbers.

w₁·x + w₂·y + b = 0
Try it — find the boundary
Workbench · logistic regressiondrop dots, watch it learn the line
accuracy
The model is just three numbers:
w₁ = —
w₂ = —
bias = —

Now hit tricky one: no straight line can separate it. Hold that thought — fixing it is exactly why Phase 2 builds neural networks.

Watch out for
"The model remembers the dots."

No — it keeps only the boundary. Delete the dots and it still classifies.

"A line can separate anything."

The tricky example just disproved that. Real data often isn't linearly separable.

Explain it back
What's a "feature"? Give two for telling apples from oranges.
Why is the model the boundary, not the data?
The tricky example failed — what does that tell us we'll need next?

Tomorrow → You watched the line "settle" — but how did it know which way to move? It's rolling downhill on a landscape of its own mistakes. Your calculus becomes the hero.

Foggy mountain
Day 03 · Foundations

Learning is
rolling downhill.

Measure the error, feel the slope, take a step. Repeat. That's the engine of all of it.

The hook
?

You're blindfolded, shooting arrows. After each shot I tell you only "too high" or "too low." Can you eventually hit the target? Of course — you nudge, and shrink the nudge as you close in.

The analogy

Pretend the AI has just one knob it can turn.

Every position of that knob makes the AI a little better or worse. "Rolling downhill" just means turning the knob toward fewer mistakes.

The "valley" in the demo below is not a real place — it's a picture of that one idea. Before you touch it, here's exactly how to read the picture:

How to read the picture

Left → right

The knob — the single number the AI is allowed to change (its "weight"). Left is one setting, right is another.

Up ↑   down ↓

How wrong the AI is at that knob setting. High up = lots of mistakes. Low down = few mistakes.

The dip ↓

The knob setting that makes the fewest mistakes. That low point is the answer the AI is hunting for.

The orange ball is the AI's current guess for the knob. Like a hiker in thick fog (above), it can't see the whole curve — it only feels the slope right under it, and steps in the downhill direction (toward fewer mistakes). The size of each step is the learning rate. Roll downhill enough times and the ball settles in the dip — the AI has found its best setting.

In short: downhill = less wrong. Watch the ball, the dashed slope line, and the live equation move together as you press Step.

Try it — roll downhill
Workbench · gradient descentdrop the ball, set the step, run
Learning rate (step size) 0.25
A sensible step — the ball should glide into a valley.
step
0
weight w
2.60
error
slope
new w = w − rate × slope

Three experiments: a tiny rate (slow), a big rate past ~1.6 (it flies off — divergence is real!), and dropping the ball left vs. right to land in a deep valley vs. a shallow trap.

Watch out for
"It always finds the best answer."

Not guaranteed — local minima and bad step sizes can trap or destabilise it.

"Bigger rate = faster learning."

Only up to a point. Past it you get worse, not faster — overshoot.

Explain it back — Phase 1 capstone
Walk through one learning step using error, slope, and learning rate.
Why can too large a learning rate make things worse?
Local vs. global minimum — why does the starting point matter?
Tie it together: how does a pile of examples become a trained model?

Phase 2 → One straight line couldn't split the tricky data. The fix: stack many simple weighted decisions into a network, and train the whole thing by rolling downhill — in millions of dimensions. That unit is the neuron.

Day 04 · Neural Networks

The neuron:
a weighted vote.

The whole of deep learning is built from one tiny, simple unit. Here it is.

The analogy

Picture a panel of judges scoring a dish.

Each judge cares about different things — one weighs taste heavily, another barely cares. You add up their weighted opinions and decide. A neuron is exactly one such judge.

A neuron takes its inputs, multiplies each by a weight (how much it trusts that input), adds them up with a bias, and passes the total through a switch (the activation) that decides how strongly it fires.

The idea

Weights

How much each input matters. Big weight = "I trust this a lot."

Sum + bias

Add the weighted inputs, nudge with a bias.

w₁x₁ + w₂x₂ + b

Activation

A switch that turns the sum into an output between 0 and 1.

Try it — tune a neuron
Workbench · one neurondrag the weights, watch the boundary turn

Blue and red dots are two classes. The line is where the neuron "fires at 50%." Tune the weights so the line separates them.

Watch out for
"A neuron is like a brain cell."

Loosely inspired by one, but it's just multiply-add-and-switch. Pure arithmetic.

"One neuron is powerful."

Alone it's just Day 2's straight line again. The power comes from stacking many — next lesson.

Explain it back
What are the three steps inside a neuron?
What does a weight represent?
Why is one neuron no stronger than the Day 2 line?

Next → One neuron draws one line. Wire many together in layers and they can trace any curve — and learn it automatically by sending the error backwards.

Day 05 · Neural Networks

How a network
learns.

Stack neurons, push data through, then send the mistake backwards. Watch it learn a curve live.

The analogy

A relay team passing a baton — and passing back the blame.

Data runs forward through the layers to make a guess. Then the error runs backward, telling each runner how they contributed to the miss, so everyone adjusts.

Forward pass = make a prediction. Backpropagation = trace the error back through every weight so each one knows which way to nudge. It's the chain rule from calculus, applied layer by layer — then it's just gradient descent (Day 3) on all of them at once.

The idea

Forward pass

Inputs flow through the layers; an answer pops out.

Backprop

The error flows back, assigning blame to each weight.

Repeat

Do it over the data many times (epochs). The fit improves.

Try it — watch a network learn a curve
Workbench · tiny neural network8 hidden neurons learning to trace the curve

The grey curve is the target. The coloured line is the network's current guess — random at first, then bending to fit as the error (loss) drops.

Watch out for
"Backprop is magic."

It's the chain rule — a tidy way to compute every weight's slope at once.

"More neurons = always better."

Too many and it memorises noise instead of the pattern — overfitting.

Explain it back
What happens in the forward pass vs. the backward pass?
How does backprop connect to Day 3's "rolling downhill"?
What is an epoch?

Next → Our network has one hidden layer. Stack many layers — go "deep" — and something remarkable happens: the layers start inventing their own concepts, simple to complex.

Day 06 · Neural Networks

Going deep.

Why stacking layers lets a network build meaning — from edges, to shapes, to "that's a face."

The analogy

An assembly line, or building with Lego.

The first station sees only edges. The next combines edges into shapes. The next, shapes into parts — an eye, a wheel. The last says "that's a cat."

"Deep" just means many layers. Each layer composes the previous one's output into something more abstract. Nobody tells it what an edge or an eye is — it invents these useful features itself. That's representation learning.

Try it — see what the layers see
Workbench · layer feature viewerdraw, then watch the first layer detect edges

Draw a shape or letter above.

This is a real edge-detector — exactly the kind of filter a first layer learns. It lights up where your drawing has an edge in the chosen direction. Deeper layers combine these into shapes and objects.

Watch out for
"We program it to find eyes."

We don't. We only set up the layers and the goal; the features emerge from training.

"Deeper is always better."

Depth adds power but also cost and trickier training. It's a balance.

Explain it back — Phase 2 capstone
What does each successive layer do to the previous layer's output?
Who decides what an "edge" or "eye" detector is?
Tie Phase 2 together: neuron → network → deep network.

Phase 3 → So far, inputs were numbers and dots. But how does a network read words? The trick is to turn every word into a list of numbers — a vector — arranged so meaning becomes geometry.

Day 07 · Transformers

Words become
points in space.

Turn every word into numbers, and meaning turns into geometry you can do maths on.

The analogy

A map — but of meaning instead of places.

On a real map, related cities sit close together. On a meaning map, "king" and "queen" are neighbours, and the arrow from "man → woman" is parallel to "king → queen".

Each word becomes a vector — a point in space. Direction and distance encode meaning. The astonishing part: this lets you do arithmetic on meaning. king − man + woman lands right on queen.

Try it — explore the meaning map
Workbench · embedding explorertap words; try the analogy
Tap two related words to see the arrow between them — or run the analogy below.
Watch out for
"The map is really 2D."

Real embeddings have hundreds of dimensions. We flatten to 2D just to see it.

"Someone places the words."

No — the positions are learned from billions of sentences about which words appear together.

Explain it back
What is a word embedding, in one sentence?
Why can "king − man + woman ≈ queen" possibly work?
What does distance on the map mean?

Next → If words live as points in space, how does a machine actually find the closest ones? That's the trick behind every "search by meaning" — and how AIs remember things in a vector database.

Day 07½ · Transformers

Searching
by meaning.

Two ideas — cosine similarity and the vector database — turn "find related stuff" into pure geometry.

The hook
?

Ask ChatGPT "what did we talk about last week?" — it doesn't remember in the human sense. It searches, by meaning, through a big pile of stored vectors. Same trick that Google uses to find a page that doesn't even contain your exact words.

The analogy

A library where books aren't shelved by title — they're shelved by what they're about.

You walk in holding a book on "how puppies learn tricks". The librarian doesn't look at the words on the cover. She walks to the spot where that book belongs, and hands you every neighbour — a book on dog training, one on animal behaviour, one on rewards. None of them share your title. All of them share your meaning.

Idea 1 — An embedding is just a list of numbers

On Day 7 we drew words as dots on a 2D map. That was a lie of convenience. A real embedding is a list of hundreds of numbers — often 768 or 1536. Each number is a learned "feature direction" that no human ever named.

Workbench · what a real embedding looks likeeach row is one word
king →  [ … 765 more numbers ]
queen →  [ … 765 more numbers ]
apple →  [ … 765 more numbers ]

No one wrote those numbers. The model learned them from billions of sentences — by nudging the values until words that appear in similar contexts ended up at similar coordinates. That's all "training an embedding" is.

Idea 2 — Cosine similarity: the angle between two arrows

If a word is a point, the line from origin to that point is an arrow. To ask "are these two words similar?" we ask: do their arrows point the same way?

Same direction → similarity = 1.0.
Perpendicular → similarity = 0.
Opposite directions → similarity = −1.

The formula looks scary but it's just the cosine of the angle between the two arrows:

cos(θ)  =  (a · b) / (|a| · |b|)

Why angle and not plain distance? Because the length of an embedding vector doesn't carry meaning — only its direction does. Two arrows pointing the same way are "about the same thing", even if one is twice as long.

Workbench · feel the angledrag either arrow
Drag the purple arrow or the teal arrow. Watch the angle and the cosine update.
Idea 3 — A vector database is just "sort by cosine"

Now glue the two ideas together. A vector database is a big table where each row is some piece of text (a sentence, a doc, a memory) and its embedding. When you ask it a question:

  1. Turn your question into an embedding (one arrow).
  2. Compute the cosine similarity against every stored embedding.
  3. Return the top few highest scores.

That's it. No keyword matching. No clever indexing magic at the concept level — just "who points the most in my direction?"

Workbench · a tiny vector DBtap a query word

The DB knows these 12 words. Tap one to use it as your query — the rest get ranked by cosine similarity to it:

Pick a query word to see the top matches and the actual cosine numbers.
Watch out for
"Cosine is the same as distance."

It's not. Two arrows can be far apart in length but point the same way (cosine = 1). Cosine only cares about direction.

"A vector DB understands the words."

It doesn't read anything. It just does arithmetic on numbers an embedding model handed it.

"The numbers in an embedding mean something specific."

Individual dimensions usually mean nothing a human can name. Only the overall direction has meaning.

Explain it back
Why use the angle between arrows instead of the straight-line distance?
In one sentence, what does a vector database do when you give it a query?
If cosine(a, b) = 1, what does that say about words a and b?

Next → A word's embedding is fixed the moment it's looked up — but the same word means different things in different sentences. So before any searching, the model needs to reshape each word by its neighbours. That's context.

Day 08 · Transformers

Context changes
everything.

The same word means different things depending on its neighbours. Watch meaning shift.

The hook
?

Say "I never said she stole the money" out loud, stressing a different word each time. Seven stresses, seven meanings — same words. Context is everything.

The analogy

The word "bank" alone tells you nothing.

Add "river" and it's a riverbank. Add "money" and it's a vault. The surrounding words decide which meaning wins.
Try it — flip the meaning
Workbench · context disambiguatoradd a context word, watch "bank" move

The word in focus: bank. Tap a context word to add it to the sentence:

Watch out for
"Each word has one fixed meaning."

In a transformer, a word's representation is reshaped by its context every time.

Explain it back
Give a word whose meaning flips with context.
Why isn't a fixed embedding (Day 7) enough on its own?

Next → If every word must look at the others to settle its meaning, we need a mechanism for that looking. That mechanism — attention — is the heart of the transformer.

Day 09 · Transformers · ★

Attention:
the spotlight.

The single idea behind every modern AI. Each word asks: which other words should I look at?

The analogy

A spotlight in a dark theatre.

Reading "…because it was too heavy," your eyes dart back to find what "it" means. You shine a spotlight on the few words that matter right now. The machine learns where to aim it.

Each word sends out a question, every word offers a key, and the match scores decide how much attention to pay. High match → look here. It's dot products → softmax → a weighted blend.

Try it — watch the spotlight
Workbench · attention visualizertap a word; predict where it looks first

Tap a word above. Before the lines appear — guess which word it reaches for.

Keeping it honest: a real transformer learns its own meaning-vectors. Here each word has a small hand-built vector so the mechanism is visible — but the engine (match → soften → blend) is the genuine article.

Watch out for
"Attention = understanding."

It's a weighted average steered by learned matches. Powerful, but still arithmetic.

Explain it back
What does attention let each word do?
What are the "question" and "key" doing?
Why do the percentages add up to 100%?

Next → Stack many attention layers, train on the whole internet, and you get a machine that predicts the next word brilliantly — a large language model.

Day 10 · Transformers

How an LLM
writes.

It's autocomplete that read the whole library: predict the next word, add it, repeat.

The analogy

A very well-read improv actor.

Given everything said so far, it picks the most fitting next word — then reads that back in and picks the next. One word at a time, on a loop.

A transformer stacks attention layers to read all the context, then predicts the next token. Temperature controls the choice: low plays it safe and predictable; high gets adventurous and creative.

Try it — predict the next word
Workbench · next-token predictorpick a start, set the temperature, generate
the model's ranked guesses for the next word:
Watch out for
"It plans the whole sentence."

It only ever predicts the next word, then re-reads everything and predicts again.

"It looks up true facts."

It predicts plausible text. That's why it can sound confident yet be wrong — a hallucination.

Explain it back — Phase 3 capstone
How does an LLM generate a sentence?
What does temperature change?
Tie Phase 3 together: embeddings → context → attention → next-token.

Phase 4 → A model that writes is amazing — but it only talks. Next we give it tools, memory, and a loop, turning the talker into a doer: an agent.

Sandbox · End to End

Words → vectors →
attention → next word.

A single live wiring diagram of everything Phase 3 taught. Change the sentence — watch every stage update.

Pick a sentence
Stage 0 · inputor type your own using the vocab below
vocab the toy model knows:
current sentence:
Stage 1 · Tokenize

Real LLMs split text into "tokens" — often word-pieces. Our toy model uses whole words. The order matters: each token gets a position.

tokens with positions
Stage 2 · Embedding lookup (the vector DB)

Each token is looked up in a giant table — the embedding matrix. This is the vector DB: row = token, columns = the 8 numbers that define its meaning. (Real models use 768 or more.) Each strip below is one row of the DB.

Stage 2 · embedding rows (Layer 0)colour = number value
Each number is a learned "feature direction". teal = positive, purple = negative, white = near zero.
Stage 3 · Cosine similarity matrix

Now the model asks every token: "which other tokens are you most similar to?" Cosine similarity between every pair. The hotter the cell, the more aligned the meaning.

Stage 3 · pairwise cosinehover a cell for the number
Click a row's label to use that token as the attention query below.
Stage 4 · Attention (softmax of cosines)

Cosines become attention weights by passing through a softmax: the top scores get amplified, the rest get squashed. The query token then gathers a weighted mix of every other token's vector.

Stage 4 · attention from one token— pick a query token
Click any token's row label in the matrix above to make it the query.
Stage 5 · Multi-layer vector DB

A transformer doesn't run attention once — it stacks layers. Each layer takes the previous layer's vectors and re-mixes them through attention. The same word's vector changes at every layer, getting richer with context. That's why "bank" near "river" ends up very different from "bank" near "money".

Stage 5 · the same tokens, three layers deepwatch vectors morph downward
Layer 0 = pure dictionary lookup. Layer 1 & 2 = each token's vector is now a context-aware blend. Notice how identical words drift apart when they sit in different contexts.
Stage 6 · Generation (next-word)

To produce the next word, the model takes the last token's vector from the top layer and compares it against every word in the vocabulary — cosine again. Softmax over those scores gives a probability distribution. Sample from it: that's your next word.

Stage 6 · next-word probabilitiestop of the distribution
What you just watched
Vector DB = lookup table.

Storing embeddings is literally rows of numbers, one per token, plus a similarity function.

Cosine ranks meaning. Softmax picks winners.

Cosine measures alignment; softmax turns raw scores into clean weights that sum to 1.

Layers stack the same trick.

One attention pass doesn't see enough. Stacking 12, 24, 96 of them is what makes modern LLMs powerful.