Learning Objectives
By completing this tutorial, you will:
- Understand how to build a transformer step-by-step: from tokenization -> embedding -> transformer block -> output logits.
- Implement a transformer with a single attention layer and no MLP layer
- Visualize how attention works
Through this tutorial, you will implement the folllowing:
1. Tokenizer
Transform text into numbers that the model can process.
2. Embeddings
- Token embeddings: Learn representations for each token.
- If we have time: Sinusoidal positional encoding, which tell the model about token order.
3. Single-Head Causal Self-Attention
- Query (W_Q), Key (W_K), Value (W_V) projections
- Attention scores with causal masking
- Softmax normalization
- Simple residual connections
4. Output Layer
Linear transformation to predict the next token.
Imports & helpers
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
np.random.seed(42)
1. The Tokenizer
Before a neural network can understand text, we need to convert words into numbers. This process, tokenization, is the starting point of training a language model, from BERT to GPT.
Below, we implement a super simple tokenizer.
def tokenize(text):
"""Split text into tokens, handling punctuation as separate tokens."""
# Basic lowercase
text = text.lower()
# Add spaces around punctuation
punct = '.,!?;:"()[]{}'
for p in punct:
text = text.replace(p, f' {p} ')
# Split on whitespace
return text.split()
def build_vocab(texts, min_freq=1):
from collections import Counter
tokens = []
for t in texts:
tokens.extend(tokenize(t))
counts = Counter(tokens)
# Keep tokens with frequency >= min_freq
vocab = ['<PAD>', '<UNK>'] + [tok for tok, c in counts.items() if c >= min_freq]
token_to_id = {tok: i for i, tok in enumerate(vocab)}
id_to_token = {i: tok for tok, i in token_to_id.items()}
return token_to_id, id_to_token, vocab
def encode(text, token_to_id):
return [token_to_id.get(tok, token_to_id['<UNK>']) for tok in tokenize(text)]
def decode(ids, id_to_token):
return ' '.join([id_to_token.get(i, '<UNK>') for i in ids])
Try it out :)
# Example usage
texts = ["Hello, world!", "Hello there.", "This is an example!"]
token_to_id, id_to_token, vocab = build_vocab(texts)
print(f"Vocabulary size: {len(vocab)}")
print("Tokens:", vocab[:20])
encoded = encode("Hello world!", token_to_id)
print("Encoded:", encoded)
print("Decoded:", decode(encoded, id_to_token))
We can build and understand a tiny astronomy corpus.
texts = [
"O B A F G K M O B A F G K M O B A F G K M",
"The Sun is a G type star in the Milky Way.",
"Betelgeuse is an M type star.",
"Stars like the Sun are main-sequence G stars.",
"Sirius is a bright A type star.",
"Proxima Centauri is an M dwarf.",
"Vega is an A type star in Lyra.",
"Rigel is a B type supergiant.",
]
token_to_id, id_to_token, vocab = build_vocab(texts, min_freq=1)
print(f"Vocab size: {len(vocab)}")
print("Sample tokens:", vocab[:20])
def batch_encode(texts, token_to_id):
return [encode(t, token_to_id) for t in texts]
encoded_corpus = batch_encode(texts, token_to_id)
for i, (t, e) in enumerate(zip(texts, encoded_corpus)):
print(i, t)
print(" ->", e)
Let us compare our tokenizer with the tokenizer used with the GPT-2 model. Try to tokenize some funky words like "unbelievable" or "artificial." What do you notice?
# Install: pip install tiktoken
import tiktoken
def demo_gpt2_tokenizer(sentence):
# Use GPT-2's tokenizer
enc = tiktoken.get_encoding("gpt2")
tokens = enc.encode(sentence)
print("Token IDs:", tokens)
print("Decoded:", enc.decode(tokens))
demo_gpt2_tokenizer("Transformers are amazing!")
2. Embeddings
Remember our tokens [5, 87, 42]
? These numbers don't mean anything to a neural network yet. We need to convert each token into a vector that can capture meaning.
The Problem with Order: Transformers process multiple tokens at once. But word order matters!- "The cat ate the mouse" ≠ "The mouse ate the cat" 🐱🐭
A solution is to add position information to each embedding using sinusoidal positional encoding.
In simple terms:
- Even dimensions use
sin
, odd dimensions usecos
- Each dimension oscillates at a different frequency
- This creates a unique "fingerprint" for each position!
Final embedding = Token embedding + Positional encoding
def one_hot_encode(ids, vocab_size):
arr = np.zeros((len(ids), vocab_size), dtype=np.float32)
for i, idx in enumerate(ids):
arr[i, idx] = 1.0
return arr
def embed(ids, vocab_size, d_model, seed=42):
np.random.seed(seed)
W_emb = np.random.randn(vocab_size, d_model).astype(np.float32) * 0.01
return W_emb[ids], W_emb # return embeddings and the table
# Demo
ids = encode("The Sun is a G type star", token_to_id)
X_onehot = one_hot_encode(ids, len(vocab))
X_emb, W_emb = embed(ids, len(vocab), d_model=16)
print("One-hot shape:", X_onehot.shape)
print("Emb shape:", X_emb.shape)
Sinusoidal Position Encoding
def sinusoidal_position_encoding(seq_len, d_model):
pos = np.arange(seq_len)[:, None] # (seq_len, 1)
i = np.arange(d_model)[None, :] # (1, d_model)
angles = pos / np.power(10000, (2 * (i // 2)) / d_model)
enc = np.zeros((seq_len, d_model), dtype=np.float32)
enc[:, 0::2] = np.sin(angles[:, 0::2]) # even indices
enc[:, 1::2] = np.cos(angles[:, 1::2]) # odd indices
return enc
# Quick visualization helper
def plot_positional_encoding(enc, title="Positional Encoding"):
plt.figure(figsize=(8, 3))
plt.imshow(enc.T, aspect='auto', origin='lower')
plt.colorbar()
plt.title(title)
plt.xlabel("Position")
plt.ylabel("Dimension")
plt.tight_layout()
enc = sinusoidal_position_encoding(seq_len=50, d_model=16)
plot_positional_encoding(enc)
3. Single-Head Causal Self-Attention
We implement a minimal, readable attention block:
- Linear projections for queries (Q), keys (K), and values (V)
- Scaled dot-product attention with causal masking
- Softmax normalization
- Residual connection
def causal_mask(seq_len):
return np.triu(np.ones((seq_len, seq_len), dtype=np.float32), k=1) * -1e9
def softmax(x, axis=-1):
x = x - np.max(x, axis=axis, keepdims=True)
e = np.exp(x)
return e / np.sum(e, axis=axis, keepdims=True)
class SelfAttention:
def __init__(self, d_model, seed=42):
rng = np.random.default_rng(seed)
self.W_q = rng.normal(scale=0.02, size=(d_model, d_model)).astype(np.float32)
self.W_k = rng.normal(scale=0.02, size=(d_model, d_model)).astype(np.float32)
self.W_v = rng.normal(scale=0.02, size=(d_model, d_model)).astype(np.float32)
self.scale = np.sqrt(d_model).astype(np.float32)
def __call__(self, X):
"""
X: (T, d_model) where T is sequence length
Returns: (T, d_model)
"""
Q = X @ self.W_q
K = X @ self.W_k
V = X @ self.W_v
scores = (Q @ K.T) / self.scale # (T, T)
scores = scores + causal_mask(scores.shape[0]) # prevent attending to future tokens
attn = softmax(scores, axis=-1) # (T, T)
out = attn @ V # (T, d_model)
return out, attn
# Demo on a short sequence
ids = encode("the sun is a g type star", token_to_id)
X, _ = embed(ids, len(vocab), d_model=16)
attn_block = SelfAttention(d_model=16)
out, attn = attn_block(X)
print("Input shape:", X.shape)
print("Output shape:", out.shape)
print("Attention shape:", attn.shape)
def plot_attention(attn, tokens):
plt.figure(figsize=(5, 4))
plt.imshow(attn, origin='lower', aspect='auto')
plt.xticks(range(len(tokens)), tokens, rotation=90)
plt.yticks(range(len(tokens)), tokens)
plt.colorbar(label="Attention weight")
plt.tight_layout()
tokens = [id_to_token[i] for i in ids]
plot_attention(attn, tokens)
Add a simple residual connection
class ResidualSelfAttention(SelfAttention):
def __call__(self, X):
out, attn = super().__call__(X)
return X + out, attn
res_block = ResidualSelfAttention(d_model=16)
out_res, attn_res = res_block(X)
print("Residual output shape:", out_res.shape)
4. Output Layer -> Logits for Next-Token Prediction
A small linear layer maps from the model dimension back to the vocabulary size to produce logits over the next token.
class OutputHead:
def __init__(self, d_model, vocab_size, seed=42):
rng = np.random.default_rng(seed)
self.W = rng.normal(scale=0.02, size=(d_model, vocab_size)).astype(np.float32)
self.b = np.zeros((vocab_size,), dtype=np.float32)
def __call__(self, X):
# X: (T, d_model)
return X @ self.W + self.b
head = OutputHead(d_model=16, vocab_size=len(vocab))
logits = head(out_res) # (T, vocab_size)
print("Logits shape:", logits.shape)
print("Sample logits row (first token):", logits[0][:10])
5. Putting It Together: A Tiny Forward Pass
def tiny_forward(text, token_to_id, d_model=16, seed=42):
# Encode + embed
ids = encode(text, token_to_id)
X, _ = embed(ids, len(vocab), d_model=d_model)
# Attention block with residual
attn_block = ResidualSelfAttention(d_model=d_model, seed=seed)
H, attn = attn_block(X)
# Output head
head = OutputHead(d_model=d_model, vocab_size=len(vocab), seed=seed)
logits = head(H)
return ids, logits, attn
ids, logits, attn = tiny_forward("the sun is a g type star", token_to_id, d_model=16)
print("Forward shapes -> ids:", len(ids), "| logits:", logits.shape, "| attn:", attn.shape)
def predict_next_token(ids, logits, id_to_token, top_k=5):
# Use the last position
last_logits = logits[-1]
# Softmax for probabilities (for reporting)
probs = softmax(last_logits)
# Top-k indices
top_idx = np.argsort(-probs)[:top_k]
return [(id_to_token[i], float(probs[i])) for i in top_idx]
preds = predict_next_token(ids, logits, id_to_token, top_k=5)
for tok, p in preds:
print(f"{tok:>12s} {p:.4f}")
6. Training Sketch: Cross-Entropy
Below is a simple sketch of token-level cross-entropy over the next-token prediction. This is for pedagogy; it is not optimized.
def cross_entropy_loss(logits, targets):
# logits: (T, V), targets: (T,)
# Compute softmax and negative log-likelihood
probs = softmax(logits, axis=-1)
n = logits.shape[0]
eps = 1e-9
log_probs = -np.log(probs[np.arange(n), targets] + eps)
return float(np.mean(log_probs))
# Create a dummy target: shift by one (predict the next token)
targets = np.array(ids[1:] + [ids[-1]], dtype=np.int64)
loss = cross_entropy_loss(logits, targets)
print("CE loss:", loss)
7. Visualizing Attention on the Corpus
def visualize_sentence_attention(sentence):
ids = encode(sentence, token_to_id)
X, _ = embed(ids, len(vocab), d_model=16)
attn_block = ResidualSelfAttention(d_model=16)
_, attn = attn_block(X)
tokens = [id_to_token[i] for i in ids]
plot_attention(attn, tokens)
visualize_sentence_attention("vega is an a type star in lyra .")
Wrap-up
- You built a readable, NumPy-only transformer core: tokenization → embeddings → single-head causal self-attention → output logits.
- You saw how causal masking prevents peeking at the future.
- This scaffold can be extended with multi-head attention, layer normalization, an MLP, and training loops.
Next steps
- Add layer normalization and a simple 2-layer MLP block.
- Stack multiple attention blocks; try multi-head attention.
- Train on a small corpus and monitor loss.
Further reading
- “Attention Is All You Need” (Vaswani et al., 2017)
- Annotated Transformer (Harvard NLP)
Comments