Background image: ioana ciucă Background image: ioana ciucă
Social Icons

🤖 Mini-Transformer with NumPy: A Hands-On Tutorial

6 min read
Image of: Ioana Ciucă Ioana Ciucă

Learning Objectives

By completing this tutorial, you will:

  • Understand how to build a transformer step-by-step: from tokenization -> embedding -> transformer block -> output logits.
  • Implement a transformer with a single attention layer and no MLP layer
  • Visualize how attention works

Through this tutorial, you will implement the folllowing:

1. Tokenizer

Transform text into numbers that the model can process.

2. Embeddings

  • Token embeddings: Learn representations for each token.
  • If we have time: Sinusoidal positional encoding, which tell the model about token order.

3. Single-Head Causal Self-Attention

  • Query (W_Q), Key (W_K), Value (W_V) projections
  • Attention scores with causal masking
  • Softmax normalization
  • Simple residual connections

4. Output Layer

Linear transformation to predict the next token.

Imports & helpers

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(42)

1. The Tokenizer

Before a neural network can understand text, we need to convert words into numbers. This process, tokenization, is the starting point of training a language model, from BERT to GPT.

Below, we implement a super simple tokenizer.

def tokenize(text):
    """Split text into tokens, handling punctuation as separate tokens."""
    # Basic lowercase
    text = text.lower()
    # Add spaces around punctuation
    punct = '.,!?;:"()[]{}'
    for p in punct:
        text = text.replace(p, f' {p} ')
    # Split on whitespace
    return text.split()

def build_vocab(texts, min_freq=1):
    from collections import Counter
    tokens = []
    for t in texts:
        tokens.extend(tokenize(t))
    counts = Counter(tokens)
    # Keep tokens with frequency >= min_freq
    vocab = ['<PAD>', '<UNK>'] + [tok for tok, c in counts.items() if c >= min_freq]
    token_to_id = {tok: i for i, tok in enumerate(vocab)}
    id_to_token = {i: tok for tok, i in token_to_id.items()}
    return token_to_id, id_to_token, vocab

def encode(text, token_to_id):
    return [token_to_id.get(tok, token_to_id['<UNK>']) for tok in tokenize(text)]

def decode(ids, id_to_token):
    return ' '.join([id_to_token.get(i, '<UNK>') for i in ids])

Try it out :)

# Example usage
texts = ["Hello, world!", "Hello there.", "This is an example!"]
token_to_id, id_to_token, vocab = build_vocab(texts)
print(f"Vocabulary size: {len(vocab)}")
print("Tokens:", vocab[:20])

encoded = encode("Hello world!", token_to_id)
print("Encoded:", encoded)
print("Decoded:", decode(encoded, id_to_token))

We can build and understand a tiny astronomy corpus.

texts = [
    "O B A F G K M O B A F G K M O B A F G K M",
    "The Sun is a G type star in the Milky Way.",
    "Betelgeuse is an M type star.",
    "Stars like the Sun are main-sequence G stars.",
    "Sirius is a bright A type star.",
    "Proxima Centauri is an M dwarf.",
    "Vega is an A type star in Lyra.",
    "Rigel is a B type supergiant.",
]
token_to_id, id_to_token, vocab = build_vocab(texts, min_freq=1)
print(f"Vocab size: {len(vocab)}")
print("Sample tokens:", vocab[:20])

def batch_encode(texts, token_to_id):
    return [encode(t, token_to_id) for t in texts]

encoded_corpus = batch_encode(texts, token_to_id)
for i, (t, e) in enumerate(zip(texts, encoded_corpus)):
    print(i, t)
    print(" ->", e)

Let us compare our tokenizer with the tokenizer used with the GPT-2 model. Try to tokenize some funky words like "unbelievable" or "artificial." What do you notice?

# Install: pip install tiktoken
import tiktoken

def demo_gpt2_tokenizer(sentence):
    # Use GPT-2's tokenizer
    enc = tiktoken.get_encoding("gpt2")

    tokens = enc.encode(sentence)
    print("Token IDs:", tokens)
    print("Decoded:", enc.decode(tokens))

demo_gpt2_tokenizer("Transformers are amazing!")

2. Embeddings

Remember our tokens [5, 87, 42]? These numbers don't mean anything to a neural network yet. We need to convert each token into a vector that can capture meaning.

The Problem with Order: Transformers process multiple tokens at once. But word order matters!- "The cat ate the mouse" ≠ "The mouse ate the cat" 🐱🐭

A solution is to add position information to each embedding using sinusoidal positional encoding.

In simple terms:

  • Even dimensions use sin, odd dimensions use cos
  • Each dimension oscillates at a different frequency
  • This creates a unique "fingerprint" for each position!

Final embedding = Token embedding + Positional encoding

def one_hot_encode(ids, vocab_size):
    arr = np.zeros((len(ids), vocab_size), dtype=np.float32)
    for i, idx in enumerate(ids):
        arr[i, idx] = 1.0
    return arr

def embed(ids, vocab_size, d_model, seed=42):
    np.random.seed(seed)
    W_emb = np.random.randn(vocab_size, d_model).astype(np.float32) * 0.01
    return W_emb[ids], W_emb  # return embeddings and the table
# Demo
ids = encode("The Sun is a G type star", token_to_id)
X_onehot = one_hot_encode(ids, len(vocab))
X_emb, W_emb = embed(ids, len(vocab), d_model=16)

print("One-hot shape:", X_onehot.shape)
print("Emb shape:", X_emb.shape)

Sinusoidal Position Encoding

def sinusoidal_position_encoding(seq_len, d_model):
    pos = np.arange(seq_len)[:, None]  # (seq_len, 1)
    i = np.arange(d_model)[None, :]    # (1, d_model)

    angles = pos / np.power(10000, (2 * (i // 2)) / d_model)
    enc = np.zeros((seq_len, d_model), dtype=np.float32)
    enc[:, 0::2] = np.sin(angles[:, 0::2])  # even indices
    enc[:, 1::2] = np.cos(angles[:, 1::2])  # odd indices
    return enc

# Quick visualization helper
def plot_positional_encoding(enc, title="Positional Encoding"):
    plt.figure(figsize=(8, 3))
    plt.imshow(enc.T, aspect='auto', origin='lower')
    plt.colorbar()
    plt.title(title)
    plt.xlabel("Position")
    plt.ylabel("Dimension")
    plt.tight_layout()
enc = sinusoidal_position_encoding(seq_len=50, d_model=16)
plot_positional_encoding(enc)

3. Single-Head Causal Self-Attention

We implement a minimal, readable attention block:

  • Linear projections for queries (Q), keys (K), and values (V)
  • Scaled dot-product attention with causal masking
  • Softmax normalization
  • Residual connection
def causal_mask(seq_len):
    return np.triu(np.ones((seq_len, seq_len), dtype=np.float32), k=1) * -1e9

def softmax(x, axis=-1):
    x = x - np.max(x, axis=axis, keepdims=True)
    e = np.exp(x)
    return e / np.sum(e, axis=axis, keepdims=True)

class SelfAttention:
    def __init__(self, d_model, seed=42):
        rng = np.random.default_rng(seed)
        self.W_q = rng.normal(scale=0.02, size=(d_model, d_model)).astype(np.float32)
        self.W_k = rng.normal(scale=0.02, size=(d_model, d_model)).astype(np.float32)
        self.W_v = rng.normal(scale=0.02, size=(d_model, d_model)).astype(np.float32)
        self.scale = np.sqrt(d_model).astype(np.float32)

    def __call__(self, X):
        """
        X: (T, d_model) where T is sequence length
        Returns: (T, d_model)
        """
        Q = X @ self.W_q
        K = X @ self.W_k
        V = X @ self.W_v

        scores = (Q @ K.T) / self.scale  # (T, T)
        scores = scores + causal_mask(scores.shape[0])  # prevent attending to future tokens
        attn = softmax(scores, axis=-1)  # (T, T)
        out = attn @ V                   # (T, d_model)
        return out, attn
# Demo on a short sequence
ids = encode("the sun is a g type star", token_to_id)
X, _ = embed(ids, len(vocab), d_model=16)
attn_block = SelfAttention(d_model=16)

out, attn = attn_block(X)
print("Input shape:", X.shape)
print("Output shape:", out.shape)
print("Attention shape:", attn.shape)
def plot_attention(attn, tokens):
    plt.figure(figsize=(5, 4))
    plt.imshow(attn, origin='lower', aspect='auto')
    plt.xticks(range(len(tokens)), tokens, rotation=90)
    plt.yticks(range(len(tokens)), tokens)
    plt.colorbar(label="Attention weight")
    plt.tight_layout()
tokens = [id_to_token[i] for i in ids]
plot_attention(attn, tokens)

Add a simple residual connection

class ResidualSelfAttention(SelfAttention):
    def __call__(self, X):
        out, attn = super().__call__(X)
        return X + out, attn
res_block = ResidualSelfAttention(d_model=16)
out_res, attn_res = res_block(X)
print("Residual output shape:", out_res.shape)

4. Output Layer -> Logits for Next-Token Prediction

A small linear layer maps from the model dimension back to the vocabulary size to produce logits over the next token.

class OutputHead:
    def __init__(self, d_model, vocab_size, seed=42):
        rng = np.random.default_rng(seed)
        self.W = rng.normal(scale=0.02, size=(d_model, vocab_size)).astype(np.float32)
        self.b = np.zeros((vocab_size,), dtype=np.float32)

    def __call__(self, X):
        # X: (T, d_model)
        return X @ self.W + self.b
head = OutputHead(d_model=16, vocab_size=len(vocab))
logits = head(out_res)  # (T, vocab_size)
print("Logits shape:", logits.shape)
print("Sample logits row (first token):", logits[0][:10])

5. Putting It Together: A Tiny Forward Pass

def tiny_forward(text, token_to_id, d_model=16, seed=42):
    # Encode + embed
    ids = encode(text, token_to_id)
    X, _ = embed(ids, len(vocab), d_model=d_model)

    # Attention block with residual
    attn_block = ResidualSelfAttention(d_model=d_model, seed=seed)
    H, attn = attn_block(X)

    # Output head
    head = OutputHead(d_model=d_model, vocab_size=len(vocab), seed=seed)
    logits = head(H)
    return ids, logits, attn

ids, logits, attn = tiny_forward("the sun is a g type star", token_to_id, d_model=16)
print("Forward shapes -> ids:", len(ids), "| logits:", logits.shape, "| attn:", attn.shape)
def predict_next_token(ids, logits, id_to_token, top_k=5):
    # Use the last position
    last_logits = logits[-1]
    # Softmax for probabilities (for reporting)
    probs = softmax(last_logits)
    # Top-k indices
    top_idx = np.argsort(-probs)[:top_k]
    return [(id_to_token[i], float(probs[i])) for i in top_idx]

preds = predict_next_token(ids, logits, id_to_token, top_k=5)
for tok, p in preds:
    print(f"{tok:>12s}  {p:.4f}")

6. Training Sketch: Cross-Entropy

Below is a simple sketch of token-level cross-entropy over the next-token prediction. This is for pedagogy; it is not optimized.

def cross_entropy_loss(logits, targets):
    # logits: (T, V), targets: (T,)
    # Compute softmax and negative log-likelihood
    probs = softmax(logits, axis=-1)
    n = logits.shape[0]
    eps = 1e-9
    log_probs = -np.log(probs[np.arange(n), targets] + eps)
    return float(np.mean(log_probs))
# Create a dummy target: shift by one (predict the next token)
targets = np.array(ids[1:] + [ids[-1]], dtype=np.int64)
loss = cross_entropy_loss(logits, targets)
print("CE loss:", loss)

7. Visualizing Attention on the Corpus

def visualize_sentence_attention(sentence):
    ids = encode(sentence, token_to_id)
    X, _ = embed(ids, len(vocab), d_model=16)
    attn_block = ResidualSelfAttention(d_model=16)
    _, attn = attn_block(X)
    tokens = [id_to_token[i] for i in ids]
    plot_attention(attn, tokens)

visualize_sentence_attention("vega is an a type star in lyra .")

Wrap-up

  • You built a readable, NumPy-only transformer core: tokenization → embeddings → single-head causal self-attention → output logits.
  • You saw how causal masking prevents peeking at the future.
  • This scaffold can be extended with multi-head attention, layer normalization, an MLP, and training loops.

Next steps

  • Add layer normalization and a simple 2-layer MLP block.
  • Stack multiple attention blocks; try multi-head attention.
  • Train on a small corpus and monitor loss.

Further reading

  • “Attention Is All You Need” (Vaswani et al., 2017)
  • Annotated Transformer (Harvard NLP)

Last Update: September 10, 2025

Author

Ioana Ciucă 11 Articles

Comments