Introduction
Before we can train a Transformer, we need to understand two fundamental mathematical functions that power every language model: Softmax and Cross-Entropy Loss. These are the bridge between raw neural network outputs and meaningful predictions.
Why This Matters: Without softmax, we can't convert model outputs to probabilities. Without cross-entropy loss, we can't tell the model how wrong its predictions are. Together, they make training possible.
| Function | Purpose | Input → Output |
|---|---|---|
| Softmax | Convert raw scores to probabilities | Logits → Probability Distribution |
| Cross-Entropy | Measure prediction quality | Predictions + Labels → Single Loss Number |
Let's understand each one deeply, with intuition, formulas, and hands-on examples.
What is Softmax?
Imagine you're building a language model that needs to predict the next word. After processing through all the neural network layers, you get a list of scores (called logits) - one for each word in your vocabulary.
The problem? These logits can be any real number: positive, negative, large, small. They don't tell you the actual probability of each word being correct.
The Challenge: Raw logits like [2.0, 1.0, 0.5, -1.0] don't tell us much. Is 2.0 good? Is -1.0 bad? We need a way to convert these to probabilities that sum to 1.
The Intuition Behind Softmax
Softmax solves this problem elegantly with two key ideas:
- Exponentiation (e^x): Makes all values positive and amplifies differences. Larger logits get exponentially larger values.
- Normalization: Divide each value by the sum of all values, ensuring everything adds up to 1.0 (a valid probability distribution).
| Property | What It Means | Why It's Useful |
|---|---|---|
| All values are positive | e^x is always > 0 | Probabilities can't be negative |
| Sum equals 1 | We divide by total | Valid probability distribution |
| Preserves order | Larger logit → larger probability | Respects model's preferences |
| Amplifies differences | Exponential scaling | Winner takes more share |
The Softmax Formula
The softmax function converts a vector of K real numbers (logits) into a probability distribution:
Let's break down what each part means:
- - The logit (raw score) for class i
- - Exponential of the logit, making it positive and amplified
- - Sum of all exponentials (the normalizing factor)
- Result - A probability between 0 and 1
Interactive Softmax Explorer
Use the interactive visualization below to see how softmax transforms logits into probabilities. Adjust the sliders to see how changing one logit affects all probabilities:
Interactive Softmax Visualizer
Adjust logits and watch probabilities change in real-time
Input Logits (adjust with sliders)
Try This: Set one logit much higher than the others and watch how it dominates the probability distribution. Then try making all logits equal and see how probabilities become uniform.
Step-by-Step Softmax Example
Let's work through a concrete example. Suppose our model outputs these logits for predicting the next word:
Scenario: The model is predicting the next word after "The cat sat on the"
Computing Softmax Probabilities
💡 The exponential makes all values positive and amplifies the differences. Notice how 2.0 (the highest logit) becomes 7.389, much larger than the others.
💡 This sum will be our denominator, ensuring probabilities sum to 1.
💡 Now we have valid probabilities! Notice they sum to approximately 1.0
💡 The tiny difference from 1.0 is due to rounding. In practice, it's exactly 1.0.
Interpretation: The model predicts "mat" with 60.9% confidence as the next word after "The cat sat on the". Makes sense!
What is Cross-Entropy Loss?
Now that we can get probabilities from our model, we need a way to tell it how wrong those probabilities are. This is where Cross-Entropy Loss comes in.
The Goal: We want a single number that says "your prediction was this bad." The higher the number, the worse the prediction. During training, we minimize this number.
The Intuition Behind Cross-Entropy
Cross-entropy measures the "surprise" or "unexpectedness" of the model's prediction relative to the true answer. Think of it this way:
| Scenario | Model Says | True Answer Is | Loss |
|---|---|---|---|
| Confident & Correct | 90% cat | cat | Low (0.1) |
| Uncertain | 25% each | cat | Medium (1.4) |
| Confident & Wrong | 90% dog | cat | Very High (2.3) |
The key insight is that cross-entropy heavily penalizes confident wrong predictions. This pushes the model to be both accurate and appropriately confident.
- Confident and correct? Low loss. The model is rewarded.
- Uncertain? Moderate loss. The model should be more decisive.
- Confident and wrong? Very high loss! The model is severely penalized.
The Cross-Entropy Formula
For classification problems (like next-word prediction), the cross-entropy loss is:
Where:
- - The true label (one-hot encoded: 1 for correct class, 0 otherwise)
- - The predicted probability for class i
- - Natural logarithm (base e)
Since y is one-hot encoded (only one class has y=1, others are 0), this simplifies beautifully to:
The Simple Rule: Cross-entropy loss is just the negative log of the probability assigned to the correct answer. Higher probability for the correct answer → lower loss.
Interactive Cross-Entropy Explorer
Use this interactive visualization to see how cross-entropy loss changes based on the model's predictions:
Interactive Cross-Entropy Loss
See how the loss penalizes wrong predictions
Model's Probability Distribution
Calculation Breakdown
The negative log function penalizes confident wrong predictions severely. If the model says 1% for the correct answer, loss = -log(0.01) = 4.6. But if it says 90%, loss = -log(0.9) = 0.1. This gradient pushes the model to be both correct and confident.
Since y is one-hot encoded, only the probability of the correct class matters
Try This: Compare "Confident & Correct" vs "Confident & Wrong" scenarios. Notice how the loss explodes when the model is confidently wrong!
Step-by-Step Cross-Entropy Example
Let's calculate the cross-entropy loss for our next-word prediction example:
Scenario: The correct next word is "mat" and our model predicted:
True Label (One-Hot Encoded):
Computing Cross-Entropy Loss
💡 Since y is one-hot, only the correct class contributes to the loss.
💡 The natural log of 0.609 is about -0.496. The negative sign flips it to positive.
💡 All terms with y=0 contribute nothing. Only the correct class matters.
Interpretation: A loss of 0.496 is quite low! It means the model was reasonably confident (60.9%) in the correct answer. Perfect prediction (100% confidence) would give loss = 0.
Scenario: What if the model had been 90% confident in "dog" instead?
Cross-Entropy for Wrong Prediction
💡 Very low confidence in the right answer!
💡 The loss is much higher because log of a small number is very negative!
Key Insight: The loss jumped from 0.496 to 3.507 - about 7 times higher! This steep penalty for confident wrong predictions is what makes cross-entropy so effective for training.
How They Work Together
In training, softmax and cross-entropy work as a team:
- Forward Pass: Model outputs logits → Softmax converts to probabilities
- Loss Calculation: Cross-entropy measures how far predictions are from truth
- Backpropagation: Gradients flow back to improve the model
Beautiful Gradient: The gradient of softmax + cross-entropy has a remarkably simple form: for the correct class, it's (p - 1), and for wrong classes, it's just p. This elegant simplicity makes training efficient!
| Class | Gradient | Effect |
|---|---|---|
| Correct class | p - 1 (negative) | Push probability up toward 1 |
| Wrong classes | p (positive) | Push probability down toward 0 |
Temperature Scaling
Sometimes we want to control how "sharp" or "flat" our probability distribution is. This is done using a temperature parameter T:
| Temperature | Effect | Use Case |
|---|---|---|
| T < 1 | Sharper distribution, more confident | When you want decisive predictions |
| T = 1 | Standard softmax | Normal training and inference |
| T > 1 | Flatter distribution, more uncertain | Sampling for creative text generation |
Try adjusting the temperature in the interactive softmax visualizer above to see this effect!
Summary
Softmax: Converts raw logits to a valid probability distribution using exponentiation and normalization.
Cross-Entropy: Measures prediction quality as -log(p_correct). Heavily penalizes confident wrong predictions.
- Softmax Properties: All outputs positive, sum to 1, preserves ranking, amplifies differences
- Cross-Entropy Properties: 0 when perfect, increases as confidence in wrong answer grows
- Together: They create clean gradients (p - y) that efficiently train neural networks
- Temperature: Controls distribution sharpness - lower = more confident, higher = more uncertain
With this understanding of softmax and cross-entropy, you now have the mathematical foundation needed to understand how Transformers learn from data. In the next chapters, we'll see these functions in action as we build attention mechanisms and train our model.