In deep neural networks, the input to each layer is the output of the previous layer. When the parameters of a layer are updated during training, the distribution of its output changes as well. As a result, the next layer must continually adapt to this shifting input distribution, which can reduce training efficiency and stability. The main purpose of normalization is to stabilize the input distribution of each layer by transforming it to have zero mean and unit variance. By keeping the input distribution consistent throughout training, normalization improves both training efficiency and overall stability.

Generated by Gemini.
Given any input $x$ with shape $(B, T, C)$, where $B$ is the batch size, $T$ is the number of tokens, and $C$ is the embedding dimension per token, the output is generally computed as :
$$ \text{normalization}(x) = \gamma * ( \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} ) + \beta, $$
where $\epsilon$ is a small constant, and $\gamma$ and $\beta$ are learnable vector parameters of shape $(C, )$. $\mu$ and $\sigma^2$ denote the mean and variance of the input. Different methods mainly differ in how this two statistics are computed.
BatchNorm. The ****statistics are computed across both the batch and token dimensions.
$$ \mu_k = \frac{1}{BT} \sum_{i,j}^{} x_{i, j, k} $$
$$ \sigma_k^2 = \frac{1}{BT} \sum_{i,j}^{} \left(x_{i,j,k} - \mu_k\right)^2 $$
Note: BatchNorm behaves differently in training and inference.
In training phase, use the current mini-batch to compute $\mu_k^{\text{batch}}$ and $\sigma_k^{2, \text{batch}}$, and update running statistics:
$$ \text{running\_mean}_k \leftarrow (1 - m)\,\text{running\_mean}_k + m\,\mu_k^{\text{batch}} $$
$$ \text{running\_var}_k \leftarrow (1 - m)\,\text{running\_var}_k + m\,\sigma_k^{2,\text{batch}} $$
During Inference, use the stored running mean and variance are used for normalization instead of batch statistics.
LayerNorm. The statistics are independently for each token in each sample. Unlike BatchNorm, LayerNorm does not rely on batch statistics, making it suitable for small-batch or variable-length sequence settings.
$$ \mu_{i,j} = \frac{1}{C} \sum_{k}^{} x_{i,j,k} $$
$$ \sigma_{i,j}^2 = \frac{1}{C} \sum_{k}^{} (x_{i,j,k} - \mu_{i,j})^2 $$
InstanceNorm. The statistics are computed per feature map. If we treat each feature map as an embedding (so $C = H \times W$), then the formulas for computing the means and variance become identical to those used in LayerNorm:
$$ \mu_{i,j} = \frac{1}{C} \sum_{k}^{} x_{i,j,k} $$
$$ \sigma_{i,j}^2 = \frac{1}{C} \sum_{k}^{} (x_{i,j,k} - \mu_{i,j})^2 $$