Implementing Transformers from Scratch — Positional Encoding

3 min readJul 18, 2023

Continuing on Implementing Glove from scratch — Word Embedding for Transformers, this article will comprise of the steps included for the next step in implementing Transformers from scratch — Positional Encoding.

Positional Encoding is the process of including the relative positional information in the Word Embeddings that we have developed in the last article. Basically, want to describe the position of a word in respects to it’s peers.
This task was previously done by the Recurrent Neural Networks, but since we are eliminating re-occurence, we needed a way to include that information for NLP tasks. Order of words is obviously very important, so in the paper “Attention is all you need” by Vaswani et al the authors have described a way to include that information using the sinusoidal functions. This information is represented in the form of a vector, this vector has the same dimensions as that of the word embedding discussed in the last article.

How to calculate positional encoding?

Suppose there are “N” words in your vocabulary, now for each word let’s say the embedding dimension is “d”. So, you will have a matrix of size
N*d. Now each entry is defined because of it’s relative position to other words, for that purpose as we need to define a relation that is periodic and allows the model to differentiate between words that have the same embedding but occur at different positions in the sequence.
Here’s how it is calculated:
For each dimension of the embedding vector, two sinusoidal functions with different frequencies are used. One function has a frequency of f = 1 / (k^(2i / d)), where i is the dimension index (starting from 0) and d is the dimension of the embedding. The other function has a frequency of f multiplied by k^(2i / d).
Here, emprically the k is decided to be 10000.

Suppose, the position of the word that you are trying to calculate positional vector for is “j”, then

import numpy as np

def positionalEncoding(length, d, n=10000):
P = np.zeros((length, d))
for j in range(length):
for i in np.arange(int(d/2)):
denominator = np.power(n, 2*i/d)
P[j, 2*i] = np.sin(k/denominator)
P[j, 2*i+1] = np.cos(k/denominator)
return P

Rationale behind using the Sinusoidal functions

The rationale behind using sinusoidal frequencies in positional encoding is to introduce a pattern of varying frequencies that encode positional information into the embeddings. The choice of sinusoidal functions with different frequencies is based on the intuition that these functions can represent different positional patterns in a smooth and continuous manner.

Here are a few reasons why sinusoidal frequencies are used:

  1. Periodicity: Sinusoidal functions exhibit periodicity, meaning they repeat their values over a certain interval. By using sinusoidal functions, the positional encoding can capture the notion of periodicity or recurrence in the sequence. This is useful because in many natural language processing tasks, there are often patterns or structures that repeat or occur at regular intervals within the input sequence.
  2. Differentiation: Sinusoidal functions with different frequencies help differentiate the positional encoding vectors for different positions in the sequence. The frequencies are designed such that the vectors for nearby positions have similar but distinct patterns, allowing the model to distinguish between positions that are closer together.
  3. Smoothness: Sinusoidal functions provide a smooth transition of values across positions. The continuous nature of sinusoidal functions ensures that nearby positions have similar positional encodings, facilitating the model’s ability to interpolate and generalize positional information.
  4. Generalization: The choice of sinusoidal functions with varying frequencies allows the model to generalize positional patterns beyond the observed training positions. By learning the relationships between the sinusoidal frequencies and the input sequence, the model can capture positional information for unseen positions during inference.

So, now we understand yet another step in implementing transformers! Stay tuned for the next article! Do follow for more

Happy Learning!