BERT  —  A brief Review

A Brief Review Of Bidirectional Encoder Representations from Transformers

Assumed Audience: You are curiosu about BERT? What are transformers and What is Attention In Deep Learning


Bert was introduced in the Paper: BERT → Pretraining of Deep Bidirectional Transformers for Language Understanding by researchers from Google; authors; Jacob Delvin, Ming-Wei Chang Kristina Touatanova. This paper introduces the current state of the art of language models and says, they want to do something new  —  Bidirectionally. They compare with Open AI’s GPT model and ELMO and show how their architecture is designed to do Bidirectional pretraining of transformers for language modelling.

Recap: Attention and Transformers.

Transformers are a form of neural network architecture that are designed to handle sequential data such as Natural Language, for tasks such as translation and text summarization. Unlike Recurrent Neural Networks (RNNs); transformers do not need to process sequential data in order; ie, It does not need to process the beginning of it before the end. This makes them more efficient in terms of parallelization and training time.

Prior to Transformers, the state of the art sequential models relied on Gated Recurrent Neural Networks such as LSTMs and Gated Recurrent Units (GRUs) with Attention Mechanisms. Transformers built upon these Attention Mechanisms without using an RNN structure, showing that attention without sequential processing is powerful enough as to achieve the performance of RNNs with Attention. But what really is Attention? Well  —  Attention Is all you need

Attention Mechanisms:

Attention mimics cognitive attention. In simple terms, the ability to focus on one thing and ignore the others. A good example is in our daily lives is we are constantly bombarded with sensory information most of which is often noise. Our minds figure out a way to extract the signal from this noise. For example, at a cocktail party, we find ourselves in position to listen to only one voice and pay attention to it, inadvertently drowning all the noise into the background. Similar to external attention, there is a form of internal attention that allows us remember a single thought at a time, a single idea.

In Machine Learning, attention mechanisms equip the Neural Network with the ability to focus on a subset of its inputs or features.

Let \(x \in R^k\) be an input vector, \(g \in R^k\) be a feature vector, \(a \in [0,1]^k\) be an attention vector. \(g \in R^k\) an attention glimpse and \(f_\phi(x)\) an attention network with parameters \(\phi\) . Attention is typically implemented as;

$$a = f_\phi(x), g = a \odot z$$

$$f(x) = \int_{-\infty}^\infty \hat f(\xi),e^{2 \pi i \xi x} ,d\xi$$

There are two types of attention mechanisms, hard attention and soft attention.

Back to Transformers

Transformers have been used for the task of Language model pre-training, which, in turn has shown to be effective for improving Many Natural Language Processing tasks. As of the release of BERT, two main strategies existed for applying pretrained to language models to downstream tasks; Feature Based Methods and Fine-Tuning.

  1. Feature based Methods such as ELMo uses task specific architectures that use the pre-trained representations as additional features. For example word embedding or Word Vectors are used from Embedding Methods such as GloVE or Word2Vec
  2. The fine-tuning Approach, such as Generative Pre-trained Transformers (Open AI GPT)