Assumed Audience: You are curiosu about BERT? What are transformers and What is Attention In Deep Learning
Bert was introduced in the Paper: BERT → Pretraining of Deep Bidirectional Transformers for Language Understanding by researchers from Google; authors; Jacob Delvin, Ming-Wei Chang Kristina Touatanova. This paper introduces the current state of the art of language models and says, they want to do something new — Bidirectionally. They compare with Open AI's GPT model and ELMO and show how their architecture is designed to do Bidirectional pretraining of transformers for language modelling.
Transformers are a form of neural network architecture that are designed to handle sequential data such as Natural Language, for tasks such as translation and text summarization. Unlike Recurrent Neural Networks (RNNs); transformers do not need to process sequential data in order; ie, It does not need to process the beginning of it before the end. This makes them more efficient in terms of parallelization and training time.
Prior to Transformers, the state of the art sequential models relied on Gated Recurrent Neural Networks such as LSTMs and Gated Recurrent Units (GRUs) with Attention Mechanisms. Transformers built upon these Attention Mechanisms without using an RNN structure, showing that attention without sequential processing is powerful enough as to achieve the performance of RNNs with Attention. But what really is Attention? Well — Attention Is all you need
Attention Mechanisms:
Attention mimics cognitive attention. In simple terms, the ability to focus on one thing and ignore the others. A good example is in our daily lives is we are constantly bombarded with sensory information most of which is often noise. Our minds figure out a way to extract the signal from this noise. For example, at a cocktail party, we find ourselves in position to listen to only one voice and pay attention to it, inadvertently drowning all the noise into the background. Similar to external attention, there is a form of internal attention that allows us remember a single thought at a time, a single idea.
In Machine Learning, attention mechanisms equip the Neural Network with the ability to focus on a subset of its inputs or features.
Let \(x \in R^k\) be an input vector, \(g \in R^k\) be a feature vector, \(a \in [0,1]^k\) be an attention vector. \(g \in R^k\) an attention glimpse and \(f_\phi(x)\) an attention network with parameters \(\phi\) . Attention is typically implemented as;
$$a = f_\phi(x), g = a \odot z$$
$$f(x) = \int_{-\infty}^\infty \hat f(\xi),e^{2 \pi i \xi x} ,d\xi$$
There are two types of attention mechanisms, hard attention and soft attention.
Transformers have been used for the task of Language model pre-training, which, in turn has shown to be effective for improving Many Natural Language Processing tasks. As of the release of BERT, two main strategies existed for applying pretrained to language models to downstream tasks; Feature Based Methods and Fine-Tuning.
Assumed Audience: You want to know how computers see
Originally posted on Medium
AlexNet is a Convolutional Neural Network that rose to prominence when it won the Imagenet Large Scale Visual Recognition Challenge (LSVRC), which is an annual challenge that evaluates algorithms for object detection and image classification at large scale (think of it as the World cup for image classification algorithms).
The ILSVRC evaluates the success of image classification solutions by using two important metrics, the top-5 and the top- 1 error. When given a set of N images, often called test images and mapped to a target class for each metric. The Top-1 Error is the percentage of the time the classifier did not give the correct class the highest score while the top-5 error is the percentage of the time that the classifier did not include the correct class among its top 5 guesses.
AlexNet received a top-5 error around 16% which was an extremely good result back in 2012. To put in context, the next best result trailed far behind (26.2%). When the dust settled deep learning became cool again and in the next few years, multiple teams would build CNN architectures that would beat human-level accuracy. The architecture used in the 2012 paper is popularly called AlexNet after the first author Alex Krizhevsky. In this blog post, we will have a look at the details of the Alexnet architecture and try to re-implement it in Keras. Let's dive in!
The model was trained on imageNet data which contains about 1.2 million images of 1000 categories.
Since ImageNet images are variable resolution, and the model presented in this paper requires fixed-size images, they scaled every image to 256x256 pixels. The scaling like so: Scale a possibly rectangular image so that the shorter side is 256 pixels. Take the middle 256x256 patch as the input image.
!(image)[/img/alexnet-crop.png]
This architecture used two forms of data augmentation. Translation and reflection. This consists of generating image translations and horizontal reflections. This scheme helped increase the size of the data by a scale of 2048 without which the network would have suffered from substantial overfitting. The other way they augmented the dataset involved perturbing the R,G,B values of each input image by a scaled version of the principal components (in RGB space) across the whole training set. This scheme helped reduce the top-1 error rate by over 1%.
AlexNet is made up of eight trainable layers, five convolution layers and three fully connected layers. All of the trainable layers are followed by a ReLu activation function except for the last fully connected layer where a softmax function is used. The architecture also consists of non-trainable layers: Three pooling layers, two normalization layers and one dropout layer (used to reduce overfitting).
[Image ]
The authors chose to use the Rectified Linear Unit (ReLU) function. They saw that deep convolutional neural nets with ReLUs trained several times faster than their equivalents with tanh units. When pitted against the tanh activation with no other changes, they were able to train their model to 25% error rate on the training set 6X faster with the ReLU activation.
Stochastic Gradient Descent with learning rate 0.01, momentum 0.9 and weight decay 0.0005 is used. The training is done on two GPUs (GTX 580) for parallelism, and the setup is quite interesting. The GPUs used each have 3GB memory. The network is split into halves, as can be seen in the model description figure, across the two GPUs.
(code goes here)
This work was the first of its kind to have trained deep convolutional networks on GPUs to achieve impressive results on the ImageNet dataset for image classification. I hope you got to learn something.
Later.