AlexNet  —  A brief Review

A Brief Review of AlexNet of 2012

Assumed Audience: You want to know how computers see

Originally posted on Medium

AlexNet is a Convolutional Neural Network that rose to prominence when it won the Imagenet Large Scale Visual Recognition Challenge (LSVRC), which is an annual challenge that evaluates algorithms for object detection and image classification at large scale (think of it as the World cup for image classification algorithms).

The ILSVRC evaluates the success of image classification solutions by using two important metrics, the top-5 and the top- 1 error. When given a set of N images, often called test images and mapped to a target class for each metric. The Top-1 Error is the percentage of the time the classifier did not give the correct class the highest score while the top-5 error is the percentage of the time that the classifier did not include the correct class among its top 5 guesses.

AlexNet received a top-5 error around 16% which was an extremely good result back in 2012. To put in context, the next best result trailed far behind (26.2%). When the dust settled deep learning became cool again and in the next few years, multiple teams would build CNN architectures that would beat human-level accuracy. The architecture used in the 2012 paper is popularly called AlexNet after the first author Alex Krizhevsky. In this blog post, we will have a look at the details of the Alexnet architecture and try to re-implement it in Keras. Let’s dive in!


The model was trained on imageNet data which contains about 1.2 million images of 1000 categories.

Image pre-processing

Since ImageNet images are variable resolution, and the model presented in this paper requires fixed-size images, they scaled every image to 256x256 pixels. The scaling like so: Scale a possibly rectangular image so that the shorter side is 256 pixels. Take the middle 256x256 patch as the input image.


Data Augmentation

This architecture used two forms of data augmentation. Translation and reflection. This consists of generating image translations and horizontal reflections. This scheme helped increase the size of the data by a scale of 2048 without which the network would have suffered from substantial overfitting. The other way they augmented the dataset involved perturbing the R,G,B values of each input image by a scaled version of the principal components (in RGB space) across the whole training set. This scheme helped reduce the top-1 error rate by over 1%.


AlexNet is made up of eight trainable layers, five convolution layers and three fully connected layers. All of the trainable layers are followed by a ReLu activation function except for the last fully connected layer where a softmax function is used. The architecture also consists of non-trainable layers: Three pooling layers, two normalization layers and one dropout layer (used to reduce overfitting).

[Image ]

Choice of Non-Linearity.

The authors chose to use the Rectified Linear Unit (ReLU) function. They saw that deep convolutional neural nets with ReLUs trained several times faster than their equivalents with tanh units. When pitted against the tanh activation with no other changes, they were able to train their model to 25% error rate on the training set 6X faster with the ReLU activation.


Stochastic Gradient Descent with learning rate 0.01, momentum 0.9 and weight decay 0.0005 is used. The training is done on two GPUs (GTX 580) for parallelism, and the setup is quite interesting. The GPUs used each have 3GB memory. The network is split into halves, as can be seen in the model description figure, across the two GPUs.


(code goes here)


This work was the first of its kind to have trained deep convolutional networks on GPUs to achieve impressive results on the ImageNet dataset for image classification. I hope you got to learn something.