Text recognition from images is an important task that has multiple real-world applications such as text localization   , transcription of text into digital format   , car plate reading     , automatic check reading  , classifying text from unlabeled/partially labeled documents  , recognizing road signs and house number   , etc. Traditionally hand designed features are used to for image classification      . However, these techniques require a huge amount of engineering effort, and often do not generalize to novel situations.
Recent techniques in deep learning have allowed efficient automatic learning of features that are superior to hand designed features. As a result, we are able to train classifier that is significantly more accurate compared to previous methods. In this paper, we investigate using deep learning to classify handwritten digits, and show that with a simple deep network, we can classify digits with near-perfect accuracy.
We test our methods on the MNIST dataset  . This dataset consists of 50,000 training digit images and 10,000 testing images and is an important benchmark for deep learning methods. Samples images from the dataset are shown in Figure 1. On this dataset, we achieve an accuracy of 99.3% on the test set.
We also investigate classifying multiple digits, where more than one digit is present in an image. An example of this task is shown in Figure 2. We design a novel method of applying classifier of a single digit to an image with multiple digits. Though the number of digits and their location is unknown a-priori, our method is able to accurately localize and classify all the digits in the image.
2. Digit Classification with Deep Networks
2.1. Supervised Learning
A supervised learning task consists of two components, the input x and label y. For example, the input can be images of handwritten digits, or image of natural objects, and the label is the corresponding digit class or object class. The goal is to learn the correct mapping f from input x to label y. To accomplish this a learner is provided with examples of the correct mapping where xi is an example input and yi is the corresponding label provided by human annotators. Ideally after learning, f should map each input in the dataset to the correct label, i.e. f(xi) = yi. The hope is that the learner can learn the correct mapping between x and y based on these examples, so that on unseen data, the learner f can also correctly classify.
Usually f is selected from a class of functions indexed by a parameter θ. For example, the class of functions can be quadratic functions
Figure 1. Samples from the MNIST dataset.
Figure 2. Classifying multiple digits.
in this example, are the parameters. We will denote the function selected by a parameter choice as .
To encourage the learner to select a that maps each to the correct we define a loss function such as
In general any loss function that takes a smaller value when f(xi) is closer to yi can be used. For classification tasks we use a special class of functions that outputs a probability distribution. That is for each , is the probability the input belongs to the j-th class. Then we can use the cross-entropy loss
where K is the number of classes, and if and equals to 0 otherwise.
To train the model, we use gradient descent on the loss function Lθ. This is described by the following process:
1) We start from a random parameter θ that can be arbitrarily chosen.
2) We compute the gradient of the loss function ∇θLθ. Computation of this gradient is discussed in the next section.
3) We update the parameters by θ: = θ − α∇θLθ. This changes θ in the direction that minimizes the loss Lθ. α is the learning rate that controls the step size. The larger the step size, the faster θ changes. However, step size that is too large may lead to instability or even divergence. Therefore, the learning rate α is an important hyperparameter that is selected based on the specific problem.
4) We repeat from Step 2 until θ stops changing.
The above algorithm reduces Lθ during each iteration. The hope is that when Lθ is minimized, fθ(xi) will be close to yi, that is, the function fθ we selected can correctly predict the label yi given xi on the training set.
However, even if fθ correctly predicts every example we provided, this does not mean that fθ will classify correctly on new data. For example, fθ may have only memorized the training dataset. Therefore we need additional examples (xj,yj), that the learner has not seen during training. The learner should only be able to classify these new examples correctly if it has learned the correct mapping between x and y. We can compute the testing accuracy by dividing the number of examples fθ correctly classifies by the total number of examples. This is the final measurement of performance that we use to evaluate our learner.
2.2. Deep Networks
In the previous section we left an open question: which class of functions to select from during training. This section introduces an important function class of deep networks    .
The key idea of deep learning is to compose very simple functions into a very complex function . Each function is a simple function with parameters θi. Then the parameters of f is simply the combined parameters of all the layers . Common functions used in deep learning include
1) Matrix multiplication g(x) = Ax + b where the parameters are matrix A and vector b.
2) Rectified Linearity (ReLU) 
This function does not contain a parameter.
3) Softmax function: the softmax function “squashes” a n-dimensional vector of arbitrary real values to a n-dimensional vector of real values in the range [0, 1] that add up to 1. The function applied to an n-dimensional input vector z is given by
Note that sigmoid naturally produces a distribution because the output sum to 1
4) Convolution     : the convolution function takes as input an array z of size M × N × K and output an array o of size M × N × C. The first two dimensions can be interpreted as width and height, while the third is the number of “channels”. This function takes the input z, and applies a 2D-convolution operation defined as
where S is called the size of the ﬁlter map, and each is an array of size S × S × K. All the combined is the set of parameters of the convolution function.
5) Pooling: Pooling is a process which reduces a M × N × K array into a smaller array, e.g. of size . Usually we keep the number of “channels” unchanged. For example, in the digit classification task, it can reduce the scale of a clearer picture into a more ambiguous one, making it easier to process in the later steps.
We shall denote the output of as . Then , etc. Finally, we have . Intuitively the network must map raw input images into highly abstract and meaningful labels, which is a highly complex mapping. The network accomplishes through a sequence of simple mappings composed together. Each function can be viewed as one “layer” of a network. This function processes the output of the previous functions hi−1 into higher level representations hi. The network therefore can be viewed as processing the input through a sequence of “layers” whose output become increasingly more high level and abstract, until we finally reach the output layer, which corresponds to the labels. This intuition is illustrated in Figure 3.
2.3. Computing Gradients
Now that we have defined our model class, to implement the algorithm in Section 2.1, we must be able to compute the gradient ∇θLθ. This is accomplished with the back-propagation algorithm    .
The back-propagation algorithm sequentially computes Intuitively, this tells us how each hidden layer must change to minimize loss Lθ. When all the are simple functions, we can compute ∇hi−1Lθ from ∇hiLθ analytically, and this can be computed automatically by software such as Tensorflow  . Given gradient ∇hiLθ over each layer hi, we can correspondingly compute the gradient ∇θiLθ over parameters θi analytically. This can also be automatically computed by Tensorflow.
Intuitively the computation flows “backward” through the next (hence the name back-propagation). We compute gradient in the following sequence
2.4. Detecting and Localizing Multiple Digits
In many real-world problems, such as car plate detection     or house number recognition  there are multiple digits in the same image, and their location is unknown to us. Therefore, not only do we want to classify existing digits, we would also like to locate where the digits are, and how many there are. We show that based on the deep classifier we trained before we can design an algorithm to accomplish this. What we need is that given an image patch we must identify both whether there is a digit in the image patch, and what digit it is, if there is one. If we can accomplish this, then we may simply apply this method to each patch of our input image, and we will be able to localize and classify all the digits in the image.
Previously we trained a classifier that takes as input an image, and outputs a probability distribution over all possible digits. We observe that when the input is an image that do not contain any digit, the output is a distribution
Figure 3. Illustration of a deep network.
with high entropy, that is, the network is not highly confident that any digit has been observed. On the other hand, when presented with an image that contains a digit, the output is a distribution with low entropy, and the network generally outputs the correct digit with very high confidence.
We can then take advantage of this property. We measure the difference between the highest probability score and the second highest probability score. If the image contains a digit, the top prediction should have high probability score compared to the second highest. If the image does not contain a digit, all the possible predictions should be assigned similar probability and there should not be a significant difference. We show that this approach works very well in practice and we are able to accurately discover digits in an image in the experiments.
3.1. Experiment Setting
We use 50,000 digit figures from the MNSIT training dataset to accomplish our training. Each example is a 28 by 28 single-color image. Our network architecture is as follows
1) A convolution layer with filter map of size 5 that takes as input the 28 × 28 × 1 image and outputs a feature map of shape 28 × 28 × 32
2) A pooling layer that reduces the size from 28 × 28 × 32 to 14 × 14 × 32
3) A ReLU layer
4) A convolution layer with filter map of size 5 and outputs a feature map of shape 14 × 14 × 64
5) A pooling layer that reduces the size from 14 × 14 × 64 to 7 × 7 × 64
6) A matrix multiplication layer that maps a vector of size 7 × 7 × 64 to 1024
7) A ReLU layer
8) A matrix multiplication layer that maps a vector of size 1024 to 10
9) A softmax layer
We train our network with gradient descent with a learning rate of for 20,000 iterations. We also use a new adaptive gradient descent algorithm known as Adam  which has been shown to perform better on a variety of tasks. Because shifting a digit does not change its class, during training we also randomly shift the digit by up to 6 pixels in each direction to augment the dataset. This makes the network more robust to shifting of the digit and improves testing accuracy.
For multi-digit classification, we first extract all 28 by 28 image patches with a stride of 2. Then we run our classification network on all the patches, we take the most confident digit prediction in a region as our digit class prediction.
3.2.1. Single Digit Classification
After training our network, we use another 10,000 test data to test the accuracy of our network. We achieved a testing accuracy of 0.993, which indicates that the network only makes a mistake in 7 out of every 1000 digits. We show the training curve in Figure 4. It can be observed that accuracy improves very quickly in the first 5000 iterations, then improves gradually until we reach approximately 99% accuracy on both the training set and testing set. No overfitting is observed.
3.2.2. Multiple Digit Classification
For the multi-digit classification, we show in Figure 5 the response of each digit detector at different locations of the sample input. The redder a region is, the more confident the classifier predicts that digit at that image patch. It can be seen that at correct digit locations, the detector shows consistently confident predictions throughout the region. This can be used to identify a region as containing a digit.
We also show in Figure 6 the confidence score that we computed. Redder color indicates presence of a digit. The location where the confidence score is high corresponds very well to where digits are present.
This paper applies deep networks to digit classification. Instead of hand designed features, we automatically learn them with a deep network and the back-propagation algorithm. We use a convolutional neural network with ReLU activations. In addition, we use pooling layers to remove unnecessary detail and learn higher level features.
We train our network with stochastic gradient descent. Training progresses quickly, we are able to achieve 90% accuracy with only 1000 iterations. After 100 k iterations, we achieve test performance of 99.3% on the MNIST dataset.
We also study multi-digit classification and propose a method to detect digits in an image with multiple digits. We utilize the fact that our classifier produces a probability distribution. We observe that when the input contains a digit, the classifier produces a distribution with low entropy and high confidence on
Figure 4. Training Curve. On the x-axis we plot the number of training iterations in log scale. On the y-axis we plot the classification accuracy on the test set. It can be observed that accuracy improves very quickly initially, reaching approximately 90% accuracy with only 1000 iterations. After that accuracy improves slowly. Eventually we reach an accuracy of 99.3% on the test set.
Figure 5. Detection of Multiple Images. Left and right are two examples where our model is able to localize the digits in an image with multiple digits at random positions. We apply our classifier to each patch of the image, and the output of each classification label. Redder color corresponds to higher confidence of the presence of that digit, and blue corresponds to low confidence.
Figure 6. Confidence score indicate the present of a digit. The score is higher (redder) where there is a digit and lower (bluer) when there is not.
the correct label. On the other hand, when the input does not contain a digit, the classifier produces an almost uniform distribution. We use this different to detect whether an image patch contains an image. We experiment on multiple digit detection and our method is able to successfully localize digits and classify them.
Future work should further improve accuracy and handle different size of digits in the multi-digit detection task.