Even the most common neural network architectures have a lot of parameters. ResNet50, a frequently used baseline model, has ~25 million. This means that during training, we perform a search in a 25 million dimensional parameter space.
To put this number in perspective, let’s look at a cube in this space. An n-dimensional cube has vertices, so in 25 million dimensions, we are talking about points. In a search grid, this would be just a single element. For comparison, the number of atoms in the observable universe is around 10⁸². It is safe to say that the magnitude of this problem is incomprehensible to us humans.
Thus, reducing the number of parameters would have several benefits. A sparse network is not only smaller, but it is also faster to train and use. Where hardware is limited, such as embedded devices or smartphones, speed and size can make or break a model. In addition, more complex models are more prone to overfitting. So, restricting the search space also acts as a regularizer.
However, this is not a simple task, as reducing the model’s capacity can also lead to inaccuracy. Thus, there is a delicate balance between complexity and performance. This post will take a deeper look into the challenges and potential solutions.
Weight pruning
The simplest problem is to aim to reduce model parameters after training. This would not help with the training itself but would reduce computational requirements for inference.
The process of eliminating weights is called pruning. (I will use weights and parameters interchangeable from now.) Its origins go back to the famous paper Optimal Brain Damage by Yann LeCun, John S. Denker and Sara A. Solla.
They proposed the following iterative pruning method:
- Train the model.
- Estimate the saliency of each weight, which they define by the change in the loss function upon perturbing the weight. The smaller the change, the more negligible effect the weight has on the training.
- Remove weights with the lowest saliency. (That is, set their value to zero and keep it there for the rest of the process.)
- Go to Step 1. and retrain the pruned model.
Continuing the training with pruned weights is necessary. The authors observed that without it, the objective function (a.k.a. the loss) increases significantly when a large portion of the weights are removed.
One particular challenge arises with this method when the pruned network is retrained. It turned out that retraining was much more difficult due to its decreased capacity. The solution to this problem arrived later, along with the so-called Lottery Ticket Hypothesis, which put this problem into a whole another perspective.
The Lottery Ticket Hypothesis
Winning the jackpot of a lottery has very small chances. For example, if you are playing Powerball, you have precisely 1 in 292,201,338 odds to win per ticket. What are your chances if you purchase n tickets? So, if the probability of winning is
then the probability of not winning is . When we buy n tickets, the probability that none of them wins is . From this, it follows that the probability of at least one of them winning is simply
If is large, this can be arbitrarily close to one. (In case you are not familiar with probability, check out my introductory article!)
What does this have to do with neural networks? Before training, the weights of a model are initialized randomly. Is there a subnetwork of a randomly initialized network that “won the initialization lottery”? In their work, Jonathan Frankle and Michael Carbin state the Lottery Ticket Hypothesis:
A randomly-initialized, dense neural network contains a subnetwork that is initialized such that — when trained in isolation — it can match the test accuracy of the original network after training for at most the same number of iterations.
The most significant problem is achieving the same accuracy as the full network with at most the same number of training steps. As we have seen before, this is the biggest challenge of pruning: training smaller networks can be more difficult due to the decreased modeling capacity.
How do we find such initialization lottery winners? Do they even exist? The authors found a way to identify such subnetworks in certain architectures consistently. Their method was the following.
- Randomly initialize a neural network.
- Train the network for n training steps.
- Remove k% of the weights with the lowest magnitude.
- Reset the remaining weights to the value they had during the random initialization.
- Go to Step 2. and iterate the training and pruning.
There are two key steps here compared to previous methods. First, the weights are simply removed according to their magnitude. Second, the weights of the pruned network are not reinitialized but reset to the state after the first initialization. This had proven to be essential: random reinitialization of the pruned network resulted in significantly worse results when more than 80% of the weights were removed.
In general, achieving a good sparsity-accuracy tradeoff is a difficult problem. It is a very active research field, and the state of the art keeps moving.
Although this method was a vast improvement, it did not fix one significant issue: the pruned subnetworks still required retraining. Surprisingly, this was solved by turning the entire problem upside down.
Weight Agnostic Neural Networks
Up until this point, we have started from a large neural network, trimming it down iteratively to compress it. However, there is an alternative approach with the opposite logic: instead of pruning, we could start with a minimal architecture and add to it incrementally.
This was the starting point of Adam Gaier and David Ha in their recent paper Weight Agnostic Neural Networks. In addition to obtaining a minimal network, they had a more ambitious goal: the resulting architectures should perform well with random weights. If you recall, the biggest weakness of weight pruning methods was that the resulting subnetworks were difficult to train. This way, we wouldn't need training.
In essence, their method is an evolutionary search in the space of possible architectures. Compared to previous methods, their search prefers simplicity and weight-agnostic property.
- Create a set of minimal neural network architectures that can be embedded into a single parent architecture. There are no weight values at this point, just connections. Basically, each network is equivalent to a specific pruning of the parent architecture.
- Test out the networks with sets of shared weights. Sharing the weights is essential since this is what forces the search to prefer weight-agnostic architectures.
- Rank the networks according to performance and complexity. Because the weights were shared between the networks, the performance also reflects the weight-agnostic capability.
- Create new networks by taking the top-ranking networks and randomly modifying them.
- Take the obtained networks and go to Step 2.
The authors summarize the process in the following figure.
The method works for several machine learning tasks, not just for handwritten digit recognition. They were able to engineer minimal weight-agnostic architectures for several continuous control problems, such as pole balancing and cart racing.
So far, this method brings us the closest to finding minimal architectures that don’t require training. Yet, there is much to do. Finding accurate weight-agnostic networks is still unsolved for more complex tasks, like ImageNet classification.
Pruning in practice
Up until now, we have talked about only theoretical aspects of pruning. I want to finish by collecting some available tools that you can start using right away.
If you are a TensorFlow user, there are several built-in tools for model optimization, including pruning.
- Introducing the Model Optimization Toolkit for TensorFlow
- TensorFlow Model Optimization Toolkit — Pruning API
Analogously, PyTorch has similar functionalities. You can find their tutorial about weight pruning here, but I have also written a quick summary about this and other available tools like model quantization.
Summary
Machine learning doesn’t stop at training accurate models. When deploying to production, speed and size matter a lot. To fit complex architectures into production constraints, one can reduce the models' size by pruning certain weights. (Along with many other methods, which we haven’t discussed.)
This is not a simple task since a loss in accuracy must be taken into account. In addition, smaller architectures require retraining, which can be more difficult due to the decreased modeling capacity. Here, we have reviewed two weight pruning methods and a complementary architecture search approach that progressively address these challenges.
However, this is just the tip of the iceberg. Weight pruning is an active research area, especially for more complex tasks, like large-scale image recognition. (Think ImageNet size.)
Pruning methods are available in modern deep learning frameworks such as TensorFlow and PyTorch, so you can start experimenting with them right away in your work, perhaps even attempting to push the boundaries of the state of the art. So, if you have found this subject interesting, go and put these awesome methods into practice :)
If you are interested in making neural networks fast by pruning weights or compressing, check out my following articles!
- How to accelerate and compress neural networks with quantization
- Can a neural network train other networks?
- How to compress a neural network
References
[1] Optimal Brain Damage by Yann LeCun, John S. Denker and Sara A. Solla [2] The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks by Jonathan Frankle and Michael Carbin [3] Weight Agnostic Neural Networks by Adam Gaier and David Ha