Small Data = Transfer Learning?

Shawn Pachgade
7 min readNov 29, 2020

With the current state of deep learning, the name of the game is all about amassing large quantities of data to train a high capacity model like BERT or ResNet. However, as machine learning practitioners, we may not always have a huge dataset to leverage. Sometimes we are stuck with a dataset on the order of only a couple hundred or thousands of data points. Without a large amount of data, these high capacity state-of-the-art models are prone to overfitting, and may not be able to learn the primitives of a particular domain, for example detecting edges and contours in image datasets. So how do we apply deep learning to our limited dataset then?

The standard go-to answer in recent years involves transfer learning, where we use someone else’s model that is pretrained on a large dataset like ImageNet or Wikipedia, and then fine-tuning the network on your own target task and dataset. While this solution has been proven to work countless times in the past, it is not a universal panacea. Some datasets have a very specialized domain, for which the source dataset of a pretrained model has a vastly disparate set of primitive features than that of the target dataset. A good example of this is the medical imaging field. Images of X-rays of different parts of the body are very different from images that a pretrained network may have seen when trained on ImageNet. In fact, training on a dataset that is too dissimilar may even lead to the phenomenon of negative transfer, in which fine-tuning a pretrained network actually performs worse than training a network from scratch due to the representation specificity of the pretrained model.

Example lung cancer image (left) and an example ImageNet picture (right). A model pretrained on images on the right might have issues classifying the image on the left.

So where does that leave us? If transfer learning does not work for some use cases and training a standard classifier from scratch will lead to underwhelming results, we need to rethink our assumptions about the fundamentals of neural network architectures.

Cosine Loss

One of the key assumptions to break is that cross-entropy is the one-and-true classification loss function to use. Cross-entropy is widely celebrated as the favored loss function to use when training any type of multi-class classification task. But that doesn’t mean it is the only option. Interesting results have been shown in a paper (Barz et al.) investigating the results of training a model from scratch using cosine loss.

Cosine similarity and loss, where f is prediction and phi is the target vector (Source)

Some of the underlying differences between cosine and cross entropy actually lead to some benefits that we can capitalize on. One distinguishing factor is that cosine loss requires the computation of unit vectors, which requires a L2-normalization term for the class prediction vectors. Intuitively, this can be thought of as forgoing the magnitude of the vector, but instead focusing on the DIRECTION of the vector. It doesn’t matter how confident you are along this direction, just getting the direction right is enough to attain a good cosine loss. This may be a bit rattling to discard the magnitude of the vector entirely, however, previous literature has shown that in high-dimensional spaces, direction has empirically captured more information than the magnitude. Thus, by omitting magnitude, we have made our optimization objective easier! We have also effectively regularized because we are not using that information to otherwise overfit.

Results from Barz et al.

The above table shows the performance of different loss functions and methodologies on various datasets. Note that when comparing “softmax + cross-entropy” to “cosine loss (one-hot embeddings) you see dramatic improvements in test performance when training neural networks from scratch. However, we do notice that using a mixture of the two — “cosine loss + cross-entropy (one-hot embeddings)” attains some marginal improvements. This solidifies our understanding that direction captures much of the information in these vectors, but also points out that prediction confidence is not useless as a feature in the loss function. In both cases, these approaches cannot beat out the “fine-tuned” variations that achieve near 80%+ accuracy on all tasks, but if you find yourself suffering from negative transfer, training from scratch via cosine loss (+ cross-entropy) is a worthwhile endeavor.

Moving on, what else can we do with our small datasets? When fine-tuning a large pretrained model, we have to take care to make sure we are not actually forgetting the originally learned features that were present in the network. This can be done by carefully adjusting the learning rate so that it is not too large to forget features, but not too small so that we fail to converge quickly. As you can see this approach requires careful hyperparameter tuning, and may not be something we are interested in trying to get good initial results with. On the other hand, we can train only the last layer of the network to avoid this issue, but previous literature has shown that freezing layers can make it difficult for the network to learn features specific to the dataset. As the number of frozen layers increases, the representation specificity (underlying features learned) of a model becomes more and more fixed, not allowing flexibility to satisfactorily train the target task.

Trade-off between freezing layers and performance (Source)

Adapter Modules

Enter: adapter modules. Adapter modules are layers that can be added in between the original layers of the pretrained network. The original layers themselves can then be frozen so we no longer have to be responsible for those weights. In addition, the adapter layer weights themselves can be initialized to a near identity function so that the initial pretrained model outputs near identically to the adapter layer variant, therefore not losing any features learned by the original model. We then have full reign over training these adapter layers how we want, as each adapter layer is constrained by its surrounding frozen layers. This in turn means that training can intuitively only allow the new adapter model to tune the features to learn to be more dataset specific rather than overhaul its learned image representations.

Example of adapter modules being added to transformer layers (Source)

As you can see above, adapters can be applied to modern NLP models by attempting to insert layers within the transformer unit itself. The adapter module itself follows an autoencoder-style architectural paradigm where the input is projected into a low-dimensional latent space (the size of the latent space itself is a hyperparameter), and then projected back into the original input space.

Now that we understand more about how adapter modules work, we have to answer one important question. Is this approach actually any good empirically? Let’s review some results on various tasks from the GLUE benchmark:

Comparing fine-tuning to adapters (Source)

In the above chart, we compare traditional re-training of the top layers on a frozen network, to training the supplementary adapter modules. The adapter model beats out the fine-tuning approach in most cases, no matter how many parameters are involved. In fact, the fine-tuning approach only comes close when close to hundreds of millions of parameters are re-trained. Clearly, fine-tuning only the top layers of the network is not very adequate, and is supported by the literature cited previously about transferability of features and representation specificity.

How about when adapters are compared to a full-finetuning of a pretrained model though?

GLUE benchmark test scores comparison (Source)

Using adapter modules to train networks attains very similar accuracies to the traditional whole network fine-tuning approach, but with a HUGE savings in the number of parameters being trained. These results make this novel approach very competitive with current state-of-the-art methods, and while not specific to small datasets, it’s very much in the line with the theme of avoiding overfitting when using deep learning methods. Additionally, it comes with the added benefit of creating compact representations of task specific models given a base model.

Even though none of these approaches have set new state-of-the-arts, that does not make them any less worthwhile to explore. Metrics for the state-of-the-art may not even be relevant for comparing methods of training on small datasets. These creative approaches serve as examples for us to try clever new architectures, instead of focusing on iterating through the pre-allotted zoo of models and heavy hyperparameter tuning. Learning on small datasets is hard and requires a solution that can learn and generalize off only a couple examples. By solving the general problem of few-shot learning, we can achieve great results on arbitrary domain-specific tasks very quickly in the future. We shouldn’t need to carry over the representation of the entire ImageNet dataset every time we want to make a simple image classifier on our toy dataset.

References:

--

--