Skip to content

Latest commit

 

History

History
24 lines (13 loc) · 1.81 KB

2006.11316.md

File metadata and controls

24 lines (13 loc) · 1.81 KB

SqueezeBERT: What can computer vision teach NLP about efficient neural networks?, Iandola et al., 2020

Paper, Tags: #nlp, #architectures, #computer-vision

Grouped convolutions have yielded significant speedups for computer vision, but have not been adapted to NLP. We demonstrate how to replace several operations in self-attention layers with grouped convolutions, and we use this technique in a novel network architecture called SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 smartphone while achieving competitive accuracy on the GLUE test set.

The current state-of-the-art mobile BERT model, MobileBERT, takes 0.6s to classify a text sequence in a Google Pixel 3. They introduce 2 concepts into the self-attention network:

  1. Bottleneck layers

In ResNet, the 3x3 convolutions are computationally expensive, so a 1x1 "bottleneck" convolution is employed to reduce the number of channels input to each 3x3 convolution layer. Similarly, MobileBERT adopts bottleneck layersthat reduce the number of channels before each self-attention layer, and this reduces the computational cost of the self-attention layers.

  1. High-information residual flow connections

CV networks connect the high-channel-count layers with residuals, which enables higher information flow through the network.


Another ideas from CV that weren't used in MobileBERT:

  • Convolutions
  • Grouped convolutions: a popular technique in modern mobile-optimized neural networks

In this work we describe how to apply convolutions and group convolutions in the design of a novel self-attention network for NLP. BERT spends most of its opertions and time in the position-wise fully connected feed-foward layers. We show that these PFC layers are equivalent to a 1D convolution with a kernel size of 1.