arXiv Analytics

Sign in

arXiv:2108.05895 [cs.CV]AbstractReferencesReviewsResources

Mobile-Former: Bridging MobileNet and Transformer

Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Xiaoyi Dong, Lu Yuan, Zicheng Liu

Published 2021-08-12Version 1

We present Mobile-Former, a parallel design of MobileNet and Transformer with a two-way bridge in between. This structure leverages the advantage of MobileNet at local processing and transformer at global interaction. And the bridge enables bidirectional fusion of local and global features. Different with recent works on vision transformer, the transformer in Mobile-Former contains very few tokens (e.g. less than 6 tokens) that are randomly initialized, resulting in low computational cost. Combining with the proposed light-weight cross attention to model the bridge, Mobile-Former is not only computationally efficient, but also has more representation power, outperforming MobileNetV3 at low FLOP regime from 25M to 500M FLOPs on ImageNet classification. For instance, it achieves 77.9\% top-1 accuracy at 294M FLOPs, gaining 1.3\% over MobileNetV3 but saving 17\% of computations. When transferring to object detection, Mobile-Former outperforms MobileNetV3 by 8.6 AP.

Related articles: Most relevant | Search more
arXiv:2107.00451 [cs.CV] (Published 2021-07-01)
VideoLightFormer: Lightweight Action Recognition using Transformers
arXiv:2003.08077 [cs.CV] (Published 2020-03-18)
Scene Text Recognition via Transformer
arXiv:2206.07435 [cs.CV] (Published 2022-06-15)
Forecasting of depth and ego-motion with transformers and self-supervision