arXiv:2108.05895 Abstract | arXiv Analytics

arXiv:2108.05895 [cs.CV]Abstract References Reviews Resources

Mobile-Former: Bridging MobileNet and Transformer

Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Xiaoyi Dong, Lu Yuan, Zicheng Liu

Published 2021-08-12Version 1

We present Mobile-Former, a parallel design of MobileNet and Transformer with a two-way bridge in between. This structure leverages the advantage of MobileNet at local processing and transformer at global interaction. And the bridge enables bidirectional fusion of local and global features. Different with recent works on vision transformer, the transformer in Mobile-Former contains very few tokens (e.g. less than 6 tokens) that are randomly initialized, resulting in low computational cost. Combining with the proposed light-weight cross attention to model the bridge, Mobile-Former is not only computationally efficient, but also has more representation power, outperforming MobileNetV3 at low FLOP regime from 25M to 500M FLOPs on ImageNet classification. For instance, it achieves 77.9\% top-1 accuracy at 294M FLOPs, gaining 1.3\% over MobileNetV3 but saving 17\% of computations. When transferring to object detection, Mobile-Former outperforms MobileNetV3 by 8.6 AP.

Categories: cs.CV, cs.LG

Keywords: transformer, bridging mobilenet, bridge enables bidirectional fusion, mobile-former outperforms mobilenetv3, low flop regime

Related articles: Most relevant | Search more

arXiv:2107.00451 [cs.CV] (Published 2021-07-01)

VideoLightFormer: Lightweight Action Recognition using Transformers

Raivo Koot, Haiping Lu

arXiv:2003.08077 [cs.CV] (Published 2020-03-18)

Scene Text Recognition via Transformer

Xinjie Feng, Hongxun Yao, Yuankai Yi, Jun Zhang, Shengping Zhang

arXiv:2206.07435 [cs.CV] (Published 2022-06-15)

Forecasting of depth and ego-motion with transformers and self-supervision

Houssem Boulahbal, Adrian Voicila, Andrew Comport