The Transformer Model and its Cutting-edge Research

Organizer

Mingming Sun

Speaker

Haihua Xie

Time

Thursday, September 26, 2024 3:00 PM - 5:00 PM

Venue

A3-1-301

Online

Zoom 361 038 6975 (BIMSA)

Abstract

The report offers a comprehensive overview of the Transformer architecture, which powers many contemporary large language models. It delves into the fundamental principles and computational processes that contribute to the Transformer’s prominence in AI. The analysis includes discussions of key weaknesses and potential improvements from a mathematical standpoint. Additionally, it highlights innovations like Aaren, WideFFN, Infini-attention, ALiBi, Roformer, and Reformer, presenting how these advancements aim to address current limitations and enhance model performance.

References:
1) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin: Attention is All you Need. NIPS 2017: 5998-6008.
2) Radford, Alec, et al. "Language models are unsupervised multitask learners." OpenAI blog 1.8 (2019): 9.
3) Mann, Ben, et al. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 1 (2020).
4) Yin, Zi, and Yuanyuan Shen. "On the dimensionality of word embedding." Advances in neural information processing systems 31 (2018).
5) Su, Jianlin, et al. "Roformer: Enhanced transformer with rotary position embedding." Neurocomputing 568 (2024): 127063.
6) Press, Ofir, Noah A. Smith, and Mike Lewis. "Train short, test long: Attention with linear biases enables input length extrapolation." arXiv preprint arXiv:2108.12409 (2021).
7) Liu, Nelson F., et al. "Lost in the middle: How language models use long contexts." Transactions of the Association for Computational Linguistics 12 (2024): 157-173.
8) Munkhdalai, Tsendsuren, Manaal Faruqui, and Siddharth Gopal. "Leave no context behind: Efficient infinite context transformers with infini-attention." arXiv preprint arXiv:2404.07143 (2024).
9) Kitaev, Nikita, Łukasz Kaiser, and Anselm Levskaya. "Reformer: The efficient transformer." arXiv preprint arXiv:2001.04451 (2020).
10) Pires, Telmo Pessoa, et al. "One wide feedforward is all you need." arXiv preprint arXiv:2309.01826 (2023).
11) Feng, Leo, et al. "Attention as an RNN." arXiv preprint arXiv:2405.13956 (2024).

Speaker Intro

Dr. Haihua Xie receives a Ph.D. in Computer Science at Iowa State University in 2015. Before joining BIMSA in Oct. 2021, Dr. Xie worked in the State Key Lab of Digital Publishing Technology, Peking University from 2015-2021. His research interests include Natural Language Processing and Knowledge Service. He published more than 20 papers and obtained 7 invention patents. In 2018, Dr. Xie was selected in the 13th batch of overseas high-level talents in Beijing and was hornored as a "Beijing Distinguished Expert".