-

The Transformer Model and its Cutting-edge Research

组织者

孙明明

演讲者

谢海华

时间

2024年09月26日 15:00 至 17:00

地点

A3-1-301

线上

Zoom 361 038 6975 (BIMSA)

摘要

The report offers a comprehensive overview of the Transformer architecture, which powers many contemporary large language models. It delves into the fundamental principles and computational processes that contribute to the Transformer’s prominence in AI. The analysis includes discussions of key weaknesses and potential improvements from a mathematical standpoint. Additionally, it highlights innovations like Aaren, WideFFN, Infini-attention, ALiBi, Roformer, and Reformer, presenting how these advancements aim to address current limitations and enhance model performance.

References:
1) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin: Attention is All you Need. NIPS 2017: 5998-6008.
2) Radford, Alec, et al. "Language models are unsupervised multitask learners." OpenAI blog 1.8 (2019): 9.
3) Mann, Ben, et al. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 1 (2020).
4) Yin, Zi, and Yuanyuan Shen. "On the dimensionality of word embedding." Advances in neural information processing systems 31 (2018).
5) Su, Jianlin, et al. "Roformer: Enhanced transformer with rotary position embedding." Neurocomputing 568 (2024): 127063.
6) Press, Ofir, Noah A. Smith, and Mike Lewis. "Train short, test long: Attention with linear biases enables input length extrapolation." arXiv preprint arXiv:2108.12409 (2021).
7) Liu, Nelson F., et al. "Lost in the middle: How language models use long contexts." Transactions of the Association for Computational Linguistics 12 (2024): 157-173.
8) Munkhdalai, Tsendsuren, Manaal Faruqui, and Siddharth Gopal. "Leave no context behind: Efficient infinite context transformers with infini-attention." arXiv preprint arXiv:2404.07143 (2024).
9) Kitaev, Nikita, Łukasz Kaiser, and Anselm Levskaya. "Reformer: The efficient transformer." arXiv preprint arXiv:2001.04451 (2020).
10) Pires, Telmo Pessoa, et al. "One wide feedforward is all you need." arXiv preprint arXiv:2309.01826 (2023).
11) Feng, Leo, et al. "Attention as an RNN." arXiv preprint arXiv:2405.13956 (2024).

演讲者介绍

谢海华2015年在美国爱荷华州立大学取得计算机博士学位，之后在北京大学数字出版技术国家重点实验室担任高级研究员和知识服务方向负责人，于2021年10月全职入职BIMSA。他的研究方向包括：自然语言处理和知识服务。他发表论文数量超过20篇，拥有7项发明专利，入选北京市高水平人才项目并当选北京市杰出专家。