北京雁栖湖应用数学研究院 北京雁栖湖应用数学研究院

  • 关于我们
    • 院长致辞
    • 理事会
    • 协作机构
    • 参观来访
  • 人员
    • 管理层
    • 科研人员
    • 博士后
    • 来访学者
    • 行政团队
  • 学术研究
    • 研究团队
    • 公开课
    • 讨论班
  • 招生招聘
    • 教研人员
    • 博士后
    • 学生
  • 会议
    • 学术会议
    • 工作坊
    • 论坛
  • 学院生活
    • 住宿
    • 交通
    • 配套设施
    • 周边旅游
  • 新闻
    • 新闻动态
    • 通知公告
    • 资料下载
关于我们
院长致辞
理事会
协作机构
参观来访
人员
管理层
科研人员
博士后
来访学者
行政团队
学术研究
研究团队
公开课
讨论班
招生招聘
教研人员
博士后
学生
会议
学术会议
工作坊
论坛
学院生活
住宿
交通
配套设施
周边旅游
新闻
新闻动态
通知公告
资料下载
清华大学 "求真书院"
清华大学丘成桐数学科学中心
清华三亚国际数学论坛
上海数学与交叉学科研究院
BIMSA > Advances in Artificial Intelligence Topology of large language models data representations
Topology of large language models data representations
组织者
孙明明 , 王雅晴
演讲者
Serguei Barannikov
时间
2024年12月05日 14:00 至 16:00
地点
A3-1-301
线上
Zoom 230 432 7880 (BIMSA)
摘要
The rapid advancement of large language models (LLMs) has made distinguishing between human and AI-generated text increasingly challenging. The talk examines the topological structures within LLM data representations, focusing on their application in artificial text detection. We explore two primary methodologies: 1) Intrinsic dimensionality estimation: Human-written texts exhibit an average intrinsic dimensionality of around 9 for alphabet-based languages in RoBERTa representations. In contrast, AI-generated texts displayed values approximately 1.5 units lower. This difference has allowed the development of robust detectors capable of generalizing across various domains and generation models. 2) Topological data analysis (TDA) of attention maps: By extracting interpretable topological features from transformer model attention maps, we capture structural nuances of texts. Similarly, TDA applied to speech attention maps and embeddings from models like HuBERT enhances classification performance in several tasks.
These topological approaches provide a mathematical methodology to study the geometric and structural properties of LLM data representations and their role in detecting AI-generated texts. The talk is based on the following works, carried out in collaboration with my PhD students E.Tulchinsky and K.Kuznetsov, and other colleagues:
“Intrinsic Dimension Estimation for Robust Detection of AI-Generated Texts”, NeurIPS 2023;
“Topological Data Analysis for Speech Processing”, InterSpeech 2023;
“Artificial Text Detection via Examining the Topology of Attention Maps”, EMNLP 2021.
演讲者介绍
Prof. Serguei Barannikov earned his Ph.D. from UC Berkeley and has made contributions to algebraic topology, algebraic geometry, mathematical physics, and machine learning. His work, prior to his Ph.D., introduced canonical forms of filtered complexes, now known as persistence barcodes, which have become fundamental in topological data analysis. More recently, he has applied topological methods to machine learning, particularly in the study of large language models, with results published in leading ML conferences such as NeurIPS, ICML, and ICLR, effectively bridging pure mathematics and advanced AI research.
北京雁栖湖应用数学研究院
CONTACT

No. 544, Hefangkou Village Huaibei Town, Huairou District Beijing 101408

北京市怀柔区 河防口村544号
北京雁栖湖应用数学研究院 101408

Tel. 010-60661855
Email. administration@bimsa.cn

版权所有 © 北京雁栖湖应用数学研究院

京ICP备2022029550号-1

京公网安备11011602001060 京公网安备11011602001060