Beijing Institute of Mathematical Sciences and Applications Beijing Institute of Mathematical Sciences and Applications

  • About
    • President
    • Governance
    • Partner Institutions
    • Visit
  • People
    • Management
    • Faculty
    • Postdocs
    • Visiting Scholars
    • Staff
  • Research
    • Research Groups
    • Courses
    • Seminars
  • Join Us
    • Faculty
    • Postdocs
    • Students
  • Events
    • Conferences
    • Workshops
    • Forum
  • Life @ BIMSA
    • Accommodation
    • Transportation
    • Facilities
    • Tour
  • News
    • News
    • Announcement
    • Downloads
About
President
Governance
Partner Institutions
Visit
People
Management
Faculty
Postdocs
Visiting Scholars
Staff
Research
Research Groups
Courses
Seminars
Join Us
Faculty
Postdocs
Students
Events
Conferences
Workshops
Forum
Life @ BIMSA
Accommodation
Transportation
Facilities
Tour
News
News
Announcement
Downloads
Qiuzhen College, Tsinghua University
Yau Mathematical Sciences Center, Tsinghua University (YMSC)
Tsinghua Sanya International  Mathematics Forum (TSIMF)
Shanghai Institute for Mathematics and  Interdisciplinary Sciences (SIMIS)
BIMSA > Advances in Artificial Intelligence Topology of large language models data representations
Topology of large language models data representations
Organizers
Ming Ming Sun , Ya Qing Wang
Speaker
Serguei Barannikov
Time
Thursday, December 5, 2024 2:00 PM - 4:00 PM
Venue
A3-1-301
Online
Zoom 230 432 7880 (BIMSA)
Abstract
The rapid advancement of large language models (LLMs) has made distinguishing between human and AI-generated text increasingly challenging. The talk examines the topological structures within LLM data representations, focusing on their application in artificial text detection. We explore two primary methodologies: 1) Intrinsic dimensionality estimation: Human-written texts exhibit an average intrinsic dimensionality of around 9 for alphabet-based languages in RoBERTa representations. In contrast, AI-generated texts displayed values approximately 1.5 units lower. This difference has allowed the development of robust detectors capable of generalizing across various domains and generation models. 2) Topological data analysis (TDA) of attention maps: By extracting interpretable topological features from transformer model attention maps, we capture structural nuances of texts. Similarly, TDA applied to speech attention maps and embeddings from models like HuBERT enhances classification performance in several tasks.
These topological approaches provide a mathematical methodology to study the geometric and structural properties of LLM data representations and their role in detecting AI-generated texts. The talk is based on the following works, carried out in collaboration with my PhD students E.Tulchinsky and K.Kuznetsov, and other colleagues:
“Intrinsic Dimension Estimation for Robust Detection of AI-Generated Texts”, NeurIPS 2023;
“Topological Data Analysis for Speech Processing”, InterSpeech 2023;
“Artificial Text Detection via Examining the Topology of Attention Maps”, EMNLP 2021.
Speaker Intro
Prof. Serguei Barannikov earned his Ph.D. from UC Berkeley and has made contributions to algebraic topology, algebraic geometry, mathematical physics, and machine learning. His work, prior to his Ph.D., introduced canonical forms of filtered complexes, now known as persistence barcodes, which have become fundamental in topological data analysis. More recently, he has applied topological methods to machine learning, particularly in the study of large language models, with results published in leading ML conferences such as NeurIPS, ICML, and ICLR, effectively bridging pure mathematics and advanced AI research.
Beijing Institute of Mathematical Sciences and Applications
CONTACT

No. 544, Hefangkou Village Huaibei Town, Huairou District Beijing 101408

北京市怀柔区 河防口村544号
北京雁栖湖应用数学研究院 101408

Tel. 010-60661855
Email. administration@bimsa.cn

Copyright © Beijing Institute of Mathematical Sciences and Applications

京ICP备2022029550号-1

京公网安备11011602001060 京公网安备11011602001060