Statistical Genetics: Unveiling Unknowns in Life

Statistical genetics is the combination between statistics and genetics, aimed to understand the genetic architecture of complex phenotypes and human health using powerful statistical tools. The core of statistical genetics is the development and application of statistical methods to analyze genetic and genomic data and understand how genes influence phenotypic traits and diseases. The objectives of this workshop are to introduce the state-of-art concepts and methodologies that can better disentangle the complexities in genetics and unveil the unknowns in this field. Speakers are active researchers at the frontiers in the methodological development of statistical genetics.

Yongtao Guan ( BIMSA )

Zuoheng Wang ( Yale University )

Qi Wu ( BIMSA )

Rongling Wu ( BIMSA , YMSC )

Dengcheng Yang ( BIMSA )

9th ~ 9th July, 2025

Weekday	Time	Venue	Online	ID	Password
Wednesday	10:00 - 19:00	A6-101	ZOOM 09	230 432 7880	BIMSA

Time\Date	Jul 9 Wed
10:00-10:20	Rong Ling Wu
10:20-11:10	Zuoheng Wang
11:10-12:00	Yong Tao Guan
14:00-14:50	Qi Wu
14:50-15:40	Deng Cheng Yang
15:40-16:30	Rong Ling Wu

*All time in this webpage refers to Beijing Time (GMT+8).

9th July, 2025

10:00-10:20 Rongling Wu

Open Remark

10:20-11:10 Zuoheng Wang

Deep learning integrating clinical and genetic data for disease risk prediction in Biobank data

In the digital medicine era, electronic health records (EHRs) contain extensive patient data from diverse sources. Leveraging this readily available information is essential for personalized medicine and predictive healthcare. Although family health history is an important component to assess risk for common chronic diseases, research has so far adopted a limited view of family relations in healthcare research. We develop ALIGATEHR, which models inferred family relations in a graph attention network augmented with an attention-based medical ontology representation, thus accounting for the complex influence of genetics, shared environmental exposures, and disease dependencies. Furthermore, the integration of electronic health records and genetic data offers great potential to improve disease risk prediction by capturing both clinical and genetic risk factors. We further develop ALIGATEHR-Gen, a graph attention network that integrates multimodal patient data including diagnosis codes, demographics, and genetic information, along with external medical ontology knowledge. ALIGATEHR-Gen constructs unified patient representations by incorporating genetically inferred first-degree relationships and disease ontology embeddings. We evaluate the predictive performance of ALIGATEHR-Gen across 118 diseases in the UK Biobank and demonstrate that it outperforms state-of-the-art baseline models by an average of at least 6%. A case study on five primary fibrotic and closely related diseases reveals that ALIGATEHR-Gen effectively distinguishes patient subgroups based on clinical and genetic features. These findings illustrate the potential of ALIGATEHR-Gen to advance predictive and interpretable modeling in healthcare.

11:10-12:00 Yongtao Guan

Abundant Parent-of-origin Effect eQTL: The Framingham Heart Study

Parent-of-origin effect (POE) is a phenomenon whereby an allele’s effect on a phenotype depends both on its allelic identity and parent from whom the allele is inherited, as exemplified by the polar overdominance in the ovine callypyge locus and the human obesity DLK1 locus. Systematic studies of POE of expression quantitative trait loci (eQTL) are lacking. In this study we use trios among participants in the Framingham Heart Study to examine to what extend POE exists for gene expression of whole blood using whole genome sequencing and RNA sequencing. For each gene and the SNPs in cis, we performed eQTL analysis using genotype, paternal, maternal, and joint models, where the genotype model enforces the identical effect sizes on paternal and maternal alleles, and the joint model allows them to have different effect sizes. We compared models using Bayes factors to identify paternal, maternal, and opposing eQTL, where paternal and maternal effects have opposite directions. The resultant variants are collectively called POE eQTL. The highlights of our study include: 1) There are more than 2, 000 genes harbor POE eQTL and majority POE eQTL are not in the vicinity of known imprinted genes; 2) Among 180 genes harboring opposing eQTL, 99 harbor exclusively opposing eQTL, and 58 of the 99 are phosphoprotein coding genes, reflecting significant enrichment; 3) Paternal eQTL are enriched with GWAS hits, and genes harboring paternal eQTL are enriched with drug targets. Our study demonstrates the abundance of POE in gene expression, illustrates the complexity of gene expression regulation, and provides a resource that is complementary to existing resources such as GTEx. We revisited two previous POE findings in light of our POE results. A SNP residing in KCNQ1 that is maternally associated with diabetes is a maternal eQTL of CDKN1C, not KCNQ1. A SNP residing in DLK1 that showed paternal polar overdominance for human obesity is a maternal eQTL of MEG3, offering an explanation for the baseline risk of homozygous samples through association between MEG3 expression and obesity. Finally, we advised caution on conducting Mendelian randomization using gene expression as the exposure.

14:00-14:50 Qi Wu

Evaluating population genetic diversity via alignment-free approaches

Alignment-free sequence analysis methods have been widely applied in phylogenomic and metagenomic studies, primarily through the construction of distance matrices followed by distance-based tree inference. In this work We extended the application of alignment-free approaches to population genetic analyses, specifically for estimating genetic diversity. Two alignment-free methods—kmer frequency profiling and natural vector methodology—were employed to calculate sequence pairwise diversity (π) using yeast genomes and human mitochondrial datasets. The results demonstrate strong correlation between diversity estimates derived from alignment-free methods and SNP-based approaches. Notably, because alignment-free methods incorporate comprehensive sequence variation information, they systematically yield higher diversity values than SNP-based calculations. However, direct comparison of absolute diversity values across methodologies requires careful interpretation in future works.

14:50-15:40 Dengcheng Yang

A Statistical Genetics Framework for Dissecting the Genetic Architecture of Plant Regeneration

Here we introduce an integrated statistical genetics framework designed to uncover the genetic basis of a specific biological process. The framework combines genome-wide resequencing and time-series transcriptomic data, and includes several complementary components: A qualitative trait GWAS, which explicitly decomposes additive and dominance effects to identify key genetic variants and their modes of action influencing whether the biological process is initiated; Functional mapping, used to capture dynamic genetic effects during post-initiation development; Time-series differential expression and enrichment analyses, providing molecular validation of genetic signals; A gene regulatory network model based on transcriptomics, used to reveal potential epistatic regulatory interactions.We applied this framework to study Populus euphratica regeneration, a classical model of plant developmental plasticity, and systematically dissected the genetic architecture underlying its key regenerative stages.

15:40-16:30 Rongling Wu

面向多基因编辑的拓扑遗传学理论提出

未来三十年将是基因编辑的时代。那时，许多复杂疾病可以通过目标基因编辑得以根治；许多复杂性状也可以通过多基因编辑得以改良。这一美好远景的实现需要我们对数量遗传学的充分而全面的理解与认识。通过这个演讲，我将展示能对多基因编辑提供充分理论支撑的新数量遗传学理论与方法。如果说R.A. Fisher开创了推动动植物育种的经典数量遗传学理论，那么一个世纪后在本次演讲提出的拓扑遗传学理论将有力推动基因编辑改变生命的进程。

Journal

Data Analytics and Topology is a peer-reviewed open access journal owned by Beijing Institute of Mathematical Sciences and Applications, under the sponsorship of the International Press inaugurated by Prof. Shing-Tung Yau.

Data Analytics and Topology has been created to meet the needs of the growing mathematical data science community as they create innovative concepts, theories, models, and tools that can best manipulate, interpret, and utilize big data. The aim of collecting big data is to uncover and extract fundamental principles and rules underlying scientific problems behind them. We anticipate that neither statistics nor even AI alone will suffice to achieve this goal in a sustainable way, given the heterogeneous, dynamical, interdependent, and high dimensionality of big data. Overcoming these challenges can be facilitated through the seamless integration of mathematics, particularly topology, with statistics.

The journal endeavors to publish papers exploring the intersection between mathematics and statistics, especially as it applies to large-scale data sets. It welcomes contributions presenting new theories, methods, and interpretations of data within the realms of topological statistics or statistical topology.

For more infos, please refer to https://intlpress.com/journals/journalList?id=1879074441815207938.

We are currently seeking additional editors and associate editors for our journal. We need between 10 to 20 editors and approximately 20 associate editors. If you are interested in joining our team, please reach out to us at daatjournal@bimsa.cn.

We also invite you to submit your upcoming research papers or review articles to Data Analytics and Topology. We look forward to the opportunity of featuring your valuable work in our publication.