Ge Gao, Ph. D

Computational Genomics

Professor, Biomedical Pioneering Innovation Center, Peking University

tel:

E-mail:gaog@mail.cbi.pku.edu.cn

As biology is increasingly turning into a data-rich science, massive data generated by high-throughput technologies pose both opportunities and serious challenges. We focus primarily on developing novel computational technology to analyze and integrate these “Big data” effectively and efficiently, with application to decipher the function and evolution of gene regulatory system.

Handle Biological “BIG DATA” Effectively and Efficiently. Powerful bioinformatics infrastructure are critical to store, manage, and analyze these data, and finally to extract novel knowledge effectively and efficiently. Supported by grant from Chinese Ministry of Science and Technology, we developed an online bioinformatic platform, Weblab, to help users analyze biological data with 260+ integrated tools, and share their data, results, and even the whole workflows with each other. As the largest online bioinformatics platform in China, Weblab has had 54+ million hits annually, from 5,000+ registered active users worldwide. In response to the rapid increase in the amount of sequencing data, we further developed the customizable genome visualization framework ABrowse for enabling more effective access to heterogeneous “-omics” data. Being the first general-purpose framework with full supports for interactive browse, open data access and collaborative teamwork genome-widely, ABrowse has 2,095 total downloads (as of Apr 2015), with 1.41 average daily downloads.

Decipher the Function and Evolution of Gene Regulatory System. Based on the powerful infrastructure, we studied the functionality and evolutionary dynamics of development-related regulatory system in various model organisms. One of our long-term goals is to determine the regulatory roles for novel (i.e. evolutionary young) regulators, as well as how they are “wired” into the existing regulation network.

Transcription Factors (TFs) are key elements in gene expression regulation circuits, play essential roles in plant development and stress response. Benefitted from continuously improving data quality and analysis methodologies during past decade, our Plant Transcription Factor database, PlantTFDB, is becoming the most comprehensive data portal for plant transcription factors, with 10+ million hits from worldwide users annually. The current PlantTFDB 3.0 contains 129,288 TFs from 83 species, covering all major lineages of green plants. The wide coverage of PlantTFDB enables us to investigate the evolutionary dynamics of plant transcriptional regulatory system globally. We found, unexpectedly, statistically significant connection between the binding specificity and wiring preference of novel TFs, suggesting novel regulators can modify the regulatory circuits by introducing holistic (but highly specialized) module.

The long non-coding RNA (lncRNA, with operational definition as noncoding transcript longer than 200 nt) is only recognized as important regulator recently. By employing machine learning approach, we developed CPC the first online tool to identify non-coding transcripts based on sequence features solely. Currently, CPC has been widely used by the noncoding community, with 42 million hits annually for the CPC online server. The rapid evolutionary rate and flexible target adaptation of noncoding RNAs make them an attractive source for evolutionary novelty. By integrating functional and evolutionary genomics data across multiple clades, we identified that a selection-driven process, rather than a purely neutral mutation-driven mechanism, contributes to the origin and maintenance of intergenic noncoding RNAs in both fruit fly and human genome, highlighting the putative roles of novel long noncoding RNAs in early development and its possible connection with lineage-specific evolutionary novelty.


Wang Y, Liang N, Gao G. (2024) Quantifying the regulatory potential of genetic variants via a hybrid sequence-oriented model with SVEN. Nat. Commun., 15: 10917.

Chen ZY, Wei L, Gao G. (2024) Foundation models for bioinformatics. Quant. Biol., 12: 339-344.

Liu X, Zhang XY, Kang YJ, Huang F, Liu S, Guo YX, Li YN, Yin CC, Liu ML, Han QM, Wang QW, Ye H, Yao HH, Li C, Li JH, Pingcuo WZ, Zhang Y, Su Y, Gao G, Li ZG, Sun XL. (2024) An autoantibody profile identified by human genome-wide protein arrays in rheumatoid arthritis. MedComm, 5: e679.

Xia CR, Cao ZJ, Tu XM, Gao G. (2023) Spatial-linked alignment tool (SLAT) for aligning heterogenous slices. Nat. Commun., 14: 7236.

Wen ZY, Kang YJ, Ke L, Yang DC, Gao G. (2023) Genome-wide identification of gene loss events suggests loss relics as a potential source of functional incRNAs in humans. Mol. Biol. Evol., 40: msad103.

Cao ZJ, Gao G. (2022) Multi-omics single-cell data integration and regulatory inference with graph-linked embedding. Nat. Biotechnol., 40: 1458-1466.

Tu XM, Cao ZJ, Xia CR, Mostafavi S, Gao G. (2022) Cross-Linked Unified Embedding for cross-modality representation learning. In 36th Conference on Neural Information Processing Systems (NeurIPS 2022).

Kang YJ, Li JY, Ke L, Jiang S, Yang DC, Hou M, Gao G. (2022) Quantitative model suggests both intrinsic and contextual features contribute to the transcript coding ability determination in cells. Brief Bioinform., 23: bbab483.

Cao ZJ, Wei L, Lu S, Yang DC, Gao G. (2020) Searching large-scale scRNA-seq databases via unbiased cell embedding with cell BLAST. Nat. Commun., 11: 3458.

Lichen Ren, Nan Liang, Yuan Lin, Shiqi Yang, Zhijie Cao, Jingyi Li, Yu Wang, Dechang Yang, Lin Wei, Ziyu Chen, Chenrui Xia, Shuting Han, Peiwen Ji, Cheng Li, Xinjie Wang, Yexi Liang, Tianyi Ma, Yuci Wang