Publications | Xinming Tu

2024

bioRxiv

A Supervised Contrastive Framework for Learning Disentangled Representations of Cell Perturbation Data

Xinming Tu, Jan-Christian Hutter, Zitong Jerry Wang, Takamasa Kudo, Aviv Regev, and Romain Lopez

bioRxiv, 2024

HTML PDF

2023

HDSR

What Should Data Science Education Do with Large Language Models?

Xinming Tu, James Zou, Weijie J. Su†, and Linjun Zhang†

Harvard Data Science Review, 2023

HTML

2022

NeurIPS

Cross-Linked Unified Embedding for cross-modality representation learning

Xinming Tu*, Zhi-Jie Cao*, Chen-Rui Xia, Sara Mostafavi†, and Ge Gao†

NeurIPS(Oral), 2022

Abs HTML PDF

Multi-modal learning is essential for understanding information in the real world. Jointly learning from multi-modal data enables global integration of both shared and modality-specific information, but current strategies often fail when observa- tions from certain modalities are incomplete or missing for part of the subjects. To learn comprehensive representations based on such modality-incomplete data, we present a semi-supervised neural network model called CLUE (Cross-Linked Unified Embedding). Extending from multi-modal VAEs, CLUE introduces the use of cross-encoders to construct latent representations from modality-incomplete observations. Representation learning for modality-incomplete observations is common in genomics. For example, human cells are tightly regulated across multi- ple related but distinct modalities such as DNA, RNA, and protein, jointly defining a cell’s function. We benchmark CLUE on multi-modal data from single cell measurements, illustrating CLUE’s superior performance in all assessed categories of the NeurIPS 2021 Multimodal Single-cell Data Integration Competition. While we focus on analysis of single cell genomic datasets, we note that the proposed cross-linked embedding strategy could be readily applied to other cross-modality representation learning problems.

2021

BiB

Identifying complex motifs in massive omics data with a variable-convolutional layer in deep neural network

Jing-Yi Li*, Shen Jin*, Xinming Tu, Yang Ding†, and Ge Gao†

Briefings in Bioinformatics, 2021

bbab233

Abs HTML

Motif identification is among the most common and essential computational tasks for bioinformatics and genomics. Here we proposed a novel convolutional layer for deep neural network, named variable convolutional (vConv) layer, for effective motif identification in high-throughput omics data by learning kernel length from data adaptively. Empirical evaluations on DNA-protein binding and DNase footprinting cases well demonstrated that vConv-based networks have superior performance to their convolutional counterparts regardless of model complexity. Meanwhile, vConv could be readily integrated into multi-layer neural networks as an ’in-place replacement’ of canonical convolutional layer. All source codes are freely available on GitHub for academic usage.

2019

Bioinformatics

Expectation pooling: an effective and interpretable pooling method for predicting DNA–protein binding

Xiao Luo*, Xinming Tu*, Yang Ding, Ge Gao†, and Minghua Deng†

Bioinformatics, 2019

Abs HTML PDF

Convolutional neural networks (CNNs) have outperformed conventional methods in modeling the sequence specificity of DNA-protein binding. While previous studies have built a connection between CNNs and probabilistic models, simple models of CNNs cannot achieve sufficient accuracy on this problem. Recently, some methods of neural networks have increased performance using complex neural networks whose results cannot be directly interpreted. However, it is difficult to combine probabilistic models and CNNs effectively to improve DNA-protein binding predictions.In this article, we present a novel global pooling method: expectation pooling for predicting DNA-protein binding. Our pooling method stems naturally from the expectation maximization algorithm, and its benefits can be interpreted both statistically and via deep learning theory. Through experiments, we demonstrate that our pooling method improves the prediction performance DNA-protein binding. Our interpretable pooling method combines probabilistic ideas with global pooling by taking the expectations of inputs without increasing the number of parameters. We also analyze the hyperparameters in our method and propose optional structures to help fit different datasets. We explore how to effectively utilize these novel pooling methods and show that combining statistical methods with deep learning is highly beneficial, which is promising and meaningful for future studies in this field.All code is public in https://github.com/gao-lab/ePooling.Supplementary data are available at Bioinformatics online.