A framework to understand regulatory genomics(To be finished)

Background

One of the central questions in genetics is understanding the relationship between genotype and phenotype. Genotype refers to an individual’s genome, or their complete set of DNA sequences, while phenotype refers to observable traits such as physical characteristics and disease states. A deeper understanding of the relationship between genotype and phenotype requires an examination of the intermediate processes that link the two. One of the most important of these processes is gene regulation, which is a multi-layered, context-dependent process. It involves interactions between various elements, including regulatory elements, the abundance of transcription factors, chromatin states, and various epigenetic modifications. Together, these intermediate processes and variables can be referred to as epigenotype/epigenomics, which plays a crucial role in determining an individual’s observable traits

Take previous research topics as examples, Genome-wide association studies (GWAS) are a powerful tool for understanding the relationship between genotype and phenotype. These studies typically focus on the association between single-nucleotide polymorphisms (SNPs) and various human disease states. Another area of research that examines this relationship is expression quantitative trait loci (eQTL) studies, which investigate the association between genetic variants and epigenotype, specifically gene expression. In recent years, a new type of eQTL called chromatin accessibility quantitative trait loci (caQTL) has emerged, which examines the relationship between genetic variants and chromatin accessibility. These statistical genetics approaches use regression models based on predefined features, such as different alleles, to establish associations. However, this approach has a limitation that it can only introduce mutations that are present in the training data, and the input features of the model are determined by the training data.

What’s regulatory genomics? In my view, Regulatory genomics is a field of study that aims to understand the underlying mechanisms of gene regulation at a higher resolution. This is achieved by using computational and experimental approaches to predict the epigenotypes from DNA sequences, which allows for a more flexible exploration of the genotype-phenotype relationship than traditional statistical genetics methods. Regulatory genomics often uses bioinformatics tools and techniques such as genome-wide ChIP-seq, RNA-seq, and ATAC-seq, to identify and analyze different types of regulatory elements, including transcription factors, enhancers, and promoter regions, with the goal of understanding how these elements interact to regulate gene expression and ultimately affect the phenotype.

Study	Genotypes	Epigenotypes	Phenotypes
GWAS	Variants	-	traits(disease)
eQTL	Variants	Gene expression	-
caQTL	Variants	Chromatin accessibility	-
Regulatory genomics	DNA sequences	Genomics tracks	-

In this blog, I will provide an in-depth overview of the field of regulatory genomics. I will start by introducing a general framework that outlines the key concepts and principles of the field. Afterward, I will delve into specific examples by discussing two types of data that are commonly used in regulatory genomics: genome sequences and synthetic sequences. Additionally, I will highlight some of the recent research papers that have contributed to the progress and advancements in the field. To conclude, I will explore the different ways in which the models developed in regulatory genomics can be applied to gain new insights into the mechanisms of gene regulation and the relationship between genotype and phenotype.

The general framework

The overarching objective of regulatory genomics is to accurately predict functional outcomes and activity levels derived from DNA sequences. This process involves the utilization of sophisticated computational models that accept DNA sequences, composed of nucleotides adenine (A), cytosine (C), thymine (T), and guanine (G), as input variables \(X\). Subsequently, these models generate predictions for the activity of a diverse array of genomic features \(Y\), including but not limited to, gene expression, chromatin accessibility, and three-dimensional genome organization. The output generated for these genomic features can be represented as either a scalar value or a multidimensional vector, contingent upon the specific genomic track being analyzed.

Two kinds of data

A crucial aspect of developing models in regulatory genomics is selecting the appropriate data source. Two commonly used data sources in this field are genome sequences and synthetic sequences. Genome sequences are obtained through sequencing technology and provide a snapshot of the genomic landscape in a natural context. They are useful for studying the effects of genetic variations on gene regulation. Synthetic sequences, on the other hand, are generated using synthetic biology techniques and are used to study the effects of genetic variations in a specific region in a controlled environment, typically in a specific cell line. These data sources provide a unique opportunity to study the effects of genetic modifications on gene regulation.

Both types of data sources have their own advantages and limitations, which will be discussed in further detail in the later sections of the blog post

The genome sequences

The human genome contains more than six billion base-pairs, which makes it an easily accessible data source for inputting into models. Additionally, various experimental technologies can be utilized to obtain different types of measurements from cells in various states and functions. These technologies include high-throughput sequencing methods such as RNA-seq to measure gene expression, ChIP-seq to measure protein-DNA binding, and ATAC-seq to measure chromatin accessibility, Hi-C to measure 3D genome structure, and more.

Models in regulatory genomics take DNA sequences as input and predict the activity of different genomics tracks. These models are trained using large amounts of data obtained through these sequencing technologies and can accurately predict the activity of different genomic tracks in different cell types.

Sequence as input

Various works focus on predicting specific genomic tracks, such as:

Predict protein binding from sequences: DeepBind, DeepSEA, BPnet
Predict gene expression from sequences: Basenji, ExPecto, Enformer
Predict chromatin accessibility from sequences: Basset, DanQ , AI-TAC, AMBER
Predict 3D structure of the genome from sequences:Akita, DeepC, Orca

Sequence and other modalities as input

In addition to using DNA sequences as the sole input, other genomics tracks can also be incorporated as inputs. For example:

Predict protein binding from DNA sequences and Chromatin accessibility: Leopard, maxATAC
Predict gene expression from sequences and 3D genome structure: GraphReg
Predict 3D genome structure from sequences and Chromatin accessibility/Protein binding: C.Origami

The main motivations behind incorporating other modalities as inputs are: (1)The ability to predict cell-type specific activity directly. (2)The ability to study multi-layer relationships between different genomics tracks.

Previous methods that rely solely on DNA sequences as input can only provide a general understanding of gene regulation, as the input for different cells is the same (the same genome). To account for cell-type specific activity, these methods often use a multi-task setting. On the other hand, incorporating other modalities as input, such as chromatin accessibility or protein-DNA binding, allows for direct prediction of cell-type specific activity. For example, if we know that a specific DNA sequence contains a motif and is accessible to a certain protein, we can generalize this understanding to other cell types. Additionally, incorporating other modalities as input allows for the study of multi-layer relationships between different genomics tracks, providing a more comprehensive understanding of gene regulation.

Figure summary:

Pros and Cons of using genome sequences

Using genome sequences as a data source in regulatory genomics allows for the prediction of the activity of different genomics tracks across long sequences, even entire chromosomes. This allows for the study of complex mechanisms of the genome, including various types of regulatory processes such as promoters, enhancers, repressors, long-range loops, and changes in compartments. Additionally, by using data from different cell types, it is possible to predict cell-type-specific genomics tracks.

However, there are also some disadvantages to using genome sequences as a data source. One of the main limitations is the limited sample size. For example, when predicting gene expression, the sample size is limited to the number of genes. Similarly, when predicting chromatin accessibility, the sample size is limited to the number of open chromatin regions. Another limitation is that the genome is in a state of constant evolution, which limits the diversity of sequences.

Pros of using genome sequences as a data source in regulatory genomics:
- Long sequences: Allows for the prediction of activity across long sequences even entire chromosomes
- Complex mechanisms: Enables the study of complex mechanisms of the genome, including various types of regulatory processes.
- Diverse cell state: Provides the ability to predict cell-type-specific genomics tracks
Cons of using genome sequences as a data source in regulatory genomics:
- Limited sample size: The number of samples is restricted by the number of genes or open chromatin regions.
- Limits the diversity of sequences The genome is in a state of constant evolution, which limits the diversity of sequences

The synthesis sequences

The use of synthetic sequences in regulatory genomics allows for a more versatile approach to investigating the functions of specific elements in the genome. This is achieved by introducing synthesized sequences into different positions in the genome, which allows for the study of the mechanisms of various regulatory regions such as promoters, 3’UTRs, 5’UTRs, and enhancers. There have been numerous studies conducted utilizing this approach, utilizing various techniques such as random sequence generation and site-directed mutagenesis to investigate these mechanisms.

For example, We can use synthesize sequences as Promoter to study the mechanism of the promoter. We can also use random sequences such as 3’UTR or 5’UTR to study the mechanism of the 5’UTR or 3’UTR. Furthermore, we can use random sequences as enhancers and Promoter

While there are advantages to using synthesized sequences, such as the large sample size and the ability to synthesize any sequence, there are also limitations. For example, experiments using synthesized sequences are typically limited to simple systems such as yeast/cell-line and the sequences used are usually short (typically no longer than 300 base pairs). This results in a more local and simple understanding of the mechanisms at play.

Pros and Cons of using random sequences

However, We only can study the simple system(such as yeast) in this way, and the sequence will be short(usually the length is not longer than 300). As a result, the mechanism will also be very local and simple.

Pros:
- Large sample size: The ability to generate a large number of random sequences allows for a larger sample size in studies.
- Flexible sequences: There are no constraints on the sequences that can be synthesized, allowing for a wide range of possibilities.
- Experimental validation: The ability to synthesize specific sequences make it easy for direct experimental validation of model predictions.
Cons:
- Simple system: Synthetic biology techniques are typically only feasible for studying simple organisms like yeast or Cell-line.
- Short sequences: The length of synthesized sequences is typically limited to around 300 base pairs or less.
- Simple mechanism: Studies using synthetic biology techniques are limited to understanding local, simple mechanisms rather than complex, systemic interactions.

From model to new biology

Now, we have the data and the model. The final goal is to find new biology insights. So we need to dig into the model and find the new biology. There are two ways to do this: Deciphering and Engineering.

Decipher the regulatory process from the model

The first kind of application is to decipher the regulatory process from the model.

After the training, the model already knows the regulatory process. So we can use the explainable AI method(Such as DeepLift, SHAPLY) to decipher the regulatory process.

We can learn the regulatory elements from the model(TF-Modisco) and also get the relationship between the regulatory elements.

Simulator to optimize and do experiments in virtual

The second kind of application is using the model to simulate the process virtually and we can also use the simulator to optimize the sequence and do engineering experiments.

We have a trusty simulator(Simulator) and we can use the simulator to do experiments virtually and introduce mutations/perturbation to study the effect of variants which can even extend to large structure variants (such as deletion, insertion, and duplication). Also, we can use the simulator to do some evolutionary studies. We can mimic the evolution of the regulatory process.

Summary

So in this post, we have learned the following: There are two kinds of data: The genome sequences/the synthesis sequences and two kinds of ways to get new biology: Deciphering the regulatory process from the model, Simulator to optimize and do experiments in virtual