Background

Genome wide association studies (GWAS) have identified thousands of potential genetic variants that are associated with human diseases and traits¹. The vast majority of these variants are located in non-coding regions, suggesting that gene regulation has a significant role in the health of an individual²^,³. Adverse changes to gene expression have been linked to an increased risk of disease²^,⁴. It is therefore critical to identify the genes regulated by disease-associated non-coding variants to help develop therapeutic targets. However, due to the limitations of GWAS, these studies are incapable of regularly finding the causal variant. They are limited to analysing only common variants and linkage disequilibrium makes it challenging to know if a variant is causal or not⁵. Even if a variant was determined to be casual, its role would still be unclear, e.g. what gene it regulates.

The shortcomings of GWAS can be alleviated by advances in deep learning models⁶^,⁷^,⁸^,⁹^,¹⁰^,⁵. These models predict gene expression, using only a sequence of DNA as input, to help understand the impact of non-coding variants. The early work of DeepSEA demonstrated the feasibility of using deep convolutional neural networks (CNN) to predict gene expression from a sequence of DNA⁶. This outperformed the previous state-of-the-art method, a support vector machine, as the authors showed that deep CNNs can better aggregate the non-linear interactions across a DNA sequence that result in the gene expression level¹¹.

Further advancements in deep learning models were made by Basset⁷, Basenji⁸, ExPecto⁹, DeepFun¹⁰, and Enformer⁵. These models used data from an increasing number of cell and tissue types, a longer input sequence of DNA, and more advanced model architectures to achieve better results.

It is necessary for a model to be trained using a variety of cell and tissue types, as transcription regulation can differ between them¹². In addition, disease-causing variants tend to alter gene regulation within specific cell and tissues types¹³. Therefore, it is critical that a model can generalise to an extensive set of cell and tissue types to maximise its utility. Fortunately, many organisations have annotated epigenetic and expression profiles in a wide range of cell types¹⁴^,¹⁵^,¹⁶. Recent deep learning approaches have leveraged these resources to create training, validation, and testing datasets. Some approaches have even included non-human data, such as from mice¹⁷. The author demonstrated that a model trained on both human and mouse data outperformed a model trained solely on human data, thus demonstrating that the conserved nature of non-coding DNA between organisms can be used to improve a model’s accuracy¹⁸^,¹⁹. An additional benefit is that non-human samples can be from cells and tissues that are unable to be easily collected from humans (e.g. brain tissue), thus serving as very useful training examples¹⁷.

The length of the input DNA sequence used by deep learning models has increased significantly from 1,000 base pairs (bp) for DeepSEA⁶ to almost 200,000 bp in Enformer⁵. Increasing the amount of DNA used by a model is essential as some regulatory elements, such as enhancers, influence the expression of a gene from hundreds of thousands of base pairs away²⁰^,²¹. Some research estimates that 16% of enhancer-gene pairs could be more than 100,000 bp away⁵. Enhancers are able to influence gene expression from long distances due to the non-linear shape of DNA. Loops are formed in DNA, which bring enhancers close to their target genes, allowing the interactions to take place²². For models to achieve even better results, they will require longer input DNA sequences to capture the remaining regulatory elements that are currently out of range.

The architecture of deep learning models has evolved in recent years. Advances have been made by using more convolutional layers to learn more abstract features⁹, densely connected dilated convolutions to share information across longer distances⁸, and transformers with self-attention to increase the receptive field of the model and more accurately capture distal information⁵.

At a high level, a Transformer model (or transformer layer) is composed of an encoder and/or a decoder unit, which allows it to process an input sequence in parallel. It is the successor of recurrent neural networks, which are limited to processing an input sequence sequentially, thereby taking longer to train. Transformers also benefit from an attention mechanism, which helps them to learn longer sequences by taking a weighted average of the input tokens. The average is weighted by the significance of each word with respect to the output, so the most relevant tokens receive the most weight (i.e. the most attention). An issue arises when the input sequence becomes too long. For models that use self-attention, each token in a sequence attends to all others, yielding n² weighted averages computed. This results in the memory usage of a model scaling quadratically with sequence length. Enformer, the current state-of-the-art model, suffers from this limitation.

To overcome this constraint, we propose a new model, Sparse-Enformer (S-Enformer) that replaces the self-attention with sparse-attention from the BigBird model²³. Using sparse-attention reduces the quadratic increase in memory usage to linear. This allows for a significantly longer DNA sequence to be used as input to a model, which should result in further accuracy improvements. We also demonstrate that S-Enformer can be trained and evaluated in exactly the same method as Enformer, which simplifies the comparison process.

Uffelmann, Emil, Qin Qin Huang, Nchangwi Syntia Munung, Jantina de Vries, Yukinori Okada, Alicia R. Martin, Hilary C. Martin, Tuuli Lappalainen, and Danielle Posthuma. 2021. ‘Genome-Wide Association Studies’. Nature Reviews Methods Primers 1 (1): 1–21. https://doi.org/10.1038/s43586-021-00056-9. ↩
Edwards, Stacey L., Jonathan Beesley, Juliet D. French, and Alison M. Dunning. 2013. ‘Beyond GWASs: Illuminating the Dark Road from Association to Function’. The American Journal of Human Genetics 93 (5): 779–97. https://doi.org/10.1016/j.ajhg.2013.10.012. ↩↩
Leslie, R., C. J. O’Donnell, and A. D. Johnson. 2014. ‘GRASP: Analysis of Genotype-Phenotype Results from 1390 Genome-Wide Association Studies and Corresponding Open Access Database’. Bioinformatics 30 (12): i185–94. https://doi.org/10.1093/bioinformatics/btu273. ↩
Albert, Frank W., and Leonid Kruglyak. 2015. ‘The Role of Regulatory Variation in Complex Traits and Disease’. Nature Reviews Genetics 16 (4): 197–212. https://doi.org/10.1038/nrg3891. ↩
Avsec, Žiga, Vikram Agarwal, Daniel Visentin, Joseph R. Ledsam, Agnieszka Grabska-Barwinska, Kyle R. Taylor, Yannis Assael, John Jumper, Pushmeet Kohli, and David R. Kelley. 2021. ‘Effective Gene Expression Prediction from Sequence by Integrating Long-Range Interactions’. Nature Methods 18 (10): 1196–1203. https://doi.org/10.1038/s41592-021-01252-x. ↩↩↩↩↩↩
Zhou, Jian, and Olga G. Troyanskaya. 2015. ‘Predicting Effects of Noncoding Variants with Deep Learning–Based Sequence Model’. Nature Methods 12 (10): 931–34. https://doi.org/10.1038/nmeth.3547. ↩↩↩
Kelley, David R., Jasper Snoek, and John L. Rinn. 2016. ‘Basset: Learning the Regulatory Code of the Accessible Genome with Deep Convolutional Neural Networks’. Genome Research 26 (7): 990–99. https://doi.org/10.1101/gr.200535.115. ↩↩
Kelley, David R., Yakir A. Reshef, Maxwell Bileschi, David Belanger, Cory Y. McLean, and Jasper Snoek. 2018. ‘Sequential Regulatory Activity Prediction across Chromosomes with Convolutional Neural Networks’. Genome Research 28 (5): 739–50. https://doi.org/10.1101/gr.227819.117. ↩↩↩
Zhou, Jian, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya. 2018. ‘Deep Learning Sequence-Based Ab Initio Prediction of Variant Effects on Expression and Disease Risk’. Nature Genetics 50 (8): 1171–79. https://doi.org/10.1038/s41588-018-0160-6. ↩↩↩
Pei, Guangsheng, Ruifeng Hu, Yulin Dai, Astrid Marilyn Manuel, Zhongming Zhao, and Peilin Jia. 2021. ‘Predicting Regulatory Variants Using a Dense Epigenomic Mapped CNN Model Elucidated the Molecular Basis of Trait-Tissue Associations’. Nucleic Acids Research 49 (1): 53–66. https://doi.org/10.1093/nar/gkaa1137. ↩↩
Mamoshina, Polina, Armando Vieira, Evgeny Putin, and Alex Zhavoronkov. 2016. ‘Applications of Deep Learning in Biomedicine’. Molecular Pharmaceutics 13 (5): 1445–54. https://doi.org/10.1021/acs.molpharmaceut.5b00982. ↩
Hobert, Oliver. 2008. ‘Gene Regulation by Transcription Factors and MicroRNAs’. Science 319 (5871): 1785–86. https://doi.org/10.1126/science.1151651. ↩
Aguet, François, Andrew A. Brown, Stephane E. Castel, Joe R. Davis, Yuan He, Brian Jo, Pejman Mohammadi, et al. 2017. ‘Genetic Effects on Gene Expression across Human Tissues’. Nature 550 (7675): 204–13. https://doi.org/10.1038/nature24277. ↩
The ENCODE Project Consortium. 2012. ‘An Integrated Encyclopedia of DNA Elements in the Human Genome’. Nature 489 (7414): 57. https://doi.org/10.1038/nature11247. ↩
Forrest et al. 2014. ‘A Promoter-Level Mammalian Expression Atlas’. Nature 507 (7493): 462. https://doi.org/10.1038/nature13182. ↩
Roadmap Epigenomics Consortium, Wouter Kundaje, Jason Ernst, Misha Bilenky, Angela Yen, Alireza Heravi-Moussavi, Pouya Kheradpour, et al. 2015. ‘Integrative Analysis of 111 Reference Human Epigenomes’. Nature 518 (7539): 317–30. https://doi.org/10.1038/nature14248. ↩
Kelley, David R. 2020. ‘Cross-Species Regulatory Sequence Activity Prediction’. PLOS Computational Biology 16 (7): e1008050. https://doi.org/10.1371/journal.pcbi.1008050. ↩↩
Woolfe, Adam, Martin Goodson, Debbie K. Goode, Phil Snell, Gayle K. McEwen, Tanya Vavouri, Sarah F. Smith, et al. 2004. ‘Highly Conserved Non-Coding Sequences Are Associated with Vertebrate Development’. PLOS Biology 3 (1): e7. https://doi.org/10.1371/journal.pbio.0030007. ↩
Pennacchio, Len A., Nadav Ahituv, Alan M. Moses, Shyam Prabhakar, Marcelo A. Nobrega, Malak Shoukry, Simon Minovitsky, et al. 2006. ‘In Vivo Enhancer Analysis of Human Conserved Non-Coding Sequences’. Nature 444 (7118): 499–502. https://doi.org/10.1038/nature05295. ↩
Levine, Mike. 2010. ‘Transcriptional Enhancers in Animal Development and Evolution’. Current Biology 20 (17): R754–63. https://doi.org/10.1016/j.cub.2010.06.070. ↩
Long, Hannah K., Sara L. Prescott, and Joanna Wysocka. 2016. ‘Ever-Changing Landscapes: Transcriptional Enhancers in Development and Evolution’. Cell 167 (5): 1170–87. https://doi.org/10.1016/j.cell.2016.09.018. ↩
Krivega, Ivan, and Ann Dean. 2012. ‘Enhancer and Promoter Interactions — Long Distance Calls’. Current Opinion in Genetics & Development 22 (2): 79. https://doi.org/10.1016/j.gde.2011.11.001. ↩
Zaheer, Manzil, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, et al. 2021. ‘Big Bird: Transformers for Longer Sequences’. ArXiv:2007.14062 [Cs, Stat], January. http://arxiv.org/abs/2007.14062. ↩