Background

Genome wide association studies (GWAS) have identified thousands of potential genetic variants that are associated with human diseases and traits1. The vast majority of these variants are located in non-coding regions, suggesting that gene regulation has a significant role in the health of an individual2,3. Adverse changes to gene expression have been linked to an increased risk of disease2,4. It is therefore critical to identify the genes regulated by disease-associated non-coding variants to help develop therapeutic targets. However, due to the limitations of GWAS, these studies are incapable of regularly finding the causal variant. They are limited to analysing only common variants and linkage disequilibrium makes it challenging to know if a variant is causal or not5. Even if a variant was determined to be casual, its role would still be unclear, e.g. what gene it regulates.

The shortcomings of GWAS can be alleviated by advances in deep learning models6,7,8,9,10,5. These models predict gene expression, using only a sequence of DNA as input, to help understand the impact of non-coding variants. The early work of DeepSEA demonstrated the feasibility of using deep convolutional neural networks (CNN) to predict gene expression from a sequence of DNA6. This outperformed the previous state-of-the-art method, a support vector machine, as the authors showed that deep CNNs can better aggregate the non-linear interactions across a DNA sequence that result in the gene expression level11.

Further advancements in deep learning models were made by Basset7, Basenji8, ExPecto9, DeepFun10, and Enformer5. These models used data from an increasing number of cell and tissue types, a longer input sequence of DNA, and more advanced model architectures to achieve better results.

It is necessary for a model to be trained using a variety of cell and tissue types, as transcription regulation can differ between them12. In addition, disease-causing variants tend to alter gene regulation within specific cell and tissues types13. Therefore, it is critical that a model can generalise to an extensive set of cell and tissue types to maximise its utility. Fortunately, many organisations have annotated epigenetic and expression profiles in a wide range of cell types14,15,16. Recent deep learning approaches have leveraged these resources to create training, validation, and testing datasets. Some approaches have even included non-human data, such as from mice17. The author demonstrated that a model trained on both human and mouse data outperformed a model trained solely on human data, thus demonstrating that the conserved nature of non-coding DNA between organisms can be used to improve a model’s accuracy18,19. An additional benefit is that non-human samples can be from cells and tissues that are unable to be easily collected from humans (e.g. brain tissue), thus serving as very useful training examples17.

The length of the input DNA sequence used by deep learning models has increased significantly from 1,000 base pairs (bp) for DeepSEA6 to almost 200,000 bp in Enformer5. Increasing the amount of DNA used by a model is essential as some regulatory elements, such as enhancers, influence the expression of a gene from hundreds of thousands of base pairs away20,21. Some research estimates that 16% of enhancer-gene pairs could be more than 100,000 bp away5. Enhancers are able to influence gene expression from long distances due to the non-linear shape of DNA. Loops are formed in DNA, which bring enhancers close to their target genes, allowing the interactions to take place22. For models to achieve even better results, they will require longer input DNA sequences to capture the remaining regulatory elements that are currently out of range.

The architecture of deep learning models has evolved in recent years. Advances have been made by using more convolutional layers to learn more abstract features9, densely connected dilated convolutions to share information across longer distances8, and transformers with self-attention to increase the receptive field of the model and more accurately capture distal information5.

At a high level, a Transformer model (or transformer layer) is composed of an encoder and/or a decoder unit, which allows it to process an input sequence in parallel. It is the successor of recurrent neural networks, which are limited to processing an input sequence sequentially, thereby taking longer to train. Transformers also benefit from an attention mechanism, which helps them to learn longer sequences by taking a weighted average of the input tokens. The average is weighted by the significance of each word with respect to the output, so the most relevant tokens receive the most weight (i.e. the most attention). An issue arises when the input sequence becomes too long. For models that use self-attention, each token in a sequence attends to all others, yielding n2 weighted averages computed. This results in the memory usage of a model scaling quadratically with sequence length. Enformer, the current state-of-the-art model, suffers from this limitation.

To overcome this constraint, we propose a new model, Sparse-Enformer (S-Enformer) that replaces the self-attention with sparse-attention from the BigBird model23. Using sparse-attention reduces the quadratic increase in memory usage to linear. This allows for a significantly longer DNA sequence to be used as input to a model, which should result in further accuracy improvements. We also demonstrate that S-Enformer can be trained and evaluated in exactly the same method as Enformer, which simplifies the comparison process.


  1. Uffelmann, Emil, Qin Qin Huang, Nchangwi Syntia Munung, Jantina de Vries, Yukinori Okada, Alicia R. Martin, Hilary C. Martin, Tuuli Lappalainen, and Danielle Posthuma. 2021. ‘Genome-Wide Association Studies’. Nature Reviews Methods Primers 1 (1): 1–21. https://doi.org/10.1038/s43586-021-00056-9

  2. Edwards, Stacey L., Jonathan Beesley, Juliet D. French, and Alison M. Dunning. 2013. ‘Beyond GWASs: Illuminating the Dark Road from Association to Function’. The American Journal of Human Genetics 93 (5): 779–97. https://doi.org/10.1016/j.ajhg.2013.10.012

  3. Leslie, R., C. J. O’Donnell, and A. D. Johnson. 2014. ‘GRASP: Analysis of Genotype-Phenotype Results from 1390 Genome-Wide Association Studies and Corresponding Open Access Database’. Bioinformatics 30 (12): i185–94. https://doi.org/10.1093/bioinformatics/btu273

  4. Albert, Frank W., and Leonid Kruglyak. 2015. ‘The Role of Regulatory Variation in Complex Traits and Disease’. Nature Reviews Genetics 16 (4): 197–212. https://doi.org/10.1038/nrg3891

  5. Avsec, Žiga, Vikram Agarwal, Daniel Visentin, Joseph R. Ledsam, Agnieszka Grabska-Barwinska, Kyle R. Taylor, Yannis Assael, John Jumper, Pushmeet Kohli, and David R. Kelley. 2021. ‘Effective Gene Expression Prediction from Sequence by Integrating Long-Range Interactions’. Nature Methods 18 (10): 1196–1203. https://doi.org/10.1038/s41592-021-01252-x

  6. Zhou, Jian, and Olga G. Troyanskaya. 2015. ‘Predicting Effects of Noncoding Variants with Deep Learning–Based Sequence Model’. Nature Methods 12 (10): 931–34. https://doi.org/10.1038/nmeth.3547

  7. Kelley, David R., Jasper Snoek, and John L. Rinn. 2016. ‘Basset: Learning the Regulatory Code of the Accessible Genome with Deep Convolutional Neural Networks’. Genome Research 26 (7): 990–99. https://doi.org/10.1101/gr.200535.115

  8. Kelley, David R., Yakir A. Reshef, Maxwell Bileschi, David Belanger, Cory Y. McLean, and Jasper Snoek. 2018. ‘Sequential Regulatory Activity Prediction across Chromosomes with Convolutional Neural Networks’. Genome Research 28 (5): 739–50. https://doi.org/10.1101/gr.227819.117

  9. Zhou, Jian, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya. 2018. ‘Deep Learning Sequence-Based Ab Initio Prediction of Variant Effects on Expression and Disease Risk’. Nature Genetics 50 (8): 1171–79. https://doi.org/10.1038/s41588-018-0160-6

  10. Pei, Guangsheng, Ruifeng Hu, Yulin Dai, Astrid Marilyn Manuel, Zhongming Zhao, and Peilin Jia. 2021. ‘Predicting Regulatory Variants Using a Dense Epigenomic Mapped CNN Model Elucidated the Molecular Basis of Trait-Tissue Associations’. Nucleic Acids Research 49 (1): 53–66. https://doi.org/10.1093/nar/gkaa1137

  11. Mamoshina, Polina, Armando Vieira, Evgeny Putin, and Alex Zhavoronkov. 2016. ‘Applications of Deep Learning in Biomedicine’. Molecular Pharmaceutics 13 (5): 1445–54. https://doi.org/10.1021/acs.molpharmaceut.5b00982

  12. Hobert, Oliver. 2008. ‘Gene Regulation by Transcription Factors and MicroRNAs’. Science 319 (5871): 1785–86. https://doi.org/10.1126/science.1151651

  13. Aguet, François, Andrew A. Brown, Stephane E. Castel, Joe R. Davis, Yuan He, Brian Jo, Pejman Mohammadi, et al. 2017. ‘Genetic Effects on Gene Expression across Human Tissues’. Nature 550 (7675): 204–13. https://doi.org/10.1038/nature24277

  14. The ENCODE Project Consortium. 2012. ‘An Integrated Encyclopedia of DNA Elements in the Human Genome’. Nature 489 (7414): 57. https://doi.org/10.1038/nature11247

  15. Forrest et al. 2014. ‘A Promoter-Level Mammalian Expression Atlas’. Nature 507 (7493): 462. https://doi.org/10.1038/nature13182

  16. Roadmap Epigenomics Consortium, Wouter Kundaje, Jason Ernst, Misha Bilenky, Angela Yen, Alireza Heravi-Moussavi, Pouya Kheradpour, et al. 2015. ‘Integrative Analysis of 111 Reference Human Epigenomes’. Nature 518 (7539): 317–30. https://doi.org/10.1038/nature14248

  17. Kelley, David R. 2020. ‘Cross-Species Regulatory Sequence Activity Prediction’. PLOS Computational Biology 16 (7): e1008050. https://doi.org/10.1371/journal.pcbi.1008050

  18. Woolfe, Adam, Martin Goodson, Debbie K. Goode, Phil Snell, Gayle K. McEwen, Tanya Vavouri, Sarah F. Smith, et al. 2004. ‘Highly Conserved Non-Coding Sequences Are Associated with Vertebrate Development’. PLOS Biology 3 (1): e7. https://doi.org/10.1371/journal.pbio.0030007

  19. Pennacchio, Len A., Nadav Ahituv, Alan M. Moses, Shyam Prabhakar, Marcelo A. Nobrega, Malak Shoukry, Simon Minovitsky, et al. 2006. ‘In Vivo Enhancer Analysis of Human Conserved Non-Coding Sequences’. Nature 444 (7118): 499–502. https://doi.org/10.1038/nature05295

  20. Levine, Mike. 2010. ‘Transcriptional Enhancers in Animal Development and Evolution’. Current Biology 20 (17): R754–63. https://doi.org/10.1016/j.cub.2010.06.070

  21. Long, Hannah K., Sara L. Prescott, and Joanna Wysocka. 2016. ‘Ever-Changing Landscapes: Transcriptional Enhancers in Development and Evolution’. Cell 167 (5): 1170–87. https://doi.org/10.1016/j.cell.2016.09.018

  22. Krivega, Ivan, and Ann Dean. 2012. ‘Enhancer and Promoter Interactions — Long Distance Calls’. Current Opinion in Genetics & Development 22 (2): 79. https://doi.org/10.1016/j.gde.2011.11.001

  23. Zaheer, Manzil, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, et al. 2021. ‘Big Bird: Transformers for Longer Sequences’. ArXiv:2007.14062 [Cs, Stat], January. http://arxiv.org/abs/2007.14062