The flavour of disorder: predicting intrinsically disordered regions in proteins by Deep Learning
Proteins are the basis of life and over the last few decades we have learned much information about them through genome sequencing projects and other massive-scale experiments. However important aspects of proteins such as their structure and function remain elusive and the experimental techniques devised to reveal them have not scaled up as quickly as the techniques that elucidate their sequence or expression. Nowadays we know the sequence of well over one hundred million proteins, while the structure/function is known for less than 0.1% of these. For decades the paradigm that proteins formed rigid, stable structures was essentially unquestioned, while it is now clear that many proteins only partly fold to a native regular structure or are normally completely unfolded or varied between folded and unfolded (semi-unfolded). By some estimates, up to 20% of amino acids in known proteins are in a disordered state. We currently have datasets comprising over 180,000 proteins for which disorder information is known in some form. The aim of this project is the prediction of disordered regions in proteins. The problem will be tackled by an array of Deep Learning techniques, which can learn the likely locations of disorder or semi-disorder from examples of proteins in which these locations are known experimentally. Also, we could dig into these locations to investigate disordered binding and semi-disorder variation. Upon success, the results of the project may feed into the online Distill servers and improve the quality of their results. The Distill servers are a widely used tool, with millions of queries served originating from over 100 national and transnational internet domains from all over the world, and even a marginal improvement of their performances would benefit a large pool of scientists world-wide and help them further their research on biology, biotechnologies, and drug design.