COHORT.2

2020 – 2024

Modal Title

Cagri Demir

Student

Project Title

Quantifying Uncertainty and Decision under Uncertainty in ML

Project Description

Real-life applications of deep learning models would require these algorithms to achieve a set of tasks that they have not encountered yet. Moreover, the data coming from observations might drift continuously or shift frequently from the data which is used to train the models. These challenges cause uncertainties for models. The uncertainties can be grouped under two broad topics: epistemic uncertainty (model uncertainty) and aleatoric uncertainty (noisy data and/or dataset shift). These two types of uncertainties originate our confidence in a prediction, namely predictive uncertainty. The dynamic nature of behaviour and physical world necessitate assessing uncertainty and adapting to changing environments for intelligent beings. Assessing the uncertainty instead of eliminating it would be a better approach for creating an intelligent agent. A big challenge for current deep learning models is both being generalizable in a reliable manner and being fast, safe, and efficient adaptation to the changing in the objective tasks. Current deep learning models can often output a confidence level in their predictions, however the benefit of having such confidence level and how it should be used to tune, change or adapt a ML model is not well investigated. The two major challenges of this project will be to develop algorithm to detect and quantify uncertainty for the task at hand, and to develop an adaptive framework for deep learning models to react to changing level of uncertainty. Tackling the epistemic and aleatoric uncertainty would require changing or updating the current models. The desired result of this project would be a learning algorithm which will adapt to the changing environment in an uncertainty-aware manner. The method would assess uncertainty when given a task and if the new situation is not familiar, should output a high uncertainty so that the model can adapt to uncertain conditions. Adaptation could imply changing models parameters, adapt to the presence of new or changing feature or to the absence of some features, adapting to new performance criteria for the model, or switching to a different machine learning architecture in order to maintain the required level of performance. Our research question is: can the performance of a deep learning model be increased by making it adaptive to the level of uncertainty of the task at hand? Data-driven methodology such as KDD will be used to develop and evaluate our research and to test our models with different datasets in changing and uncertain domain.

Modal Title

Carlos Gómez Tapia

Student

Project Title

Graph-neural networks

Project Description

Non-Euclidean data refers to a set of points that cannot be represented on a two-dimensional space since they violate at least one of the axioms of Euclidean geometry. Graphs are a non-Euclidean data structure composed of nodes (objects) and edges (relations) since they violate the triangle inequality principle. By mapping real-world data onto a graph, it is possible to model complex systems such as physics systems and social networks from which rich relational information can be extracted. Graph analysis is a process aimed at extracting information from graphs and some of the objectives include node classification, link prediction and clustering. Examples of techniques employed in graph analysis include those for finding the minimum spanning tree or producing adjacency matrices. These techniques have been used for applications in aviation, such as for aircraft scheduling as well as in multiprocessor systems for task allocation. Recently Deep Learning (DL) based methods such as Convolutional (CNNs) or Recurrent Neural Networks (RNNs) have been employed for graph analysis. However, here the structure of a graph must be explicit to fully exploit the relations between the objects. These methods function with Euclidean data in one- or two-dimensional spaces thus not suitable for processing graph inputs efficiently. To overcome this issue, a recent research field Geometric Deep Learning (GDL) is devoted to build models that can be trained efficiently with non-Euclidean data. One of techniques for such a type of learning is Graph Neural Network (GNN) and one of its peculiarities is its invariance for changes in the order in which non-Euclidean inputs are presented to the learning mechanism. Here, the edges among nodes are treated as dependencies rather than features, in contrast to traditional Euclidean-based learning approaches. This project will be devoted to better understanding the functioning and application of Graph Neural Networks as well as formally comparing them against traditional approaches for deep learning.

Modal Title

Jenni Ajderian

Student

Project Title

NLP techniques for Fake News detection

Project Description

In this project we aim to investigate which NLP techniques are most effective at detecting fake news online. We will research current state-of-the-art in NLP and claim verification techniques, and identify areas for further investigation. Previous offerings in this topic have included knowledge graph-based techniques facilitated with string-matching from large knowledge bases such as Wikipedia. Further research is required at every stage of the process. Fake news detection itself is not the full story: some researchers try to predict whether an article contains false information, while others look at verifying individual claims. We will look into the whole fact checking and fake news process, including interviews with journalists and professional fact-checkers. Several researchers have looked at how to construct large datasets for fake news identification with the right kind and amount of metadata. Further work has then investigated which Machine Learning models should be used to process this data, from Naive Bayes classifiers with no stop-word or stemming stages, to complex convolutional neural networks with stacks of modules. We will look at the interaction of different kinds of data with different kinds of models. Processing input data for relevant information is another step which requires further research, and which is tackled differently by different open challenges and datasets. The FEVER dataset release paper recommends splitting claim verification tasks into document retrieval, sentence selection, and claim verification stages, so that a model will assemble relevant information first before trying to verify a claim. Conversely, the Fake News Challenge dataset is generated with the aim of checking whether or not the headline provided is related to the article body text. We aim to investigate if these are the definitive steps needed for fact- checking, or if some other conceptual step is needed along the way.

Modal Title

Kyle Hamilton

Student

Project Title

Provenance Chain Fact Validation in Neural Knowledge Graphs

Project Description

Recent years have brought a proliferation of information to the public. Social networks serve up billions of bite size chunks of “information” which we as humans process in the context of our world view and experience. But even with our wealth of “knowledge” about the world, it can be very difficult to infer the veracity or intent of the information presented. The potential for harm cannot be understated – the effects of mis- and disinformation on society, whether it be in politics, public health, or climate change, are already evident. The application of modern Machine Learning, and in particular Deep Learning techniques is constantly evolving and improving. However, the classification of information based solely on its linguistic content can only get us so far. We would like to explore the use of Knowledge Graphs (KG) as additional context for identifying false information. In particular, we would like to explore provenance (to which graph structures ideally lend themselves) as indicative of the probability that the item is/not “true” (this term requires a much more in-depth definition beyond the scope of this introduction). In addition, we are interested in the extent to which sources are biased as a possible proxy for intent. We also believe that it is not enough to provide a model with high precision, but that the model must be explainable. We think it is important to provide a provenance chain with credibility and bias indicators at each step. There is currently a lot of manual effort in this arena – FactCheck.org, PolitiFact, Snopes.com, Hoax Slayer, to name a few. We would like our model to be at least as insightful as these efforts. To build our model we will use existing datasets which will need to be converted to a KG using NLP. This KG would be augmented by existing KGs such as DBpedia (leveraging Semantic Web), or a proprietary solution such as DiffBot. To build an ontology for the fact validation model, we can use a framework like PROVO. We can then combine the ontology and knowledge graph to train a neural network to build and check the provenance chain. To validate our solution, we will compare to baselines such as http://aksw.org/Projects/DeFacto.html to see if it improves results or provides stream or real time validation of facts, or Microsoft’s early detection model claiming to beat existing SOTA.

Modal Title

Nasim Sobhani

Project Title

Project Description

Modal Title

Trevor Doherty

Student

Project Title

Developing a Methylation Risk Score for Telomere Shortening and investigating its association with age and stress-related disorders

Project Description

At the end of each chromosome in a human cell is cap-like structure called telomeres. Just like the plastic tip on the end of a shoelace, the telomere keeps the DNA from fraying. As cells divide telomeres gets shorter, overtime the DNA unravels like the shoelace unravelling and the cell dies. As telomeres shorten, our tissues show signs of ageing, thus telomere length (TL) is a marker of aging. Previously, scientists identified 7 genetic determinants of TL, providing novel biological insights into TL and its relationship with disease. However, identifying genetic determinants of TL was only the first step in our journey to understand the role of TL in disease. Recently, scientists have identified a second layer of information (the epigenome) that sits on top of our DNA, acting like a molecular switch by fine-tuning how genes are regulated. The primary aim of this study is to use machine learning methods to train a predictor of telomere shortening using epigenomic profiling data. It will provide a framework for identifying biological predictors of aging, uncovering biological insights into telomere biology and may lead to the identification of potential epigenomic biomarkers and/or therapeutic targets of aging and stress-related phenotypes like depression. The primary Objectives of this study are: Use machine learning methods (e.g. LASSO penalised regression models) to train a predictor of TL based on DNA methylation (a type of epigenetic modification) in a large epidemiology sample (n= 819). Develop a methylation risk score (MRS) for telomere shortening, based on the CpG sites identified in the training set. This MRS will be validated in two independent replication blood cohorts (n=192; n=178, respectively) collated in-house that have DNA methylation and TL measured. Test whether the identified MRS for TL shortening are associated with age (e.g. Alzheimer’s Disease) and stress-related diseases (e.g. Depression) results from previously published DNA methylation-wide association studies. 4Identify the causal relationship between DNA methylation changes and TL in humans using mediation analysis. By the end of the project we will have a robust methodology utilising machine learning algorithms, which could be applied to other biological markers, such as pro-inflammatory cytokines, to examine their relationship with DNA methylation.