Using Knowledge Graphs to Improve Data Quality in Machine Learning
Knowledge graphs have a wide variety of applications such as question-answering, digital assistants, structured reasoning and exploratory reasoning. One aspect of knowledge graphs is that they can be validated by a rule-based system, such as a knowledge base.
Existing data set may be semantically incoherent. Creating large scale knowledge graphs out of existing datasets would add knowledge representation to the semantically incoherent data which in turn would help improving accuracy in prediction tasks for machine learning models. Knowledge graphs by definition are relationship rich because they allow any-to-any relationships. By using the semantic graph integration approach, you’re guaranteeing the use of the most effective large scale, web-scale, and data integration method, one that’s symbiotic with machine learning. It allows to creation of better and most fully disambiguated training sets. This improves the quality of the data set for better results.
The proposed model transforms a data set to knowledge graphs and uses a knowledge base for evaluation of data quality as an additional layer in data prediction with machine learning models. These models would have the potential to simplify the task, use reasoning to enhance the dataset and assert quality issues in the underlying dataset. Especially, in the case of deep domains with very complex rules and complex interaction between rules, there is no substitute. Such scenarios are evident when there is a requirement to integrate disparate domains.
To summarize, the project combines the research fields of Semantic Web and Machine Learning. The goal is to design an ontology that validates datasets which are previously translated into knowledge graphs. However, this would involve work on description logic and propositional logic and building a knowledge base in Prolog. The workflow could go along the lines of 1. Translating a dataset into a knowledge graph which represents dataset characteristics and tells us about the quality of the dataset, 2. Feeding the dataset into the knowledge base (which is the main part of the project) to check against DL/PL rules. 3. Summarizing the results and making suggestions/automatic corrections to the dataset. A part of the work would be to investigate a SHAQL-style restriction language to define the rules.