Knowledge-enhanced text mining for short documents
With the vast amount of short text documents available online (tweets, forum messages, social networks in general, etc.) short text document mining has been an important part of natural language processing. However, unlike longer texts, short documents often lack contextual information, are often grammatically incorrect and may contain abbreviations. As a result the traditional approaches for text mining tasks don’t apply well on short documents.
The aim of this project is to investigate augmenting the traditional text mining tasks with semantic information by utilising linguistic resources such as Wordnet, ConceptNet, knowledge graphs, distributed representations, Wikipedia (for identifying related concepts and therefore additional feature), or other suitable ontologies/repositories.
One approach would be to investigate is the semantic analysis of the different parts of speech (nouns, adjectives, verbs) in local context (semantic similarity between pairs of words) and global context (lexical chains) with Wordnet and the effect of augmenting some vs. all on the performance on different linguistic resources. Similarity measures can be evaluated based on path and the contents of synonyms or hyponyms. Usage of graph-based approaches such as page rank algorithm for context mapping among short text and linguistic resources such as Wikipedia is another area we will look at.
Moreover we will look at representation techniques of features of text such as Co-occurrences or keywords, collocations, predicate-argument relations (Verb-object, subject-verb), Head of Noun and word phrases for augmenting the text mining tasks. Another suggested approach would be a building of distribution semantic model with lexical resources. Investigation of the usage and identification of word and contexts, weights and space reduction techniques (LSI, LSA, and PCA) will be explored under this approach.
There is a variety of possible applications areas for the outcome of this research. For example, fake news detection, business intelligence, enhancement of recommender systems (Content based filtering). Another potential or more specific application would be the Social media data analysis. Our research approach could be applied on social media posts to analyze large volumes of unstructured data.