Simple and Efficient Approach for Knowledge Graph Mining from Text

Parmesh Ashwath
4 min readMay 31, 2019

Knowledge graphs (KG) is an efficient mechanism of storing and leveraging data, its structure allows people and machines to better tap into the connections in their datasets. There are many applications of KG and the suggested approach talks about Question Answering system, generating a short QA test on a given subject for high school students. The application space for KG is vast. However, constructing KG from unstructured text is a challenging problem due to its nature. Many approaches have been proposed on this subject and yet there is a lot of scope for improvement. In this article a very simple system which takes the raw text constructs the Triples and stores in a way suitable for query processing is discussed.

The KG system consists of Text Preprocessing, Coreference Resolution, Triple Extraction, Efficient Triple Storage, Query Mapping (Searching KG) as key components. We will discuss each component in brief and also on the integration between them to get the end to end system. We will also talk about applying this concept in creating a tool which will generate the short one-word answer test based on a given subject and also a Question Answering system.

Text Preprocessing:

Initially, raw data is processed to remove the spelling mistake and also the stop-words. Lemmatization can also be done, but we did not perform the lemmatization as there was no performance improvement.

Coreference Resolution:

Coreference resolution is the task of finding all expressions that refer to the same entity in a text. It is an important step in creating a knowledge graph and entity mapping. This will reduce the number of nodes and also helps in having meaningful nodes that are easy to query.

StanfordCoreNLP module is used to achieve this task. So, given an input text, the output of this step would be the text with all coreference substituted with its referent.

Triple Extraction:

This is the key component for the system where we extract the triples in the format [subject, predicate, object]. With all coreference substitution and text preprocessing, the activity of entity mapping is easier.

Stanford’s Open information extraction (open IE) is used for this task. And today we need some amount of manual intervention after this step to generate the rules for relation mapping and also for triple cleanups. We are working continuously on improving this step as this is the core of our system

The generated triples can be stored in persistent storage. The usual choice for this would be some graph databases like neo4j, dgraph, and others. But here for this simple system, we have stored the triples in a NoSQL database [MongoDB]. And we have noticed for a KG generated from a single document there is a performance issue. But once the system becomes mature, the plan is to move to the graph DB to leverage its capabilities.

Search/Querying the KG(Triples)

As mentioned, the triples are stored in MongoDB. Each triple is stored as one document with source, relation, and target as individual fields. A text index is built on all three fields. One assumption done here is that the user query will only deal with a single relation. We perform a text search on the index of the stored documents to get the top 5 matched documents based on the user query.

And now the task is to find which component, either the source or an object, should be returned to the user. This can be easily done by finding the text overlap. But there are cases where there can be multiple relations existing with the identified source or object. To resolve this again we consider the match between the user query and the relations of all 5 documents considering all of its synonyms. We have used NLTK WordNet package to the synonyms list and select the best document as a response.

The application of generating the questions is much simpler since here we can pick a document at random and give the user either source-predicate and ask him to give response for an object or it can be another way of, we giving the predicate -object and user can give the source. The only requirement here is we need to form a proper question using source-predicate or predicate-object pair. Again, we have used Stanford CoreNLP for the question sentence generation.

Next Steps:

  • We are still working on improving the triple extraction process and avoiding the manual intervention completely for generating the rules and removing/adding the relevant triples.
  • Currently, the search process is primitive, and we are considering different approaches to get better results and also on dealing with multi relation queries.

Knowledge Graphs has been a compelling concept for me. I always wanted to contribute to this field and here is the start. This article explains the work I am doing. But there is still a lot of space to cover to reach the final solution. I would like to hear the feedback and suggestions on this work to make it better. Please comment/message me your inputs

--

--