Extending NLP

Pedro Oliveira

Jun 12, 2018, 3 minute read

Extending Stardog’s NLP pipeline is easy, and this short blog post will show you how.

One of the most powerful features of BITES, our unstructured data ingestion system, is the ability to easily create domain-specific NLP pipelines that process and extract structured data from text. We call them Knowledge Extractors and, by default, we ship Stardog with several useful ones. For example, tika extracts metadata from all kinds of documents, such as title, authors, and creation dates; entities extracts named entity mentions, while linker and dictionary further link those entities to nodes in a knowledge graph.

In this post we will show you how we created three new Knowledge Extractors, based on Stanford’s CoreNLP, which we just released as an open source project.

Entity Extraction and Linking

Our entities and linker extractors are based on OpenNLP. Although they work very well out of the box in most domains, the underlying models sometimes struggle to identify certain named entities. Stanford’s CoreNLP offers a very powerful set of named entity recognition models, which are known to provide state of the art results in several industry datasets.

Expanding our entity recognition and linking modules to use CoreNLP was easy. Internally, those modules work as a pipeline, and the only coupling to OpenNLP was at the first step, i.e., parsing the text and translating it to our internal Document representation. This representation follows a very similar structure to CoreNLP’s one. By applying that transformation to the pipeline, we created two extractors: CoreNLPMentionRDFExtractor, which replicates the behavior of entities, and CoreNLPEntityLinkerRDFExtractor, which does the same but for linker.

Relation Extraction

One of the most interesting features of CoreNLP is the ability not only to extract named entities, but also relationships between them. For example, given the sentence

The Orioles are a professional baseball team based in Baltimore.

We can identify that Orioles and Baltimore are named entities, but we can also identify an implicit relationship between them: Baltimore is the headquarter’s city of the Orioles.

We created an extractor, CoreNLPRelationRDFExtractor, that leverages this feature to automatically extract nodes-and-edges from text. For example, running the previous sentence through the extractor, the following output will be generated.

entity:f06574 rdfs:label "Orioles"
entity:679a56 rdfs:label "Baltimore"
entity:f06574 relation:org:city_of_headquarters entity:679a56

CoreNLP provides models to extract several different kinds of relationships, such as lives_in and works_for, and it’s also possible to train your own models to recognize relationships specific to data that you care about.

Usage

We released all three extractors as an open source project, bites-corenlp, available on github. Using them with Stardog is easy:

Download the latest jar
Add that jar to Stardog’s classpath, by copying it to the server/ext folder inside Stardog or by pointing the environment variable STARDOG_EXT to the directory containing the jar
Restart Stardog server
CoreNLPMentionRDFExtractor, CoreNLPEntityLinkerRDFExtractor, and CoreNLPRelationRDFExtractor will be available as RDF extractors, accessible through the CLI, API, and HTTP interfaces

For example, using the CLI, if you want to add a document to BITES and extract its entities:

stardog doc put --rdf-extractors CoreNLPMentionRDFExtractor myDatabase document.pdf

Multiple extractors can be applied on the same document. For example, the following command will extract metadata and relationships from a document:

stardog doc put --rdf-extractors tika,CoreNLPRelationRDFExtractor myDatabase document.pdf

Future Work

BITES has great potential to unlock the implicit knowledge within unstructured data, and we are actively working on creating new knowledge extractors and making the whole pipeline easier to use.