Extending NLP
Get the latest in your inbox
Get the latest in your inbox
Extending Stardog’s NLP pipeline is easy, and this short blog post will show you how.
One of the most powerful features of BITES, our unstructured data ingestion
system, is the ability to easily create domain-specific NLP pipelines that
process and extract structured data from text. We call them Knowledge
Extractors and, by
default, we ship Stardog with several useful ones. For example, tika
extracts
metadata from all kinds of documents, such as title, authors, and creation
dates; entities
extracts named entity mentions, while linker
and
dictionary
further link those entities to nodes in a knowledge graph.
In this post we will show you how we created three new Knowledge Extractors, based on Stanford’s CoreNLP, which we just released as an open source project.
Our entities
and linker
extractors are based on
OpenNLP. Although they work very well out of the
box in most domains, the underlying models sometimes struggle to identify
certain named entities. Stanford’s
CoreNLP offers a very powerful set of
named entity recognition models, which are known to provide state of the
art results in several
industry datasets.
Expanding our entity recognition and linking modules to use CoreNLP was easy.
Internally, those modules work as a pipeline, and the only coupling to OpenNLP
was at the first step, i.e., parsing the text and translating it to our internal
Document
representation. This representation follows a very similar
structure
to CoreNLP’s one. By applying that transformation to the pipeline, we
created two extractors: CoreNLPMentionRDFExtractor
, which replicates the
behavior of entities
, and CoreNLPEntityLinkerRDFExtractor
, which does the
same but for linker
.
One of the most interesting features of CoreNLP is the ability not only to extract named entities, but also relationships between them. For example, given the sentence
The Orioles are a professional baseball team based in Baltimore.
We can identify that Orioles
and Baltimore
are named entities, but we can
also identify an implicit relationship between them: Baltimore
is the
headquarter’s city of the Orioles
.
We created an extractor, CoreNLPRelationRDFExtractor
, that leverages this
feature to automatically extract nodes-and-edges from text. For example, running
the previous sentence through the extractor, the following output will be
generated.
entity:f06574 rdfs:label "Orioles"
entity:679a56 rdfs:label "Baltimore"
entity:f06574 relation:org:city_of_headquarters entity:679a56
CoreNLP provides models to extract several different kinds of relationships,
such as lives_in
and works_for
, and it’s also possible to train your
own models to
recognize relationships specific to data that you care about.
We released all three extractors as an open source project, bites-corenlp
,
available on github. Using
them with Stardog is easy:
jar
jar
to Stardog’s
classpath, by copying it
to the server/ext
folder inside Stardog or by pointing the environment
variable STARDOG_EXT
to the directory containing the jarCoreNLPMentionRDFExtractor
, CoreNLPEntityLinkerRDFExtractor
, and
CoreNLPRelationRDFExtractor
will be available as RDF extractors, accessible
through the CLI, API, and HTTP interfacesFor example, using the CLI, if you want to add a document to BITES and extract its entities:
stardog doc put --rdf-extractors CoreNLPMentionRDFExtractor myDatabase document.pdf
Multiple extractors can be applied on the same document. For example, the following command will extract metadata and relationships from a document:
stardog doc put --rdf-extractors tika,CoreNLPRelationRDFExtractor myDatabase document.pdf
BITES has great potential to unlock the implicit knowledge within unstructured data, and we are actively working on creating new knowledge extractors and making the whole pipeline easier to use.
How to Overcome a Major Enterprise Liability and Unleash Massive Potential
Download for free