Two experts from Stardog and Stardog partner Lymba sat down to discuss trends in graph and NLP. Mithun Balakrishna, Director of Research & Engineering, and Rob Harris, VP of Solutions, answer some of the most important questions. Below you will find a transcript of some highlights from their conversation. Scroll all the way to the bottom of the page to watch the full interview!
Q: What should organizations expect when implementing graph?
Rob Harris, Stardog: When you look at the adoption of graph and NLP, one of the biggest struggles people have been going through is how to connect their data in order to leverage this type of technology. Trading yet another data store within organizations that have already spent millions of dollars in order to build up these large data warehouses and data lakes and then say “you could have good lots of really great insights by leveraging graph,” but you have to take everything out and stick it all into another repository is kind of a bitter pill for people to swallow. That’s why we really invested in trying to build this data virtualization capability directly into the platform. So we can represent data from existing silos within the graph. And people can write graph queries, they can look at graph embeddings, they can look at all the ways they can leverage data through the graph, but don’t necessarily have to do all this ETL into the graph itself. And that’s been a huge help in the adoption, especially as it relates to the ingestion side.
But what is also interesting is we’ve seen a lot of evolution on the consumption side - how people are consuming out of the graph. And one of the ways I’ve seen that really change recently has been the idea of looking at natural language as a way to interact with the graph instead of a more structured query language, like SPARQL or SQL. But people wanting their end users to be able to ask natural questions in English or a language that they’re used to using, and then being able to get related pieces out.
Mithun Balakrishna, Lymba: That tells you that people are now looking at the other side of things, right? So if there was data, or other semantic data that was not in the graph, then nobody would ask how to query it. So if somebody says they want to query it, that means that they did something to get semantic data, and most are probably using some kind of conversion like NLP to take the unstructured data, convert it into semantic structured data, and now the question is, “I have all these pieces, what do I do? How do I query it?” Usually our pipeline starts from ontology development to create the shape of the knowledge that you want. You create the important concepts that you’re looking for, the important relationships between those concepts that you’re looking for, and then create that into formal structure, usually in OWL ontology. With that, just like how you have this virtualization process where you can take the legacy data and without having to bring it in, you can still connect to it. Similarly, we have a process for taking the ontology to automatically train the NLP because it’s very hard for people to create an ontology set and create an ontology and then we tell them you know what, the same piece of information now train the machine as well, right? There’s no reason why the ontology cannot be used directly to create or train the NLP material.
Q: What are some critical features of NLP software?
Rob Harris, Stardog: When you talk about logical reasoning or inferencing—the ability to create connections between entities that don’t naturally have keys or aren’t naturally connected—that’s been a critical feature of our customers as well, regardless of whether they’re leveraging data that’s extracted from natural language or from structured sources. We’ve adhered very closely to the W3C standards as it relates to OWL, and we support all of levels of OWL axioms that are specified, as well as the SWRL language, which is the Semantic Web rules language, in order to be able to to actually create and define these inference rules to be able to structurally look at, for instance, this item is hierarchically a subclass, or sub version of this other item. And I can then inherit that, for example, Texas is a state of the United States, or be able to say explicit rules, for instance, if I work at a certain hospital and I am doing a study on a particular disease condition that I can be considered an expert on that disease condition. Being able to define these rules is a way that a lot of our customers get value out of the data beyond just integrating all the data under a single flexible platform. But now, like you said, creating these linkages, where you don’t need to follow your nose through 10 different connections but rather create the rules that allow you to jump right to the answer. And it makes sense that’s a critical piece of looking at it as it relates to natural language.
When we see people look at inference rules, a lot of times they’re using this technology for some sort of discovery process. They’re using this to try to put out in the field within their group of people that are leveraging the technology to ask questions that they may not know the answer to, like risk analysts or clinical researchers, or people who are doing financial investments, and they’re trying to find a better way to mitigate the risk and look at the financial world in order to understand that better, all the connections between the various companies.
Q: What industries have had a rise in NLP?
Mithun Balakrishna, Lymba: Mainly in communities that have a lot of unstructured data, which is pretty much everybody now. Mainly, in the financial domain, related to compliance. What we see is there is a regulator that is coming up with all these different regulations that organizations are supposed to comply with. Organizations themselves have created policies and procedures to comply with those regulations. A lot of use cases that we have are related to the matching of this, for instance a company would ask if their policies and procedures comply with a regulation that Regulatory Authority has come up with, and which parts of their documents comply with that. We need to make sure that the NLP can extract the schema, i.e. the important concepts from the regulatory document, as well as the actual policy and procedure, and then do a merge match as well. That is the querying part of the report creation part.
With this, there are several new NLP components that come into the picture. One is the extraction of entities. That’s been very key. And the extraction of relationships between concepts. Not only that, it’s also about topic modeling, i.e. what are these documents about? What’s the aboutness of a particular document? The main reason for that aboutness creation or understanding is because people do not use the same terms all the time. So you need to have new mechanisms of saying all of these are in the same domain or close to each other. That’s the aboutness of this particular topic. They’re related somehow.
Rob Harris, Stardog: One of the interesting things I’ve noticed is how all of this technology plays with the rise of ML and AI. So many of the organizations I talk to built large data science orgs they’re always hiring into. They’re trying to find new techniques in order to better predict their business, their sales, the behavior of their customers, and be able to leverage that information in a way that will help the business. And one of the things that we’ve noticed from our side as it relates to leveraging semantic technology in order to help that is really that initial data acquisition piece. So when you talk about being able to bring together data from lots of different sources, and represent it in a consistent way that it can be fed into an ML model. That’s where we’ve seen people leverage the graph in order to tie together using data virtualization, using the semantic ontologies in order to rationalize the meaning of various objects and show the connectedness to be able to get more insight out of these models.
Q: How has the rise of AI, NLP, and ML impacted industries?
Mithun Balakrishna, Lymba: It’s helped a lot, and then it has also harmed a lot in a way, because there is this preconceived notion that you have all these deep learning models, apply them, and it works. It does not happen that way, unfortunately. But that said, organizations have tried. Each organization now definitely has a data science team, has even an ML team to do some kind of a learning. Where they hit a roadblock is the data requirement. Everything requires labeled data, and creating labeled data is expensive. One of the key things that we have been working on is bootstrapping that process. We understand that if you have a lot of labeled data, you cannot beat machine learning techniques in doing either classification, extraction, question answering, or anything you want to do. But that said, labeled data is expensive.
So we have a bootstrapping process. We start with a small amount of data, just a little bit of semantic annotations that you need to do. We can train the machine adequately enough that you can get 70-80% there. And then as you collect that, they can try the system to help the machine learn more. And that’s where the machine learning models come in. So this has been the message that we’ve been communicating quite a bit right now. Like I said, it’s helped a lot, because people are definitely open now, to say, these applications are not possible. These are not just wild science projects that somebody just dreamed up. These are some things that are actually conceivable now as working in practice.
Rob Harris, Stardog: I’ve been in the space for for many years now in a variety of different roles, usually from an analytics perspective. And to see the rise of ML techniques being brought to bear and the missed expectations has been quite interesting. We certainly have a lot of expectation setting that we need to do with our customers that I’ve had to do over a variety of fears with our customers about how it’s not magic. There’s a lot of work that goes into building out these machine learning models. And a lot of work that goes into being able to get those predictions at a place where it makes sense and it logically fits into the world. They don’t just kind of erupt out of nothing. There’s a lot of labeling of data that needs to happen, rationalizing of data that needs to happen for things like classification of information.
But that brings up an interesting topic. And that’s really, what are the things that are kind of fantastical now that we see the world heading to? Where do we see this space evolving? And I’d say on our side, there’s really two things that are not there yet, but are things that are coming soon. The first one is really about the ease of use. This has been a space of experts for quite some time, where we’re talking about ontologists and linguists and people that need to be involved to deeply understand how these things are structured and created and put together in order to get the most value out of it. And more and more, I’ve seen automation coming and being able to onboard new data sources to be able to do some of that initial training. None of it’s there yet, but there’s more and more innovations happening in order to accelerate that, as well as new concepts like data fabrics that are starting to talk about how you take you entire data integration and access plane and hide it behind a semantic technology where you can use a single canonical model in order to interact with data regardless of where that data lives, regardless of whether it’s structured or unstructured data, to simplify the citizen analysts role or the application integrator role throughout the organization. Most of these are pretty early stage but we do definitely see the the market still continuing to evolve and continue to innovate in this space.
Check out the full interview below!