Case Study

Boehringer Ingelheim uses Stardog to transform its data lake into a FAIR data foundation

A “data foundation” vision Icon

A “data foundation” vision

Several years ago, Boehringer Ingelheim—the maker of some of the leading treatments for illnesses like type-2 diabetes, stroke, and COPD—began the process of linking their biomarker data, such as targets, genes, and diseases, to improve how bioinformaticians work across R&D data. They tried several different tech stacks before realizing they needed a more systemic approach; they needed to establish a “technical foundation” that would link data from different parts of the company and make this data available to everyone in the organization.

Boehringer had been making progress in some areas with values lists and master vocabularies, but in the more complex area of computational biology they needed a more mature solution that would allow them to show how terms are related to one another. Another important consideration was external data. Boehringer sources 30% of its active ingredients from external collaborations, and they have limited control over the quality of this data. They needed a flexible solution that could relate their internal experimental results to external and publicly available studies.

Their existing data lake was not up to this task. They needed a technology that would allow them to connect data regardless of its source or type—a solution that would create a data layer that would make data available to everyone in Boehringer and allow them to explore the data “Wikipedia-style”. This lead them to knowledge graphs.

Quote Icon

For us it was a natural choice to deviate from the pure data lake technologies to a more sophisticated model.

- Dirk Malthan, Head of IT Research Computational Biology and Translational Science, Boehringer Ingelheim

The Semantic Integration Project Icon

The Semantic Integration Project

Their vision begat the Semantic Integration Project, which would build a semantic layer atop Boehringer’s data lake to accelerate their data lake and provide a consolidated, one-stop shop for 90% of their R&D data.

This necessitated a shift in perspective. With a data lake, you put the data in and then worry about the quality later when you get the data out, oftentimes much to the chagrin of the data consumer. Working with Knowledge Graphs, you do it the other way around. You first think about the data model and get your data in good shape before storing it.

To that end, Boehringer began by taking care of the data quality up front. The team built an entity name service which generates IDs for all entities. They also connected metadata from workflow systems, such as:

  • Who generated what sample?
  • Which study is currently running?
  • Which research project is currently being performed?
  • Where is the data stored?
  • Which device has been creating some data?
  • In which freezer was the sample located?

Letting bioinformaticians focus on the science Icon

Letting bioinformaticians focus on the science

With these common nomenclatures and IDs in upstream data acquisitions systems, bioinformaticians could access the data, no cleaning required, and the data arrived already linked to the right entities.

Boehringer launched a linked data dictionary, which bioinformaticians could use to explore the data. The tool allows for three different types of access:

  • Search engine: Users can search for a particular disease, study, or gene, and then explore the results “Wikipedia-style”
  • Data-model browsing: Analysts can see directly in the data model how one piece of data relates to the rest of R&D data
  • Pre-defined queries: A lite query builder, which allows analysts to pull reports out of the Knowledge Graph—no SPARQL knowledge required

Bioinformaticians are responsible for finding useful signals within large sets of noisy data. More specifically, they use genomic data to find links between specific genotypes and diseases and then screen drug data to identify therapeutic candidates. Before the Semantic Integration Project, bioinformaticians were cleaning up data themselves (running Python scripts on CSV files) and storing data from their individual workstations in databases. Now, they can fetch their data directly from the data lake using the linked data dictionary and put it into R for analysis.

Now users can ask highly specific questions including:

  • Could some gene expression be used as a biomarker to understand whether some drug is delivering some effect?
  • Are certain genetic conditions suitable to be treated with some drug?
  • Could you show me a set of compounds which are creating a similar effect or which compound has been tested in similar conditions and similar treatments?
  • Could you show me in which other chemical assays a treatment or compound has already been tested?
Using virtualization to reduce management complexities and costs Icon

Using virtualization to reduce management complexities and costs

A key reason Boehringer chose Stardog’s Enterprise Knowledge Graph platform is our graph-based virtualization capability, which allows you to connect data across silos without copying the data into Stardog. For Boehringer, this allowed them to eliminate redundant data storage and reduce costs. They have large amounts of data already stored in relational databases, and doing statistics and aggregations on this data is straightforward. However, they also wanted to retrieve this data and provide a single, centralized place where the data is available at data scientists’ fingertips.

By virtualizing, Boehringer is working on live data and avoiding ETL processes that are expensive both in terms of storage and time. However, Stardog doesn’t force Boehringer to choose between virtualization and materialization and allows them to use the method best suited to their particular scenario.

Quote Icon

A Knowledge Graph gives you the tooling you need to implement the four high-level principles of FAIR.

- Maksim Kolchin, Sr. Principal System Analyst, Boehringer Ingelheim

FAIR: Realizing the business value of data Icon

FAIR: Realizing the business value of data

FAIR is a set of best practices for scientific data management. FAIR stands for:

  • Findable - provisioning metadata to easily discover data
  • Accessible - embracing protocols that open ways to access data
  • Interoperable - mapping data across sources so they can be used in workflows
  • Reusable - applying community standards so that data can be reproduced or combined in different settings

For members of the Semantic Integration Project, achieving FAIR was a must. In the words of Head of IT Research & TMCP Digital Lab, Michael Pöttschacher: “Every company that is data-driven needs to go for FAIR data.” This is because, by making your data more useable, FAIR converts your data into a tangible business asset.

According to Boehringer, it is easier to understand what you’re talking about when you’re using a data model. This is because you can do things like draw entity relationship diagrams to discuss data management best practices with end users. In essence, FAIR-enabling semantic technologies help Boehringer take data discussions out of the pure IT realm and bring them into the business realm.

Boehringer’s R&D data is now accessible through standardized protocol and using well-known terms that are familiar to business not just IT. Before, researchers needed to go to a colleague to ask where data is located or reference a data catalog, but that was only useful if they understood how the datasets were organized and how to integrate them to get answer they need. Now, Boehringer has one system where questions can be asked using a natural language interface.

The data models in Knowledge Graphs are perfect for data integration. And once you have integrated the data, you don’t need to repeat research - it is reusable. For example, regulatory analysts now can easily refer to answers to previous questions when regulators come back with additional comments on a filing.

Contact us to learn more about Stardog's solutions

Contact us