Connect with Stardog

At Databricks Data + AI Summit

Conventional Data Integration is No Longer Sufficient

Mar 11, 2021, 6 minute read
Stardog Newsletter

Get the latest in your inbox

The original premise of data integration was focused on the correct goal of unifying data, but its execution was fundamentally flawed in that it sought to consolidate all enterprise data into one physical place. After fifty years of data warehouses and other ETL-based solutions, it’s clear that their promise for fast, complete analysis has fallen short.

Data is distributed over hundreds of different places with incompatible schemas, maintained by disjointed departments. Today, the average company uses over 400 different data sources for analysis, and for many, their real needs span over 1,000 data sources. Attempting to conform all data to the same structure is not only a time-intensive process but also offers no flexibility for analysis. Unstructured data doesn’t lend itself to a rigid tabular format and is readily left out of analysis. Further, with the adoption of cloud storage, data is now spread across on-premise and cloud environments, complicating physical consolidation due to security protocols.

Internal business practices have also hindered data integration projects. Many times, information is fragmented across internal departments without any standardized definitions or naming conventions. While this allows individual departments to perform their day-to-day tasks, it becomes extremely hard for an organization-wide initiative to reach across all departments to get a consistent, global view.

What are the challenges with data integration?

  • It takes too long: traditional data integration requires the permanent transformation of data to conform to the same structure and to resolve different identifiers used for the same entity in different data sources.
  • Creates more silos: copies and copies of data lead to multiple versions and it’s unclear which one is to be trusted. On the other hand, multiple versions may exist because individual departments use different naming conventions.
  • You can’t make any changes: relational-based integration tools cannot easily accommodate new data sources.
  • Unstructured data is an afterthought: you cannot integrate data sources with non-uniform or non-tabular structure.
  • Data quickly becomes out-of-date: data is loaded in batches rather than real-time resulting in stale data that never reflects the current state of the business.
  • You don’t trust the answers: results are in a black box. There is no traceability from answers to the original data sources and no details of additional processing that might have been performed while moving the data. Without clear data lineage, metadata, and provenance, answers cannot be explained or trusted.

These challenges can have great costs. On average, data scientists spend 80% of their time cleaning and preparing data for analysis. It’s not just technical teams that are trying to access data. All employees are constantly searching for information throughout their day. An IDC survey found that organizations with 1,000 workers lose almost $6 million dollars a year with employees spending 36% of each day searching for information. Half the time, employees can’t even find it.

Poor data management causes more than just increased operational costs. The tragic Space Shuttle Columbia disaster in 2003 revealed that poor data management contributed to the accident. The Columbia Accident Board Report notes that “the Space Shuttle Program has a wealth of data tucked away in multiple databases without a convenient way to integrate and use the data for management, engineering, or safety decisions.” Since then, NASA has revolutionized their data management practices and can now understand the impact of changes throughout the mission from the earliest flight modeling stages to launch and beyond.

Modern data management requires flexibility

An inability to easily access your data hinders decision making. But it’s not just about having access to the the right information to make the initial decision. The need to adapt and change direction quickly is a core need for any business. We operate in a world that is constantly changing - new regulations are introduced, new business opportunities explored, unforeseen supply chain issues arise, new mergers and acquisitions are undertaken. Data management tools must enable flexibility.

Key characteristics of flexible data management tools:

  • Adaptable to new requirements. Data analysis should mirror the human process of discovery. As we learn, we evolve our hypothesis and make changes. Traditional data integration tools with relational data models force you to code all questions and inputs at the beginning of an analysis leading to constant re-work by data engineers. A more flexible data model organizes data based on relationships between data points and does not require permanent transformation of the data to create uniformity across all entities.
  • Brings meaning to data. In order to create business value, you need to be able to connect all the data that matters. Some of this data will be stored in tables, but also in PDFs, webpages, emails, and other semistructured and unstructured sources. Data management tools should be able to represent data that is natively stored in other structures and connect all relevant metadata and context.
  • Limitless access to data, regardless of location or structure. One of the biggest barriers to fast data analysis is the endless process of copying and transforming data required with ETL-based solutions. This leads to slow and stale results. Flexible data tools must focus on connecting data, not collecting it. A modern approach includes virtualizing data which accesses data remotely — whether it’s on-premise or cloud. No more copies of copies of data. In addition, you should be able to easily unify all data, regardless of data type. Unstructured data can no longer be an after-thought and stored separately, especially since 80% of data will be unstructured by 2025, according to IDC.
  • Modernizes existing investments. Some data must be stored apart from other data to comply with legal regulation or simply for legacy business reasons. Other data may be too essential to the business to bear the risk of consolidating, eliminating, or modernizing it. This is okay! Some data exists in silos for good reasons. But data management tools should leverage these silos and work alongside legacy data management investments.
  • Connects data at the compute, rather than the storage layer. Modern data management tools are not static things, but rather queryable data layers. This querying needs to happen at the computer layer above the actual storage layer so as not to create another silo.

The next generation of data management: Data Fabric

Data fabric is the enterprise data management solution that meets the demands of the modern enterprise. Data fabrics weave together data from internal and external sources and create a network of information to power business applications, AI, and analytics.

Want to learn more about how a data fabric can resolve your data integration problems? Start by reading our Data Fabric whitepaper.

download our free e-guide

Knowledge Graphs 101

How to Overcome a Major Enterprise Liability and Unleash Massive Potential

Download for free
ebook