Here are questions to help decide if you should virtualize or materialize your data.
1. What are the data sources and are they authoritative?
Understanding your data ecosystem and what sources are available is an important first step. You need to determine where the data is located within the enterprise and how the data is structured(or not in the case of unstructured documents). Regardless of the method used to access the data, you need to map the schemas of these data sources to the graph model and understand how these data sources relate to each other.
2. How often is your data changing?
The rate at which a data source is updated can vary drastically based on use cases and requirements. Transactional systems change on an ongoing basis and sometimes at a very rapid rate in the order of seconds or milliseconds. Other systems might update much less frequently and might be tied to an event such as a planned quarterly product catalog. If the data change frequency is low then materializing the data as a local graph and setting up a scheduled update procedure would be enough to provide accurate and up-to-date information. For data sources that are changing continuously, virtualization would provide the latest and most up-to-date results.
3. What are the end users’ expectations? How often is someone accessing the data?
It’s not only how fast the data is changing, but also how users or other applications are accessing the data. You’ll want to have a good understanding of the expectations and patterns of usage. Is this a business report that is generated once a week? Or is it a traditional web application where users are searching for results and expect millisecond turnaround times? Very low latency response requirements might be harder to satisfy with virtual graphs depending on network latency between the knowledge graph and the data source, as well as, the additional workload on that data source. Virtual Graph caching might be configured and used in these cases to provide low latency results without sacrificing on data recency significantly.
4. Is this data source accessed frequently?
There may be many different data sources connected to your knowledge graph but not all the data sources will be accessed at the same rate by users. More frequent queries might be over a certain subset of the data sources, whereas the other data sources might not be accessed often. The query latency might not be a concern for less frequently accessed data sources making virtualization feasible even if the network speed is not optimal.
5. What are the application data needs?
Some applications need to perform simple point lookups over a data source that can be done very quickly without transferring a lot of data over the network. Other applications might run more complex queries or might need to retrieve large amounts of data which might not be feasible at query time. More analytical and iterative algorithms such as finding shortest paths in a highly-connected graph would fall into that category. Materializing or caching the data in a local graph database would yield better performance in such cases.
6. Are you connecting data behind a firewall or via a cloud instance? Do you need to enable a hybrid cloud infrastructure?
The data sources that are being pulled into your knowledge graph might be hosted in different parts of the organization behind firewalls or live in a cloud instance outside the organization. This might require additional setup to establish a secure and performant network connection between the virtualization layer and the data sources. You might use generic networking solutions or vendor-specific approaches like AWS Direct Connect, but if this kind of live data access cannot be established with your infrastructure, you would have to fall back to materialization.
7. Will your organization allow you to create live access to the data source?
There may be policies and procedures that limit access to certain data sources. Apart from security and authorization considerations there might be strict QoS SLAs associated with a data source preventing you from having direct access. You might also need to take cultural and organizational restrictions into account. What are your business processes for an application to gain access to a data source? That will help you determine if virtualization is possible.
Learn more about how to use Stardog’s Virtual Graphs in our on-demand training webinar.