One Good Dog Deserves Another

John Bresnahan

Feb 13, 2018, 5 minute read

Working with Datadog, we can improve and troubleshoot all aspects of Stardog performance.

Monitoring provides insight into current and historical performance. When operating any service it is crucially important to gather internal metrics on its resource consumption. This is certainly true for a knowledge graph. Having a real-time look at memory usage, CPU load, and network and disk activity allows the operator to reason about performance issues and hardware requirements. Further it gives the Stardog development team an incredibly handy tool to debug race conditions and find performance bottlenecks.

We’ve recently contributed a plugin to the Datadog agent allowing data collection about a running Stardog instance. The resulting metrics are stored in Datadog and can be published to a monitoring dashboard.

Datadog is a powerful service for viewing metrics and events. It has many collectors for various services like AWS, Docker, and Cassandra. Now Stardog is one of them.

Here’s a typical dashboard for a small Stardog cluster:

In the dashboard above we are able to visualize a set of time series graphs showing the operation of a Stardog cluster. The graphs are synchronized allowing us to correlate information and reason about possible events. For example, if we see a spike in CPU usage followed by a sharp reduction in memory usage we might assume that a garbage collection event took place. As we start looking at interactions between Stardog cluster and ZooKeeper nodes, events get quite complicated. This type of dashboard provides a lot of information and we can easily see changes in metrics.

Architecture

Datadog is a SaaS offering, so we log into a web console. From that console we can build dashboards. Each dashboard can be configured with a set of time series graphs, gauges, event time lines, alerts, and many other things. However these graphs are only as useful as the data that they display. In order to gather data, we must put collectors in place. Some collectors, like Amazon Web Services, are configured by giving Datadog authorization to poll their services. In contrast, Stardog requires an agent to be deployed on host in order to gather data.

The default configuration of an agent will give you memory, CPU, IO, and other system standard metrics. While this information is generally very useful, we can also get Stardog specific information using the extensions package.

The Datadog agent has an community-managed open source extensions repository. Contributions are submitted as pull requests by vendors interested in integrating their products with Datadog. The Datadog team reviews and merges these contributions into their releases if they think they will be valuable.

Deployment

The best way to inject Stardog metrics into Datadog is by installing the Datadog agent on the same system as the Stardog server (or in the case of a cluster deployment on each node of the cluster). The agent will then poll each Stardog server for metric information via its REST API. By default the polling interval is 15 seconds. After the agent collects metrics, it forwards them up to Datadog SaaS collectors where they are made available to dashboards.

Installing The Datadog Agent

The Datadog agent can be installed in a variety of ways including system package managers like yum and apt. It can also be done with a custom script provided to you by Datadog when your account is created. Installation instructions can be found here.

Installing the Extensions Packages

Once the base package is installed, the extensions package needs to be installed along side if it. In the near future Datadog intends to provide system packages for installing the extra integrations but at the present time the best way to do this is to pull the latest version out of the GitHub and manually copy the needed files for each specific configuration. The following commands show how this is done for Stardog on a Linux system:

git clone https://github.com/stardog-union/integrations-extras.git
cp integrations-extras/stardog/check.py /etc/dd-agent/checks.d/stardog.py

Once the collector script is in place it must be configured. This is done via the YAML file /etc/dd-agent/conf.d/stardog.yaml. A sample is below:

init_config:

instances:
    - stardog_url: http://localhost:5820
      username: admin
      password: admin

      tags:
      - backpressure
      - stardog-node-1
      - internal-testing
      - stardog

The first section under instances in the above example tells the Datadog agent where Stardog is and what credentials are needed to access it. admin access is required. The tags section can be anything. These are just a list of strings that are sent along with the metrics as metadata. We will use those later to filter out results in a dashboard.

The final step here is to start the Datadog agent:

/etc/init.d/datadog-agent start

The log for the Datadog agent can be found at /var/log/datadog/collector.log. Once it is running with the Stardog plugin lines like the following should be displayed.

2018-02-09 17:49:06 UTC | INFO | dd.collector | daemon(daemon.py:234) | Starting
2018-02-09 17:49:06 UTC | INFO | dd.collector | config(config.py:1243) | initialized checks.d checks: ['stardog', 'disk', 'network', 'ntp']
2018-02-09 17:49:06 UTC | INFO | dd.collector | config(config.py:1244) | initialization failed checks.d checks: []
2018-02-09 17:49:10 UTC | INFO | dd.collector | checks.collector(collector.py:404) | Running check stardog

Configuring a Dashboard

Now that an agent is running and publishing metrics to Datadog we can login and create a dashboard to visualize our data. Login to your Datadog account here. On the left hand side find the select Dashboards and New Dashboard as shown below:

In the pop-up give your new dashboard a name and select New TimeBoard. Next drag a timeseries graph from the tool bar onto the dashboard. At this point you can select the metrics that you would like to visualize by filling in the value of the Get text-box.

The text-box labeled from allows the graph to filter out data by specific tags. This is where the tags from the stardog.yaml file discussed above come into play. In this way we can limit the graph to looking at a specific Stardog node.

Conclusion

Stardog is a complex system and performance characteristics vary widely by workload. Interactions between components can bring about unforeseen situations which can be difficult to debug. Monitoring provides the foundational level of data required to reason about those interactions. This helps to increase availability and fix crucial issues at runtime. The metric visualization that Datadog provides via the Stardog plugin makes this task more attainable.