Finding Fraud, Part Two

Paul Jackson

Aug 25, 2018, 8 minute read

Unify data at query time using Stardog to make fraud detection more dynamic.

This is part two of a series that explores how to use Stardog’s Enterprise Knowledge Graph platform to detect fraud in financial transactions. In this series we search for complex webs of relationships between people, companies, addresses and investments that suggest efforts to hide illicit activity.

In the first part of this series we explored the use of property paths and inference to sniff out fraud. In this post we consider how to do the same thing when some of the data resides in an external system and is accessed virtually.

All of the data for this post is available in the stardog-examples github repo. For this post we’ll modify the create_db.sh script so that some of the data is loaded into MySQL. Here are the modified lines:

# Uncomment these lines to import transactions into Stardog
#stardog-admin virtual import $DB transaction-mapping.ttl aml_dataset_transactions.csv

# Or, comment the prior line and uncomment these lines to load
# the transactions into MySQL
# Alternatively, you can use load_transactions.sql instead of
# load_transactions-mysql.sql
mysql -uadmin -padmin -hlocalhost stardog < load_transactions-mysql.sql
stardog-admin virtual add aml.properties transaction-mapping-sql.ttl

In practice, when working with data that lives in other data sources throughout the enterprise, a Stardog Knowledge Graph will often be a hybrid of replicated (that is, materialized) data and virtual mappings to external data. Several considerations can influence which approach to take, including:

How frequently the data changes (velocity)
The volume of data (especially in relationship to how often the data will be accessed by Stardog)
Security and other policy considerations
Whether the data will be used for localized, real-time (OLTP) queries or analytic queries (OLAP) that aggregate global data
The cost of adding additional querying capacity to the external system

A real Knowledge Graph system makes it practical to query data of any type or velocity, which includes data at rest or in motion and data that may or may not be replicated. Stardog’s Virtual Graph engine does exactly that.

Virtual Transaction Data

We’ll start by looking at a scenario where all the data is stored locally in Stardog except for the transactional data, which we’ll access virtually on demand. We’ll then consider alternatives that balance query performance, replication cost, and data freshness.

The schema for the dataset looks like this:

To adapt our queries from the first post to access the data remotely, we move the triple patterns containing an :originated, :beneficiary, :amount or :date predicate to inside of a graph clause. With this modification our final query from the last post becomes:

# Fraud score as Money * (Paths Count)^2
select ?org ?name1 ?ben ?name2 ?score {
    {
        # Count distinct paths
        select ?org ?name1 ?ben ?name2 ?s (count(*) as ?c) {
            {
                # Group by intermediaries
                select ?org ?name1 ?ben ?name2 ?t1 ?t2 ?s {
                    ?org :lastName ?lname1 ;
                        :firstName ?fname1 ;
                        :hasAffiliation* ?t1 .
                    ?t1 :hasBankAccount ?a1 .
                    ?t2 :hasBankAccount ?a2 ;
                        :hasAffiliation* ?ben .
                    ?ben :lastName ?lname2 ;
                        :firstName ?fname2 .
                    graph <virtual://aml> {
                        ?a1 :originated ?tx .
                        ?tx :beneficiary ?a2 .
                    }
                    {
                        # Find highest sum of Tx for all paths
                        # between ?org and ?ben
                        select ?org ?ben (sum(?m) as ?s) {
                            ?org a :Person ;
                                :hasAffiliation* ?t1 .
                            ?t1 :hasBankAccount ?a1 .
                            ?t2 :hasBankAccount ?a2 ;
                                :hasAffiliation* ?ben .
                            ?ben a :Person .
                            graph <virtual://aml> {
                                ?a1 :originated ?tx .
                                ?tx :beneficiary ?a2 ;
                                    :amount ?m .
                            }
                        }
                        group by ?org ?ben
                        order by desc(?s) ?org ?ben
                        limit 1000
                    }
                    bind(concat(?fname1, ' - ', ?lname1) as ?name1)
                    bind(concat(?fname2, ' - ', ?lname2) as ?name2)
                }
                group by ?org ?name1 ?ben ?name2 ?t1 ?t2 ?s
            }
        }
        group by ?org ?ben ?name1 ?name2 ?s
        order by desc(?s) ?org ?ben
    }
    # Calculate score as total money transacted times
    # the square of unique path count between actors
    bind(?s * ?c * ?c as ?score)
}
order by desc(?score) ?org ?ben

The key change is in the patterns that find transactions between pairs of originator and beneficiary.

In other words, this query fragment:

select ?org ?ben ?m {
    ?org a :Person ;
        :hasAffiliation* ?t1 .
    ?t1 :hasBankAccount ?a1 .
    ?a1 :originated ?tx .
    ?tx :beneficiary ?a2 ;
        :amount ?m .
    ?t2 :hasBankAccount ?a2 ;
        :hasAffiliation* ?ben .
    ?ben a :Person .
}

becomes this query fragment instead:

select ?org ?ben ?m {
    ?org a :Person ;
        :hasAffiliation* ?t1 .
    ?t1 :hasBankAccount ?a1 .
    ?t2 :hasBankAccount ?a2 ;
        :hasAffiliation* ?ben .
    ?ben a :Person .
    graph <virtual://aml> {
        ?a1 :originated ?tx .
        ?tx :beneficiary ?a2 ;
            :amount ?m .
    }
}

Looking at the query plan for the second query fragment we see this node that shows the SQL that Stardog formulated to access the transaction data:

VirtualGraphSql<virtual://aml> [#17084] {
      SELECT `tx_id`, `account1`, `account2`, `amount`
      FROM `transactions`
      WHERE `amount` IS NOT NULL AND `account2` IS NOT NULL
      AND `account1` IS NOT NULL
}

The SQL statement shows that all of the data is retrieved and aggregated locally. This works for our example dataset of 23k transactions but if the dataset grows large enough we will want to explore alternatives.

A Hybrid Approach

We initially chose to access all the transaction data virtually. The advantages of this approach include the following:

We’re able to develop the query quickly without having to first replicate data.
Query results reflect the latest state of the external data.
We can avoid the cost of replicating all the transaction data, especially when all we need is aggregated totals.

Now that we’re ready to operationalize the query we can explore a hybrid solution where we materialize only aggregated data. Periodic updates to the aggregated data can be scheduled for a convenient time. Further, we can retain the ability to query the latest transactions by combining the materialized historic data with virtualized access to transaction data limited to those records that occurred after the last import.

An import would look like this:

insert {
    ?a1 :originated [
        :beneficiary ?a2 ;
        :amount ?msum ;
        :date ?date
    ]
}
where {
    select * {
        graph <virtual://aml>
        {
            select ?a1 ?a2 (sum(?m) as ?msum) ('2017-08-31'^^xsd:date as ?date) {
                ?a1 :originated ?tx .
                ?tx :beneficiary ?a2 ;
                    :amount ?m ;
                    :date ?d .
                # Could save roll-up Tx for (say) each quarter,
                # or could merge into single Tx per account pair
                filter (?d < '2017-09-01'^^xsd:date)
            }
            group by ?a1 ?a2
        }
    }
}

This query aggregates all transactions that occurred before 2017-09-01 and stores the total amount as a single summary transaction with the date 2017-08-31.

To achieve hybrid querying we combine these materialized local results with virtualized data that’s constrained to only those transactions that occurred on or after 2017-09-01. A query that finds the sum of all transactions between all pairs of people looks like this:

select ?org ?ben (sum(?m) as ?s) {
    ?org a :Person ;
        :hasAffiliation ?t1 .
    ?t1 :hasBankAccount ?a1 .
    ?t2 :hasBankAccount ?a2 .
    ?ben a :Person ;
        :hasAffiliation ?t2 .
    {
        ?a1 :originated ?tx .
        ?tx :beneficiary ?a2 ;
            :amount ?m ;
            :date ?d
        filter (?d < '2017-09-01'^^xsd:date)
    }
    union
    {
        graph <virtual://aml> {
            ?a1 :originated ?tx .
            ?tx :beneficiary ?a2 ;
                :amount ?m ;
                :date ?d
            filter (?d >= '2017-09-01'^^xsd:date)
        }
    }
}
group by ?org ?ben
order by desc(?s)
limit 50

In this query we use the union operator to combine the results from the materialized and virtual sources.

The final query now looks like this after these changes:

select ?org ?name1 ?ben ?name2 ?score {
    {
        # Count distinct paths
        select ?org ?name1 ?ben ?name2 ?s (count(*) as ?c) {
            {
                # Group by intermediaries
                select ?org ?name1 ?ben ?name2 ?t1 ?t2 ?s {
                    ?org :lastName ?lname1 ;
                        :firstName ?fname1 ;
                        :hasAffiliation* ?t1 .
                    ?t1 :hasBankAccount ?a1 .
                    ?t2 :hasBankAccount ?a2 ;
                        :hasAffiliation* ?ben .
                    ?ben :lastName ?lname2 ;
                        :firstName ?fname2 .
                    {
                        ?a1 :originated ?tx .
                        ?tx :beneficiary ?a2 ;
                            :date ?d .
                        filter (?d < '2017-09-01'^^xsd:date)
                    }
                    union
                    {
                        graph <virtual://aml> {
                            ?a1 :originated ?tx .
                            ?tx :beneficiary ?a2 ;
                                :date ?d .
                            filter (?d >= '2017-09-01'^^xsd:date)
                        }
                    }
                    {
                        # Find highest sum of Tx for all paths between org and ben
                        select ?org ?ben (sum(?m) as ?s) {
                            ?org a :Person ;
                                :hasAffiliation* ?t1 .
                            ?t1 :hasBankAccount ?a1 .
                            ?t2 :hasBankAccount ?a2 ;
                                :hasAffiliation* ?ben .
                            ?ben a :Person .
                            {
                                ?a1 :originated ?tx .
                                ?tx :beneficiary ?a2 ;
                                    :amount ?m ;
                                    :date ?d
                                filter (?d < '2017-09-01'^^xsd:date)
                            }
                            union
                            {
                                graph <virtual://aml> {
                                    ?a1 :originated ?tx .
                                    ?tx :beneficiary ?a2 ;
                                        :amount ?m ;
                                        :date ?d
                                    filter (?d >= '2017-09-01'^^xsd:date)
                                }
                            }
                        }
                        group by ?org ?ben
                        order by desc(?s) ?org ?ben
                        limit 1000
                    }
                    bind(concat(?fname1, ' - ', ?lname1) as ?name1)
                    bind(concat(?fname2, ' - ', ?lname2) as ?name2)
                }
                group by ?org ?name1 ?ben ?name2 ?t1 ?t2 ?s
            }
        }
        group by ?org ?ben ?name1 ?name2 ?s
        order by desc(?s) ?org ?ben
    }
    # Calculate score as total money transacted times
    # the square of unique path count between actors
    bind(?s * ?c * ?c as ?score)
}
order by desc(?score) ?org ?ben

Being able to choose between a materialized, virtualized, or hybrid data strategy gives architects tremendous control and provides the flexibility to achieve both performance and up-to-the-minute recency. This illustrates the power of a hybrid approach that an Enterprise Knowledge Graph platform like Stardog can deliver.