Finding Fraud, Part Two
Get the latest in your inbox
Get the latest in your inbox
Unify data at query time using Stardog to make fraud detection more dynamic.
This is part two of a series that explores how to use Stardog’s Enterprise Knowledge Graph platform to detect fraud in financial transactions. In this series we search for complex webs of relationships between people, companies, addresses and investments that suggest efforts to hide illicit activity.
In the first part of this series we explored the use of property paths and inference to sniff out fraud. In this post we consider how to do the same thing when some of the data resides in an external system and is accessed virtually.
All of the data for this post is available in the
stardog-examples github repo.
For this post we’ll modify the create_db.sh
script so that some of the data is loaded into
MySQL. Here are the modified lines:
# Uncomment these lines to import transactions into Stardog
#stardog-admin virtual import $DB transaction-mapping.ttl aml_dataset_transactions.csv
# Or, comment the prior line and uncomment these lines to load
# the transactions into MySQL
# Alternatively, you can use load_transactions.sql instead of
# load_transactions-mysql.sql
mysql -uadmin -padmin -hlocalhost stardog < load_transactions-mysql.sql
stardog-admin virtual add aml.properties transaction-mapping-sql.ttl
In practice, when working with data that lives in other data sources throughout the enterprise, a Stardog Knowledge Graph will often be a hybrid of replicated (that is, materialized) data and virtual mappings to external data. Several considerations can influence which approach to take, including:
A real Knowledge Graph system makes it practical to query data of any type or velocity, which includes data at rest or in motion and data that may or may not be replicated. Stardog’s Virtual Graph engine does exactly that.
We’ll start by looking at a scenario where all the data is stored locally in Stardog except for the transactional data, which we’ll access virtually on demand. We’ll then consider alternatives that balance query performance, replication cost, and data freshness.
The schema for the dataset looks like this:
To adapt our queries from the first post to access the data remotely, we move the
triple patterns containing an :originated
, :beneficiary
, :amount
or :date
predicate
to inside of a graph
clause. With this modification our final query from the last
post becomes:
# Fraud score as Money * (Paths Count)^2
select ?org ?name1 ?ben ?name2 ?score {
{
# Count distinct paths
select ?org ?name1 ?ben ?name2 ?s (count(*) as ?c) {
{
# Group by intermediaries
select ?org ?name1 ?ben ?name2 ?t1 ?t2 ?s {
?org :lastName ?lname1 ;
:firstName ?fname1 ;
:hasAffiliation* ?t1 .
?t1 :hasBankAccount ?a1 .
?t2 :hasBankAccount ?a2 ;
:hasAffiliation* ?ben .
?ben :lastName ?lname2 ;
:firstName ?fname2 .
graph <virtual://aml> {
?a1 :originated ?tx .
?tx :beneficiary ?a2 .
}
{
# Find highest sum of Tx for all paths
# between ?org and ?ben
select ?org ?ben (sum(?m) as ?s) {
?org a :Person ;
:hasAffiliation* ?t1 .
?t1 :hasBankAccount ?a1 .
?t2 :hasBankAccount ?a2 ;
:hasAffiliation* ?ben .
?ben a :Person .
graph <virtual://aml> {
?a1 :originated ?tx .
?tx :beneficiary ?a2 ;
:amount ?m .
}
}
group by ?org ?ben
order by desc(?s) ?org ?ben
limit 1000
}
bind(concat(?fname1, ' - ', ?lname1) as ?name1)
bind(concat(?fname2, ' - ', ?lname2) as ?name2)
}
group by ?org ?name1 ?ben ?name2 ?t1 ?t2 ?s
}
}
group by ?org ?ben ?name1 ?name2 ?s
order by desc(?s) ?org ?ben
}
# Calculate score as total money transacted times
# the square of unique path count between actors
bind(?s * ?c * ?c as ?score)
}
order by desc(?score) ?org ?ben
The key change is in the patterns that find transactions between pairs of originator and beneficiary.
In other words, this query fragment:
select ?org ?ben ?m {
?org a :Person ;
:hasAffiliation* ?t1 .
?t1 :hasBankAccount ?a1 .
?a1 :originated ?tx .
?tx :beneficiary ?a2 ;
:amount ?m .
?t2 :hasBankAccount ?a2 ;
:hasAffiliation* ?ben .
?ben a :Person .
}
becomes this query fragment instead:
select ?org ?ben ?m {
?org a :Person ;
:hasAffiliation* ?t1 .
?t1 :hasBankAccount ?a1 .
?t2 :hasBankAccount ?a2 ;
:hasAffiliation* ?ben .
?ben a :Person .
graph <virtual://aml> {
?a1 :originated ?tx .
?tx :beneficiary ?a2 ;
:amount ?m .
}
}
Looking at the query plan for the second query fragment we see this node that shows the SQL that Stardog formulated to access the transaction data:
VirtualGraphSql<virtual://aml> [#17084] {
SELECT `tx_id`, `account1`, `account2`, `amount`
FROM `transactions`
WHERE `amount` IS NOT NULL AND `account2` IS NOT NULL
AND `account1` IS NOT NULL
}
The SQL statement shows that all of the data is retrieved and aggregated locally. This works for our example dataset of 23k transactions but if the dataset grows large enough we will want to explore alternatives.
We initially chose to access all the transaction data virtually. The advantages of this approach include the following:
Now that we’re ready to operationalize the query we can explore a hybrid solution where we materialize only aggregated data. Periodic updates to the aggregated data can be scheduled for a convenient time. Further, we can retain the ability to query the latest transactions by combining the materialized historic data with virtualized access to transaction data limited to those records that occurred after the last import.
An import would look like this:
insert {
?a1 :originated [
:beneficiary ?a2 ;
:amount ?msum ;
:date ?date
]
}
where {
select * {
graph <virtual://aml>
{
select ?a1 ?a2 (sum(?m) as ?msum) ('2017-08-31'^^xsd:date as ?date) {
?a1 :originated ?tx .
?tx :beneficiary ?a2 ;
:amount ?m ;
:date ?d .
# Could save roll-up Tx for (say) each quarter,
# or could merge into single Tx per account pair
filter (?d < '2017-09-01'^^xsd:date)
}
group by ?a1 ?a2
}
}
}
This query aggregates all transactions that occurred before 2017-09-01 and stores the total amount as a single summary transaction with the date 2017-08-31.
To achieve hybrid querying we combine these materialized local results with virtualized data that’s constrained to only those transactions that occurred on or after 2017-09-01. A query that finds the sum of all transactions between all pairs of people looks like this:
select ?org ?ben (sum(?m) as ?s) {
?org a :Person ;
:hasAffiliation ?t1 .
?t1 :hasBankAccount ?a1 .
?t2 :hasBankAccount ?a2 .
?ben a :Person ;
:hasAffiliation ?t2 .
{
?a1 :originated ?tx .
?tx :beneficiary ?a2 ;
:amount ?m ;
:date ?d
filter (?d < '2017-09-01'^^xsd:date)
}
union
{
graph <virtual://aml> {
?a1 :originated ?tx .
?tx :beneficiary ?a2 ;
:amount ?m ;
:date ?d
filter (?d >= '2017-09-01'^^xsd:date)
}
}
}
group by ?org ?ben
order by desc(?s)
limit 50
In this query we use the union
operator to combine the results from the materialized and
virtual sources.
The final query now looks like this after these changes:
select ?org ?name1 ?ben ?name2 ?score {
{
# Count distinct paths
select ?org ?name1 ?ben ?name2 ?s (count(*) as ?c) {
{
# Group by intermediaries
select ?org ?name1 ?ben ?name2 ?t1 ?t2 ?s {
?org :lastName ?lname1 ;
:firstName ?fname1 ;
:hasAffiliation* ?t1 .
?t1 :hasBankAccount ?a1 .
?t2 :hasBankAccount ?a2 ;
:hasAffiliation* ?ben .
?ben :lastName ?lname2 ;
:firstName ?fname2 .
{
?a1 :originated ?tx .
?tx :beneficiary ?a2 ;
:date ?d .
filter (?d < '2017-09-01'^^xsd:date)
}
union
{
graph <virtual://aml> {
?a1 :originated ?tx .
?tx :beneficiary ?a2 ;
:date ?d .
filter (?d >= '2017-09-01'^^xsd:date)
}
}
{
# Find highest sum of Tx for all paths between org and ben
select ?org ?ben (sum(?m) as ?s) {
?org a :Person ;
:hasAffiliation* ?t1 .
?t1 :hasBankAccount ?a1 .
?t2 :hasBankAccount ?a2 ;
:hasAffiliation* ?ben .
?ben a :Person .
{
?a1 :originated ?tx .
?tx :beneficiary ?a2 ;
:amount ?m ;
:date ?d
filter (?d < '2017-09-01'^^xsd:date)
}
union
{
graph <virtual://aml> {
?a1 :originated ?tx .
?tx :beneficiary ?a2 ;
:amount ?m ;
:date ?d
filter (?d >= '2017-09-01'^^xsd:date)
}
}
}
group by ?org ?ben
order by desc(?s) ?org ?ben
limit 1000
}
bind(concat(?fname1, ' - ', ?lname1) as ?name1)
bind(concat(?fname2, ' - ', ?lname2) as ?name2)
}
group by ?org ?name1 ?ben ?name2 ?t1 ?t2 ?s
}
}
group by ?org ?ben ?name1 ?name2 ?s
order by desc(?s) ?org ?ben
}
# Calculate score as total money transacted times
# the square of unique path count between actors
bind(?s * ?c * ?c as ?score)
}
order by desc(?score) ?org ?ben
Being able to choose between a materialized, virtualized, or hybrid data strategy gives architects tremendous control and provides the flexibility to achieve both performance and up-to-the-minute recency. This illustrates the power of a hybrid approach that an Enterprise Knowledge Graph platform like Stardog can deliver.
Learn more about the solutions Stardog delivers for Financial Services clients, from operational risk to risk management. Contact our team to learn more or get started building today.
How to Overcome a Major Enterprise Liability and Unleash Massive Potential
Download for free