Similarity Search
Get the latest in your inbox
Get the latest in your inbox
Learn how to find similar items in the Knowledge Graph with machine learning.
While showcasing Stardog’s machine learning capabilities, customers keep asking for the ability to find similar items in the Knowledge Graph. This is a useful feature for many tasks, such as generating recommendations or finding near duplicates.
So in Stardog 5.3.3 we introduced a new type of machine learning model that supports search and retrieval of similar items in an efficient and scalable way. This is a general problem: you’re searching in a graph of nodes that represent real-world objects and the main thing you want to consider is similarity between pairs of objects. The motivating reasons you’d be doing this are varied; maybe you’re building a recommendation system or looking at data lineage or debugging problems in some business process where a problem in one object may also occur in similar objects.
To get into the details without getting bogged down, let’s explore a specific example, using the movie dataset.
Similarity search follows the same
syntax and pipeline as our
other machine learning models. First, you need to create a model, which holds
the set of items available for search. The spa:arguments
property receives the
features used for similarity calculation, while spa:predict
contains the
identifier of the item.
prefix : <http://schema.org/>
prefix spa: <tag:stardog:api:analytics:>
INSERT {
graph spa:model {
:simModel a spa:SimilarityModel ;
spa:arguments (?genres ?directors ?writers ?producers ?metaCritic) ;
spa:predict ?movie .
}
}
WHERE {
SELECT
(spa:set(?genre) as ?genres)
(spa:set(?director) as ?directors)
(spa:set(?writer) as ?writers)
(spa:set(?producer) as ?producers)
?metaCritic
?movie
{
?movie :genre ?genre ;
:director ?director ;
:author ?writer ;
:productionCompany ?producer ;
:metaCritic ?metaCritic .
}
GROUP BY ?movie ?metaCritic
}
Here, we are creating a SimilarityModel
named :simModel
which takes as input
the genres, directors, writers, producers and MetaCritic score for all movies in
the dataset.
Using this model it’s pretty easy to find similar movies. We select a movie and
its properties and pass it as input to the model. The number of similar items to
return is controlled by the spa:limit
property given in spa:parameters
.
prefix : <http://schema.org/>
prefix t: <http://www.imdb.com/title/>
prefix spa: <tag:stardog:api:analytics:>
SELECT ?similarMovieLabel ?confidence
WHERE {
graph spa:model {
:simModel spa:arguments (?genres ?directors ?writers ?producers ?metaCritic) ;
spa:confidence ?confidence ;
spa:parameters [ spa:limit 5 ] ;
spa:predict ?similarMovie .
}
{ ?similarMovie rdfs:label ?similarMovieLabel }
{
SELECT
(spa:set(?genre) as ?genres)
(spa:set(?director) as ?directors)
(spa:set(?writer) as ?writers)
(spa:set(?producer) as ?producers)
?metaCritic
?movie
{
?movie :genre ?genre ;
:director ?director ;
:author ?writer ;
:productionCompany ?producer ;
:metaCritic ?metaCritic .
VALUES ?movie { t:tt0118715 } # The Big Lebowski
}
GROUP BY ?movie ?metaCritic
}
}
ORDER BY DESC(?confidence)
This query finds five movies that are similar to The Big Lebowski
and their
similarity score, based on the features given through spa:arguments
.
similarMovieLabel | confidence |
---|---|
The Big Lebowski | 0.9999999999999998 |
Fargo | 0.9996443676337468 |
Blood Simple | 0.9996332068990889 |
The Man Who Wasn’t There | 0.9996019945613324 |
Barton Fink | 0.9995802728226650 |
As expected, the most similar item is the movie itself, followed by other movies from the inimitable Coen Brothers.
Just like other models, similarity search features can have any datatype: numbers, strings, sets, etc. The best representation for those features is automatically taken into account by Stardog when it calculates a similarity score.
Items and their features are vectorized using feature hashing, the same technique used by our classification and regression models. This vectors are saved in a search index created using cluster pruning, an approximate search algorithm which groups items based on their similarity in order to speed up query performance.
The index is used to find the vectors with largest cosine similarity, which is
the score given by spa:confidence
.
The Stardog docs describe advanced parameters which can be used to increase query performance and recall.
We are exploring other ways of representing items as vectors, such as knowledge graph embeddings and predication-based semantic indexing, while improving the techniques underlying the search index itself. Stay tuned for updates.
How to Overcome a Major Enterprise Liability and Unleash Massive Potential
Download for free