Similarity Search

Pedro Oliveira

Aug 7, 2018, 4 minute read

Learn how to find similar items in the Knowledge Graph with machine learning.

While showcasing Stardog’s machine learning capabilities, customers keep asking for the ability to find similar items in the Knowledge Graph. This is a useful feature for many tasks, such as generating recommendations or finding near duplicates.

So in Stardog 5.3.3 we introduced a new type of machine learning model that supports search and retrieval of similar items in an efficient and scalable way. This is a general problem: you’re searching in a graph of nodes that represent real-world objects and the main thing you want to consider is similarity between pairs of objects. The motivating reasons you’d be doing this are varied; maybe you’re building a recommendation system or looking at data lineage or debugging problems in some business process where a problem in one object may also occur in similar objects.

Similarity Search

To get into the details without getting bogged down, let’s explore a specific example, using the movie dataset.

Similarity search follows the same syntax and pipeline as our other machine learning models. First, you need to create a model, which holds the set of items available for search. The spa:arguments property receives the features used for similarity calculation, while spa:predict contains the identifier of the item.

prefix : <http://schema.org/>
prefix spa: <tag:stardog:api:analytics:>

INSERT {
    graph spa:model {
        :simModel a spa:SimilarityModel ;
                  spa:arguments (?genres ?directors ?writers ?producers ?metaCritic) ;
                  spa:predict ?movie .
    }
}
WHERE {
    SELECT
    (spa:set(?genre) as ?genres)
    (spa:set(?director) as ?directors)
    (spa:set(?writer) as ?writers)
    (spa:set(?producer) as ?producers)
    ?metaCritic
    ?movie
    {
        ?movie  :genre ?genre ;
                :director ?director ;
                :author ?writer ;
                :productionCompany ?producer ;
                :metaCritic ?metaCritic .
    }
    GROUP BY ?movie ?metaCritic
}

Here, we are creating a SimilarityModel named :simModel which takes as input the genres, directors, writers, producers and MetaCritic score for all movies in the dataset.

Using this model it’s pretty easy to find similar movies. We select a movie and its properties and pass it as input to the model. The number of similar items to return is controlled by the spa:limit property given in spa:parameters.

prefix : <http://schema.org/>
prefix t: <http://www.imdb.com/title/>
prefix spa: <tag:stardog:api:analytics:>

SELECT ?similarMovieLabel ?confidence
WHERE {
    graph spa:model {
      :simModel spa:arguments (?genres ?directors ?writers ?producers ?metaCritic) ;
                spa:confidence ?confidence ;
                spa:parameters [ spa:limit 5 ] ;
                spa:predict ?similarMovie .
    }

    { ?similarMovie rdfs:label ?similarMovieLabel }

    {
        SELECT
        (spa:set(?genre) as ?genres)
        (spa:set(?director) as ?directors)
        (spa:set(?writer) as ?writers)
        (spa:set(?producer) as ?producers)
        ?metaCritic
        ?movie
        {
            ?movie  :genre ?genre ;
                    :director ?director ;
                    :author ?writer ;
                    :productionCompany ?producer ;
                    :metaCritic ?metaCritic .

            VALUES ?movie { t:tt0118715 } # The Big Lebowski
        }
        GROUP BY ?movie ?metaCritic
    }
}

ORDER BY DESC(?confidence)

This query finds five movies that are similar to The Big Lebowski and their similarity score, based on the features given through spa:arguments.

similarMovieLabel	confidence
The Big Lebowski	0.9999999999999998
Fargo	0.9996443676337468
Blood Simple	0.9996332068990889
The Man Who Wasn’t There	0.9996019945613324
Barton Fink	0.9995802728226650

As expected, the most similar item is the movie itself, followed by other movies from the inimitable Coen Brothers.

Just like other models, similarity search features can have any datatype: numbers, strings, sets, etc. The best representation for those features is automatically taken into account by Stardog when it calculates a similarity score.

Under the Hood

Items and their features are vectorized using feature hashing, the same technique used by our classification and regression models. This vectors are saved in a search index created using cluster pruning, an approximate search algorithm which groups items based on their similarity in order to speed up query performance.

The index is used to find the vectors with largest cosine similarity, which is the score given by spa:confidence.

The Stardog docs describe advanced parameters which can be used to increase query performance and recall.

Future Work

We are exploring other ways of representing items as vectors, such as knowledge graph embeddings and predication-based semantic indexing, while improving the techniques underlying the search index itself. Stay tuned for updates.