Category: API · Last updated: · Permalink
When you use the /match
API to implement a screening process, each query will be answered with a list of result entities from the database. Each result entity will be assigned a score
, reflecting the similarity of the entity to the data provided in the query.
Different users of the /match
API have different objectives: some might want to use queries consisting of only a single name to match against a specific set of sanctions lists, while other users perform PEP checks as part of a know-your-customer process, in which they've already generated complex profiles featuring birth dates, country info, tax numbers, etc.
These different scenarios can benefit from fine-tuning the ways in which result scores in the API are computed and applied. Below we describe three different methods for doing so.
We sometimes wish that identifying people and companies across databases was a binary problem, a simple "yes" or "no". However, unclean and partial data, as well as many regulatory requirements to perform "fuzzy matching", make this impossible. That's why the score
returned by /match
is a floating point number in the range 0.0
(not similar at all) to 1.0
(perfect match). In order to decide which results actually require an alert or review, we must set a threshold
value, above which a result will be tagged as a match
.
The API enables you to modify this threshold
as a query parameter. Choosing a higher threshold will reduce the number of false positives at the price of also increasing the number of false negatives: you'll see less noise, but also miss more possible alerts. Setting threshold
to 1.0
will return only perfect matches, setting it to 0.0
would match any two entities.
With the default logic-v1
scoring system, the range of plausible threshold
values is ca. 0.6
to 0.85
.
By default, the API will also return some extra results that have not been tagged as match
(ie. with a score
below threshold
). This behaviour can be changed using the cutoff
query parameter. Setting it to the same value as threshold
will return only result entities that are matches.
The manner in which the /match
API computes the score
value is designed to be transparent and somewhat configurable. The score
is tallied by computing a set of features
that compare the query and matching entity in different ways: one feature might compare the names of persons using a phonetic matching technique or a fuzzy string matching algorithm, another might compare their nationality, their DOB, or address. You can see a list of the supported features in the matcher documentation.
NB. Some features are only enabled for specific entity types. This is why scoring will work better if the query states the type of the entity - Person
, Company
, Vessel
- precisely.
In order to aggregate the features into a single score
, each of them is assigned a weight
or counter-weight. For example, if the query and result have a perfect name match, a name matching feature (name_literal_match
) would yield a score of 1.0
. If both records also include a date of birth, however, a mismatch between these dates would be detected by another feature (dob_day_disjoint
) and lead to a significant reduction of the score
.
Custom weights
can be included as part of the request body for the /match
API like this:
{
"weights": {
"name_soundex_match": 0.8,
"person_name_phonetic_match": 0.0,
"country_mismatch": -1.0
},
"queries": {}
}
This configuration will enable the name_soundex_match
feature (disabled by default) and disable person_name_phonetic_match
. It will also reconfigure the weight of a country mismatch between two entities to be a deciding factor in scoring.
The above examples assume that you're using the API's default scoring system called logic-v1
. Several other algorithms are also available and can be chosen using the algorithm=
query parameter.
Given the open source nature of yente
, on-premise deployments of the software could even consider implementing a custom algorithm, or modifying the code used to compute individual features for a specific use case.
OpenSanctions is free for non-commercial users. Businesses must acquire a data license to use the dataset.