Both the hosted OpenSanctions API and the yente open source application provide a selection of different mechanisms that can be used to score result matches.
When you send a match query to the API, it will be processed in two stages: first, a search index is used to locate possible candidate results. This process is meant to optimise for recall, i.e. find a broad selection of result candidates. In a second stage, these candidates are scored against the query that has been provided by the API consumer. The following URL query parameters are used to configure that process:
algorithm
query parameter is used to select a scoring algorithm. The different algorithms are described below, but you can also retrieve metadata about them programmatically using the /algorithms
endpoint.
best
to use the highest-quality algorithm available at any time. This will produce good results, but may mean that specific scores for matched entities change significantly over time.threshold
is defined as the numeric score limit above which a result should be considered a match
. This parameter may need to be adapted in conjunction with the algorithm
to avoid producing too many false positive matches.cutoff
describes the lower bound of the result score that should still be returned by the API. Lower this parameter to see more candidates that have been down-ranked by the scoring system.limit
gives the maximum number of matches returned. The OpenSanctions dataset is de-duplicated, so there can usually only be one matching record for each query. Returning a large number of results therefore does not make sense like it would in a full-text search.fuzzy
is a boolean flag that can be used to disable fuzzy matching in the candidate finding stage. This has proven to be largely ineffective compared to other techniques (e.g. the search for phonetic and normalised forms of the names). We recommend disabling fuzzy candidate finding.Recommended default: ?algorithm=best&fuzzy=false
The API supports several scoring mechanisms ("algorithms") that can be used to compute and rank the results of a match query. Below is a narrative overview of the supported algorithms, please also refer to the technical documentation.
logic-v1
(currently also: best
) implements a large number of deterministic rules to generate a match result suitable for screening systems. The rules include phonetic and fuzzy name matching, rules regarding the use of IMO, ISIN, LEI, OGRN, INN and other entity identifiers, and rules that reduce the quality of matches in which supporting information (such as countries, DOB, gender, and address) are divergent between the query and the matching candidate. This model is calibrated to be used with the default threshold
parameter value (0.7
).
name-based
and name-qualified
are name-only scoring system that combines the Jaro-Winkler and Soundex name comparison techniques to aggressively match entities by name. The algorithm attempts to loosely emulate the OFAC Sanctions Search web tool. This can be useful for regulatory purposes, or when you only know the names of the entities you need to screen. name-qualified
provides a marginal improvement over name-based
by computing the same score and then penalizing matches where the birth date or nationality is different for people, or where different registration numbers/tax identifiers are used for companies.
regression-v1
and regression-v2
are scoring systems based on logistic regression based on a wide set of features. They provide good results in particular if you can include multiple attributes to describe the entities you are screening for: dates of birth, nationalities, addresses, tax identifiers. Both models will produce high match scores only for multi-attribute matches, e.g. when a query shares the name and birth date or identification number of an entry in the database.
regression-v2
produces signficantly lower score values than regression-v1
. You may want to set the threshold
parameter for matches to 0.5
when using it.The logic-v1
, name-based
and name-qualified
matchers support the fine tuning of feature weights for custom scoring. For example, an API client may want to give more weight to a phonetic matching algorithm, or fully disable one of the existing mechanisms. Feature weights are between 0.0-1.0 and can be applied to any of the documented features by including a weights
section in the body of the /match
API request:
{
"weights": {
"name_literal_match": 0.0,
"name_soundex_match": 1.0
},
"queries": {
"q1": {
"schema": "Person",
"properties": {"name": ["Barack Ohbama"]}
}
}
}
The logic-v1
matcher includes some features that are weighted at 0.0 by default. These are meant to be enabled using custom weights if desired by the API user. Features that have a 0.0 weight are not computed by default, which has a positive impact on system performance.
Matching entities from multiple databases is a complex problem. The matchers included in yente
provides solutions to this problem that have several known limitations. These limitations are most visible in scenarios where the query data provided by the API consumer is extremely limited (e.g. name-only matching).
Some known limitations:
Lymited
vs. Limited
).Alexander
and Sasha
). Such dictionaries are not currently included in the OpenSanctions matching system.Several vendors of advanced entity matching technology have integrated OpenSanctions data into their solutions. We're happy to put you in touch with those vendors.
In order to set up a matching solution with low error rates (both false positives and false negatives), it may be helpful to reflect what input data you can provide in order to allow precise decision-making. Consider the following questions:
schema
in your matching query to Person
and Organization
will increase precision.firstName
, lastName
properties)?birthDate
)?registrationNumber
) or tax identifiers (taxNumber
)?country
)?/match/default
will search sanctions lists, the PEP database and a broad set of other risk-adjacent entities. For a simple sanctions screening system, consider using /match/sanctions
instead: this will only produce matches sourced from sanctions lists.OpenSanctions is free for non-commercial users. Businesses must acquire a data license to use the dataset.