Configuring the scoring system

Both the hosted OpenSanctions API and the yente open source application provide a selection of different mechanisms that can be used to score result matches.

Introduction

When you send a match query to the API, it will be processed in two stages: first, a search index is used to locate possible candidate results. This process is meant to optimise for recall, i.e. find a broad selection of result candidates. In a second stage, these candidates are scored against the query that has been provided by the API consumer. The following URL query parameters are used to configure that process:

algorithm query parameter is used to select a scoring algorithm. The different algorithms are described below, but you can also retrieve metadata about them programmatically using the /algorithms endpoint.
- Set this to best to use the highest-quality algorithm available at any time. This will produce good results, but may mean that specific scores for matched entities change significantly over time.
threshold is defined as the numeric score limit above which a result should be considered a match. This parameter may need to be adapted in conjunction with the algorithm to avoid producing too many false positive matches.
cutoff describes the lower bound of the result score that should still be returned by the API. Lower this parameter to see more candidates that have been down-ranked by the scoring system.
limit gives the maximum number of matches returned. The OpenSanctions dataset is de-duplicated, so there can usually only be one matching record for each query. Returning a large number of results therefore does not make sense like it would in a full-text search.

Recommended default: ?algorithm=best

Supported scoring mechanisms

The API supports several scoring mechanisms ("algorithms") that can be used to compute and rank the results of a match query. Below is a narrative overview of the supported algorithms, please also refer to the technical documentation.

logic-v1 (currently also: best) implements a large number of deterministic rules to generate a match result suitable for screening systems. The rules include phonetic and fuzzy name matching, rules regarding the use of IMO, ISIN, LEI, OGRN, INN and other entity identifiers, and rules that reduce the quality of matches in which supporting information (such as countries, DOB, gender, and address) are divergent between the query and the matching candidate. This model is calibrated to be used with the default threshold parameter value (0.7).
name-based and name-qualified are name-only scoring system that combines the Jaro-Winkler and Soundex name comparison techniques to aggressively match entities by name. The algorithm attempts to loosely emulate the OFAC Sanctions Search web tool. This can be useful for regulatory purposes, or when you only know the names of the entities you need to screen. name-qualified provides a marginal improvement over name-based by computing the same score and then penalizing matches where the birth date or nationality is different for people, or where different registration numbers/tax identifiers are used for companies.
regression-v1 and regression-v2 are scoring systems based on logistic regression based on a wide set of features. They provide good results in particular if you can include multiple attributes to describe the entities you are screening for: dates of birth, nationalities, addresses, tax identifiers. Both models will produce high match scores only for multi-attribute matches, e.g. when a query shares the name and birth date or identification number of an entry in the database.
- Please note: regression-v2 produces signficantly lower score values than regression-v1. You may want to set the threshold parameter for matches to 0.5 when using it.

Fine-tuning the score weights

The logic-v1, name-based and name-qualified matchers support the fine tuning of feature weights for custom scoring. For example, an API client may want to give more weight to a phonetic matching algorithm, or fully disable one of the existing mechanisms. Feature weights are between 0.0-1.0 and can be applied to any of the documented features by including a weights section in the body of the /match API request:

{
    "weights": {
        "name_literal_match": 0.0,
        "name_soundex_match": 1.0
    },
    "queries": {
        "q1": {
            "schema": "Person",
            "properties": {"name": ["Barack Ohbama"]}
        }
    }
}

The logic-v1 matcher includes some features that are weighted at 0.0 by default. These are meant to be enabled using custom weights if desired by the API user. Features that have a 0.0 weight are not computed by default, which has a positive impact on system performance.

Limitations of the matcher system

Matching entities from multiple databases is a complex problem. The matchers included in yente provides solutions to this problem that have several known limitations. These limitations are most visible in scenarios where the query data provided by the API consumer is extremely limited (e.g. name-only matching).

Some known limitations:

Name matching is less precise when used in conjunction with writing systems that are not Western-style alphabets. In particular, the fuzzy comparison between different writing systems will produce increased error rates. This affects writing systems including Arabic/Farsi, Burmese, the systems used in China, Japan, Korea and many Indian languages.
Phonetic matching (Soundex, Metaphone) does not support any non-alphabet writing systems.
The company name matching mechanism is particularly vulnerable to mis-spellings in the legal type parts of company names (e.g. Lymited vs. Limited).
Some name comparisons require dictionary alias approaches (e.g. matching Alexander and Sasha). Such dictionaries are not currently included in the OpenSanctions matching system.

Several vendors of advanced entity matching technology have integrated OpenSanctions data into their solutions. We're happy to put you in touch with those vendors.

Selecting your input data

In order to set up a matching solution with low error rates (both false positives and false negatives), it may be helpful to reflect what input data you can provide in order to allow precise decision-making. Consider the following questions:

Do you know if a record in your screening set refers to a person or an organization? Setting the schema in your matching query to Person and Organization will increase precision.
Can you provide multiple name aliases? For persons, are you able to include the first and last name separately (in the firstName, lastName properties)?
The following can be useful qualifiers to include in your query in order to reduce false positives from name-only matches:
- Can you provide a birth date or year of birth for individuals (birthDate)?
- For companies, do you know any registration numbers (registrationNumber) or tax identifiers (taxNumber)?
- Do you know the nationality of a person, or the country in which a company was registered (country)?
Finally, consider reducing the scope of your query. Using /match/default will search sanctions lists, the PEP database and a broad set of other risk-adjacent entities. For a simple sanctions screening system, consider using /match/sanctions instead: this will only produce matches sourced from sanctions lists.