Supported scoring algorithms JSON

The OpenSanctions API supports matching of entities using a simple query-by-example mechanism. For transparency, you can find details regarding the used algorithms, and in particular the weighting of features used to compute scores, here.

The API offers a selection of scoring algorithms described below. Both the training data and the code are fully public, inviting public scrutiny and proposals for improvement.

logic-v1

A rule-based matching system that generates a set of basic scores via name and identifier-based matching, and then qualifies that score using supporting or contradicting features of the two entities.

FeatureWeightDescription
name_literal_match1.00Two entities have the same name, without normalization applied to the name.
person_name_jaro_winkler0.80Compare two persons' names using the Jaro-Winkler string similarity algorithm.
person_name_phonetic_match0.90Two persons have similar names, using a phonetic algorithm.
name_fingerprint_levenshtein0.90Two non-person entities have similar fingerprinted names. This includes simplifying entity type names (e.g. "Limited" -> "Ltd") and uses the Damerau-Levensthein string distance algorithm.
name_metaphone_match0.00Two entities (person and non-person) have similar names, using the metaphone algorithm.
name_soundex_match0.00Two entities (person and non-person) have similar names, using the soundex algorithm.
address_entity_match0.98Two address entities relate to similar addresses.
crypto_wallet_address0.98Two cryptocurrency wallets have the same public key.
isin_security_match0.98Two securities have the same ISIN.
lei_code_match0.95Two entities have the same Legal Entity Identifier.
ogrn_code_match0.95Two entities have the same Russian company registration (OGRN) code.
vessel_imo_mmsi_match0.95Two vessels have the same IMO or MMSI identifier.
inn_code_match0.95Two entities have the same Russian tax identifier (INN).
bic_code_match0.95Two entities have the same SWIFT BIC.
identifier_match0.85Two entities have the same tax or registration identifier.
weak_alias_match0.80The query name is exactly the same as a result's weak alias.
country_mismatch-0.20Both entities are linked to different countries.
last_name_mismatch-0.20The two persons have different last names.
dob_year_disjoint-0.15The birth date of the two entities is not the same.
dob_day_disjoint-0.20The birth date of the two entities is not the same.
gender_mismatch-0.20Both entities have a different gender associated with them.
orgid_disjoint-0.20Two companies or organizations have different tax identifiers or registration numbers.
numbers_mismatch-0.10Find numbers in names and addresses and penalise different numbers.

name-based

An algorithm that matches on entity name, using phonetic comparisons and edit distance to generate potential matches. This implementation is vaguely based on the behaviour proposed by the US OFAC documentation (FAQ #249).

FeatureWeightDescription
jaro_name_parts0.50Compare two sets of name parts using the Jaro-Winkler string similarity algorithm.
soundex_name_parts0.50Compare two sets of name parts using the phonetic matching.

name-qualified

Same as the name-based algorithm, but scores will be reduced if a mis-match of birth dates and nationalities is found for persons, or different tax/registration identifiers are included for organizations and companies.

FeatureWeightDescription
jaro_name_parts0.50Compare two sets of name parts using the Jaro-Winkler string similarity algorithm.
soundex_name_parts0.50Compare two sets of name parts using the phonetic matching.
country_mismatch-0.10Both entities are linked to different countries.
dob_year_disjoint-0.10The birth date of the two entities is not the same.
dob_day_disjoint-0.15The birth date of the two entities is not the same.
gender_mismatch-0.10Both entities have a different gender associated with them.
orgid_disjoint-0.10Two companies or organizations have different tax identifiers or registration numbers.

regression-v1

A simple matching algorithm based on a regression model.

FeatureWeightDescription
name_match0.71Check for exact name matches between the two entities.
name_token_overlap0.77Evaluate the proportion of identical words in each name.
name_numbers-0.15Find if names contain numbers, score if the numbers are different.
name_levenshtein0.13Consider the edit distance (as a fraction of name length) between the two most similar names linked to both entities.
phone_match0.02Matching phone numbers between the two entities.
email_match0.01Matching email addresses between the two entities.
identifier_match0.10Matching identifiers (e.g. passports, national ID cards, registration or tax numbers) between the two entities.
dob_matches0.80The birth date of the two entities is the same.
dob_year_matches0.38The birth date of the two entities is the same.
dob_year_disjoint-0.54The birth date of the two entities is not the same.
first_name_match-0.07Matching first/given name between the two entities.
family_name_match0.03Matching family name between the two entities.
birth_place-0.02Same place of birth.
gender_mismatch-0.18Both entities have a different gender associated with them.
country_mismatch-0.24Both entities are linked to different countries.
org_identifier_match0.35Matching identifiers (e.g. registration or tax numbers) between two organizations or companies.
address_match0.55Text similarity between addresses.
address_numbers0.07Find if names contain numbers, score if the numbers are different.

regression-v2

A simple matching algorithm based on a regression model with phonetic comparison.

FeatureWeightDescription
name_part_soundex0.67Check for overlap of phonetic forms of the names.
name_numbers-0.28Find if names contain numbers, score if the numbers are different.
name_levenshtein1.31Levenshtein similiarity between the two entities' names.
identifier_match0.10Matching identifiers (e.g. passports, national ID cards, registration or tax numbers) between the two entities.
dob_matches0.26The birth date of the two entities is the same.
dob_year_matches0.12The birth date of the two entities is the same.
dob_year_disjoint-1.03The birth date of the two entities is not the same.
first_name_match0.05Matching first/given name between the two entities.
last_name_mismatch-0.20The two persons have different last names.
birth_place-0.01Same place of birth.
gender_mismatch-0.16Both entities have a different gender associated with them.
country_mismatch-0.25Both entities are linked to different countries.
org_identifier_match0.51Matching identifiers (e.g. registration or tax numbers) between two organizations or companies.
address_entity_match1.39Two address entities relate to similar addresses.
address_prop_match0.02Two entities have similar stated addresses.
address_numbers0.01Find if names contain numbers, score if the numbers are different.