Supported scoring algorithms JSON

The OpenSanctions API supports matching of entities using a simple query-by-example mechanism. For transparency, you can find details regarding the used algorithms, and in particular the weighting of features used to compute scores, here.

The API offers a selection of scoring algorithms described below. Both the training data and the code are fully public, inviting public scrutiny and proposals for improvement.

logic-v1

A rule-based matching system that generates a set of basic scores via name and identifier-based matching, and then qualifies that score using supporting or contradicting features of the two entities.

Feature	Weight	Description
`name_literal_match`	1.00	Two entities have the same name, without normalization applied to the name.
`person_name_jaro_winkler`	0.80	Compare two persons' names using the Jaro-Winkler string similarity algorithm.
`person_name_phonetic_match`	0.90	Two persons have similar names, using a phonetic algorithm.
`name_fingerprint_levenshtein`	0.90	Two non-person entities have similar fingerprinted names. This includes simplifying entity type names (e.g. "Limited" -> "Ltd") and uses the Damerau-Levensthein string distance algorithm.
`name_metaphone_match`	0.00	Two entities (person and non-person) have similar names, using the metaphone algorithm.
`name_soundex_match`	0.00	Two entities (person and non-person) have similar names, using the soundex algorithm.
`address_entity_match`	0.98	Two address entities relate to similar addresses.
`crypto_wallet_address`	0.98	Two cryptocurrency wallets have the same public key.
`isin_security_match`	0.98	Two securities have the same ISIN.
`lei_code_match`	0.95	Two entities have the same Legal Entity Identifier.
`ogrn_code_match`	0.95	Two entities have the same Russian company registration (OGRN) code.
`vessel_imo_mmsi_match`	0.95	Two vessels have the same IMO or MMSI identifier.
`inn_code_match`	0.95	Two entities have the same Russian tax identifier (INN).
`bic_code_match`	0.95	Two entities have the same SWIFT BIC.
`identifier_match`	0.85	Two entities have the same tax or registration identifier.
`weak_alias_match`	0.80	The query name is exactly the same as a result's weak alias.
`country_mismatch`	-0.20	Both entities are linked to different countries.
`last_name_mismatch`	-0.20	The two persons have different last names.
`dob_year_disjoint`	-0.15	The birth date of the two entities is not the same.
`dob_day_disjoint`	-0.20	The birth date of the two entities is not the same.
`gender_mismatch`	-0.20	Both entities have a different gender associated with them.
`orgid_disjoint`	-0.20	Two companies or organizations have different tax identifiers or registration numbers.
`numbers_mismatch`	-0.10	Find numbers in names and addresses and penalise different numbers.

name-based

An algorithm that matches on entity name, using phonetic comparisons and edit distance to generate potential matches. This implementation is vaguely based on the behaviour proposed by the US OFAC documentation (FAQ #249).

Feature	Weight	Description
`jaro_name_parts`	0.50	Compare two sets of name parts using the Jaro-Winkler string similarity algorithm.
`soundex_name_parts`	0.50	Compare two sets of name parts using the phonetic matching.

name-qualified

Same as the name-based algorithm, but scores will be reduced if a mis-match of birth dates and nationalities is found for persons, or different tax/registration identifiers are included for organizations and companies.

Feature	Weight	Description
`jaro_name_parts`	0.50	Compare two sets of name parts using the Jaro-Winkler string similarity algorithm.
`soundex_name_parts`	0.50	Compare two sets of name parts using the phonetic matching.
`country_mismatch`	-0.10	Both entities are linked to different countries.
`dob_year_disjoint`	-0.10	The birth date of the two entities is not the same.
`dob_day_disjoint`	-0.15	The birth date of the two entities is not the same.
`gender_mismatch`	-0.10	Both entities have a different gender associated with them.
`orgid_disjoint`	-0.10	Two companies or organizations have different tax identifiers or registration numbers.

regression-v1

A simple matching algorithm based on a regression model.

Feature	Weight	Description
`name_match`	0.69	Check for exact name matches between the two entities.
`name_token_overlap`	0.76	Evaluate the proportion of identical words in each name.
`name_numbers`	-0.16	Find if names contain numbers, score if the numbers are different.
`name_levenshtein`	0.14	Consider the edit distance (as a fraction of name length) between the two most similar names linked to both entities.
`phone_match`	0.02	Matching phone numbers between the two entities.
`email_match`	0.01	Matching email addresses between the two entities.
`identifier_match`	0.10	Matching identifiers (e.g. passports, national ID cards, registration or tax numbers) between the two entities.
`dob_matches`	0.81	The birth date of the two entities is the same.
`dob_year_matches`	0.38	The birth date of the two entities is the same.
`dob_year_disjoint`	-0.54	The birth date of the two entities is not the same.
`first_name_match`	-0.08	Matching first/given name between the two entities.
`family_name_match`	0.04	Matching family name between the two entities.
`birth_place`	-0.02	Same place of birth.
`gender_mismatch`	-0.18	Both entities have a different gender associated with them.
`country_mismatch`	-0.24	Both entities are linked to different countries.
`org_identifier_match`	0.34	Matching identifiers (e.g. registration or tax numbers) between two organizations or companies.
`address_match`	0.56	Text similarity between addresses.
`address_numbers`	0.06	Find if names contain numbers, score if the numbers are different.