Supported scoring algorithms JSON

The API supports matching of entities using a simple query-by-example mechanism. This page describes the available matching algorithms in detail. In particular, the descriptions focus on the heuristics - features - used for comparison, and their relative weighting in the calculation of match scores.

The API offers a selection of scoring algorithms described below. Both the training data and the code are fully public, inviting public scrutiny and proposals for improvement.

logic-v2

A rule-based matching system that generates a set of basic scores via name and identifier-based matching, and then qualifies that score using supporting or contradicting features of the two entities. Version 2 uses a different set of features and consolidates name matching into a single feature, which uses a versatile and complex name matching algorithm.

FeatureWeightDescription
name_match1.00Match two entities by analyzing and comparing their names.
address_entity_match0.98Two address entities relate to similar addresses.
crypto_wallet_address0.98Two cryptocurrency wallets have the same public key.
isin_security_match0.98Two securities have the same ISIN.
lei_code_match0.95Two entities have the same Legal Entity Identifier.
ogrn_code_match0.95Two entities have the same Russian company registration (OGRN) code.
vessel_imo_mmsi_match0.95Two vessels have the same IMO or MMSI identifier.
inn_code_match0.95Two entities have the same Russian tax identifier (INN).
bic_code_match0.95Two entities have the same SWIFT BIC.
uei_code_match0.95Two entities have the same US Unique Entity ID (UEI).
npi_code_match0.95Two entities have the same US National Provider Identifier (NPI).
identifier_match0.85Two entities have the same tax or registration identifier.
weak_alias_match0.80The query name is exactly the same as a result's weak alias.
address_prop_match0.20Two entities have similar stated addresses.
country_mismatch-0.20Both entities are linked to different countries.
dob_year_disjoint-0.15The birth date of the two entities is not the same.
dob_day_disjoint-0.25The birth date of the two entities is not the same.
gender_mismatch-0.20Both entities have a different gender associated with them.
Configuration VariableDefaultDescription
nm_number_mismatch0.3Penalty for mismatching numbers in object or company names.
nm_extra_query_name0.8Weight for name parts in the query not matched to the result.
nm_extra_result_name0.2Weight for name parts in the result not matched to the query.
nm_family_name_weight1.3Extra weight multiplier for family name in person matches (John Smith vs. John Gruber is clearly distinct).
nm_fuzzy_cutoff_factor1Extra factor for when a fuzzy match is triggered in name matching. Below a certain threshold, a fuzzy match is considered as a non-match (score = 0.0). Adjusting this multiplier will raise this threshold, making a fuzzy match trigger more leniently.

name-based

An algorithm that matches on entity name, using phonetic comparisons and edit distance to generate potential matches. This implementation is vaguely based on the behaviour proposed by the US OFAC documentation (FAQ #249).

FeatureWeightDescription
jaro_name_parts0.50Compare two sets of name parts using the Jaro-Winkler string similarity algorithm.
soundex_name_parts0.50Compare two sets of name parts using the phonetic matching.

name-qualified

Same as the name-based algorithm, but scores will be reduced if a mis-match of birth dates and nationalities is found for persons, or different tax/registration identifiers are included for organizations and companies.

FeatureWeightDescription
jaro_name_parts0.50Compare two sets of name parts using the Jaro-Winkler string similarity algorithm.
soundex_name_parts0.50Compare two sets of name parts using the phonetic matching.
country_mismatch-0.10Both entities are linked to different countries.
dob_year_disjoint-0.10The birth date of the two entities is not the same.
dob_day_disjoint-0.15The birth date of the two entities is not the same.
gender_mismatch-0.10Both entities have a different gender associated with them.
orgid_disjoint-0.10Two companies or organizations have different tax identifiers or registration numbers.

logic-v1

A rule-based matching system that generates a set of basic scores via name and identifier-based matching, and then qualifies that score using supporting or contradicting features of the two entities.

FeatureWeightDescription
name_literal_match1.00Two entities have the same name, without normalization applied to the name.
person_name_jaro_winkler0.80Compare two persons' names using the Jaro-Winkler string similarity algorithm.
person_name_phonetic_match0.90Two persons have similar names, using a phonetic algorithm.
name_fingerprint_levenshtein0.90Two non-person entities have similar fingerprinted names. This includes simplifying entity type names (e.g. "Limited" -> "Ltd") and uses the Damerau-Levensthein string distance algorithm.
name_metaphone_match0.00Two entities (person and non-person) have similar names, using the metaphone algorithm.
name_soundex_match0.00Two entities (person and non-person) have similar names, using the soundex algorithm.
address_entity_match0.98Two address entities relate to similar addresses.
crypto_wallet_address0.98Two cryptocurrency wallets have the same public key.
isin_security_match0.98Two securities have the same ISIN.
lei_code_match0.95Two entities have the same Legal Entity Identifier.
ogrn_code_match0.95Two entities have the same Russian company registration (OGRN) code.
vessel_imo_mmsi_match0.95Two vessels have the same IMO or MMSI identifier.
inn_code_match0.95Two entities have the same Russian tax identifier (INN).
bic_code_match0.95Two entities have the same SWIFT BIC.
identifier_match0.85Two entities have the same tax or registration identifier.
weak_alias_match0.80The query name is exactly the same as a result's weak alias.
country_mismatch-0.20Both entities are linked to different countries.
last_name_mismatch-0.20The two persons have different last names.
dob_year_disjoint-0.15The birth date of the two entities is not the same.
dob_day_disjoint-0.20The birth date of the two entities is not the same.
gender_mismatch-0.20Both entities have a different gender associated with them.
orgid_disjoint-0.20Two companies or organizations have different tax identifiers or registration numbers.
numbers_mismatch-0.10Find numbers in names and addresses and penalise different numbers.

regression-v1

A simple matching algorithm based on a regression model.

FeatureWeightDescription
name_match0.70Check for exact name matches between the two entities.
name_token_overlap0.76Evaluate the proportion of identical words in each name.
name_numbers-0.14Find if names contain numbers, score if the numbers are different.
name_levenshtein0.13Consider the edit distance (as a fraction of name length) between the two most similar names linked to both entities.
phone_match0.03Matching phone numbers between the two entities.
email_match0.01Matching email addresses between the two entities.
identifier_match0.10Matching identifiers (e.g. passports, national ID cards, registration or tax numbers) between the two entities.
dob_matches0.82The birth date of the two entities is the same.
dob_year_matches0.37The birth date of the two entities is the same.
dob_year_disjoint-0.54The birth date of the two entities is not the same.
first_name_match-0.06Matching first/given name between the two entities.
family_name_match0.04Matching family name between the two entities.
birth_place-0.02Same place of birth.
gender_mismatch-0.18Both entities have a different gender associated with them.
country_mismatch-0.24Both entities are linked to different countries.
org_identifier_match0.35Matching identifiers (e.g. registration or tax numbers) between two organizations or companies.
address_match0.56Text similarity between addresses.
address_numbers0.07Find if names contain numbers, score if the numbers are different.