The API supports matching of entities using a simple query-by-example mechanism. This page describes the available matching algorithms in detail. In particular, the descriptions focus on the heuristics - features - used for comparison, and their relative weighting in the calculation of match scores.
The API offers a selection of scoring algorithms described below. Both the training data and the code are fully public, inviting public scrutiny and proposals for improvement.
A rule-based matching system that generates a set of basic scores via name and identifier-based matching, and then qualifies that score using supporting or contradicting features of the two entities. Version 2 uses a different set of features and consolidates name matching into a single feature, which uses a versatile and complex name matching algorithm.
Feature | Weight | Description |
---|---|---|
name_match | 1.00 | Match two entities by analyzing and comparing their names. |
address_entity_match | 0.98 | Two address entities relate to similar addresses. |
crypto_wallet_address | 0.98 | Two cryptocurrency wallets have the same public key. |
isin_security_match | 0.98 | Two securities have the same ISIN. |
lei_code_match | 0.95 | Two entities have the same Legal Entity Identifier. |
ogrn_code_match | 0.95 | Two entities have the same Russian company registration (OGRN) code. |
vessel_imo_mmsi_match | 0.95 | Two vessels have the same IMO or MMSI identifier. |
inn_code_match | 0.95 | Two entities have the same Russian tax identifier (INN). |
bic_code_match | 0.95 | Two entities have the same SWIFT BIC. |
uei_code_match | 0.95 | Two entities have the same US Unique Entity ID (UEI). |
npi_code_match | 0.95 | Two entities have the same US National Provider Identifier (NPI). |
identifier_match | 0.85 | Two entities have the same tax or registration identifier. |
weak_alias_match | 0.80 | The query name is exactly the same as a result's weak alias. |
address_prop_match | 0.20 | Two entities have similar stated addresses. |
country_mismatch | -0.20 | Both entities are linked to different countries. |
dob_year_disjoint | -0.15 | The birth date of the two entities is not the same. |
dob_day_disjoint | -0.25 | The birth date of the two entities is not the same. |
gender_mismatch | -0.20 | Both entities have a different gender associated with them. |
Configuration Variable | Default | Description |
---|---|---|
nm_number_mismatch | 0.3 | Penalty for mismatching numbers in object or company names. |
nm_extra_query_name | 0.8 | Weight for name parts in the query not matched to the result. |
nm_extra_result_name | 0.2 | Weight for name parts in the result not matched to the query. |
nm_family_name_weight | 1.3 | Extra weight multiplier for family name in person matches (John Smith vs. John Gruber is clearly distinct). |
nm_fuzzy_cutoff_factor | 1 | Extra factor for when a fuzzy match is triggered in name matching. Below a certain threshold, a fuzzy match is considered as a non-match (score = 0.0). Adjusting this multiplier will raise this threshold, making a fuzzy match trigger more leniently. |
An algorithm that matches on entity name, using phonetic comparisons and edit distance to generate potential matches. This implementation is vaguely based on the behaviour proposed by the US OFAC documentation (FAQ #249).
Feature | Weight | Description |
---|---|---|
jaro_name_parts | 0.50 | Compare two sets of name parts using the Jaro-Winkler string similarity algorithm. |
soundex_name_parts | 0.50 | Compare two sets of name parts using the phonetic matching. |
Same as the name-based algorithm, but scores will be reduced if a mis-match of birth dates and nationalities is found for persons, or different tax/registration identifiers are included for organizations and companies.
Feature | Weight | Description |
---|---|---|
jaro_name_parts | 0.50 | Compare two sets of name parts using the Jaro-Winkler string similarity algorithm. |
soundex_name_parts | 0.50 | Compare two sets of name parts using the phonetic matching. |
country_mismatch | -0.10 | Both entities are linked to different countries. |
dob_year_disjoint | -0.10 | The birth date of the two entities is not the same. |
dob_day_disjoint | -0.15 | The birth date of the two entities is not the same. |
gender_mismatch | -0.10 | Both entities have a different gender associated with them. |
orgid_disjoint | -0.10 | Two companies or organizations have different tax identifiers or registration numbers. |
A rule-based matching system that generates a set of basic scores via name and identifier-based matching, and then qualifies that score using supporting or contradicting features of the two entities.
Feature | Weight | Description |
---|---|---|
name_literal_match | 1.00 | Two entities have the same name, without normalization applied to the name. |
person_name_jaro_winkler | 0.80 | Compare two persons' names using the Jaro-Winkler string similarity algorithm. |
person_name_phonetic_match | 0.90 | Two persons have similar names, using a phonetic algorithm. |
name_fingerprint_levenshtein | 0.90 | Two non-person entities have similar fingerprinted names. This includes simplifying entity type names (e.g. "Limited" -> "Ltd") and uses the Damerau-Levensthein string distance algorithm. |
name_metaphone_match | 0.00 | Two entities (person and non-person) have similar names, using the metaphone algorithm. |
name_soundex_match | 0.00 | Two entities (person and non-person) have similar names, using the soundex algorithm. |
address_entity_match | 0.98 | Two address entities relate to similar addresses. |
crypto_wallet_address | 0.98 | Two cryptocurrency wallets have the same public key. |
isin_security_match | 0.98 | Two securities have the same ISIN. |
lei_code_match | 0.95 | Two entities have the same Legal Entity Identifier. |
ogrn_code_match | 0.95 | Two entities have the same Russian company registration (OGRN) code. |
vessel_imo_mmsi_match | 0.95 | Two vessels have the same IMO or MMSI identifier. |
inn_code_match | 0.95 | Two entities have the same Russian tax identifier (INN). |
bic_code_match | 0.95 | Two entities have the same SWIFT BIC. |
identifier_match | 0.85 | Two entities have the same tax or registration identifier. |
weak_alias_match | 0.80 | The query name is exactly the same as a result's weak alias. |
country_mismatch | -0.20 | Both entities are linked to different countries. |
last_name_mismatch | -0.20 | The two persons have different last names. |
dob_year_disjoint | -0.15 | The birth date of the two entities is not the same. |
dob_day_disjoint | -0.20 | The birth date of the two entities is not the same. |
gender_mismatch | -0.20 | Both entities have a different gender associated with them. |
orgid_disjoint | -0.20 | Two companies or organizations have different tax identifiers or registration numbers. |
numbers_mismatch | -0.10 | Find numbers in names and addresses and penalise different numbers. |
A simple matching algorithm based on a regression model.
Feature | Weight | Description |
---|---|---|
name_match | 0.70 | Check for exact name matches between the two entities. |
name_token_overlap | 0.76 | Evaluate the proportion of identical words in each name. |
name_numbers | -0.14 | Find if names contain numbers, score if the numbers are different. |
name_levenshtein | 0.13 | Consider the edit distance (as a fraction of name length) between the two most similar names linked to both entities. |
phone_match | 0.03 | Matching phone numbers between the two entities. |
email_match | 0.01 | Matching email addresses between the two entities. |
identifier_match | 0.10 | Matching identifiers (e.g. passports, national ID cards, registration or tax numbers) between the two entities. |
dob_matches | 0.82 | The birth date of the two entities is the same. |
dob_year_matches | 0.37 | The birth date of the two entities is the same. |
dob_year_disjoint | -0.54 | The birth date of the two entities is not the same. |
first_name_match | -0.06 | Matching first/given name between the two entities. |
family_name_match | 0.04 | Matching family name between the two entities. |
birth_place | -0.02 | Same place of birth. |
gender_mismatch | -0.18 | Both entities have a different gender associated with them. |
country_mismatch | -0.24 | Both entities are linked to different countries. |
org_identifier_match | 0.35 | Matching identifiers (e.g. registration or tax numbers) between two organizations or companies. |
address_match | 0.56 | Text similarity between addresses. |
address_numbers | 0.07 | Find if names contain numbers, score if the numbers are different. |