tl;dr: New matching algorithm! logic-v2
reduces false positives, is fully deterministic, improves cross-language and cross-script matching, runs fast, and provides detailed explanations of its decision-making path.
What problem we set out to solve
Matching names in sanctions screening is a balancing act that’s deceptively complex. (You can read our recent article about the technical problems that can arise from name matching in sanctions screening for more.) Analysts want to make sure they receive all relevant alerts, but also a minimal rate of false positive noise. Underlying that process is often incomplete and messy data, sourced from multiple, time-honored (read: “ancient and confused”) systems.
The key questions we set out to answer: How do we compare names in a way that tolerates small spelling variations (Brian
vs. Bryan
), respects local name conventions, maps abstract equivalencies such as the Russian OAO
vs. Joint Stock Company
and 14th
vs. Fourteenth
, and still avoids accidentally matching similar, but different names?
The road not taken
We’ve chosen to use a rule-driven approach so that results are explainable and consistent: the same input always yields the same output. While newer deep learning approaches (like the use of embeddings for name matching, explored in project Eridu) are phenomenal at surfacing a wide set of match candidates, the technology is less suited to help discern false positives – the industry's main pain point. “A herd of 384-dimensional vectors told me so” is also an awkward explanation to share with a regulator who demands to understand a screening system.
So if the system can’t have vibes, what can it have? We decided that the next best thing would be to provide the matcher with an expansive set of reference data – multi-lingual descriptions of many of the concepts in the domain of business names.
How logic-v2
works
The core idea is to understand what each part of a name represents before we compare anything. We annotate tokens with symbols taken from curated reference data, such as organization types like LLC
, generic business words like Holding
or Industries
, human names (John
, Smith
) or countries (Palestine
), and so on. Each symbol collects multiple spellings and aliases across languages and scripts. This not only helps the engine understand that business
and бизнес
mean the same thing, but also allows us to de-prioritize this well-known term relative to more specific name parts (a human analyst would understand that Sparkles Business Corporation
is really mostly Sparkles
). Common filler terms in company names, such as Holding
or Industries
, are treated as soft stop words so that they can contribute less weight to the overall score compared to the distinctive core of the name.
Once we’ve used symbols to explain and match most of a given name, we can be much stricter and precise in how we apply fuzzy string matching techniques to its remainder. For example:
Comparing Sparkles Business Corporation
vs. Sporks Business Corporation
using basic Levenshtein scoring – the percentage of characters which need to be replaced to turn one text into another – produces an 89% match (alarm!). Tagging Business
Corporation
as generic symbols, however, the focus of the comparison reduces to Sparkles
vs. Sporks
– which errors out at a 50% match.
Inversely, Stripe LLC
and Stripe, Limited Liability Corporation
is a hopeless challenge for text matching, but Stripe [ORGTYPE:LLC]
produces a precise match.
When it comes to individuals, name part awareness matters, too. First names only compare to first names, family names to family names, and middle initials can match with middle names. This reduces accidental near-matches between unrelated people.
We've also decided to apply transliteration more selectively than in logic-v1
. For language pairs where the romanization is reliable, it has proven very helpful. Other scripts tend to cause an unacceptable loss of precision from conversion, so we keep comparisons in the native script and rely on symbols to provide cross-script matches.
What data powers the symbols
It’s worth pointing out that while a lot of the reference data used in the new matcher is hand-curated (and open source, see rigour data) we have once again found invaluable help in Wikidata, the structured data twin of Wikipedia. Using various filters, we have picked over 161,000 items from its taxonomy of given and family names, with a total of 1.16 million different spellings.
To illustrate the matching intelligence derived from this, it’s worth to show a single name symbol in full:
abd al-karim, abd al-karím, abd al-karīm, abd el-karím, abd-al-karim, abdalkarím, abdelkarím, abdu 'l-karím, abdul karim, abdulkarim, abdulkarím, abdülkerim, əbdülkərim, абд аль-карим, աբդլքարիմ, עבד אל-כרים, عبد الكريم, عبد الکریم, عبدالكريم, عبدالکریم, আবদুল করিম, アブドゥル・カリーム, 阿卜杜勒·卡里姆 => Q317304
We’re not in SOUNDEX land any more.
The same logic also applies to organization classes: OAO
, Joint Stock Company
, and Aktiengesellschaft
are all normalized to a shared concept, which lets the matcher recognize them as potential synonyms regardless of language or abbreviation.
Beyond Symbols: smarter name handling
One of the biggest changes in logic-v2
is how it treats name parts. In logic-v1
, if you sent firstName
and lastName
separately, the system would simply concatenate them and run a full-string comparison. This often led to inflated similarity scores or missed matches. On top of symbolic matching, the new matcher also makes use of name part type annotations (a first and a family name will not be matched), supports middle initials (Jesus H. Christ
matches middleName:Helmut
). This means the matcher is far more precise when you provide structured inputs. For example:
- Query:
{"firstName": "John", "middleName": "A", "lastName": "Smith"}
- Result:
John A. Smith
ranks much stronger thanJohn B. Smith
.
We’ve also made name transliteration selective for languages where the result of romanization is of a suitable quality for producing strong matches.
Explainability that travels well
Besides the new screening logic, we’ve also re-worked the way in which we describe the results of each matching result. Until now, screening a person by name, date of birth and nationality would have returned an overall similarity score, and the individual weighted scores generated by the name matching, date matching and country matching functions.
We’ve now expanded this further with small explanations, such as “Entities both linked to country X”, or “First names match, last names are a fuzzy 0.67 match.” The goal of these narrative results is to provide further transparency on how decisions are made, in particular in a manner that can be used as input to a generative AI in a case management system that will propose what steps an analyst needs to take to qualify a potential match.
Using logic-v2
in practice
logic-v2
ships with yente 5.0
. You can enable the algorithm by adding algorithm=logic-v2
to the /match
API. Please note that the scores returned by logic-v2
will naturally be different from those returned by logic-v1
, and a different set of match alerts will result for adopting the new system.
You can also supply a small set of configuration parameters by adding a config
block to the request body (see /algorithms
on the API for a description of the available parameters). This means you can fine-tune thresholds and weights to match your specific context, for instance if you want stricter cut-offs for sanctions screening or a broader tolerance for KYC.
Using logic-v2
, you should see fewer irrelevant hits on generic company names, cleaner cross-script person matches, and explanations that make it clear what to verify next. Because symbol tagging removes filler terms, fuzzy distance is only applied to the distinct core remaining part of the name, which is where precision matters most.
Limitations and what’s next
We believe that logic-v2
will make for a significant improvement in the experience of API users: reduced false positives, better cross-alphabet matches, more explainability. It’s not revolutionary, but it is much better.
What has us particularly excited: the reliance on extensive cultural reference data makes logic-v2
a great target for continuous improvement. This is where we will come to rely on your kind cooperation: please share examples of false results – positive or negative – with our support team so that we can work together to fine-tune and evolve the system over time to make sure it meets the screening needs of a broad set of organizations.
As always, we welcome feedback and real-world cases, as this is what will continue to allow us to make the matcher sharper for everyone.