The OpenSanctions API - both the easy-to-use hosted service and the self-hosted option - provides an easy way to submit a set of entity descriptions (e.g. a list of customers, business counterparties, or subjects of an investigation) and check their presence on a sanctions or PEPs list.
Until now, this matching API has used a simple statistical model to assign a match score to each result it has returned. With the new release of yente 3.4
, we've made that mechanism more flexible: clients can now select one of a set of supported algorithms to optimise the behaviour of the API for their use case.
With the new release, we've added three new scoring systems to augment the existing model (now called regression-v1
, it is used as the default if no other algorithm is specified):
-
regression-v2
is a new statistical model for matching people and companies. Unlikeregression-v1
it uses pronunciation-based (phonetic/soundex) comparison for entity names, and it has reduced the impact of birthdates as a decision criterion. The new model will generally produce much lower scores for results, so you may want to reduce your matchingthreshold
parameter in the API to0.5
or0.6
. -
name-based
is a simple scoring mechanism based on name similarity only. It uses two criteria, the Jaro-Winkler string distance mechanism and the Soundex phonetic algorithm. This can be a useful tool to conduct matching on data where you only have entity names, and no other details such as birth dates, nationalities, etc. -
name-qualified
uses the score from thename-based
mechanism but then considers other criteria, such as birth dates, nationalities, tax and registration identifiers. If any of these mismatch between the query and the result, the score is lowered. This attempts to anticipate a simple review process that a human analyst might otherwise undertake when a result is found.
You can read more about these mechanisms and inspect their detailed scoring criteria. But what's even more exciting: by making the matching logic of yente
into a configurable component, we can now keep adding specialised scoring systems without breaking backward compatibility. And, because it's open source: you could, too.
In the future, we can add algorithms that introduce a more human-like understanding of names, or use name frequencies to predict the likelihood of a certain name being unique. (Inserting a sentence here about the future application of OpenAI's GPT will be left as an exercise to the inclined reader.)
We're keen for any feedback regarding this change, and what our next steps with customised scoring should be!