Picking query filters for a screening process

The database includes hundreds of data sources - government watchlists, research databases, contextual resources. Not all screening processes will need to query the whole dataset, and picking your scope carefully is key to reducing false positive alerts.

The databases has various filtering mechanisms that expose technical and legal criteria which can be used to limit to scope of a query. Filtering occurs at two levels: the filtering of data sources, and the filtering of classes of entities described within those data sources. Keep in mind that a single entity can be sourced from multiple data sources.

  • Entity types (jargon: Schema) describe categories of logical entities: Person, Company, Vessel, CryptoWallet. If you are screening a list that includes both organizations and natural persons, use LegalEntity – it's an umbrella term for both. Schemata are explained in the data dictionary.
  • Collections are scopes that combine different data sources with similar meaning. The whole database is contained in the default collection, while sanctions is a subset of sources limited to government-issued sanctions lists. Inside of that, us_sanctions limits the scope to only US (federal) watchlists, and eu_sanctions combines EU and member state watchlists. Additional collections are listed here.
  • Specific data sources (eg. us_ofac_sdn) can also be a filter. For example, you may wish to query all sanctions lists, except those published by China (cn_sanctions) and Russia (ru_mfa_sanctions).
  • Risk topics are a taxonomy of the risk factors that apply to specific entities. A person can be role.pep, sanction, a company might be sanction.linked, reg.warn etc.
    • Topics identify a category of risk, but not its origin. Sanction entities close that gap: they're linked to companies and people, and detail the name of the sanctioning authority, the reason, time span, and measures imposed.
  • Sanctions programs detail individual policy instruments under which an entity was designated. For example, the US sanctions list includes companies and people sanctioned under 30+ programs, most them linked to a specific geopolitical conflict (eg. Ukraine) or topical focus (eg. Cyber warfare).

In Practice

Using the default collection endpoint (/match/default) is a good place to start. Pick relevant topics (eg. sanction, sanction.linked for sanctions screening, add role.pep, role.rca and reg.action for basic AML checks), and run some experiments.

Then, use the include_dataset and exclude_dataset parameters to either pick a custom set of sources, or exclude sources that don't have regulatory relevance and produce false positives. Use the include_dataset argument to pick only a select set of datasets: /match/default?include_dataset=us_ofac_sdn&include_dataset=us_ofac_cons, and use exclude_dataset to filter a specific dataset from a collection query like this: /match/default?exclude_dataset=iq_aml_list.

Avoid using the peps collection, instead filter for the relevant topics (role.pep, role.rca, and poi) and consider implementing country filters.

On-premise: Using a manifest to create custom collections

When using the on-premise version of yente, you can also use the custom datasets function to define custom collections. To do this by adding a manifest file like this:

catalogs:
  - url: "https://data.opensanctions.org/datasets/latest/index.json"
    # Limit the dataset scope of the entities which will be indexed into yente. Useful
    # values include `default`, `sanctions` or `peps`. This will speed up the update
    # process in which data is re-indexed.
    scope: sanctions
    resource_name: entities.ftm.json
datasets:
  - name: europe
    title: European datasets
    datasets:
      - eu_fsf
      - eu_travel_bans
      - eu_sanctions_map
      - be_fod_sanctions
      - fr_tresor_gels_avoir
      # - gb_hmt_sanctions

This will create a new dataset collection named europe, which can be used in query endpoints, e.g. /match/europe and /search/europe.

Caveats

  • Entity profiles returned from the API will always include attribute values from all data sources in the database. Filtering on data source will only affect what entities can be returned, but not remove properties (like aliases, dates of birth) found in other sources. You can achieve a different behavior by using statement data or using a custom manifest in an on-premise installation of yente.
  • In rare cases, sanctions data sources list secondary entities which are not sanctioned. Hence, it’s possible for an entity to feature a sanctions dataset as a source, but not be tagged with the sanction topic.
Picking query filters for a screening process - OpenSanctions