Category: Bulk data · Last updated: · Permalink
Next to the main OpenSanctions dataset, we also publish the Know-Your-Business (KYB) collection, which includes full data extracts of several company registries and reference datasets.
Their unifying logic is that they're datasets that are too large to include in the default
collection, and usually include a majority of entities that have no risk indication (eg. a company register will contain a vast majority of companies that are not sanctioned or subject to regulatory action).
We use the datasets in the KYB collection for an enrichment process which identifies companies and people in the KYB datasets that are also mentioned in watchlists, and then includes additional details and associated entities from the KYB sources into our default
dataset.
This has several consequences:
default
dataset, each KYB source is represented by a dataset with the prefix ext_
that includes the relevant entities that have been included (example).yente
The overlap in scope between default
and kyb
means that indexing both collections into the same instance of yente
will lead a situation where certain entity IDs are written by both indexers. The results of this are a sort of race condition, which is undesirable. A work-around is the use of the setting namespace: true
in the KYB dataset definitions in the manifest file: this will append a hash to the end of the KYB-sourced entity IDs that avoids the collision.
We hope to expand the scope of the KYB dataset opportunistically as we discover data sources that cover regions and topics relevant to the core dataset. It's important to note that it is not our ambition to develop a comprehensive global companies database. Our friends at OpenCorporates are also using open data to build such a data product.
OpenSanctions is free for non-commercial users. Businesses must acquire a data license to use the dataset.