Frequently asked questions

#162: What are the differences between the main dataset and the KYB collection?

Category: Bulk data · Last updated: · Permalink

Next to the main OpenSanctions dataset, we also publish the Know-Your-Business (KYB) collection, which includes full data extracts of several company registries and reference datasets.

Their unifying logic is that they're datasets that are too large to include in the default collection, and usually include a majority of entities that have no risk indication (eg. a company register will contain a vast majority of companies that are not sanctioned or subject to regulatory action).

We use the datasets in the KYB collection for an enrichment process which identifies companies and people in the KYB datasets that are also mentioned in watchlists, and then includes additional details and associated entities from the KYB sources into our default dataset.

Caveats

This has several consequences:

  • The datasets published in the KYB collection do not use deduplication. There is no integration of different references to the same person or company in these datasets.
  • The KYB sources are updated less often - usually once a week or once a month - because the size of the sources makes intra-day updates an un-economical proposition.
  • There's an overlap of entities between our main dataset and the KYB datasets, namely the entities that have been "copied in" by the enrichment process. Inside the default dataset, each KYB source is represented by a dataset with the prefix ext_ that includes the relevant entities that have been included (example).

Using both collections in yente

The overlap in scope between default and kyb means that indexing both collections into the same instance of yente will lead a situation where certain entity IDs are written by both indexers. The results of this are a sort of race condition, which is undesirable. A work-around is the use of the setting namespace: true in the KYB dataset definitions in the manifest file: this will append a hash to the end of the KYB-sourced entity IDs that avoids the collision.

Outlook

We hope to expand the scope of the KYB dataset opportunistically as we discover data sources that cover regions and topics relevant to the core dataset. It's important to note that it is not our ambition to develop a comprehensive global companies database. Our friends at OpenCorporates are also using open data to build such a data product.

Related questions

« Back to full FAQ index