Identifiers and deduplication

Every entity carries an ID that can change as duplicates are merged across sources — here's how the system works and how to keep your references current.

OpenSanctions collects data from hundreds of sources. The same person or company often appears on multiple lists. We merge these duplicates into consolidated profiles, each with a stable canonical ID. The sections below explain how IDs work and what you need to do to stay in sync.

Note: This page covers the IDs OpenSanctions assigns to entities, not the passport numbers, tax IDs, or registration numbers found inside the data itself.

The short version

Every entity has an id field. For entities that have been deduplicated — merged with records from other sources — this is a canonical ID, either an NK- identifier (like NK-abcdef) or a Wikidata QID (like Q12345). Entities that appear in only one source and haven't been merged keep their original source ID (like ofac-81717 or eu-fsf-1234).

Deduplicated entities also carry a referents array: all the other IDs that map to this canonical ID. These include the source-specific IDs from each list the entity appears on, and any previous canonical IDs that were retired when clusters merged.

If your system stores references to OpenSanctions entities — screening alerts, case files, exclusion lists — you need to keep those references current. On each data update, look up your stored IDs against both id and referents. If a stored ID now appears only in referents, it was merged or re-keyed: update your reference to the new canonical ID.

Practical guide

Storing references to OpenSanctions entities

When your system creates a record that references an OpenSanctions entity — a screening alert, a case file, a false-positive exclusion — store the entity id at the time you create the record.

In order to keep your stored entity IDs current, you need to reconcile them as part of your integration. Build a lookup from every value in every entity's referents array to its canonical id. Then check your stored IDs against that lookup: if a stored ID appears as a referent rather than as a current id, the entity was merged or re-keyed — update your reference to the new canonical ID.

If you use the API rather than bulk data, the /entities/<id> endpoint handles this automatically: looking up a referent ID produces an HTTP redirect to the current canonical entity.

Unused referents are removed after a grace period

When a source delists an entity, or when two clusters merge and one canonical ID is retired, the old ID stays in referents for 6 months. After that, we remove it from the published data. If you import data on a regular schedule (daily or weekly), that window gives you enough time to detect and update stale references.

Deltas and change detection

For the consolidated entities.ftm.json export, we publish delta files that show what changed between two snapshots. These are useful for incremental ETL pipelines that don't need to diff the full dataset on every run.

When entities are merged, the delta reflects this as a set of DEL operations (one for each entity that was absorbed) and a single ADD for the surviving canonical entity with its updated properties and referents.

Source-scoped exports

Each dataset also publishes its own entities.ftm.json. Entities in source-scoped exports use canonical IDs (if the entity has been deduplicated), so the id field is consistent across all exports. The properties, however, come exclusively from that source — no cross-source data leaks in. The one exception is referents, which reflects the global deduplication state and can include IDs from other sources that map to the same canonical entity.

How the identifier system works

You don't need this section to integrate the data, but it helps if you want to understand edge cases.

Source IDs

When a crawler collects data from a source, it generates an entity ID scoped to that dataset. These IDs are prefixed with a dataset key: ofac- for US OFAC, eu-fsf- for the EU sanctions list, and so on. The rest of the ID is usually derived from a stable key provided by the publisher (a record number, a reference code).

Source IDs are stable as long as the publisher doesn't renumber its records. This rarely happens. When it does — for example, a small-country authority restructures its data format — we absorb the change internally so it doesn't propagate to canonical IDs.

Canonical IDs and deduplication

Newly imported entities are compared against existing records using a scoring algorithm. Candidates above a high confidence threshold merge automatically. Below that, candidates are reviewed by an LLM-assisted system that applies domain-specific rules, or by a human analyst for ambiguous cases. The merged profile gets a canonical ID:

  • NK-IDs (NK-xxxxxx) — randomly generated unique identifiers assigned during deduplication.
  • Q-IDs (Q12345) — Wikidata item identifiers, used when a matching Wikidata entry exists. We prefer these for well-known persons. Our long-term goal is to identify all persons by QIDs, but this is a gradual process.

Internally, a component called the resolver maintains a graph of relationships between all entity identifiers. When two clusters merge, one canonical ID survives and the other becomes a referent.

Clusters can also be split — for example, when a merged profile turns out to contain two distinct people. Splits are always manual. The existing canonical ID stays with the larger portion of the cluster, and the split-off portion gets a new ID. The majority of references to the original entity remain valid after a split.

Timing: when deduplication happens

Data collection and deduplication are separate processes. When a new entity appears on a sanctions list, we publish it in the next data export with its source ID, typically within hours. Deduplication against existing records happens separately: for major sanctions lists, cross-referencing runs on a daily schedule, but the full merge can take 12 to 72 hours after the entity first appears.

During that window, a newly published entity may exist as a standalone record with only its source ID, even if the same person or company is already known under a canonical ID from other sources. Once deduplication runs, the entity joins its cluster and the source ID becomes a referent of the canonical. If you're building a system that matches incoming entities to existing profiles, be aware that very recently added entities may not yet be merged.

How referents expire

The referents array serves as a forwarding table: it lets you find the current canonical ID for any historical reference. Not all referents last forever.

A referent stops being relevant for one of two reasons:

  • A source stops publishing an entity. If a country removes someone from its sanctions list, the source ID for that entry becomes stale. It no longer appears in any active data, but it stays in referents so downstream systems can still resolve old references.
  • An intermediate canonical is absorbed. When two clusters merge, the retired canonical ID (NK-a) becomes a referent of the surviving one (NK-b). Once downstream consumers have had time to update, the intermediate ID is no longer needed.

In both cases, we keep the stale referent in the published data for 6 months after it stops being active, then remove it.

Entity consolidation

A consolidated entity is not simply a union of everything each source says about a person or company. When we merge source data into a single profile, we apply heuristics that select, reconcile, and sometimes suppress property values. The result is a curated view that aims to be more useful than the raw sum of its parts.

Dates

If one source lists a date of birth as 1972 and another as 1972-06-21, only the more precise value appears on the consolidated entity. The vague one adds no information. For provenance dates (like createdAt or modifiedAt), we keep only the earliest or latest value as appropriate.

Names

If a name appears as a full alias in one source but as a weakAlias in another, we treat it as a weak alias everywhere. This prevents name fragments (like "Butcher" or "Varo") from being promoted to full names and generating false positives in screening. We also remove case-duplicate names and names that don't contain any letters. For core sanctions lists (OFAC, FCDO, and a few others), names are never removed regardless of these rules, because consumers expect those lists to be reproduced verbatim.

Relationships

When an undirected relationship (like a family connection) has both entities listed on both sides due to data from multiple sources, we normalize it to a single direction so it isn't double-counted.

This means the properties on a consolidated entity don't always match a simple concatenation of all source data. If you need the complete, unfiltered picture, the statements data (described below) gives you every assertion from every source.

Statements and field-level provenance

For use cases that need per-field, per-source provenance, we publish a statements dataset alongside the entity data. Each statement records a single property assertion about an entity, tagged with the dataset it came from.

Each row in the statements data carries two key IDs:

  • canonical_id — the current canonical ID of the entity. This always reflects the latest deduplication state.
  • entity_id — the source-specific entity ID that originally produced the statement.

This lets you reconstruct exactly which source contributed which facts and build domain-specific projections (names, addresses, identifiers) with full lineage back to the originating list. The output normalization heuristics described above apply to entity exports but not to raw statements, so statements can include values that don't appear in the entity properties.

Identifiers and deduplication - OpenSanctions