Skip to content

[IMPROVE] Data quality: CUSIP identifier collisions across different companies #118

@afurgal

Description

@afurgal

Hi - human here writing this, thanks for the excellent open source data work - it's much appreciated. Came across a possible DQ issue and thought it may be helpful to flag. Disclaimer - this below could be making some assumptions about your data model that are wrong. It's also (obviouly) quite heavily AI assisted in writing the bug report. Hope it helps.

-Adam

[END HUMAN]

[START AI SLOP]

Thanks & Context

First, thank you for maintaining FinanceDatabase - it's an excellent resource! We're using the Equities dataset to look up GICS sector classifications for bond issuers in our fixed income analytics platform, and it's been tremendously helpful.

During our integration work, we noticed some data quality issues with CUSIP identifiers that we wanted to flag.

Issue

The same CUSIP identifier is sometimes assigned to completely different companies. This makes CUSIP-based lookups unreliable.

Statistics

import financedatabase as fd

equities = fd.Equities()
df = equities.select()

cusip_df = df[df['cusip'].notna()]
print(f"Total records with CUSIP: {len(cusip_df):,}")        # 13,990
print(f"Unique CUSIPs: {cusip_df['cusip'].nunique():,}")     # 2,459

cusip_to_names = cusip_df.groupby('cusip')['name'].nunique()
multi_name = cusip_to_names[cusip_to_names > 1]
print(f"CUSIPs with >1 company: {len(multi_name):,}")        # 1,660 (67.5%)

cusip_to_sectors = cusip_df.groupby('cusip')['sector'].nunique()
multi_sector = cusip_to_sectors[cusip_to_sectors > 1]
print(f"CUSIPs with >1 sector: {len(multi_sector):,}")       # 597 (24.3%)

Examples

These CUSIPs map to clearly different companies in different sectors:

CUSIP Company 1 Sector Company 2 Sector
00089H106 ACS, Actividades de Construccion Industrials Oakley Capital Investments Financials
00090Q103 Abundance International Materials ADT Inc. Industrials
00181T107 A-Mark Precious Metals Financials Amir Marketing and Investments Materials
00182C103 ANI Pharmaceuticals Health Care BSF Enterprise Plc Financials
00211Y506 ARCA biopharma Health Care Albioma Utilities

Comparison

For context, the FIGI identifiers in the same dataset have a 0.0% collision rate (1 out of 25,103 unique FIGIs), suggesting this may be specific to how CUSIP data was sourced or aggregated.

Environment

  • financedatabase: latest (tested Jan 2026)
  • Python: 3.11

Thanks again for the great work on this project!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions