Skip to content

fix: update CompanyScraper selectors for new LinkedIn DOM#271

Open
choru-k wants to merge 1 commit intojoeyism:masterfrom
choru-k:fix/company-scraper-dom-update
Open

fix: update CompanyScraper selectors for new LinkedIn DOM#271
choru-k wants to merge 1 commit intojoeyism:masterfrom
choru-k:fix/company-scraper-dom-update

Conversation

@choru-k
Copy link

@choru-k choru-k commented Feb 1, 2026

Summary

LinkedIn changed their company page DOM structure, causing CompanyScraper to return null for all fields except linkedin_url and name. This PR updates the selectors to match the new structure.

Changes:

  • Navigate to /about page for more detailed company info
  • Update CSS selectors for about section (section.org-about-module__margin-bottom p.break-words)
  • Refactor overview extraction into helper methods:
    • _parse_dl_definition_list() - Parses dl.overflow-hidden dt/dd structure
    • _parse_top_card_info() - Parses .org-top-card-summary-info-list__info-item
    • _find_website_link() - Extracts company website URL
  • Add multi-language support for field detection (English, Korean, Japanese)
  • Infer field types from value patterns rather than relying solely on localized labels

Test Results

Tested with multiple companies:

# ✅ get_company_profile('anthropicresearch')
{
  'name': 'Anthropic',
  'about_us': 'Anthropic is an AI safety company...',
  'website': 'https://www.anthropic.com',
  'industry': 'Research Services',
  'company_size': '501-1,000 employees',
  ...
}

# ✅ get_company_profile('microsoft')
{
  'name': 'Microsoft',
  'about_us': 'Every company has a mission...',
  'website': 'https://news.microsoft.com',
  'industry': 'Software Development',
  'company_size': '10,001+ employees',
  'headquarters': 'Redmond, Washington',
  ...
}

Backwards Compatibility

The changes maintain backwards compatibility:

  • Still tries the old dt/dd selectors as fallback
  • Still parses .org-top-card-summary-info-list__info-item if the new structure isn't found
  • Returns the same Company object structure

Fixes #238

- Navigate to /about page for detailed company info
- Update CSS selectors for about section and overview details
- Add multi-language support for field detection (English, Korean, Japanese)
- Refactor overview extraction into helper methods
- Add _parse_dl_definition_list() for dl.overflow-hidden dt/dd structure
- Add _parse_top_card_info() for .org-top-card-summary-info-list
- Add _find_website_link() for website URL extraction

Fixes joeyism#238
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

company scraper keeps refresh the webpage and can not find any elements

1 participant