Here’s a clean, professional, and detailed README.md you can drop in:
Scrape Norwegian NGOs’ local chapter pages into a single, normalized dataset — consistently, politely, and with great observability.
Each scraper targets one organization (e.g., Røde Kors, Kirkens Bymisjon, …) and emits a uniform record schema so downstream tooling can treat all orgs the same.
- Uniform schema across organizations
- Polite by default: rate limits, retries with backoff, geocoding throttle + disk cache
- Strong selectors with semantic fallbacks
- Beautiful terminal UX: progress bar, counters, sections, and leveled logs
- Resumability: partial dataset written on Ctrl+C
- Per-record freshness:
data_updatedfield stamped on every record
- Requirements
- Install
- Repository Structure
- Quick Start
- CLI / Env Options
- Output & Schema
- How It Works
- Logging & Progress
- Geocoding
- Performance & Etiquette
- Add a New Organization
- Quality Checklist
- Troubleshooting
- Contributing
- License
- Node.js ≥ 20.6 (supports JSON import attributes). Tested on Node 20–22.
- Internet access
If you must use Node 18, you can replace JSON import attributes with a tiny helper; otherwise stick to Node 20+.
From the repo root:
npm installThis pulls in: axios, cheerio, p-limit, html-minifier-terser, cli-progress, chalk, boxen, ora, pretty-ms.
.
├─ scripts/
│ └─ scrape.js # CLI runner, picks org scraper by slug
├─ scrapers/
│ ├─ rodekors/
│ │ └─ rode-kors.js # Røde Kors scraper (reference implementation)
│ └─ <org-slug>/
│ └─ <org>.js # Your next scraper
├─ lib/
│ ├─ http.js # axios GET with retry/backoff
│ ├─ geocode.js # Nominatim + serialized rate limiting + cache
│ ├─ log.js # terminal UX (sections, progress, levels)
│ └─ utils.js # shared DOM/text helpers
├─ data/
│ └─ <org>-local.json # final dataset per org (written on completion)
└─ .cache/
└─ geocode-cache.json # shared geocoding cache
Use the CLI runner (scripts/scrape.js) and pass the organization slug.
Polite defaults (Røde Kors):
LOG_LEVEL=info CONCURRENCY=3 SLEEP_MS=600 node scripts/scrape.js rodekorsWith geocoding (only for missing coordinates):
LOG_LEVEL=info GEOCODE=1 CONCURRENCY=3 SLEEP_MS=600 node scripts/scrape.js rodekorsNarrow scope while developing:
ONLY_COUNTY=agder node scripts/scrape.js rodekors
ONLY_COUNTY=agder ONLY_CITY=kvinesdal node scripts/scrape.js rodekorsStamp a specific data_updated date (per record):
DATA_UPDATED=2023-06-01 node scripts/scrape.js rodekors| Variable | Default | Description |
|---|---|---|
ORG |
— | Organization slug (or pass as CLI arg, e.g., node scripts/scrape.js rodekors). |
CONCURRENCY |
5 |
Max concurrent HTTP fetches (keep modest: 2–5). |
SLEEP_MS |
300 |
Delay between tasks for politeness (ms). |
ONLY_COUNTY |
— | Limit discovery to a single county slug (e.g., agder). |
ONLY_CITY |
— | Limit to a city slug within the chosen county. |
GEOCODE |
off | 1/true/yes to geocode only if coordinates are missing. |
GEO_RATE_MS |
1100 |
Min delay (ms) between geocoding calls (Nominatim policy friendly). |
MAX_GEOCODES |
10000 |
Hard cap on geocoding calls per run. |
LOG_LEVEL |
info |
silent, error, warn, info, verbose, debug. |
DATA_UPDATED |
today | Per-record data_updated (ISO YYYY-MM-DD). Defaults to today (UTC). |
The HTTP User-Agent is
CivicSupportScrapers/<version from package.json> (+hey@codefornorway.org).
Final dataset is written to data/<org>-local.json (array of objects). Each record:
{
"name": "Arendal Røde Kors",
"description": "Arendal Røde Kors er en frivillig, medlemsstyrt organisasjon ...",
"image": "https://www.rodekors.no/globalassets/.../treffpunkt-hove-for-sosiale-medier.jpg",
"address": "Hans Thornes vei 26, 4846 ARENDAL",
"email": "leder@arendal-rk.no",
"source": "https://www.rodekors.no/lokalforeninger/agder/arendal/",
"coordinates": [58.4887604, 8.7585903],
"notes": "<p>Styreleder: ...</p>",
"organization": "Røde Kors",
"city": "Arendal",
"data_updated": "2023-06-01"
}Field notes
-
coordinatesis[lat, lon]. Value comes from the page when present; geocoding runs only if missing. -
notesis a concise HTML snippet:- For Røde Kors, it’s content from
<h2/3>Velkommenup to.expander-list-header(exclusive). If the header isn’t present,notes = null.
- For Røde Kors, it’s content from
-
Records without an address are skipped (not written).
A partial file is written on Ctrl+C to data/<org>-local.partial.json.
-
Discover
- Start at org index (e.g.,
/lokalforeninger/). - Collect county links at depth 1 (
/org/{county}/). - For each county, collect city links at depth 2 (
/org/{county}/{city}/). - Filter out non-city pages with an organization-specific stop list (
om,kontakt, …).
- Start at org index (e.g.,
-
Extract
name:<h1>description: prefer structured intro (e.g.,.lead p), then metadescription, then nearest paragraph after<h1>.image:<meta property="og:image">fallback to first<img>(absolute URL).address: prefer semantic<dt>Adresse</dt><dd>…</dd>, fallback to map data or regex.email: first valid email near the top or anywhere on the page.notes: targeted slice based on headings (see schema notes).coordinates: from map widget (data-marker) or inline text; otherwise geocode if enabled.
-
Write
- Stream results into an array and save to
data/<org>-local.json. - On SIGINT, write a partial file with whatever’s collected so far.
- Stream results into an array and save to
The scraper prints structured sections and a live progress bar:
Extract records
────────────────────────
Scraping | ███████░░░ 128/319 | 40% | ETA: 5m 12s | page:94 geo:29 skip:6 err:1
Counters
page: coordinates read from page (map/regex)geo: coordinates obtained via geocodingskip: records skipped due to missingaddresserr: extraction errors
Use LOG_LEVEL=verbose for more detail (HTTP/geocode hits). debug adds per-query traces.
- Engine: Nominatim (OpenStreetMap)
- When: only if page lacks coordinates
- Throttle: serialized queue,
GEO_RATE_MSdelay between calls (default 1100ms) - Cache:
.cache/geocode-cache.json(shared for all orgs; speeds up subsequent runs) - Query strategy: progressively broader (address → address+county →
POSTCODE CITY→POSTCODE→CITY, COUNTY→CITY)
- Keep
CONCURRENCYlow (2–3) andSLEEP_MS≥ 600 for production runs. - Respect retry/backoff signals and avoid hammering origins.
- Geocoding is intentionally slow; please don’t lower the throttle for large runs.
- Consider off-peak schedules and contacting site owners for ongoing crawls.
Use scrapers/rodekors/rode-kors.js as a reference:
-
Create:
scrapers/<org-slug>/<org>.js -
Constants: set
BASE,START,ORG -
Stop list:
STOP_SLUGSfor non-city subpaths (lower-case) -
Discovery:
getCountyLinks()→ URLs like/.../{county}/(depth 1)getCityLinks()→ URLs like/.../{county}/{city}/(depth 2)
-
Extraction: implement robust selectors + fallbacks per site
-
Register the scraper in
scripts/scrape.jsregistry -
Test with
ONLY_COUNTY/ONLY_CITYand low concurrency
- County pages (depth 1) only; city pages (depth 2) only
- Non-city pages excluded by stop list
- Image URLs normalized to absolute
-
addresspreferred via semanticdt/dd; regex/map otherwise -
notesscoped specifically to the intended slice - Records without
addressare skipped -
coordinatescopied from page; geocoding happens only when missing - Progress bar reflects
page,geo,skip,err - Output passes a quick JSON schema spot check on samples
Too noisy / can’t see progress
- Use
LOG_LEVEL=info(default).verbose/debugadd detail;warnquiets things down.
Geocoding slow
- That’s by design (rate-limited; cached). Subsequent runs will reuse
.cache/geocode-cache.json.
HTTP 429/5xx
- Reduce
CONCURRENCY, increaseSLEEP_MS. The HTTP client already retries with backoff.
Unexpected pages scraped
- Add more slugs to
STOP_SLUGSand/or refine the “depth” checks.
Wrong JSON date
- Set an explicit
DATA_UPDATED=YYYY-MM-DDin your run command.
PRs welcome! Especially:
- New organization scrapers
- Selector improvements for edge cases
- Exporters (CSV/Parquet)
- Tests for selectors and schema
MIT
Happy scraping!