wiki-geoparquet

English Wikipedia main-namespace articles with Earth coordinates, as GeoParquet + PMTiles.

Coordinates, inlink counts, article length, Wikidata QIDs, image links, and descriptions. Updated monthly from Wikipedia SQL dumps.

demo: shane98c.github.io/wiki-geoparquet

Data

All files are on R2 with CORS enabled — query directly from the browser or any Parquet-aware tool:

File	Description
`wikipedia_geotagged.parquet`	GeoParquet, Hilbert-sorted with bbox covering
`wikipedia_geotagged.pmtiles`	Vector tiles, auto-zoom with overzoom, drops by article length
`wikipedia_search.parquet`	Lightweight search index (lowercased label, coords, inlink count), sorted by label for prefix-range row-group pruning

Pinned versions are also available on GitHub Releases.

Schema

Column	Type	Description
`geometry`	WKB Point	WGS84 coordinates
`page_id`	int32	Wikipedia page ID
`qid`	string	Wikidata QID (e.g. Q90) — use for joins with Wikidata
`label`	string	Article title
`description`	string	Short description from `wikibase-shortdesc`
`gt_type`	string	Wikipedia geo classification (city, mountain, landmark, etc.)
`gt_primary`	bool	Whether coordinates are the article's primary geo_tag
`page_len`	int32	Article length in bytes
`inlink_count`	int32	Number of namespace-0 pagelinks pointing here
`wikipedia_url`	string	Full article URL
`image_url`	string	Wikimedia Commons image URL
`bbox`	struct	Covering bbox for spatial predicate pushdown

The PMTiles carry a subset of these properties (no geometry, qid, wikipedia_url, image_url, or bbox). Reconstruct URLs client-side:

Article: https://en.wikipedia.org/?curid={page_id}
Image: https://commons.wikimedia.org/wiki/Special:FilePath/{image}

Quick start

Query with DuckDB

INSTALL spatial; LOAD spatial;

-- Find the most notable geotagged articles
SELECT label, inlink_count, page_len, gt_type, ST_AsText(geometry)
FROM 'https://pub-016504dd3a4d419a9c17a8939840935e.r2.dev/v1/wikipedia_geotagged.parquet'
ORDER BY inlink_count DESC
LIMIT 20;

-- Spatial query: articles within 50km of Paris
SELECT label, inlink_count, gt_type
FROM 'https://pub-016504dd3a4d419a9c17a8939840935e.r2.dev/v1/wikipedia_geotagged.parquet'
ORDER BY inlink_count DESC;

Browser search with duckdb-wasm

The wikipedia_search.parquet index has columns label (lowercased, used for both sorting and lookup), lon, lat, and inlink_count for ranking. It's sorted by label so a lowercased prefix range lets DuckDB skip row groups and fetch ~2 MB per query instead of the full ~25 MB file.

// In duckdb-wasm, INSTALL httpfs (or SET builtin_httpfs = false) — without
// this the built-in HTTP handler downloads the whole file on every query.
// See https://github.com/duckdb/duckdb-wasm/issues/2153.
await conn.query("SET builtin_httpfs = false;");

const q = userInput.toLowerCase().replace(/'/g, "''");
await conn.query(`
  SELECT label, lon, lat
  FROM 'https://.../wikipedia_search.parquet'
  WHERE label >= '${q}' AND label < '${q}~'
  ORDER BY inlink_count DESC
  LIMIT 10
`);

Use pmtiles in MapLibre

import { Protocol } from "pmtiles";

let protocol = new Protocol();
maplibregl.addProtocol("pmtiles", protocol.tile);

const map = new maplibregl.Map({
  style: {
    sources: {
      wikipedia: {
        type: "vector",
        url: "pmtiles://https://pub-016504dd3a4d419a9c17a8939840935e.r2.dev/v1/wikipedia_geotagged.pmtiles",
      },
    },
    layers: [
      {
        id: "articles",
        source: "wikipedia",
        "source-layer": "wikipedia",
        type: "circle",
        paint: {
          "circle-radius": [
            "interpolate",
            ["linear"],
            ["sqrt", ["get", "inlink_count"]],
            0,
            2.5,
            280,
            20,
          ],
          "circle-color": "#4264fb",
        },
      },
    ],
  },
});

How it works

Streams 5 Wikipedia SQL dump files (~12 GB) and extracts geotagged pages with Earth coordinates (preferring primary, falling back to non-primary), filtering to main-namespace non-redirect articles. Drops catalog articles (List of …, Listed buildings …, Timeline of …) when their coord is non-primary — that's how we distinguish "Timeline of Pittsburgh" (a city history, kept) from "Timeline of the Syrian civil war" (an event at an incidental location, dropped)
Joins with page metadata, Wikidata properties, and pagelinks-based inlink counts
Writes Hilbert-sorted GeoParquet with bbox covering via DuckDB spatial
Pipes DuckDB to tippecanoe for PMTiles with attribute-based feature dropping

See the Makefile for build steps.

Data source

All data from English Wikipedia SQL dumps, released under CC BY-SA 4.0.

License

Code: MIT Data outputs: CC BY-SA 4.0 (derived from Wikipedia/Wikidata)

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
index.html		index.html
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wiki-geoparquet

Data

Schema

Quick start

Query with DuckDB

Browser search with duckdb-wasm

Use pmtiles in MapLibre

How it works

Data source

License

About

Uh oh!

Releases 5

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

wiki-geoparquet

Data

Schema

Quick start

Query with DuckDB

Browser search with duckdb-wasm

Use pmtiles in MapLibre

How it works

Data source

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Contributors

Uh oh!

Languages