`ctbk` python library

CLI for generating ctbk.dev datasets (derived from Citi Bike public data in s3://tripdata).

Data flow
Installation
CLI
- Subcommands: urls, create
- Examples
GitHub Actions
- ci.yml
- www.yml

Data flow

flowchart LR;
z["TripdataZips\ns3://tripdata"]
c["TripdataCsvs\ns3://ctbk/csvs"]
n["NormalizedMonths\ns3://ctbk/normalized/YYYYMM.parquet"]
agg_sc["AggregatedMonths(YYYYMM, 's', 'c')\ns3://ctbk/aggregated/s_c_YYYYMM.parquet"]
agg_sec["AggregatedMonths(YYYYMM, 'se', 'c')\ns3://ctbk/aggregated/se_c_YYYYMM.parquet"]
agg_ymrgtb["AggregatedMonths(YYYYMM, 'ymrgtb', 'cd')\ns3://ctbk/aggregated/ymrgtb_cd_YYYYMM.parquet"]
smh_in["StationMetaHists(YYYYMM, 'in')\ns3://ctbk/stations/meta_hists/in_YYYYMM.parquet"]
smh_il["StationMetaHists(YYYYMM, 'il')\ns3://ctbk/stations/meta_hists/il_YYYYMM.parquet"]
sm["StationModes\ns3://ctbk/aggregated/YYYYMM/stations.json"]
spj["StationPairJsons\ns3://ctbk/aggregated/YYYYMM/se_c.json"]

z --> c --> n
n --> agg_sc
n --> agg_sec
n --> agg_ymrgtb
n --> smh_in
n --> smh_il
smh_in --> sm
smh_il --> sm
agg_sc --> sm
sm --> spj
agg_sec --> spj

`TripdataZips` (a.k.a. `zip`s): public Citi Bike `.csv.zip` files

Released as NYC and JC .csv.zip files at s3://tripdata
See s3://tripdata

`TripdataCsvs` (a.k.a. `csv`s): unzipped and gzipped CSVs

Writes <root>/ctbk/csvs/YYYYMM.csv
See also: s3://ctbk/csvs

`NormalizedMonths` (a.k.a. `norm`s): normalize `csv`s

Merge regions (NYC, JC) for the same month, harmonize columns drop duplicate data, etc.
Writes <root>/ctbk/normalized/YYYYMM.parquet
See also: s3://ctbk/normalized

`AggregatedMonths` (a.k.a. `agg`s): compute histograms over each month's rides:

Group by any of several "aggregation keys" ({year, month, day, hour, user type, bike type, start and end station, …})
Produce any "sum keys" ({ride counts, duration in seconds})
Writes <root>/ctbk/aggregated/KEYS_YYYYMM.parquet
See also: s3://ctbk/aggregated/*.parquet

`StationMetaHists` (a.k.a. `smh`s): compute station {id,name,lat/lng} histograms:

Similar to aggs, but counts station {id,name,lat/lng} tuples that appear as each ride's start and end stations (whereas agg's rows are 1:1 with rides)
"agg_keys" can include id (i), name (n), and lat/lng (l); there are no "sum_keys" (only counting is supported)
Writes <root>/ctbk/stations/meta_hists/YYYYMM.parquet
See also: s3://ctbk/stations/meta_hists

`StationModes` (a.k.a. `sm`s): canonical {id,name,lat/lng} info for each station:

Computed from StationMetaHists:
- name is chosen as the "mode" (most commonly listed name for that station ID)
- lat/lng is taken to be the mean of the lat/lngs reported for each ride's start and end station
Writes <root>/ctbk/aggregated/YYYYMM/stations.json
See also: s3://ctbk/aggregated/YYYYMM/stations.json

`StationPairJsons` (a.k.a. `spj`s): counts of rides between each pair of stations:

JSON formatted as { <start idx>: { <end idx>: <count> } }
idxs are based on order of appearance in StationModes / stations.json above (which is also sorted by station ID)
Values are read from AggregatedMonths(<ym>, 'se', 'c'):
- group by station start ("s") and end ("e"),
- sum ride counts ("c")
Writes <root>/ctbk/aggregated/YYYYMM/se_c.json
See also: s3://ctbk/stations/YYYYMM/se_c.json

Installation

Clone this repo and install this library:

git clone https://github.com/hudcostreets/ctbk.dev
cd ctbk.dev
pip install -e ctbk

Then the ctbk executable will be available, which exposes a subcommand for each of the stages above:

CLI

ctbk

Usage: ctbk [OPTIONS] COMMAND [ARGS]...

  CLI for generating ctbk.dev datasets (derived from Citi Bike public data in `s3://`).
  ## Data flow
  ### `TripdataZips` (a.k.a. `zip`s): Public Citi Bike `.csv.zip` files
  - Released as NYC and JC `.csv.zip` files at s3://tripdata
  - See https://tripdata.s3.amazonaws.com/index.html
  ### `TripdataCsvs` (a.k.a. `csv`s): unzipped and gzipped CSVs
  - Writes `<root>/ctbk/csvs/YYYYMM.csv`
  - See also: https://ctbk.s3.amazonaws.com/index.html#/csvs
  ### `NormalizedMonths` (a.k.a. `norm`s): normalize `csv`s
  - Merge regions (NYC, JC) for the same month, harmonize columns drop duplicate data, etc.
  - Writes `<root>/ctbk/normalized/YYYYMM.parquet`
  - See also: https://ctbk.s3.amazonaws.com/index.html#/normalized
  ### `AggregatedMonths` (a.k.a. `agg`s): compute histograms over each month's rides:
  - Group by any of several "aggregation keys" ({year, month, day, hour, user type, bike
    type, start and end station, …})
  - Produce any "sum keys" ({ride counts, duration in seconds})
  - Writes `<root>/ctbk/aggregated/KEYS_YYYYMM.parquet`
  - See also: https://ctbk.s3.amazonaws.com/index.html#/aggregated?p=8
  ### `StationMetaHists` (a.k.a. `smh`s): compute station {id,name,lat/lng} histograms:
  - Similar to `agg`s, but counts station {id,name,lat/lng} tuples that appear as each
    ride's start and end stations (whereas `agg`'s rows are 1:1 with rides)
  - "agg_keys" can include id (i), name (n), and lat/lng (l); there are no "sum_keys"
    (only counting is supported)
  - Writes `<root>/ctbk/stations/meta_hists/YYYYMM/KEYS.parquet`
  - See also: https://ctbk.s3.amazonaws.com/index.html#/stations/meta_hists
  ### `StationModes` (a.k.a. `sm`s): canonical {id,name,lat/lng} info for each station:
  - Computed from `StationMetaHist`s:
    - `name` is chosen as the "mode" (most commonly listed name for that station ID)
    - `lat/lng` is taken to be the mean of the lat/lngs reported for each ride's start
      and end station
  - Writes `<root>/ctbk/aggregated/YYYYMM/stations.json`
  - See also: https://ctbk.s3.amazonaws.com/index.html#/aggregated
  ### `StationPairJsons` (a.k.a. `spj`s): counts of rides between each pair of stations:
  - JSON formatted as `{ <start idx>: { <end idx>: <count> } }`
  - `idx`s are based on order of appearance in `StationModes` / `stations.json` above
    (which is also sorted by station ID)
  - Values are read from `AggregatedMonths(YYYYMM, 'se', 'c')`:
    - group by station start ("s") and end ("e"),
    - sum ride counts ("c")
  - Writes `<root>/ctbk/aggregated/YYYYMM/se_c.json`
  - See also: https://ctbk.s3.amazonaws.com/index.html#/aggregated

Options:
  --help            Show this message and exit.

Commands:
  zip                 Read .csv.zip files from s3://tripdata
  csv                 Extract CSVs from "tripdata" .zip files.
  normalized          Normalize "tripdata" CSVs (combine regions for each...
  partition           Separate pre-2024 parquets (`normalized/v0`) by...
  consolidate         Consolidate `normalized/YM/YM_YM.parquet` files...
  aggregated          Aggregate normalized ride entries by various...
  ymrgtb-cd           Read aggregated...
  station-meta-hist   Aggregate station name, lat/lng info from ride...
  station-modes-json  Compute canonical station names, lat/lngs from...
  station-pairs-json  Write station-pair ride_counts keyed by...
  yms                 Print one or more YM (year-month) ranges, e.g.:

ctbk zip --help

Usage: ctbk zip [OPTIONS] COMMAND [ARGS]...

  Read .csv.zip files from s3://tripdata

Options:
  --help  Show this message and exit.

Commands:
  urls  Print URLs for selected datasets

ctbk csv --help

Usage: ctbk csv [OPTIONS] COMMAND [ARGS]...

  Extract CSVs from "tripdata" .zip files. Writes to <root>/ctbk/csvs.

Options:
  --help  Show this message and exit.

Commands:
  urls    Print URLs for selected datasets
  create  Create selected datasets
  sort    Sort one or more `.csv{,.gz}`'s in-place, remove empty lines

ctbk normalized --help

Usage: ctbk normalized [OPTIONS] COMMAND [ARGS]...

  Normalize "tripdata" CSVs (combine regions for each month, harmonize column
  names, etc. Populates directory `<root>/ctbk/normalized/YYYYMM/` with files
  of the form `YYYYMM_YYYYMM.parquet`, for each pair of (start,end) months
  found in a given month's CSVs.

Options:
  --help  Show this message and exit.

Commands:
  urls    Print URLs for selected datasets
  create  Create selected datasets

ctbk partition --help

Usage: ctbk partition [OPTIONS] [YM_RANGES_STR]

  Separate pre-2024 parquets (`normalized/v0`) by {src,start,end} months.

Options:
  --help  Show this message and exit.

ctbk consolidate --help

Usage: ctbk consolidate [OPTIONS] [YM_RANGES_STR]

  Consolidate `normalized/YM/YM_YM.parquet` files into a single
  `normalized/YM.parquet`, containing all rides ending in the given month.

Options:
  -c, --col TEXT  Columns to backfill; default: ['Birth Year', 'Gender', 'Bike
                  ID']
  -n, --dry-run   Print stats about fields that would be backfilled, but don't
                  perform any writes
  --help          Show this message and exit.

ctbk aggregated --help

Usage: ctbk aggregated [OPTIONS] COMMAND [ARGS]...

  Aggregate normalized ride entries by various columns, summing ride counts or
  durations. Writes to <root>/ctbk/aggregated/KEYS_YYYYMM.parquet.

Options:
  --help  Show this message and exit.

Commands:
  urls    Print URLs for selected datasets
  create  Create selected datasets

ctbk station-meta-hist --help

Usage: ctbk station-meta-hist [OPTIONS] COMMAND [ARGS]...

  Aggregate station name, lat/lng info from ride start and end fields. Writes
  to <root>/ctbk/stations/meta_hists/KEYS_YYYYMM.parquet.

Options:
  --help  Show this message and exit.

Commands:
  urls    Print URLs for selected datasets
  create  Create selected datasets

ctbk station-modes-json --help

Usage: ctbk station-modes-json [OPTIONS] COMMAND [ARGS]...

  Compute canonical station names, lat/lngs from StationMetaHists. Writes to
  <root>/ctbk/aggregated/YYYYMM/stations.json.

Options:
  --help  Show this message and exit.

Commands:
  urls    Print URLs for selected datasets
  create  Create selected datasets

ctbk station-pairs-json --help

Usage: ctbk station-pairs-json [OPTIONS] COMMAND [ARGS]...

  Write station-pair ride_counts keyed by StationModes' JSON indices. Writes
  to <root>/ctbk/aggregated/YYYYMM/se_c.json.

Options:
  --help  Show this message and exit.

Commands:
  urls    Print URLs for selected datasets
  create  Create selected datasets

Subcommands: `urls`, `create`

Each of the ctbk commands above supports 3 further subcommands:

urls: print the URLs that would be read from or written to
create: compute and save the relevant data to those URLs (optionally no-op'ing if already present, overwriting, or failing if not present)

Examples

`urls`: print URLS

Print URLs for 3 months of normalized data in the local s3/ folder:

ctbk normalized -d 202206-202209 urls
# s3/ctbk/normalized/202206.parquet
# s3/ctbk/normalized/202207.parquet
# s3/ctbk/normalized/202208.parquet

`create`: create+save data

Compute one month of normalized ride data:

ctbk normalized -d202101 create

This reads upstream CSVs from the local s3/ctbk/csvs/ directory and writes normalized parquet files to s3/ctbk/normalized/.

Note: stderr messages about Rideable Type not being found are due to older months predating the addition of that column in February 2021.

Current create options include:

-e, --engine: Parquet engine selection
-t, --name-type INTEGER: CSV name-type preference
-G, --no-git: Skip git/DVC workflow integration

Generate all the data used by ctbk.dev in the local s3/ctbk directory:

ctbk station-pairs-json create

station-pairs-json (abbreviated as spj) is the final derived data product in the diagram above
Creating station-pair JSONs requires creating all predecessor datasets in the pipeline
Data is stored in the local s3/ctbk/ directory structure
Initial TripdataZips are downloaded from the public s3://tripdata bucket

⚠️ takes O(hours), streams ≈7GB of .csv.zips from s3://tripdata, writes ≈12GiB under s3/ctbk/ locally.

Abbreviated command names

Abbreviations for each subcommand are supported, e.g. n for normalized:

ctbk n -d2022- urls

GitHub Actions

`ci.yml`

ci.yml breaks each derived dataset into a separate job, for example:

It also includes a final call to generate JSON used by the main plot at ctbk.dev:

ctbk ymrgtb-cd

Any changes are pushed to the www branch, which triggers the www.yml GHA.

`www.yml`

The www.yml GHA:

runs on pushes to the www branch
rebuilds and deploys the site

The code for the site is under ../www.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ctbk` python library

Data flow

`TripdataZips` (a.k.a. `zip`s): public Citi Bike `.csv.zip` files

`TripdataCsvs` (a.k.a. `csv`s): unzipped and gzipped CSVs

`NormalizedMonths` (a.k.a. `norm`s): normalize `csv`s

`AggregatedMonths` (a.k.a. `agg`s): compute histograms over each month's rides:

`StationMetaHists` (a.k.a. `smh`s): compute station {id,name,lat/lng} histograms:

`StationModes` (a.k.a. `sm`s): canonical {id,name,lat/lng} info for each station:

`StationPairJsons` (a.k.a. `spj`s): counts of rides between each pair of stations:

Installation

CLI

Subcommands: `urls`, `create`

Examples

`urls`: print URLS

`create`: create+save data

Abbreviated command names

GitHub Actions

`ci.yml`

`www.yml`

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

ctbk python library

Data flow

TripdataZips (a.k.a. zips): public Citi Bike .csv.zip files

TripdataCsvs (a.k.a. csvs): unzipped and gzipped CSVs

NormalizedMonths (a.k.a. norms): normalize csvs

AggregatedMonths (a.k.a. aggs): compute histograms over each month's rides:

StationMetaHists (a.k.a. smhs): compute station {id,name,lat/lng} histograms:

StationModes (a.k.a. sms): canonical {id,name,lat/lng} info for each station:

StationPairJsons (a.k.a. spjs): counts of rides between each pair of stations:

Installation

CLI

Subcommands: urls, create

Examples

urls: print URLS

create: create+save data

Abbreviated command names

GitHub Actions

ci.yml

www.yml

`ctbk` python library

`TripdataZips` (a.k.a. `zip`s): public Citi Bike `.csv.zip` files

`TripdataCsvs` (a.k.a. `csv`s): unzipped and gzipped CSVs

`NormalizedMonths` (a.k.a. `norm`s): normalize `csv`s

`AggregatedMonths` (a.k.a. `agg`s): compute histograms over each month's rides:

`StationMetaHists` (a.k.a. `smh`s): compute station {id,name,lat/lng} histograms:

`StationModes` (a.k.a. `sm`s): canonical {id,name,lat/lng} info for each station:

`StationPairJsons` (a.k.a. `spj`s): counts of rides between each pair of stations:

Subcommands: `urls`, `create`

`urls`: print URLS

`create`: create+save data

`ci.yml`

`www.yml`