Skip to content

Latest commit

 

History

History
426 lines (339 loc) · 15.9 KB

File metadata and controls

426 lines (339 loc) · 15.9 KB

ctbk python library

CLI for generating ctbk.dev datasets (derived from Citi Bike public data in s3://tripdata).

Data flow

flowchart LR;
z["TripdataZips\ns3://tripdata"]
c["TripdataCsvs\ns3://ctbk/csvs"]
n["NormalizedMonths\ns3://ctbk/normalized/YYYYMM.parquet"]
agg_sc["AggregatedMonths(YYYYMM, 's', 'c')\ns3://ctbk/aggregated/s_c_YYYYMM.parquet"]
agg_sec["AggregatedMonths(YYYYMM, 'se', 'c')\ns3://ctbk/aggregated/se_c_YYYYMM.parquet"]
agg_ymrgtb["AggregatedMonths(YYYYMM, 'ymrgtb', 'cd')\ns3://ctbk/aggregated/ymrgtb_cd_YYYYMM.parquet"]
smh_in["StationMetaHists(YYYYMM, 'in')\ns3://ctbk/stations/meta_hists/in_YYYYMM.parquet"]
smh_il["StationMetaHists(YYYYMM, 'il')\ns3://ctbk/stations/meta_hists/il_YYYYMM.parquet"]
sm["StationModes\ns3://ctbk/aggregated/YYYYMM/stations.json"]
spj["StationPairJsons\ns3://ctbk/aggregated/YYYYMM/se_c.json"]

z --> c --> n
n --> agg_sc
n --> agg_sec
n --> agg_ymrgtb
n --> smh_in
n --> smh_il
smh_in --> sm
smh_il --> sm
agg_sc --> sm
sm --> spj
agg_sec --> spj
Loading

TripdataZips (a.k.a. zips): public Citi Bike .csv.zip files

  • Released as NYC and JC .csv.zip files at s3://tripdata
  • See s3://tripdata

TripdataCsvs (a.k.a. csvs): unzipped and gzipped CSVs

NormalizedMonths (a.k.a. norms): normalize csvs

  • Merge regions (NYC, JC) for the same month, harmonize columns drop duplicate data, etc.
  • Writes <root>/ctbk/normalized/YYYYMM.parquet
  • See also: s3://ctbk/normalized

AggregatedMonths (a.k.a. aggs): compute histograms over each month's rides:

  • Group by any of several "aggregation keys" ({year, month, day, hour, user type, bike type, start and end station, …})
  • Produce any "sum keys" ({ride counts, duration in seconds})
  • Writes <root>/ctbk/aggregated/KEYS_YYYYMM.parquet
  • See also: s3://ctbk/aggregated/*.parquet

StationMetaHists (a.k.a. smhs): compute station {id,name,lat/lng} histograms:

  • Similar to aggs, but counts station {id,name,lat/lng} tuples that appear as each ride's start and end stations (whereas agg's rows are 1:1 with rides)
  • "agg_keys" can include id (i), name (n), and lat/lng (l); there are no "sum_keys" (only counting is supported)
  • Writes <root>/ctbk/stations/meta_hists/YYYYMM.parquet
  • See also: s3://ctbk/stations/meta_hists

StationModes (a.k.a. sms): canonical {id,name,lat/lng} info for each station:

  • Computed from StationMetaHists:
    • name is chosen as the "mode" (most commonly listed name for that station ID)
    • lat/lng is taken to be the mean of the lat/lngs reported for each ride's start and end station
  • Writes <root>/ctbk/aggregated/YYYYMM/stations.json
  • See also: s3://ctbk/aggregated/YYYYMM/stations.json

StationPairJsons (a.k.a. spjs): counts of rides between each pair of stations:

  • JSON formatted as { <start idx>: { <end idx>: <count> } }
  • idxs are based on order of appearance in StationModes / stations.json above (which is also sorted by station ID)
  • Values are read from AggregatedMonths(<ym>, 'se', 'c'):
    • group by station start ("s") and end ("e"),
    • sum ride counts ("c")
  • Writes <root>/ctbk/aggregated/YYYYMM/se_c.json
  • See also: s3://ctbk/stations/YYYYMM/se_c.json

Installation

Clone this repo and install this library:

git clone https://github.com/hudcostreets/ctbk.dev
cd ctbk.dev
pip install -e ctbk

Then the ctbk executable will be available, which exposes a subcommand for each of the stages above:

CLI

ctbk
Usage: ctbk [OPTIONS] COMMAND [ARGS]...

  CLI for generating ctbk.dev datasets (derived from Citi Bike public data in `s3://`).
  ## Data flow
  ### `TripdataZips` (a.k.a. `zip`s): Public Citi Bike `.csv.zip` files
  - Released as NYC and JC `.csv.zip` files at s3://tripdata
  - See https://tripdata.s3.amazonaws.com/index.html
  ### `TripdataCsvs` (a.k.a. `csv`s): unzipped and gzipped CSVs
  - Writes `<root>/ctbk/csvs/YYYYMM.csv`
  - See also: https://ctbk.s3.amazonaws.com/index.html#/csvs
  ### `NormalizedMonths` (a.k.a. `norm`s): normalize `csv`s
  - Merge regions (NYC, JC) for the same month, harmonize columns drop duplicate data, etc.
  - Writes `<root>/ctbk/normalized/YYYYMM.parquet`
  - See also: https://ctbk.s3.amazonaws.com/index.html#/normalized
  ### `AggregatedMonths` (a.k.a. `agg`s): compute histograms over each month's rides:
  - Group by any of several "aggregation keys" ({year, month, day, hour, user type, bike
    type, start and end station, …})
  - Produce any "sum keys" ({ride counts, duration in seconds})
  - Writes `<root>/ctbk/aggregated/KEYS_YYYYMM.parquet`
  - See also: https://ctbk.s3.amazonaws.com/index.html#/aggregated?p=8
  ### `StationMetaHists` (a.k.a. `smh`s): compute station {id,name,lat/lng} histograms:
  - Similar to `agg`s, but counts station {id,name,lat/lng} tuples that appear as each
    ride's start and end stations (whereas `agg`'s rows are 1:1 with rides)
  - "agg_keys" can include id (i), name (n), and lat/lng (l); there are no "sum_keys"
    (only counting is supported)
  - Writes `<root>/ctbk/stations/meta_hists/YYYYMM/KEYS.parquet`
  - See also: https://ctbk.s3.amazonaws.com/index.html#/stations/meta_hists
  ### `StationModes` (a.k.a. `sm`s): canonical {id,name,lat/lng} info for each station:
  - Computed from `StationMetaHist`s:
    - `name` is chosen as the "mode" (most commonly listed name for that station ID)
    - `lat/lng` is taken to be the mean of the lat/lngs reported for each ride's start
      and end station
  - Writes `<root>/ctbk/aggregated/YYYYMM/stations.json`
  - See also: https://ctbk.s3.amazonaws.com/index.html#/aggregated
  ### `StationPairJsons` (a.k.a. `spj`s): counts of rides between each pair of stations:
  - JSON formatted as `{ <start idx>: { <end idx>: <count> } }`
  - `idx`s are based on order of appearance in `StationModes` / `stations.json` above
    (which is also sorted by station ID)
  - Values are read from `AggregatedMonths(YYYYMM, 'se', 'c')`:
    - group by station start ("s") and end ("e"),
    - sum ride counts ("c")
  - Writes `<root>/ctbk/aggregated/YYYYMM/se_c.json`
  - See also: https://ctbk.s3.amazonaws.com/index.html#/aggregated

Options:
  --help            Show this message and exit.

Commands:
  zip                 Read .csv.zip files from s3://tripdata
  csv                 Extract CSVs from "tripdata" .zip files.
  normalized          Normalize "tripdata" CSVs (combine regions for each...
  partition           Separate pre-2024 parquets (`normalized/v0`) by...
  consolidate         Consolidate `normalized/YM/YM_YM.parquet` files...
  aggregated          Aggregate normalized ride entries by various...
  ymrgtb-cd           Read aggregated...
  station-meta-hist   Aggregate station name, lat/lng info from ride...
  station-modes-json  Compute canonical station names, lat/lngs from...
  station-pairs-json  Write station-pair ride_counts keyed by...
  yms                 Print one or more YM (year-month) ranges, e.g.:
ctbk zip --help
Usage: ctbk zip [OPTIONS] COMMAND [ARGS]...

  Read .csv.zip files from s3://tripdata

Options:
  --help  Show this message and exit.

Commands:
  urls  Print URLs for selected datasets
ctbk csv --help
Usage: ctbk csv [OPTIONS] COMMAND [ARGS]...

  Extract CSVs from "tripdata" .zip files. Writes to <root>/ctbk/csvs.

Options:
  --help  Show this message and exit.

Commands:
  urls    Print URLs for selected datasets
  create  Create selected datasets
  sort    Sort one or more `.csv{,.gz}`'s in-place, remove empty lines
ctbk normalized --help
Usage: ctbk normalized [OPTIONS] COMMAND [ARGS]...

  Normalize "tripdata" CSVs (combine regions for each month, harmonize column
  names, etc. Populates directory `<root>/ctbk/normalized/YYYYMM/` with files
  of the form `YYYYMM_YYYYMM.parquet`, for each pair of (start,end) months
  found in a given month's CSVs.

Options:
  --help  Show this message and exit.

Commands:
  urls    Print URLs for selected datasets
  create  Create selected datasets
ctbk partition --help
Usage: ctbk partition [OPTIONS] [YM_RANGES_STR]

  Separate pre-2024 parquets (`normalized/v0`) by {src,start,end} months.

Options:
  --help  Show this message and exit.
ctbk consolidate --help
Usage: ctbk consolidate [OPTIONS] [YM_RANGES_STR]

  Consolidate `normalized/YM/YM_YM.parquet` files into a single
  `normalized/YM.parquet`, containing all rides ending in the given month.

Options:
  -c, --col TEXT  Columns to backfill; default: ['Birth Year', 'Gender', 'Bike
                  ID']
  -n, --dry-run   Print stats about fields that would be backfilled, but don't
                  perform any writes
  --help          Show this message and exit.
ctbk aggregated --help
Usage: ctbk aggregated [OPTIONS] COMMAND [ARGS]...

  Aggregate normalized ride entries by various columns, summing ride counts or
  durations. Writes to <root>/ctbk/aggregated/KEYS_YYYYMM.parquet.

Options:
  --help  Show this message and exit.

Commands:
  urls    Print URLs for selected datasets
  create  Create selected datasets
ctbk station-meta-hist --help
Usage: ctbk station-meta-hist [OPTIONS] COMMAND [ARGS]...

  Aggregate station name, lat/lng info from ride start and end fields. Writes
  to <root>/ctbk/stations/meta_hists/KEYS_YYYYMM.parquet.

Options:
  --help  Show this message and exit.

Commands:
  urls    Print URLs for selected datasets
  create  Create selected datasets
ctbk station-modes-json --help
Usage: ctbk station-modes-json [OPTIONS] COMMAND [ARGS]...

  Compute canonical station names, lat/lngs from StationMetaHists. Writes to
  <root>/ctbk/aggregated/YYYYMM/stations.json.

Options:
  --help  Show this message and exit.

Commands:
  urls    Print URLs for selected datasets
  create  Create selected datasets
ctbk station-pairs-json --help
Usage: ctbk station-pairs-json [OPTIONS] COMMAND [ARGS]...

  Write station-pair ride_counts keyed by StationModes' JSON indices. Writes
  to <root>/ctbk/aggregated/YYYYMM/se_c.json.

Options:
  --help  Show this message and exit.

Commands:
  urls    Print URLs for selected datasets
  create  Create selected datasets

Subcommands: urls, create

Each of the ctbk commands above supports 3 further subcommands:

  • urls: print the URLs that would be read from or written to
  • create: compute and save the relevant data to those URLs (optionally no-op'ing if already present, overwriting, or failing if not present)

Examples

urls: print URLS

Print URLs for 3 months of normalized data in the local s3/ folder:

ctbk normalized -d 202206-202209 urls
# s3/ctbk/normalized/202206.parquet
# s3/ctbk/normalized/202207.parquet
# s3/ctbk/normalized/202208.parquet

create: create+save data

Compute one month of normalized ride data:

ctbk normalized -d202101 create

This reads upstream CSVs from the local s3/ctbk/csvs/ directory and writes normalized parquet files to s3/ctbk/normalized/.

Note: stderr messages about Rideable Type not being found are due to older months predating the addition of that column in February 2021.

Current create options include:

  • -e, --engine: Parquet engine selection
  • -t, --name-type INTEGER: CSV name-type preference
  • -G, --no-git: Skip git/DVC workflow integration

Generate all the data used by ctbk.dev in the local s3/ctbk directory:

ctbk station-pairs-json create
  • station-pairs-json (abbreviated as spj) is the final derived data product in the diagram above
  • Creating station-pair JSONs requires creating all predecessor datasets in the pipeline
  • Data is stored in the local s3/ctbk/ directory structure
  • Initial TripdataZips are downloaded from the public s3://tripdata bucket

⚠️ takes O(hours), streams ≈7GB of .csv.zips from s3://tripdata, writes ≈12GiB under s3/ctbk/ locally.

Abbreviated command names

Abbreviations for each subcommand are supported, e.g. n for normalized:

ctbk n -d2022- urls

GitHub Actions

ci.yml breaks each derived dataset into a separate job, for example:

ctbk dev gha dag

It also includes a final call to generate JSON used by the main plot at ctbk.dev:

ctbk ymrgtb-cd

Any changes are pushed to the www branch, which triggers the www.yml GHA.

The www.yml GHA:

The code for the site is under ../www.