This guide walks you through setting up a local analytics engineering environment using DuckDB and dbt.
Note
This guide will explain how to do the setup manually. If you want an additional challenge, try to run this setup using Docker Compose or a Python virtual environment.
Important: All dbt commands must be run from inside the taxi_rides_ny/ directory. The setup steps below will guide you through:
- Installing the necessary tools
- Configuring your connection to DuckDB
- Loading the NYC taxi data
- Verifying everything works
DuckDB is a fast, in-process SQL database that works great for local analytics workloads. To install DuckDB, follow the instruction on the official site for your specific operating system.
Tip
You can install DuckDB in two ways. You can install the CLI or install the client API for your favorite programming language (in the case of Python, you can use pip install duckdb). I personally prefer installing the CLI, but either way is fine.
pip install dbt-duckdbThis installs:
dbt-core: The core dbt frameworkdbt-duckdb: The DuckDB adapter for dbt
Since this repository already contains a dbt project (taxi_rides_ny/), you don't need to run dbt init. Instead, you need to configure your dbt profile to connect to DuckDB.
The dbt profile tells dbt how to connect to your database. Create or update the file ~/.dbt/profiles.yml with the following content:
taxi_rides_ny:
target: dev
outputs:
# DuckDB Development profile
dev:
type: duckdb
path: taxi_rides_ny.duckdb
schema: dev
threads: 1
extensions:
- parquet
settings:
memory_limit: '2GB'
preserve_insertion_order: false
# DuckDB Production profile
prod:
type: duckdb
path: taxi_rides_ny.duckdb
schema: prod
threads: 1
extensions:
- parquet
settings:
memory_limit: '2GB'
preserve_insertion_order: false
# Troubleshooting:
# - If you have less than 4GB RAM, try setting memory_limit to '1GB'
# - If you have 16GB+ RAM, you can increase to '4GB' for faster builds
# - Expected build time: 5-10 minutes on most systemsNow that your dbt profile is configured, let's load the taxi data into DuckDB. Navigate to the dbt project directory and run the ingestion script
import duckdb
import requests
from pathlib import Path
BASE_URL = "https://github.com/DataTalksClub/nyc-tlc-data/releases/download"
def download_and_convert_files(taxi_type):
data_dir = Path("data") / taxi_type
data_dir.mkdir(exist_ok=True, parents=True)
for year in [2019, 2020]:
for month in range(1, 13):
parquet_filename = f"{taxi_type}_tripdata_{year}-{month:02d}.parquet"
parquet_filepath = data_dir / parquet_filename
if parquet_filepath.exists():
print(f"Skipping {parquet_filename} (already exists)")
continue
# Download CSV.gz file
csv_gz_filename = f"{taxi_type}_tripdata_{year}-{month:02d}.csv.gz"
csv_gz_filepath = data_dir / csv_gz_filename
response = requests.get(f"{BASE_URL}/{taxi_type}/{csv_gz_filename}", stream=True)
response.raise_for_status()
with open(csv_gz_filepath, 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
print(f"Converting {csv_gz_filename} to Parquet...")
con = duckdb.connect()
con.execute(f"""
COPY (SELECT * FROM read_csv_auto('{csv_gz_filepath}'))
TO '{parquet_filepath}' (FORMAT PARQUET)
""")
con.close()
# Remove the CSV.gz file to save space
csv_gz_filepath.unlink()
print(f"Completed {parquet_filename}")
def update_gitignore():
gitignore_path = Path(".gitignore")
# Read existing content or start with empty string
content = gitignore_path.read_text() if gitignore_path.exists() else ""
# Add data/ if not already present
if 'data/' not in content:
with open(gitignore_path, 'a') as f:
f.write('\n# Data directory\ndata/\n' if content else '# Data directory\ndata/\n')
if __name__ == "__main__":
# Update .gitignore to exclude data directory
update_gitignore()
for taxi_type in ["yellow", "green"]:
download_and_convert_files(taxi_type)
con = duckdb.connect("taxi_rides_ny.duckdb")
con.execute("CREATE SCHEMA IF NOT EXISTS prod")
for taxi_type in ["yellow", "green"]:
con.execute(f"""
CREATE OR REPLACE TABLE prod.{taxi_type}_tripdata AS
SELECT * FROM read_parquet('data/{taxi_type}/*.parquet', union_by_name=true)
""")
con.close()This script downloads yellow and green taxi data from 2019-2020, creates the prod schema, and loads the raw data into DuckDB. The download may take several minutes depending on your internet connection.
Verify dbt can connect to your DuckDB database:
dbt debugIf you're using Visual Studio Code, install the dbt Power User extension to enhance your dbt development experience.
dbt Power User is a VS Code extension that provides:
- SQL syntax highlighting and formatting for dbt models
- Inline column-level lineage visualization
- Auto-completion for dbt models, sources, and macros
- Interactive documentation preview
- Model compilation and execution directly from the editor
dbt Labs released an official VS Code extension called dbt Extension powered by the new dbt Fusion engine. However, this extension requires dbt Fusion and does not support dbt Core.
Since we're using dbt Core with DuckDB for local development, we need the community-maintained dbt Power User by AltimateAI extension instead. This extension:
- Works seamlessly with dbt Core (not just dbt Cloud)
- Supports all dbt adapters, including DuckDB
- Is actively maintained and open source
- Provides a rich feature set for local development
- Open VS Code
- Go to Extensions (Ctrl+Shift+X / Cmd+Shift+X)
- Search for "dbt Power User"
- Install dbt Power User by AltimateAI (not the dbt Labs version)
Alternatively, install it from the VS Code Marketplace.
Note
At this point, your local dbt environment is fully configured and ready to use. The next steps (running models, tests, and building documentation) will be covered in the tutorial videos.