Let's back to school with an example and demonstrate the use of tests.
First, usual bootstrapping of database for lea, it goes by creating a .env:
echo "
LEA_USERNAME=max
LEA_WAREHOUSE=duckdb
LEA_DUCKDB_PATH=school.db
" > .envThis example uses DuckDB as the datawarehouse.
You can run the scripts:
lea runLea will create schema in DuckDB, create audit tables based on scripts definition. Once audit tables are generated, lea run tests against audit tables.
Let's review some tests together.
Vizualize students in this school :
python -c import duckdb; print(duckdb.connect('school_max.db').execute('SELECT student_id, first_name, last_name, university FROM staging.students').df()) student_id first_name last_name university
0 1 Lauren Levine Stanford University
1 2 Daniel Lopez Massachusetts Institute of Technology
2 3 Melanie Foster University of California Berkeley
3 4 Gabriel Cooke Harvard University
4 5 Anne Porter Harvard University
5 6 Amy Lee Princeton University
6 7 Rebecca Chavez Princeton University
You can see some students, let's review their grades !
python -c "import duckdb; print(duckdb.connect('school_max.db').execute('SELECT student_id, student_name, class_name, semester, average_grade FROM core.yearly_results USING SAMPLE 5').df())" student_id student_name class_name semester average_grade
0 6 Amy Lee Mathematics Semester 1 59.0
1 5 Anne Porter Literature Semester 2 100.0
2 5 Anne Porter Physics Semester 2 46.0
3 1 Lauren Levine Biology Semester 1 28.5
4 1 Lauren Levine Literature Semester 2 52.5
Awesome ! Pretty good students, let's review some tests made.
WITH raw_students AS (
SELECT * FROM './seeds/raw_students.csv'
)
SELECT
-- #UNIQUE
-- #NO_NULLS
id AS student_id,
first_name,
-- #UNIQUE_BY(first_name)
last_name,
-- #SET{'Stanford University', 'University of California Berkeley', 'Princeton University', 'Harvard University', 'Massachusetts Institute of Technology'}
university,
FROM raw_students;During the Write-Audit-Publish pattern, thoses checks will ensure Data Quality making assertions tests.
Here for instance, the staging model during Audit step will ensure that :
student_idvalues are not null and uniquelast_nameare unique by first_nameuniversityvalues are in the exposed list
Let's break a test on purpose for demonstration :
Under seeds/raw_students, let's add a new student :
echo "8,Andy,Bernard,Cornell University,23" >> seeds/raw_students.csvLet's run again scripts :
lea runCornell University is not allowed here :
β Early ending because an error occurred
π΄ Ending session
STOPPED school_max.core.yearly_results___audit
SUCCESS school_max.core.yearly_results___audit, contains 112 rows
ERRORED school_max.tests.staging__students__university___set___audit
university
0 Cornell University
β Finished, took less than a second π
Remove last line added to restore source:
sed -i '' '$d' seeds/raw_students.csvAs audit prevented from corrupting intermediate tables, your tables are still healthy.
However, as our audit tables are messy and not sync with source, let's rerun them:
lea run --restartIt will flush the audit table, as if it was a fresh start.
You might think now, "hey, how does lea know which audit table should be run again or not ?". That's an excellent question !
Have you noticed that lea automatically skip table that are not relevant to process during audits ?
Let's see together with a closer look by practicing with an example !
First you can vizualize award winner of scholarships:
Each top performing student get a 1000$ grant and 2nd gets a 500$ grant.
You can see the winners of the year in Economics:
python -c "import duckdb; print(duckdb.connect('school_max.db').execute('SELECT student_name, domain, scholarship_amount FROM analytics.scholarship_award WHERE domain = \'Economics\'').df())" student_name domain scholarship_amount
0 Daniel Lopez Economics 1000
1 Gabriel Cooke Economics 500
You can review the total amount of money spent :
lea run --select analytics.finance.expenses
python -c "import duckdb; print(duckdb.connect('school_max.db').execute('SELECT total_expenses FROM analytics.finance__expenses').df())" total_expenses
0 12000.0
Let's modify a script and demonstrate that lea will run again only scripts that have been modified.
Good news, the academy got 2x more budget this year ! You can deliver a scholarship award for top performing student each semester.
To apply this evolution, uncomment all lines under analytics.scholarship_award with --uncomment here and comment the --comment here ones
sed -i '' '/--comment here/s/^/--/' scripts/analytics/scholarship_award.sql
sed -i '' '/--uncomment here/s/-- //' scripts/analytics/scholarship_award.sqlThen run again the finance script.
lea run --select analytics.finance.expensesOh no, the budget test is failing ! Modify the value under scripts/tests/budget.sql :
sed -i '' '/--comment here/s/^/--/' scripts/tests/budget.sql
sed -i '' '/--uncomment here/s/-- //' scripts/tests/budget.sqlNow let's run again the scripts :
lea runEverything pass π
Look closely : **audit tables haven't been materialized again for school*max.core.yearly_results***audit
as they were already existing and the script modification date was anterior to materialization date !
But it would has been executed, if the script was modified prior last table materialization.
You can check the table materialization date with :
python -c "import duckdb; print(duckdb.connect('school_max.db').execute('SELECT MAX(_materialized_timestamp) AS last_materialized FROM analytics.scholarship_award').df())"
last_materialized
0 2025-03-14 00:31:28.114
Now the school has extra budget, you can view the new scholarship award winners !
There is twice more winners now, 2 at each semester :
python -c "import duckdb; print(duckdb.connect('school_max.db').execute('SELECT student_name, domain, semester, scholarship_amount FROM analytics.scholarship_award WHERE domain = \'Economics\'').df())" student_name domain semester scholarship_amount
0 Lauren Levine Economics Semester 2 1000
1 Gabriel Cooke Economics Semester 2 500
2 Daniel Lopez Economics Semester 1 1000
3 Gabriel Cooke Economics Semester 1 500
As you can see now, the expenses have doubled :
lea run --select analytics.finance.expenses
python -c "import duckdb; print(duckdb.connect('school_max.db').execute('SELECT total_expenses FROM analytics.finance__expenses').df())" total_expenses
0 24000.0