etl- Main project that contains a session enrichment logicetl-local- Auxiliary project to run jobs with Spark Standalone Modedata-generator- Project to generate input logs forSessionLogsEnrichmentJob
me.vitaly.etl.jobs.SessionLogsEnrichmentJob- the main job which takes raw logs and previously handled session logs as input and saves new session logs in partitioned by data, month, year directories.me.vitaly.etl.runners.SessionLogsEnrichmentJobRunner- the runner ofSessionLogsEnrichmentJobwhich:- Makes validations that input logs has not been already processed.
- Calculates files to process based on configs and input parameters
- Mark files as processed after the job is finished.
me.vitaly.etl.jobs.SessionLogsEnrichmentJobTest- parameterized unit tests to check common and edge cases.etl/src/main/resources/application.conf- file with configurations
- Run
me.vitaly.etl.generator.DataGeneratorKt.mainfrom thedata-generatorto generate logs todata/raw/year=2021/month=04/day=14. Note that is astupidnaive generator without any session logic. - Run
me.vitaly.etl.local.SessionLogsLocalRunnerKt.mainfrom theetl-local. It runsSessionLogsEnrichmentJobusing configetl/src/main/resources/application.confand the data from thedatafolder on Spark Standalone. The results should be saved todata/processed/year=2021/month=04/day=13.