Skip to content

Commit ec404d8

Browse files
committed
📝 add lakeground repository scaffold article
1 parent fb3fafe commit ec404d8

File tree

2 files changed

+120
-3
lines changed

2 files changed

+120
-3
lines changed

‎content/Projects/Lakeground/00. Concept & Motivation.md‎

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,15 @@
11
---
22
title: Concept & Motivation
3+
date: 2025-02-01
4+
draft: false
35
tags:
46
- projects
57
- lakeground
68
- data-engineering
79
- python
810
---
11+
---
12+
913
I’m thrilled to kick off a new project that’s been brewing in my mind for a while. Meet **Lakeground** — a fusion of a Data Lake and a _playground_, where the goal is to experiment, learn, and build a fully open-source, end-to-end Data Engineering stack. This project is all about exploring creative ways to solve data challenges using free tools, while keeping everything modular and fun to work with.
1014

1115
The idea behind **Lakeground** is to build something that feels like a sandbox for Data Engineering enthusiasts. Imagine a system where you can piece together components, break them apart, and experiment freely, all while building something functional. Each part of the stack will be its own standalone tool, designed to work independently but also integrate seamlessly with the others to form a complete pipeline.
@@ -28,12 +32,12 @@ Because Data Engineering should be more than just building pipelines for work
2832

2933
This is just the beginning. Over the coming weeks (and maybe months), I’ll be sharing updates on how **Lakeground** is shaping up. You can expect detailed write-ups on design decisions, hands-on exploration of open-source tools, and maybe even a few missteps along the way.
3034

31-
I’ll also be setting up a _monorepo_ on GitHub, where each component of **Lakeground** will live as a separate directory (a _submodule_). These components will not only function independently but also come together to form a complete stack. This modularity will make it easier to experiment with specific tools or workflows without needing to set up the entire pipeline every time.
35+
I’ll be setting up a central repository on GitHub with a _monorepo_-inspired structure, but each component of **Lakeground** will live as a separate directory (a _submodule_—meaning each directory is its own repository). These components will function independently while still being part of a cohesive stack. This modular approach makes it easier to experiment with specific tools or workflows without needing to set up the entire pipeline every time.
3236

33-
> [!faq]- monorepo
37+
> [!faq] monorepo
3438
> A monorepo (short for _monolithic repository_) is a single Git repository that houses the code for multiple projects or components. Instead of separating each part of a system into its own repo, everything lives together, making it easier to share code, manage dependencies, and ensure consistency. ^monorepo
3539
36-
> [!faq]- submodule
40+
> [!faq] submodule
3741
> A Git submodule is a repository embedded within another repository. It allows you to manage multiple, independent projects while keeping them connected.
3842
3943
---
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
---
2+
title: Repository Scaffold
3+
date: 2025-02-08
4+
draft: false
5+
tags:
6+
- projects
7+
- lakeground
8+
- data-engineering
9+
- git
10+
---
11+
---
12+
13+
The foundation of [[00. Concept & Motivation|Lakeground]] is all about modularity without sacrificing cohesion. To achieve this, I’m structuring the project around a **monorepo-ish** design, where all components live under a single repository. At the same time, I want each major part to stay self-contained. That’s where `git` **submodules** come in—they let each component function as an independent repository while still being part of the larger whole. It might sound a bit unconventional (or just plain confusing) at first, but by the end of this article, I promise it’ll make a lot more sense.
14+
15+
---
16+
17+
### A Monorepo-ish Design?
18+
19+
A **monorepo** is a single repository that houses multiple related projects. This setup makes it easier to share code, ensure consistency, and manage dependencies between different parts of the stack. However, it also comes with challenges—keeping clear module boundaries, avoiding unnecessary coupling, and handling repository size over time.
20+
21+
On the other hand, a **multi-repo** approach splits projects into separate repositories, where each module or service is developed, versioned, and deployed independently. This structure offers flexibility, cleaner ownership, and a reduced risk of unintended dependencies. The downside? Coordinating changes across multiple repositories can be a pain, and maintaining visibility into the overall project state isn’t always straightforward.
22+
23+
So, technically, my approach isn’t a **monorepo**. While everything is linked within a central repository, each module exists as its own independent repository with its own versioning and lifecycle—making this a **multi-repo** design at its core. My goal is to get the best of both worlds: the visibility and cohesion of a **monorepo** while keeping the flexibility of a **multi-repo** approach.
24+
25+
That’s why I’m structuring the project around a **core-repo**—a _single entry point_ that tracks all modules without centralizing development. Each module is housed in its own repository and functions as an independent **component**, but everything is still connected within the core structure. This setup allows me to maintain modularity while keeping a high-level view of the entire system. It’s not exactly conventional, but that’s what makes it interesting.
26+
27+
> [!note] monorepos vs. multi-repos
28+
> To be honest, I don’t believe in silver bullets—just in the right tool for the job. There’s a whole debate about **monorepos** vs. **multi-repos**, and my take is simple: _it depends_. The best approach varies based on the solution being built, the company structure, and even the team’s maturity level.
29+
>
30+
> In my case, everything is experimental. I want to see how this setup influences the coupling between different parts of the data stack. Maybe it works perfectly, maybe I’ll regret it in a few weeks—but that’s the fun of it.
31+
>
32+
> If you’re curious about **monorepos** and want to dive deeper, [this site](https://monorepo.tools/) is an amazing resource. For a great breakdown of the differences, pros/cons between **monorepo** and **multi-repo** approaches, I highly recommend [this article](https://www.thoughtworks.com/insights/blog/agile-engineering-practices/monorepo-vs-multirepo).
33+
34+
---
35+
36+
### Why Submodules?
37+
38+
The first time I saw a GitHub repo where a folder was actually a link to another repository, I was fascinated. It felt awesome—having a single central repository while keeping multiple projects independent but connected. With `git` **submodules**, each component of [[00. Concept & Motivation|Lakeground]] remains its own separate repository, meaning I can version and manage them independently while still linking them back to the main structure.
39+
40+
**Submodules** allow me to develop a component in isolation and then seamlessly integrate it into the broader project. This approach keeps things clean and avoids the typical downsides of a massive monolithic repository where everything is tangled together.
41+
42+
#### Working with Submodules
43+
44+
To add a **submodule**:
45+
46+
```sh
47+
git submodule add REPO_URL PATH
48+
```
49+
50+
For example, if we want to add a repository for the ingestion module under `component`, the command looks like this:
51+
52+
```sh
53+
git submodule add https://github.com/alanmmolina/lakeground-component.git component
54+
```
55+
56+
Cloning a repository with **submodules** requires an extra step. Instead of a regular `git clone`, we initialize and update **submodules** with:
57+
58+
```sh
59+
git clone --recurse-submodules REPO_URL
60+
```
61+
62+
Or, if we forgot to do that when cloning, we can initialize them later:
63+
64+
```sh
65+
git submodule update --init --recursive
66+
```
67+
68+
Each **submodule** is treated as an independent repository, so changes inside it won’t automatically reflect in the main repository. To update a **submodule** to its latest version, we navigate inside it and pull the latest changes:
69+
70+
```sh
71+
cd component
72+
git pull origin main
73+
cd -
74+
```
75+
76+
Then, back in the main repo, we commit the updated reference:
77+
78+
```sh
79+
git add component
80+
git commit -m "update component"
81+
```
82+
83+
When working with **submodules**, `git` needs a way to track their locations and source repositories. That’s where the `.gitmodules` file comes in. This file lives at the root of the main repository and keeps a record of every **submodule** we’ve added.
84+
85+
A typical `.gitmodules` file looks like this:  
86+
87+
```ini
88+
[submodule "component"]
89+
    path = component
90+
    url = https://github.com/alanmmolina/lakeground-component.git
91+
```
92+
93+
Each section corresponds to a **submodule** and contains its name, the path where it lives inside the main repository, and the external repository URL from which it is fetched. This file ensures that whenever someone clones the repository, they know where each **submodule** comes from.
94+
95+
> [!tip] Managing `git` the easy way
96+
> As much as I appreciate the power of the command line, I have to admit—I’m a bit lazy. I prefer working with `git` through **VSCode**, where I can visualize changes, manage branches, and switch between **submodules** effortlessly. The `git` panel in VS Code makes it easy to see which files have changed, stage updates, and resolve conflicts in a much more intuitive way than dealing with raw commands.
97+
>
98+
> For **submodules**, VSCode’s interface allows me to open them as separate repositories, making it simple to commit changes to a specific module without affecting the main repo. This workflow keeps everything neat and helps me stay focused on the code rather than on `git` mechanics.
99+
>
100+
> If you want to learn more about it, you can find plenty of information [here](https://code.visualstudio.com/docs/sourcecontrol/overview)
101+
102+
---
103+
104+
The goal of this setup is to keep the project structured, flexible, and scalable. Whether I’m tweaking an ingestion pipeline or refining data storage, I can do so in isolation without disrupting the entire system. It’s still early, and I’m sure there will be plenty of lessons along the way—but I’m excited to see how it all unfolds.
105+
106+
If you're working with **monorepos**, **multi-repos**, or **submodules**, or if you have a different approach to structuring multi-component projects, I’d love to hear your thoughts!
107+
108+
---
109+
110+
With the **core-repo** and **components** design in place, the next challenge is managing Python environments efficiently across all these components. That’s where [uv](https://docs.astral.sh/uv/) comes in. It offers a _workspace_ feature that aligns perfectly with this modular structure, allowing me to maintain separate dependencies for each submodule while keeping everything under one roof.
111+
112+
In the next post, I’ll dive into **uv** and how it fits into the [[00. Concept & Motivation|Lakeground]] ecosystem.
113+

0 commit comments

Comments
 (0)