|
| 1 | +--- |
| 2 | +title: Repository Scaffold |
| 3 | +date: 2025-02-08 |
| 4 | +draft: false |
| 5 | +tags: |
| 6 | + - projects |
| 7 | + - lakeground |
| 8 | + - data-engineering |
| 9 | + - git |
| 10 | +--- |
| 11 | +--- |
| 12 | + |
| 13 | +The foundation of [[00. Concept & Motivation|Lakeground]] is all about modularity without sacrificing cohesion. To achieve this, I’m structuring the project around a **monorepo-ish** design, where all components live under a single repository. At the same time, I want each major part to stay self-contained. That’s where `git` **submodules** come in—they let each component function as an independent repository while still being part of the larger whole. It might sound a bit unconventional (or just plain confusing) at first, but by the end of this article, I promise it’ll make a lot more sense. |
| 14 | + |
| 15 | +--- |
| 16 | + |
| 17 | +### A Monorepo-ish Design? |
| 18 | + |
| 19 | +A **monorepo** is a single repository that houses multiple related projects. This setup makes it easier to share code, ensure consistency, and manage dependencies between different parts of the stack. However, it also comes with challenges—keeping clear module boundaries, avoiding unnecessary coupling, and handling repository size over time. |
| 20 | + |
| 21 | +On the other hand, a **multi-repo** approach splits projects into separate repositories, where each module or service is developed, versioned, and deployed independently. This structure offers flexibility, cleaner ownership, and a reduced risk of unintended dependencies. The downside? Coordinating changes across multiple repositories can be a pain, and maintaining visibility into the overall project state isn’t always straightforward. |
| 22 | + |
| 23 | +So, technically, my approach isn’t a **monorepo**. While everything is linked within a central repository, each module exists as its own independent repository with its own versioning and lifecycle—making this a **multi-repo** design at its core. My goal is to get the best of both worlds: the visibility and cohesion of a **monorepo** while keeping the flexibility of a **multi-repo** approach. |
| 24 | + |
| 25 | +That’s why I’m structuring the project around a **core-repo**—a _single entry point_ that tracks all modules without centralizing development. Each module is housed in its own repository and functions as an independent **component**, but everything is still connected within the core structure. This setup allows me to maintain modularity while keeping a high-level view of the entire system. It’s not exactly conventional, but that’s what makes it interesting. |
| 26 | + |
| 27 | +> [!note] monorepos vs. multi-repos |
| 28 | +> To be honest, I don’t believe in silver bullets—just in the right tool for the job. There’s a whole debate about **monorepos** vs. **multi-repos**, and my take is simple: _it depends_. The best approach varies based on the solution being built, the company structure, and even the team’s maturity level. |
| 29 | +> |
| 30 | +> In my case, everything is experimental. I want to see how this setup influences the coupling between different parts of the data stack. Maybe it works perfectly, maybe I’ll regret it in a few weeks—but that’s the fun of it. |
| 31 | +> |
| 32 | +> If you’re curious about **monorepos** and want to dive deeper, [this site](https://monorepo.tools/) is an amazing resource. For a great breakdown of the differences, pros/cons between **monorepo** and **multi-repo** approaches, I highly recommend [this article](https://www.thoughtworks.com/insights/blog/agile-engineering-practices/monorepo-vs-multirepo). |
| 33 | +
|
| 34 | +--- |
| 35 | + |
| 36 | +### Why Submodules? |
| 37 | + |
| 38 | +The first time I saw a GitHub repo where a folder was actually a link to another repository, I was fascinated. It felt awesome—having a single central repository while keeping multiple projects independent but connected. With `git` **submodules**, each component of [[00. Concept & Motivation|Lakeground]] remains its own separate repository, meaning I can version and manage them independently while still linking them back to the main structure. |
| 39 | + |
| 40 | +**Submodules** allow me to develop a component in isolation and then seamlessly integrate it into the broader project. This approach keeps things clean and avoids the typical downsides of a massive monolithic repository where everything is tangled together. |
| 41 | + |
| 42 | +#### Working with Submodules |
| 43 | + |
| 44 | +To add a **submodule**: |
| 45 | + |
| 46 | +```sh |
| 47 | +git submodule add REPO_URL PATH |
| 48 | +``` |
| 49 | + |
| 50 | +For example, if we want to add a repository for the ingestion module under `component`, the command looks like this: |
| 51 | + |
| 52 | +```sh |
| 53 | +git submodule add https://github.com/alanmmolina/lakeground-component.git component |
| 54 | +``` |
| 55 | + |
| 56 | +Cloning a repository with **submodules** requires an extra step. Instead of a regular `git clone`, we initialize and update **submodules** with: |
| 57 | + |
| 58 | +```sh |
| 59 | +git clone --recurse-submodules REPO_URL |
| 60 | +``` |
| 61 | + |
| 62 | +Or, if we forgot to do that when cloning, we can initialize them later: |
| 63 | + |
| 64 | +```sh |
| 65 | +git submodule update --init --recursive |
| 66 | +``` |
| 67 | + |
| 68 | +Each **submodule** is treated as an independent repository, so changes inside it won’t automatically reflect in the main repository. To update a **submodule** to its latest version, we navigate inside it and pull the latest changes: |
| 69 | + |
| 70 | +```sh |
| 71 | +cd component |
| 72 | +git pull origin main |
| 73 | +cd - |
| 74 | +``` |
| 75 | + |
| 76 | +Then, back in the main repo, we commit the updated reference: |
| 77 | + |
| 78 | +```sh |
| 79 | +git add component |
| 80 | +git commit -m "update component" |
| 81 | +``` |
| 82 | + |
| 83 | +When working with **submodules**, `git` needs a way to track their locations and source repositories. That’s where the `.gitmodules` file comes in. This file lives at the root of the main repository and keeps a record of every **submodule** we’ve added. |
| 84 | + |
| 85 | +A typical `.gitmodules` file looks like this: Â |
| 86 | + |
| 87 | +```ini |
| 88 | +[submodule "component"] |
| 89 | +Â Â path = component |
| 90 | +Â Â url = https://github.com/alanmmolina/lakeground-component.git |
| 91 | +``` |
| 92 | + |
| 93 | +Each section corresponds to a **submodule** and contains its name, the path where it lives inside the main repository, and the external repository URL from which it is fetched. This file ensures that whenever someone clones the repository, they know where each **submodule** comes from. |
| 94 | + |
| 95 | +> [!tip] Managing `git` the easy way |
| 96 | +> As much as I appreciate the power of the command line, I have to admit—I’m a bit lazy. I prefer working with `git` through **VSCode**, where I can visualize changes, manage branches, and switch between **submodules** effortlessly. The `git` panel in VS Code makes it easy to see which files have changed, stage updates, and resolve conflicts in a much more intuitive way than dealing with raw commands. |
| 97 | +> |
| 98 | +> For **submodules**, VSCode’s interface allows me to open them as separate repositories, making it simple to commit changes to a specific module without affecting the main repo. This workflow keeps everything neat and helps me stay focused on the code rather than on `git` mechanics. |
| 99 | +> |
| 100 | +> If you want to learn more about it, you can find plenty of information [here](https://code.visualstudio.com/docs/sourcecontrol/overview) |
| 101 | +
|
| 102 | +--- |
| 103 | + |
| 104 | +The goal of this setup is to keep the project structured, flexible, and scalable. Whether I’m tweaking an ingestion pipeline or refining data storage, I can do so in isolation without disrupting the entire system. It’s still early, and I’m sure there will be plenty of lessons along the way—but I’m excited to see how it all unfolds. |
| 105 | + |
| 106 | +If you're working with **monorepos**, **multi-repos**, or **submodules**, or if you have a different approach to structuring multi-component projects, I’d love to hear your thoughts! |
| 107 | + |
| 108 | +--- |
| 109 | + |
| 110 | +With the **core-repo** and **components** design in place, the next challenge is managing Python environments efficiently across all these components. That’s where [uv](https://docs.astral.sh/uv/) comes in. It offers a _workspace_ feature that aligns perfectly with this modular structure, allowing me to maintain separate dependencies for each submodule while keeping everything under one roof. |
| 111 | + |
| 112 | +In the next post, I’ll dive into **uv** and how it fits into the [[00. Concept & Motivation|Lakeground]] ecosystem. |
| 113 | + |
0 commit comments