Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
dda5e24
relocate docker files
Bill-hbrhbr Oct 8, 2025
ac9e110
Task version of managing docker images
Bill-hbrhbr Oct 9, 2025
be23651
yaml lint fix
Bill-hbrhbr Oct 9, 2025
49f6e5b
restructure
Bill-hbrhbr Oct 9, 2025
fc148bc
lint fix and others
Bill-hbrhbr Oct 9, 2025
7fa26c0
Variable refactoring
Bill-hbrhbr Oct 9, 2025
e9faabe
Update README
Bill-hbrhbr Oct 9, 2025
a28735d
address review comment
Bill-hbrhbr Oct 10, 2025
52064cf
Apply suggestions from code review
Bill-hbrhbr Oct 20, 2025
629264e
Split docker build into two scripts. External for arg checking and pr…
Bill-hbrhbr Oct 20, 2025
bd218fb
third person tense
Bill-hbrhbr Oct 20, 2025
e1405af
Address 2nd-pass review comments
Bill-hbrhbr Oct 21, 2025
0196c8e
Address review comments
Bill-hbrhbr Oct 21, 2025
d86c14a
Move helper funciton
Bill-hbrhbr Oct 22, 2025
a976ae3
Leverage docker buildx build
Bill-hbrhbr Oct 23, 2025
119e62d
Revert "Leverage docker buildx build"
Bill-hbrhbr Oct 24, 2025
c0aa216
Fix docker inspect checksum system
Bill-hbrhbr Oct 24, 2025
7080d4a
Trim last tag time from image digest for convinience of up-to-date im…
Bill-hbrhbr Oct 24, 2025
553a971
address coderabbitai comments
Bill-hbrhbr Oct 24, 2025
4a7c224
Various bug fixes. Directly checksum the output json instead of putti…
Bill-hbrhbr Oct 25, 2025
f23386e
Use a more general sources section
Bill-hbrhbr Oct 25, 2025
008b8cc
Add task docstrings
Bill-hbrhbr Oct 25, 2025
803daa7
Replace --trim with --id-only
Bill-hbrhbr Oct 26, 2025
384982c
Move dockerfiles back to their folders
Bill-hbrhbr Oct 30, 2025
e037c10
Move docker image includes to respective folders as well
Bill-hbrhbr Oct 30, 2025
ed18773
Rename dockerfiles to plain dockerfile names
Bill-hbrhbr Oct 30, 2025
5d805f6
remove checksum mechanism and rename engine to service
Bill-hbrhbr Nov 1, 2025
a72a9f1
Update Readme
Bill-hbrhbr Nov 1, 2025
5e4dab0
Fix docstring return/raise
Bill-hbrhbr Nov 5, 2025
9a3ddf3
Add docstrings for docker image tasks and minor renamings
Bill-hbrhbr Nov 5, 2025
f324788
Polisth readme
Bill-hbrhbr Nov 5, 2025
f0e5170
Update src/log_archival_bench/scripts/docker_images/__init__.py
Bill-hbrhbr Nov 5, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 18 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,24 +7,31 @@ Initialize and update submodules:
git submodule update --init --recursive
```

Run the following code to setup the virtual environment, add the python files in src to python's
import path, then run the venv
## Download Datasets

```
python3 -m venv venv
You can download all the datasets we use in the benchmark using the [download\_all.py](/scripts/download_all.py) script we provide.

echo "$(pwd)" > $(find venv/lib -maxdepth 1 -mindepth 1 -type d)/site-packages/project_root.pth
The [download\_all.py](/scripts/download_all.py) script will download all datasets into the correct directories **with** the specified names, concentrate multi-file datasets together into a single file, and generate any modified version of the dataset needed for tools like Presto \+ CLP.

. venv/bin/activate
## Docker Containers

pip3 install -r requirements.txt
```
Benchmark services run inside Docker containers to provide reproducible, isolated environments for
test service engines with straightforward setup and teardown.

## Download Datasets
While we use existing published images whenever possible, the `log-archival-bench` repository also
builds and maintains its own service-specific images for benchmark testing.

You can download all the datasets we use in the benchmark using the [download\_all.py](/scripts/download_all.py) script we provide.
To build all benchmark service Docker images in parallel:

The [download\_all.py](/scripts/download_all.py) script will download all datasets into the correct directories **with** the specified names, concentrate multi-file datasets together into a single file, and generate any modified version of the dataset needed for tools like Presto \+ CLP.
```shell
task docker-images:build
```

To build an image for a specific service:

```shell
uv run src/log_archival_bench/scripts/docker_images/build.py --service-name <service_name>
```

## Run Everything

Expand Down
12 changes: 0 additions & 12 deletions assets/overhead_test/Dockerfile

This file was deleted.

12 changes: 0 additions & 12 deletions assets/template/Dockerfile

This file was deleted.

2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ dev = [
]

[tool.mypy]
explicit_package_bases = true
mypy_path = ["src"]
Comment on lines +27 to +28
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are needed for mypy linting to work in its current form.

strict = true

# Additional output
Expand Down
File renamed without changes.
1 change: 1 addition & 0 deletions src/log_archival_bench/scripts/docker_images/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Scripts related to Docker images and containers."""
52 changes: 52 additions & 0 deletions src/log_archival_bench/scripts/docker_images/build.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
#!/usr/bin/env python3
"""Builds a Docker image for the specified benchmark service."""

import argparse
import subprocess
import sys
from pathlib import Path

from log_archival_bench.scripts.docker_images.utils import get_image_name, validate_service_name
from log_archival_bench.utils.project_config import CONFIG_DIR, PACKAGE_ROOT


def main(argv: list[str]) -> int:
"""
Builds a Docker image for the specified benchmark service.

:param argv:
:return: 0 on success, non-zero error code on failure.
"""
args_parser = argparse.ArgumentParser()
args_parser.add_argument(
"--service-name",
required=True,
help="The benchmark service that the built Docker image will provide.",
)

parsed_args = args_parser.parse_args(argv[1:])
service_name = parsed_args.service_name

validate_service_name(service_name)

docker_file_path = Path(CONFIG_DIR) / "docker-images" / service_name / "Dockerfile"
if not docker_file_path.is_file():
err_msg = f"Dockerfile for `{service_name}` does not exist in {CONFIG_DIR}/docker-images."
raise RuntimeError(err_msg)

# fmt: off
build_cmds = [
"docker",
"build",
"--tag", get_image_name(service_name),
"--file", str(docker_file_path),
str(PACKAGE_ROOT),
]
# fmt: on
subprocess.run(build_cmds, check=True)

return 0


if __name__ == "__main__":
sys.exit(main(sys.argv))
26 changes: 26 additions & 0 deletions src/log_archival_bench/scripts/docker_images/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
"""Shared helpers for Docker image scripts."""

import os


def get_image_name(service_name: str) -> str:
"""
:param service_name:
:return: The name assigned to the Docker image that contains the service.
"""
user = os.getenv("USER", "clp-user")
return f"log-archival-bench-{service_name}-ubuntu-jammy:dev-{user}"


def validate_service_name(service_name: str) -> None:
"""
:param service_name: The name of the benchmark service.
:raise: ValueError if the service is invalid.
"""
# NOTE: Keep in sync with `G_BENCHMARK_DOCKER_SERVICES` in taskfiles/docker-images/main.yaml
valid_services = ["clickhouse", "clp", "elasticsearch", "sparksql", "zstandard"]
if service_name not in valid_services:
err_msg = (
f"Invalid service name `{service_name}`. Valid services: {', '.join(valid_services)}"
)
raise ValueError(err_msg)
1 change: 1 addition & 0 deletions src/log_archival_bench/utils/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Scripts providing general python utilities for the project."""
11 changes: 11 additions & 0 deletions src/log_archival_bench/utils/project_config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
"""Project configurations."""

from pathlib import Path

import log_archival_bench

# Constants
PACKAGE_ROOT = Path(log_archival_bench.__file__).parent

BUILD_DIR = PACKAGE_ROOT / "build"
CONFIG_DIR = PACKAGE_ROOT / "config"
2 changes: 2 additions & 0 deletions taskfile.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,11 @@ shopt: ["globstar"]

includes:
lint: "taskfiles/lint/main.yaml"
docker-images: "taskfiles/docker-images/main.yaml"

vars:
G_OUTPUT_DIR: "{{.ROOT_DIR}}/build"
G_PROJECT_SRC_DIR: "{{.ROOT_DIR}}/src/log_archival_bench"

tasks:
clean:
Expand Down
41 changes: 41 additions & 0 deletions taskfiles/docker-images/main.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
version: "3"

includes:
utils:
internal: true
taskfile: "../../tools/yscope-dev-utils/exports/taskfiles/utils/utils.yaml"

vars:
G_BENCHMARK_DOCKER_SERVICES:
- "clickhouse"
- "clp"
- "elasticsearch"
# Note: Presto-related service images currently fail to build and are pending fixes.
#- "presto_clp"
#- "presto_parquet"
- "sparksql"
- "zstandard"
G_DOCKER_IMAGE_SCRIPT_DIR: "{{.G_PROJECT_SRC_DIR}}/scripts/docker_images"

tasks:
build:
# Build Docker images for all containerized benchmark services in parallel.
run: "once"
deps:
- for:
var: "G_BENCHMARK_DOCKER_SERVICES"
task: "build-single-benchmark-service-image"
vars:
SERVICE_NAME: "{{.ITEM}}"

build-single-benchmark-service-image:
# Builds a Docker image for the specified benchmark service. Runs only once per unique service.
#
# @param {string} SERVICE_NAME The benchmark service that the built Docker image will provide.
internal: true
label: "{{.TASK}}:{{.SERVICE_NAME}}"
requires:
vars: ["SERVICE_NAME"]
run: "when_changed"
cmds:
- "uv run '{{.G_DOCKER_IMAGE_SCRIPT_DIR}}/build.py' --service-name {{.SERVICE_NAME}}"