Skip to content

Commit 06df877

Browse files
author
Ubuntu
committed
chore(testing): organize S3 transfer benchmark artifacts
1 parent c20f928 commit 06df877

13 files changed

+1459
-1
lines changed

.gitignore

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,10 @@ evaluation/benchmarks/**/pash_graphviz_*/
7272

7373
# Local experimental testing artifacts
7474
test_pretty_print.py
75-
testing/
75+
testing/*
76+
!testing/s3_transfer_benchmarks/
77+
!testing/s3_transfer_benchmarks/**
78+
testing/legacy/
7679
testing2/
7780
sort_inout_splits/
81+
pipeline_io/
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# S3 Transfer Benchmarks
2+
3+
This folder contains the organized benchmark assets used to compare data transfer paths for serverless PaSh testing.
4+
5+
## What is measured
6+
7+
1. `S3 -> EC2` full object download timing
8+
2. `S3 -> Lambda` full object download timing
9+
3. `S3 -> Lambda` byte-range download timing
10+
4. `S3 -> EC2 -> Lambda` streaming timing
11+
12+
## Layout
13+
14+
- `ec2_s3_benchmark.py`: direct S3-to-EC2 pull benchmark script
15+
- `orchestrators/`: manual orchestrators used to invoke Lambda sort workers
16+
- `lambda_workers/`: Lambda worker handlers and deployment scripts
17+
- `analysis/plot.py`: plotting script for benchmark results
18+
- `analysis/*.png`: generated comparison figures
19+
20+
## Quick usage
21+
22+
From this directory:
23+
24+
```bash
25+
# Deploy worker lambdas
26+
./lambda_workers/deploy_lambda_sort.sh
27+
./lambda_workers/deploy_lambda_sort_byte_ranges.sh
28+
29+
# Run byte-range orchestrator
30+
python3 orchestrators/manual_s3_orchestrator_byte_ranges.py \
31+
--bucket "$AWS_BUCKET" \
32+
--input oneliners/inputs/1G.txt \
33+
--output oneliners/outputs/byte-range-result.txt \
34+
--workers 2
35+
36+
# Plot current benchmark summary
37+
python3 analysis/plot.py
38+
```
39+
40+
## Notes
41+
42+
- Legacy/ad-hoc experimental artifacts were moved under `testing/legacy/` and are intentionally not committed.
43+
- Deployment scripts package Lambda zips locally in `lambda_workers/`.
Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
from pathlib import Path
2+
3+
import matplotlib.pyplot as plt
4+
import numpy as np
5+
6+
# Complete data
7+
sizes_mib = [100, 500, 1024]
8+
sizes_labels = ['100 MiB', '500 MiB', '1 GiB']
9+
10+
# S3 to Lambda (full files)
11+
lambda_full_avg = [1.14, 6.70, 13.55]
12+
lambda_full_min = [1.07, 5.28, 13.19]
13+
lambda_full_max = [1.38, 7.08, 14.13]
14+
lambda_full_p90 = [1.34, 7.06, 14.06]
15+
16+
17+
# S3 to EC2 (full files)
18+
ec2_avg = [1.14, 5.34, 10.85]
19+
ec2_min = [1.14, 5.33, 10.58]
20+
ec2_max = [1.15, 5.36, 11.88]
21+
ec2_p90 = [1.15, 5.35, 11.68]
22+
23+
# S3 to Lambda (byte range)
24+
lambda_range_avg = [1.13, 6.59, 13.35]
25+
lambda_range_min = [1.07, 5.26, 11.95]
26+
lambda_range_max = [1.38, 7.02, 14.11]
27+
lambda_range_p90 = [1.34, 6.99, 13.47]
28+
29+
# S3 to EC2 to Lambda (streaming, /dev/null)
30+
streaming_avg = [4.027, 7.840, 14.650]
31+
streaming_min = [3.957, 7.427, 14.542]
32+
streaming_max = [4.079, 7.999, 14.726]
33+
streaming_p90 = [4.079, 7.999, 14.726] # Using max as approximation
34+
35+
# Calculate error bars (distance from average to min/max)
36+
ec2_err_lower = [ec2_avg[i] - ec2_min[i] for i in range(3)]
37+
ec2_err_upper = [ec2_max[i] - ec2_avg[i] for i in range(3)]
38+
lambda_full_err_lower = [lambda_full_avg[i] - lambda_full_min[i] for i in range(3)]
39+
lambda_full_err_upper = [lambda_full_max[i] - lambda_full_avg[i] for i in range(3)]
40+
lambda_range_err_lower = [lambda_range_avg[i] - lambda_range_min[i] for i in range(3)]
41+
lambda_range_err_upper = [lambda_range_max[i] - lambda_range_avg[i] for i in range(3)]
42+
streaming_err_lower = [streaming_avg[i] - streaming_min[i] for i in range(3)]
43+
streaming_err_upper = [streaming_max[i] - streaming_avg[i] for i in range(3)]
44+
45+
# Create figure with 2 subplots
46+
fig = plt.figure(figsize=(14, 6))
47+
48+
# Plot 1: Average times comparison with error bars
49+
ax1 = plt.subplot(1, 2, 1)
50+
x = np.arange(len(sizes_labels))
51+
width = 0.2 # Narrower bars to fit 4 series
52+
53+
bars1 = ax1.bar(x - 1.5*width, ec2_avg, width, label='S3→EC2 (download)',
54+
color='#2ecc71', alpha=0.8,
55+
yerr=[ec2_err_lower, ec2_err_upper], capsize=4, error_kw={'linewidth': 1.5})
56+
bars2 = ax1.bar(x - 0.5*width, lambda_full_avg, width, label='S3→Lambda (direct)',
57+
color='#e74c3c', alpha=0.8,
58+
yerr=[lambda_full_err_lower, lambda_full_err_upper], capsize=4, error_kw={'linewidth': 1.5})
59+
bars3 = ax1.bar(x + 0.5*width, lambda_range_avg, width,
60+
label='S3→Lambda (byte range)', color='#3498db', alpha=0.8,
61+
yerr=[lambda_range_err_lower, lambda_range_err_upper], capsize=4, error_kw={'linewidth': 1.5})
62+
bars4 = ax1.bar(x + 1.5*width, streaming_avg, width,
63+
label='S3→EC2→Lambda (stream)', color='#9b59b6', alpha=0.8,
64+
yerr=[streaming_err_lower, streaming_err_upper], capsize=4, error_kw={'linewidth': 1.5})
65+
66+
ax1.set_xlabel('File Size', fontsize=12, fontweight='bold')
67+
ax1.set_ylabel('Average Time (seconds)', fontsize=12, fontweight='bold')
68+
ax1.set_title('S3 Download Performance Comparison', fontsize=14, fontweight='bold')
69+
ax1.set_xticks(x)
70+
ax1.set_xticklabels(sizes_labels)
71+
ax1.legend(fontsize=9)
72+
ax1.grid(axis='y', alpha=0.3)
73+
74+
# Add percentage labels showing streaming overhead vs direct Lambda
75+
for i in range(len(sizes_labels)):
76+
if streaming_avg[i] and lambda_full_avg[i]:
77+
overhead = ((streaming_avg[i] - lambda_full_avg[i]) / lambda_full_avg[i]) * 100
78+
if abs(overhead) > 1: # Only show if meaningful difference
79+
ax1.text(i + 1.5*width, streaming_avg[i] + 0.5, f'+{overhead:.0f}%',
80+
ha='center', fontsize=8, fontweight='bold', color='#8e44ad')
81+
82+
# Plot 2: Throughput (MB/s)
83+
ax2 = plt.subplot(1, 2, 2)
84+
85+
throughput_ec2 = [sizes_mib[i] / ec2_avg[i] for i in range(3)]
86+
throughput_lambda_full = [sizes_mib[i] / lambda_full_avg[i] for i in range(3)]
87+
throughput_lambda_range = [sizes_mib[i] / lambda_range_avg[i] for i in range(3)]
88+
throughput_streaming = [sizes_mib[i] / streaming_avg[i] for i in range(3)]
89+
90+
ax2.plot(sizes_labels, throughput_ec2, 'o-', color='#2ecc71',
91+
linewidth=2, markersize=10, label='S3→EC2')
92+
ax2.plot(sizes_labels, throughput_lambda_full, 's-', color='#e74c3c',
93+
linewidth=2, markersize=10, label='S3→Lambda (direct)')
94+
ax2.plot(sizes_labels, throughput_lambda_range, '^-', color='#3498db',
95+
linewidth=2, markersize=10, label='S3→Lambda (range)')
96+
ax2.plot(sizes_labels, throughput_streaming, 'd-', color='#9b59b6',
97+
linewidth=2, markersize=10, label='S3→EC2→Lambda (stream)')
98+
99+
ax2.set_xlabel('File Size', fontsize=12, fontweight='bold')
100+
ax2.set_ylabel('Throughput (MiB/s)', fontsize=12, fontweight='bold')
101+
ax2.set_title('Network Throughput by File Size', fontsize=14, fontweight='bold')
102+
ax2.legend(fontsize=9)
103+
ax2.grid(alpha=0.3)
104+
105+
# Add throughput values
106+
for i, size in enumerate(sizes_labels):
107+
ax2.text(i, throughput_ec2[i] + 2, f'{throughput_ec2[i]:.0f}',
108+
ha='center', fontsize=8, color='#27ae60')
109+
ax2.text(i, throughput_lambda_full[i] - 3, f'{throughput_lambda_full[i]:.0f}',
110+
ha='center', fontsize=8, color='#c0392b')
111+
112+
# Add note about sample size and warm-up
113+
fig.text(0.5, 0.02, 'N=9 runs per configuration (excluding 1 cold start run for Lambda - all measurements use warmed-up Lambdas)',
114+
ha='center', fontsize=9, style='italic', color='#555555')
115+
116+
plt.tight_layout(rect=[0, 0.03, 1, 1]) # Make room for the note at bottom
117+
output_path = Path(__file__).resolve().parent / "s3_performance_complete.png"
118+
plt.savefig(output_path, dpi=300, bbox_inches='tight')
119+
print(f"Saved: {output_path}")
120+
121+
# Print comprehensive summary
122+
print("\n" + "="*60)
123+
print("📊 S3 DOWNLOAD PERFORMANCE ANALYSIS")
124+
print("="*60)
125+
126+
print("\n🔍 KEY FINDINGS:\n")
127+
128+
print("1. EC2 Advantage Grows with File Size:")
129+
for i, size in enumerate(sizes_labels):
130+
speedup = ((lambda_full_avg[i] - ec2_avg[i]) / lambda_full_avg[i]) * 100
131+
print(f" {size:>8}: {speedup:>5.1f}% faster")
132+
133+
print("\n2. Throughput Analysis:")
134+
for i, size in enumerate(sizes_labels):
135+
print(f" {size:>8}: EC2={throughput_ec2[i]:>6.1f} MiB/s | Lambda={throughput_lambda_full[i]:>6.1f} MiB/s")
136+
137+
print("\n3. Consistency (Variance):")
138+
variance_ec2 = [ec2_max[i] - ec2_min[i] for i in range(3)]
139+
variance_lambda_full = [lambda_full_max[i] - lambda_full_min[i] for i in range(3)]
140+
variance_lambda_range = [lambda_range_max[i] - lambda_range_min[i] for i in range(3)]
141+
for i, size in enumerate(sizes_labels):
142+
print(f" {size:>8}: EC2={variance_ec2[i]:>5.2f}s | Lambda={variance_lambda_full[i]:>5.2f}s ({variance_lambda_full[i]/variance_ec2[i]:.1f}x more variable)")
143+
144+
print("\n4. Byte Range Overhead:")
145+
for i, size in enumerate(sizes_labels):
146+
overhead = ((lambda_range_avg[i] - lambda_full_avg[i]) / lambda_full_avg[i]) * 100
147+
print(f" {size:>8}: {overhead:>+5.1f}% (essentially zero!)")
148+
149+
print("\n5. Streaming S3→EC2→Lambda Overhead (vs Direct S3→Lambda):")
150+
variance_streaming = [streaming_max[i] - streaming_min[i] for i in range(3)]
151+
for i, size in enumerate(sizes_labels):
152+
overhead = ((streaming_avg[i] - lambda_full_avg[i]) / lambda_full_avg[i]) * 100
153+
print(f" {size:>8}: {overhead:>+5.1f}% slower (streaming overhead)")
154+
155+
print("\n✅ CONCLUSIONS:")
156+
print(" • EC2 is 0-25% faster than Lambda for downloads (scales with file size)")
157+
print(" • EC2 has 8-30x lower variance (much more consistent)")
158+
print(" • Byte ranges have ZERO performance penalty")
159+
print(" • Streaming S3→EC2→Lambda is SLOWER than direct S3→Lambda:")
160+
print(" - 100 MiB: 3.5x slower (extra hop overhead dominates)")
161+
print(" - 500 MiB: 1.2x slower")
162+
print(" - 1 GiB: 1.1x slower")
163+
print(" • Direct S3→Lambda is always faster - streaming adds latency without benefit")
164+
print("="*60)
165+
166+
# Create a detailed table
167+
print("\n📋 DETAILED TIMING TABLE:")
168+
print("-"*95)
169+
print(f"{'Size':<10} {'Method':<25} {'Avg':<8} {'Min':<8} {'Max':<8} {'P90':<8} {'Variance':<10} {'Throughput':<12}")
170+
print("-"*95)
171+
for i, size in enumerate(sizes_labels):
172+
print(f"{size:<10} {'EC2 (download)':<25} {ec2_avg[i]:<8.2f} {ec2_min[i]:<8.2f} {ec2_max[i]:<8.2f} {ec2_p90[i]:<8.2f} {variance_ec2[i]:<10.2f} {throughput_ec2[i]:<12.1f}")
173+
print(f"{'':<10} {'Lambda (direct)':<25} {lambda_full_avg[i]:<8.2f} {lambda_full_min[i]:<8.2f} {lambda_full_max[i]:<8.2f} {lambda_full_p90[i]:<8.2f} {variance_lambda_full[i]:<10.2f} {throughput_lambda_full[i]:<12.1f}")
174+
print(f"{'':<10} {'Lambda (byte range)':<25} {lambda_range_avg[i]:<8.2f} {lambda_range_min[i]:<8.2f} {lambda_range_max[i]:<8.2f} {lambda_range_p90[i]:<8.2f} {variance_lambda_range[i]:<10.2f} {throughput_lambda_range[i]:<12.1f}")
175+
print(f"{'':<10} {'S3→EC2→Lambda (stream)':<25} {streaming_avg[i]:<8.2f} {streaming_min[i]:<8.2f} {streaming_max[i]:<8.2f} {streaming_p90[i]:<8.2f} {variance_streaming[i]:<10.2f} {throughput_streaming[i]:<12.1f}")
176+
if i < len(sizes_labels) - 1:
177+
print("-"*95)
178+
print("-"*95)
172 KB
Loading
352 KB
Loading
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
#!/usr/bin/env python3
2+
"""
3+
EC2 script to benchmark S3 download performance
4+
Measures time to pull 1G or 500M file from S3
5+
"""
6+
import boto3
7+
import botocore.config
8+
import time
9+
import uuid
10+
11+
BUCKET = "inout741448956691"
12+
13+
KEYS = [
14+
"unix50/inputs/1_20G.txt",
15+
"oneliners/inputs/1G.txt",
16+
"oneliners/inputs/500M.txt",
17+
"oneliners/inputs/100M.txt",
18+
]
19+
20+
KEY = KEYS[0] # Change this to select different file sizes
21+
22+
def run_benchmark(event=None, context=None):
23+
runtimes = []
24+
25+
for i in range(10):
26+
# Create a fresh S3 client each iteration to avoid connection reuse
27+
config = botocore.config.Config(
28+
max_pool_connections=1,
29+
retries={'max_attempts': 0}
30+
)
31+
s3 = boto3.client("s3", config=config)
32+
33+
# Add a random query param to avoid any caching
34+
unique_key = f"{KEY}?nocache={uuid.uuid4().hex}"
35+
36+
t0 = time.time()
37+
res = s3.get_object(Bucket=BUCKET, Key=KEY)
38+
for _ in res['Body'].iter_chunks(chunk_size=1024*1024):
39+
pass
40+
41+
t1 = time.time()
42+
43+
#if i == 0: # first run is cold start garbage
44+
# continue
45+
46+
dt = t1 - t0
47+
runtimes.append(dt)
48+
print(f"Run {i+1}: {dt:.2f}s")#, size pulled: {len(data)/1e6:.1f} MB")
49+
50+
#del data # free memory
51+
52+
# Compute statistics
53+
avg_ = sum(runtimes) / len(runtimes)
54+
mn = min(runtimes)
55+
mx = max(runtimes)
56+
sorted_r = sorted(runtimes)
57+
p90_index = max(0, int(len(sorted_r) * 0.9) - 1)
58+
p90 = sorted_r[p90_index]
59+
60+
print(f"Average: {avg_:.2f}s")
61+
print(f"Min: {mn:.2f}s")
62+
print(f"Max: {mx:.2f}s")
63+
print(f"P90: {p90:.2f}s")
64+
65+
return {
66+
"avg": avg_,
67+
"min": mn,
68+
"max": mx,
69+
"p90": p90,
70+
"runs": runtimes
71+
}
72+
73+
if __name__ == "__main__":
74+
# Run benchmark on EC2
75+
result = run_benchmark()
76+
print("\nFinal results:", result)
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
#!/usr/bin/env bash
2+
# Deploy lambda-sort worker function
3+
4+
set -euo pipefail
5+
6+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
7+
8+
echo "=========================================="
9+
echo "Deploying Lambda Sort Worker"
10+
echo "=========================================="
11+
12+
# Check environment variables
13+
if [ -z "$AWS_ACCOUNT_ID" ]; then
14+
echo "Error: AWS_ACCOUNT_ID environment variable not set"
15+
exit 1
16+
fi
17+
18+
# Package the Lambda function
19+
echo "[Step 1] Packaging Lambda function..."
20+
cd "$SCRIPT_DIR"
21+
zip -q lambda_sort_worker.zip lambda_sort_worker.py
22+
echo " ✓ Created lambda_sort_worker.zip"
23+
24+
# Check if function exists
25+
FUNCTION_EXISTS=$(aws lambda get-function --function-name lambda-sort 2>&1 | grep -c "ResourceNotFoundException" || true)
26+
27+
if [ "$FUNCTION_EXISTS" -eq 1 ]; then
28+
# Create new function
29+
echo "[Step 2] Creating new Lambda function 'lambda-sort'..."
30+
31+
aws lambda create-function \
32+
--function-name lambda-sort \
33+
--runtime python3.9 \
34+
--role arn:aws:iam::${AWS_ACCOUNT_ID}:role/pash-release-us-east-1-lambdaRole \
35+
--handler lambda_sort_worker.lambda_handler \
36+
--zip-file fileb://lambda_sort_worker.zip \
37+
--timeout 300 \
38+
--memory-size 3008 \
39+
--ephemeral-storage Size=2048 \
40+
--region us-east-1
41+
42+
echo " ✓ Lambda function created"
43+
else
44+
# Update existing function
45+
echo "[Step 2] Updating existing Lambda function 'lambda-sort'..."
46+
47+
aws lambda update-function-code \
48+
--function-name lambda-sort \
49+
--zip-file fileb://lambda_sort_worker.zip \
50+
--region us-east-1
51+
52+
echo " ✓ Lambda function updated"
53+
fi
54+
55+
echo
56+
echo "=========================================="
57+
echo "✓ Deployment complete!"
58+
echo "Function name: lambda-sort"
59+
echo "=========================================="
60+
echo
61+
echo "You can now test with:"
62+
echo " python3 ../orchestrators/manual_s3_orchestrator.py \\"
63+
echo " --bucket \$AWS_BUCKET \\"
64+
echo " --input oneliners/inputs/1M.txt \\"
65+
echo " --output oneliners/outputs/manual-sort-result.txt \\"
66+
echo " --workers 2"

0 commit comments

Comments
 (0)