Skip to content

Commit f9c2654

Browse files
committed
Update Results for DataFusion 51.0.0
1 parent c3a1e28 commit f9c2654

File tree

8 files changed

+272
-65
lines changed

8 files changed

+272
-65
lines changed

datafusion-partitioned/README.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
1-
# DataFusion
1+
# Apache DataFusion
22

3-
DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format. For more information, please check <https://arrow.apache.org/datafusion/user-guide/introduction.html>
3+
[Apache DataFusion] is an extensible query execution framework, written in Rust, that uses [Apache Arrow] as its in-memory format. For more information, please check <https://arrow.apache.org/datafusion/user-guide/introduction.html>
4+
5+
[Apache DataFusion]: https://arrow.apache.org/datafusion/
6+
[Apache Arrow]: https://arrow.apache.org/
47

58
We use parquet file here and create an external table for it; and then execute the queries.
69

@@ -10,7 +13,7 @@ The benchmark should be completed in under an hour. On-demand pricing is $0.6 pe
1013

1114
1. manually start a AWS EC2 instance
1215
- `c6a.4xlarge`
13-
- Ubuntu 22.04 or later
16+
- Ubuntu 24.04 or later
1417
- Root 500GB gp2 SSD
1518
- no EBS optimized
1619
- no instance store
@@ -20,16 +23,16 @@ The benchmark should be completed in under an hour. On-demand pricing is $0.6 pe
2023
1. `vi benchmark.sh` and modify following line to target Datafusion version
2124

2225
```bash
23-
git checkout 46.0.0
26+
git checkout 51.0.0
2427
```
2528

2629
1. `bash benchmark.sh`
30+
1. `./save-result.sh c6a.4xlarge`
2731

2832
### Know Issues
2933

3034
1. importing parquet by `datafusion-cli` doesn't support schema, need to add some casting in queries.sql (e.g. converting EventTime from Int to Timestamp via `to_timestamp_seconds`)
3135
2. importing parquet by `datafusion-cli` make column name column name case-sensitive, i change all column name in queries.sql to double quoted literal (e.g. `EventTime` -> `"EventTime"`)
32-
3. `comparing binary with utf-8` and `group by binary` don't work in mac, if you run these queries in mac, you'll get some errors for queries contain binary format apache/arrow-datafusion#3050
3336
3437
## Generate full human readable results (for debugging)
3538

datafusion-partitioned/benchmark.sh

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,9 @@ sudo apt-get update -y
1111
sudo apt-get install -y gcc
1212

1313
echo "Install DataFusion main branch"
14-
git clone https://github.com/apache/arrow-datafusion.git
15-
cd arrow-datafusion/
16-
git checkout 47.0.0
14+
git clone https://github.com/apache/datafusion.git
15+
cd datafusion/
16+
git checkout 51.0.0
1717
CARGO_PROFILE_RELEASE_LTO=true RUSTFLAGS="-C codegen-units=1" cargo build --release --package datafusion-cli --bin datafusion-cli
1818
export PATH="`pwd`/target/release:$PATH"
1919
cd ..

datafusion-partitioned/result.csv

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
1,1,0.108
2+
1,2,0.032
3+
1,3,0.031
4+
2,1,0.161
5+
2,2,0.054
6+
2,3,0.053
7+
3,1,0.307
8+
3,2,0.095
9+
3,3,0.098
10+
4,1,0.577
11+
4,2,0.112
12+
4,3,0.108
13+
5,1,1.160
14+
5,2,0.769
15+
5,3,0.757
16+
6,1,1.110
17+
6,2,0.829
18+
6,3,0.826
19+
7,1,0.112
20+
7,2,0.032
21+
7,3,0.032
22+
8,1,0.169
23+
8,2,0.056
24+
8,3,0.057
25+
9,1,1.099
26+
9,2,0.931
27+
9,3,0.914
28+
10,1,1.771
29+
10,2,1.007
30+
10,3,1.006
31+
11,1,0.667
32+
11,2,0.232
33+
11,3,0.236
34+
12,1,0.882
35+
12,2,0.257
36+
12,3,0.253
37+
13,1,1.204
38+
13,2,0.839
39+
13,3,0.833
40+
14,1,2.712
41+
14,2,1.391
42+
14,3,1.414
43+
15,1,1.228
44+
15,2,0.804
45+
15,3,0.813
46+
16,1,1.023
47+
16,2,0.870
48+
16,3,0.882
49+
17,1,2.751
50+
17,2,1.688
51+
17,3,1.681
52+
18,1,2.749
53+
18,2,1.683
54+
18,3,1.683
55+
19,1,5.618
56+
19,2,3.391
57+
19,3,3.380
58+
20,1,0.375
59+
20,2,0.103
60+
20,3,0.104
61+
21,1,10.142
62+
21,2,1.119
63+
21,3,1.114
64+
22,1,11.557
65+
22,2,1.381
66+
22,3,1.376
67+
23,1,22.326
68+
23,2,2.639
69+
23,3,2.549
70+
24,1,52.872
71+
24,2,9.353
72+
24,3,9.169
73+
25,1,0.390
74+
25,2,0.155
75+
25,3,0.165
76+
26,1,1.144
77+
26,2,0.261
78+
26,3,0.256
79+
27,1,0.380
80+
27,2,0.160
81+
27,3,0.157
82+
28,1,10.451
83+
28,2,1.511
84+
28,3,1.507
85+
29,1,9.596
86+
29,2,8.827
87+
29,3,9.053
88+
30,1,0.582
89+
30,2,0.430
90+
30,3,0.453
91+
31,1,3.205
92+
31,2,0.791
93+
31,3,0.802
94+
32,1,6.970
95+
32,2,0.976
96+
32,3,0.983
97+
33,1,5.111
98+
33,2,3.477
99+
33,3,3.508
100+
34,1,10.275
101+
34,2,3.680
102+
34,3,3.682
103+
35,1,10.314
104+
35,2,3.657
105+
35,3,3.658
106+
36,1,1.385
107+
36,2,1.231
108+
36,3,1.252
109+
37,1,0.357
110+
37,2,0.141
111+
37,3,0.134
112+
38,1,0.217
113+
38,2,0.075
114+
38,3,0.076
115+
39,1,0.341
116+
39,2,0.140
117+
39,3,0.142
118+
40,1,0.506
119+
40,2,0.208
120+
40,3,0.225
121+
41,1,0.199
122+
41,2,0.071
123+
41,3,0.075
124+
42,1,0.191
125+
42,2,0.068
126+
42,3,0.064
127+
43,1,0.178
128+
43,2,0.058
129+
43,3,0.062
Lines changed: 49 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -1,56 +1,56 @@
11
{
2-
"system": "DataFusion (Parquet, partitioned)",
3-
"date": "2025-07-10",
2+
"system": "DataFusion (Parquet, partitioned)",
3+
"date": "2025-11-24",
44
"machine": "c6a.4xlarge",
55
"cluster_size": 1,
6-
"proprietary": "no",
7-
"tuned": "no",
8-
"tags": ["Rust","column-oriented","embedded","stateless", "lukewarm-cold-run"],
6+
"proprietary": "no",
7+
"tuned": "no",
8+
"tags": ["Rust","column-oriented","embedded","stateless"],
99
"load_time": 0,
1010
"data_size": 14737666736,
1111
"result": [
12-
[0.058, 0.017, 0.015],
13-
[0.116, 0.035, 0.037],
14-
[0.2, 0.084, 0.088],
15-
[0.43, 0.081, 0.084],
16-
[1.086, 0.78, 0.799],
17-
[0.977, 0.751, 0.756],
18-
[0.086, 0.026, 0.026],
19-
[0.125, 0.04, 0.037],
20-
[1.011, 0.882, 0.862],
21-
[1.349, 0.971, 0.983],
22-
[0.565, 0.231, 0.24],
23-
[0.677, 0.264, 0.265],
24-
[1.062, 0.816, 0.82],
25-
[2.769, 1.346, 1.201],
26-
[1.135, 0.792, 0.78],
27-
[1.021, 0.926, 0.916],
28-
[2.638, 1.639, 1.63],
29-
[2.585, 1.555, 1.592],
30-
[5.159, 3.238, 3.24],
31-
[0.26, 0.077, 0.077],
32-
[10.045, 1.067, 1.082],
33-
[11.424, 1.291, 1.269],
34-
[22.117, 2.487, 2.511],
35-
[55.492, 9.765, 9.851],
36-
[2.825, 0.432, 0.423],
37-
[0.853, 0.328, 0.33],
38-
[2.837, 0.508, 0.504],
39-
[9.744, 1.469, 1.478],
40-
[9.444, 9.445, 9.475],
41-
[0.515, 0.405, 0.415],
42-
[2.433, 0.729, 0.735],
43-
[6.158, 0.884, 0.891],
44-
[4.608, 3.342, 3.281],
45-
[10.221, 3.481, 3.455],
46-
[10.145, 3.486, 3.46],
47-
[1.261, 1.188, 1.168],
48-
[0.309, 0.114, 0.114],
49-
[0.175, 0.05, 0.048],
50-
[0.313, 0.099, 0.117],
51-
[0.451, 0.166, 0.192],
52-
[0.183, 0.04, 0.043],
53-
[0.171, 0.04, 0.041],
54-
[0.143, 0.035, 0.037]
55-
]
12+
[0.110,0.032,0.032],
13+
[0.159,0.054,0.053],
14+
[0.268,0.097,0.098],
15+
[0.609,0.111,0.111],
16+
[1.170,0.789,0.777],
17+
[1.147,0.834,0.823],
18+
[0.109,0.031,0.031],
19+
[0.173,0.056,0.055],
20+
[1.117,0.942,0.916],
21+
[1.778,0.997,0.994],
22+
[0.663,0.232,0.240],
23+
[0.864,0.258,0.258],
24+
[1.209,0.835,0.854],
25+
[2.715,1.370,1.386],
26+
[1.223,0.834,0.831],
27+
[1.054,0.882,0.876],
28+
[2.757,1.699,1.707],
29+
[2.737,1.670,1.688],
30+
[5.613,3.370,3.410],
31+
[0.377,0.104,0.102],
32+
[10.116,1.111,1.140],
33+
[11.557,1.408,1.365],
34+
[22.315,2.650,2.627],
35+
[52.820,9.173,9.215],
36+
[0.340,0.158,0.150],
37+
[1.177,0.254,0.264],
38+
[0.390,0.163,0.151],
39+
[10.337,1.480,1.508],
40+
[9.570,8.813,8.964],
41+
[0.585,0.454,0.446],
42+
[3.202,0.778,0.776],
43+
[6.962,0.959,0.994],
44+
[5.083,3.497,3.509],
45+
[10.231,3.706,3.661],
46+
[10.270,3.653,3.645],
47+
[1.411,1.223,1.278],
48+
[0.349,0.134,0.137],
49+
[0.210,0.071,0.075],
50+
[0.335,0.139,0.133],
51+
[0.487,0.210,0.211],
52+
[0.202,0.067,0.067],
53+
[0.187,0.063,0.065],
54+
[0.182,0.059,0.063]
55+
]
5656
}
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
#!/bin/bash
2+
3+
# This scripts converts the raw results.csv data from `benchmark.sh` into a the
4+
# final json format used by the benchmark dashboard.
5+
#
6+
# usage : ./save-result.sh <machine>
7+
#
8+
# example (save results/c6a.4xlarge.json)
9+
# ./save-result.sh c6a.4xlarge
10+
11+
MACHINE=$1
12+
OUTPUT_FILE="results/${MACHINE}.json"
13+
SYSTEM_NAME="DataFusion (Parquet, partitioned)"
14+
DATE=$(date +%Y-%m-%d)
15+
16+
17+
# Read the CSV and build the result array using sed
18+
RESULT_ARRAY=$(awk -F, '{arr[$1]=arr[$1]","$3} END {for (i=1;i<=length(arr);i++) {gsub(/^,/, "", arr[i]); printf " ["arr[i]"]"; if (i<length(arr)) printf ",\n"}}' result.csv)
19+
20+
# form the final JSON structure from the template
21+
cat <<EOF > $OUTPUT_FILE
22+
{
23+
"system": "$SYSTEM_NAME",
24+
"date": "$DATE",
25+
"machine": "$MACHINE",
26+
"cluster_size": 1,
27+
"proprietary": "no",
28+
"tuned": "no",
29+
"tags": ["Rust","column-oriented","embedded","stateless"],
30+
"load_time": 0,
31+
"data_size": 14737666736,
32+
"result": [
33+
$RESULT_ARRAY
34+
]
35+
}
36+
EOF

datafusion/README.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,19 @@
11
# DataFusion
22

3-
DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format. For more information, please check <https://arrow.apache.org/datafusion/user-guide/introduction.html>
3+
[Apache DataFusion] is an extensible query execution framework, written in Rust, that uses [Apache Arrow] as its in-memory format. For more information, please check <https://arrow.apache.org/datafusion/user-guide/introduction.html>
4+
5+
[Apache DataFusion]: https://arrow.apache.org/datafusion/
6+
[Apache Arrow]: https://arrow.apache.org/
47

58
We use parquet file here and create an external table for it; and then execute the queries.
69

7-
## Generate benchmark results
10+
## Cookbook: Generate benchmark results
811

912
The benchmark should be completed in under an hour. On-demand pricing is $0.6 per hour while spot pricing is only $0.2 to $0.3 per hour (us-east-2).
1013

1114
1. manually start a AWS EC2 instance
1215
- `c6a.4xlarge`
13-
- Ubuntu 22.04 or later
16+
- Ubuntu 24.04 or later
1417
- Root 500GB gp2 SSD
1518
- no EBS optimized
1619
- no instance store
@@ -20,16 +23,16 @@ The benchmark should be completed in under an hour. On-demand pricing is $0.6 pe
2023
1. `vi benchmark.sh` and modify following line to target Datafusion version
2124

2225
```bash
23-
git checkout 46.0.0
26+
git checkout 51.0.0
2427
```
2528

2629
1. `bash benchmark.sh`
30+
1. `./save-result.sh c6a.4xlarge`
2731

2832
### Know Issues
2933

3034
1. importing parquet by `datafusion-cli` doesn't support schema, need to add some casting in queries.sql (e.g. converting EventTime from Int to Timestamp via `to_timestamp_seconds`)
3135
2. importing parquet by `datafusion-cli` make column name column name case-sensitive, i change all column name in queries.sql to double quoted literal (e.g. `EventTime` -> `"EventTime"`)
32-
3. `comparing binary with utf-8` and `group by binary` don't work in mac, if you run these queries in mac, you'll get some errors for queries contain binary format apache/arrow-datafusion#3050
3336
3437
## Generate full human readable results (for debugging)
3538

datafusion/benchmark.sh

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,9 @@ sudo apt-get update -y
1111
sudo apt-get install -y gcc
1212

1313
echo "Install DataFusion main branch"
14-
git clone https://github.com/apache/arrow-datafusion.git
15-
cd arrow-datafusion/
16-
git checkout 47.0.0
14+
git clone https://github.com/apache/datafusion.git
15+
cd datafusion/
16+
git checkout 51.0.0
1717
CARGO_PROFILE_RELEASE_LTO=true RUSTFLAGS="-C codegen-units=1" cargo build --release --package datafusion-cli --bin datafusion-cli
1818
export PATH="`pwd`/target/release:$PATH"
1919
cd ..

0 commit comments

Comments
 (0)