-
Notifications
You must be signed in to change notification settings - Fork 64
Description
What happens?
When querying against a parquet file located on s3 we are noticing that the query plan for when query parameters are present and when they are not is different. Different enough that a query can end up being 10x or more slower.
I have seen this replicated on both on Duckdb 1.3.2 and 1.4.1. Python and CLI both have the same behavior.
I can not share the parquet file directly due to its size and data but the file is sorted by id which should make the query very fast. My hunch is that something is potentially skipping row group filters under this condition and instead doing a full table scan.
To Reproduce
Python script that I was using for performance testing. Can also be replicated running the same or similar commands within CLI.
import duckdb
import time
def create_connection():
conn = duckdb.connect()
conn.execute("""SET TimeZone = 'UTC';""")
conn.execute("SET autoinstall_known_extensions = false;")
conn.execute("INSTALL spatial; LOAD spatial;")
conn.execute("INSTALL parquet; LOAD parquet;")
conn.execute("INSTALL httpfs")
conn.execute("INSTALL aws;")
conn.execute("LOAD httpfs;")
conn.execute("LOAD aws;")
conn.execute("""
CREATE SECRET aws_secret (
TYPE S3,
PROVIDER CREDENTIAL_CHAIN,
CHAIN 'config'
);
""")
return conn
def fast():
# File is a parquet file ordered by id
file_path = 's3://some_bucket/file_sorted_by_id.parquet'
sql = f"""
SELECT id
FROM read_parquet('{file_path}')
WHERE id in ('0346EADE0DE2CFFC', '13BCADA2FAB1FD66', '1CBE1BD7B8D59C98', '1D91EB13DA3E2472', '1EDACF5BEDAB3C10',
'1FCA7BEDE0FD9A3B', '23A306DAD89EAF0F', '28E205CC6CEFFBB8', '2CDA4C50BA2DEC8A', '30B464675F23E14B',
'37CE5E04CDA59E0F', '397CB382EDE1AC16', '3BBAB85A2228ADF8', '4A0A4A4BC5EB5099', '4ABAF8EEBD9DFE32',
'4CD67B12AC7B3D9D', '52F2CBEBD25C545D', '68782CDCEB6CDDFF', '6ABC479EF31B4AD7', '6CC7B512B7AFCE14',
'6CDC0CC540AEF13E', '75FB83ABE2AE2C8C', '7C82C5CCB80EDFBA', '7CFF8E5778F0F71B', '7EAE4FF84AB243AC',
'82B34DB28DF1AD8B', '901A27257C3FD449', '9CAE544FACCDD41D', '9D4D30EA3FE9D3E5', '9DE0191AEAFB0B7B',
'A0AE0D1FE22AE828', 'A6CFD40963F5BD95', 'A78ED4C6608CFEE5', 'ADE7F802A7D6CEE5', 'B1A39BB3AD58CB9A',
'B2D9CFEEAAF5A5C4', 'B4B150599EAEBD65', 'B4CCFCE24ABFCBA1', 'BBBAC104BD067AC7', 'BFD7028C65EACA6D',
'C3747454BECF20A0', 'C763573A73A3DA79', 'C78B6D74CD14EDF9', 'CA541676C357F5FF', 'CAE9ACDEC7539DE0',
'CC65DCBFA10B3EEE', 'CE343BDF60B26BE5', 'CF0EE41CC3CE418B', 'D3A69E024760745C', 'D7FEAA2EA8D2FECC',
'D8FDBADEBAC4A739', 'D9DDEBD64CCDE7EF', 'DCB539EEA25CFEF3', 'DFEACCB88AEF0EFF', 'E6C8B41EBECA1CEF',
'E99E67CFF37F4D44', 'EBFAFE2E18EAC99B', 'EF5BC93B43FBF9EE', 'F41BAC78137112BE', 'FAE3DFAAC7DD5B2D',
'FBA9DADFFE2B3A5F', 'FBEFFEBA7AEEE43A', 'FFDA9F7FDC824D05', 'FFF6DD3CD3B2D90B');
"""
conn = create_connection()
s = time.monotonic()
print(conn.execute(sql).fetchall())
e = time.monotonic()
print(e - s)
conn.close()
def slow():
file_path = 's3://some_bucket/file_sorted_by_id.parquet'
# This is much slower. Query plan is different. Sometimes up to 10 to 20x for larger files
sql = f"""
SELECT id
FROM read_parquet('{file_path}')
WHERE id in $1;
"""
query_values = ['0346EADE0DE2CFFC', '13BCADA2FAB1FD66', '1CBE1BD7B8D59C98', '1D91EB13DA3E2472', '1EDACF5BEDAB3C10',
'1FCA7BEDE0FD9A3B', '23A306DAD89EAF0F', '28E205CC6CEFFBB8', '2CDA4C50BA2DEC8A', '30B464675F23E14B',
'37CE5E04CDA59E0F', '397CB382EDE1AC16', '3BBAB85A2228ADF8', '4A0A4A4BC5EB5099', '4ABAF8EEBD9DFE32',
'4CD67B12AC7B3D9D', '52F2CBEBD25C545D', '68782CDCEB6CDDFF', '6ABC479EF31B4AD7', '6CC7B512B7AFCE14',
'6CDC0CC540AEF13E', '75FB83ABE2AE2C8C', '7C82C5CCB80EDFBA', '7CFF8E5778F0F71B', '7EAE4FF84AB243AC',
'82B34DB28DF1AD8B', '901A27257C3FD449', '9CAE544FACCDD41D', '9D4D30EA3FE9D3E5', '9DE0191AEAFB0B7B',
'A0AE0D1FE22AE828', 'A6CFD40963F5BD95', 'A78ED4C6608CFEE5', 'ADE7F802A7D6CEE5', 'B1A39BB3AD58CB9A',
'B2D9CFEEAAF5A5C4', 'B4B150599EAEBD65', 'B4CCFCE24ABFCBA1', 'BBBAC104BD067AC7', 'BFD7028C65EACA6D',
'C3747454BECF20A0', 'C763573A73A3DA79', 'C78B6D74CD14EDF9', 'CA541676C357F5FF', 'CAE9ACDEC7539DE0',
'CC65DCBFA10B3EEE', 'CE343BDF60B26BE5', 'CF0EE41CC3CE418B', 'D3A69E024760745C', 'D7FEAA2EA8D2FECC',
'D8FDBADEBAC4A739', 'D9DDEBD64CCDE7EF', 'DCB539EEA25CFEF3', 'DFEACCB88AEF0EFF', 'E6C8B41EBECA1CEF',
'E99E67CFF37F4D44', 'EBFAFE2E18EAC99B', 'EF5BC93B43FBF9EE', 'F41BAC78137112BE', 'FAE3DFAAC7DD5B2D',
'FBA9DADFFE2B3A5F', 'FBEFFEBA7AEEE43A', 'FFDA9F7FDC824D05', 'FFF6DD3CD3B2D90B']
conn = create_connection()
s = time.monotonic()
print(conn.execute(sql, parameters=[query_values]).fetchall())
e = time.monotonic()
print(e - s)
conn.close()
if __name__ == '__main__':
fast()
slow()OS:
MacOS Sequoia 15.5
DuckDB Version:
1.4.1 & 1.3.2
DuckDB Client:
Python & CLI
Hardware:
Apple M4 Pro 48GB
Full Name:
Gregory Hartranft
Affiliation:
Prefer not to say
Did you include all relevant configuration (e.g., CPU architecture, Linux distribution) to reproduce the issue?
- Yes, I have
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant data sets for reproducing the issue?
No - I cannot share the data sets because they are confidential