Skip to content

Parametrized queries much slower when querying parquet files #169

@gdh12

Description

@gdh12

What happens?

When querying against a parquet file located on s3 we are noticing that the query plan for when query parameters are present and when they are not is different. Different enough that a query can end up being 10x or more slower.

I have seen this replicated on both on Duckdb 1.3.2 and 1.4.1. Python and CLI both have the same behavior.

I can not share the parquet file directly due to its size and data but the file is sorted by id which should make the query very fast. My hunch is that something is potentially skipping row group filters under this condition and instead doing a full table scan.

To Reproduce

Python script that I was using for performance testing. Can also be replicated running the same or similar commands within CLI.

import duckdb
import time


def create_connection():
    conn = duckdb.connect()

    conn.execute("""SET TimeZone = 'UTC';""")
    conn.execute("SET autoinstall_known_extensions = false;")
    conn.execute("INSTALL spatial; LOAD spatial;")
    conn.execute("INSTALL parquet; LOAD parquet;")
    conn.execute("INSTALL httpfs")
    conn.execute("INSTALL aws;")
    conn.execute("LOAD httpfs;")
    conn.execute("LOAD aws;")

    conn.execute("""
    CREATE SECRET aws_secret (
        TYPE S3,
        PROVIDER CREDENTIAL_CHAIN,
        CHAIN 'config'
    );
    """)
    return conn


def fast():
    # File is a parquet file ordered by id
    file_path = 's3://some_bucket/file_sorted_by_id.parquet'

    sql = f"""
          SELECT id
          FROM read_parquet('{file_path}')
          WHERE id in ('0346EADE0DE2CFFC', '13BCADA2FAB1FD66', '1CBE1BD7B8D59C98', '1D91EB13DA3E2472', '1EDACF5BEDAB3C10',
                    '1FCA7BEDE0FD9A3B', '23A306DAD89EAF0F', '28E205CC6CEFFBB8', '2CDA4C50BA2DEC8A', '30B464675F23E14B',
                    '37CE5E04CDA59E0F', '397CB382EDE1AC16', '3BBAB85A2228ADF8', '4A0A4A4BC5EB5099', '4ABAF8EEBD9DFE32',
                    '4CD67B12AC7B3D9D', '52F2CBEBD25C545D', '68782CDCEB6CDDFF', '6ABC479EF31B4AD7', '6CC7B512B7AFCE14',
                    '6CDC0CC540AEF13E', '75FB83ABE2AE2C8C', '7C82C5CCB80EDFBA', '7CFF8E5778F0F71B', '7EAE4FF84AB243AC',
                    '82B34DB28DF1AD8B', '901A27257C3FD449', '9CAE544FACCDD41D', '9D4D30EA3FE9D3E5', '9DE0191AEAFB0B7B',
                    'A0AE0D1FE22AE828', 'A6CFD40963F5BD95', 'A78ED4C6608CFEE5', 'ADE7F802A7D6CEE5', 'B1A39BB3AD58CB9A',
                    'B2D9CFEEAAF5A5C4', 'B4B150599EAEBD65', 'B4CCFCE24ABFCBA1', 'BBBAC104BD067AC7', 'BFD7028C65EACA6D',
                    'C3747454BECF20A0', 'C763573A73A3DA79', 'C78B6D74CD14EDF9', 'CA541676C357F5FF', 'CAE9ACDEC7539DE0',
                    'CC65DCBFA10B3EEE', 'CE343BDF60B26BE5', 'CF0EE41CC3CE418B', 'D3A69E024760745C', 'D7FEAA2EA8D2FECC',
                    'D8FDBADEBAC4A739', 'D9DDEBD64CCDE7EF', 'DCB539EEA25CFEF3', 'DFEACCB88AEF0EFF', 'E6C8B41EBECA1CEF',
                    'E99E67CFF37F4D44', 'EBFAFE2E18EAC99B', 'EF5BC93B43FBF9EE', 'F41BAC78137112BE', 'FAE3DFAAC7DD5B2D',
                    'FBA9DADFFE2B3A5F', 'FBEFFEBA7AEEE43A', 'FFDA9F7FDC824D05', 'FFF6DD3CD3B2D90B');
          """
    conn = create_connection()
    s = time.monotonic()
    print(conn.execute(sql).fetchall())
    e = time.monotonic()
    print(e - s)
    conn.close()


def slow():
    file_path = 's3://some_bucket/file_sorted_by_id.parquet'
    # This is much slower. Query plan is different. Sometimes up to 10 to 20x for larger files
    sql = f"""
          SELECT id
          FROM read_parquet('{file_path}')
          WHERE id in $1;
          """
    query_values = ['0346EADE0DE2CFFC', '13BCADA2FAB1FD66', '1CBE1BD7B8D59C98', '1D91EB13DA3E2472', '1EDACF5BEDAB3C10',
                    '1FCA7BEDE0FD9A3B', '23A306DAD89EAF0F', '28E205CC6CEFFBB8', '2CDA4C50BA2DEC8A', '30B464675F23E14B',
                    '37CE5E04CDA59E0F', '397CB382EDE1AC16', '3BBAB85A2228ADF8', '4A0A4A4BC5EB5099', '4ABAF8EEBD9DFE32',
                    '4CD67B12AC7B3D9D', '52F2CBEBD25C545D', '68782CDCEB6CDDFF', '6ABC479EF31B4AD7', '6CC7B512B7AFCE14',
                    '6CDC0CC540AEF13E', '75FB83ABE2AE2C8C', '7C82C5CCB80EDFBA', '7CFF8E5778F0F71B', '7EAE4FF84AB243AC',
                    '82B34DB28DF1AD8B', '901A27257C3FD449', '9CAE544FACCDD41D', '9D4D30EA3FE9D3E5', '9DE0191AEAFB0B7B',
                    'A0AE0D1FE22AE828', 'A6CFD40963F5BD95', 'A78ED4C6608CFEE5', 'ADE7F802A7D6CEE5', 'B1A39BB3AD58CB9A',
                    'B2D9CFEEAAF5A5C4', 'B4B150599EAEBD65', 'B4CCFCE24ABFCBA1', 'BBBAC104BD067AC7', 'BFD7028C65EACA6D',
                    'C3747454BECF20A0', 'C763573A73A3DA79', 'C78B6D74CD14EDF9', 'CA541676C357F5FF', 'CAE9ACDEC7539DE0',
                    'CC65DCBFA10B3EEE', 'CE343BDF60B26BE5', 'CF0EE41CC3CE418B', 'D3A69E024760745C', 'D7FEAA2EA8D2FECC',
                    'D8FDBADEBAC4A739', 'D9DDEBD64CCDE7EF', 'DCB539EEA25CFEF3', 'DFEACCB88AEF0EFF', 'E6C8B41EBECA1CEF',
                    'E99E67CFF37F4D44', 'EBFAFE2E18EAC99B', 'EF5BC93B43FBF9EE', 'F41BAC78137112BE', 'FAE3DFAAC7DD5B2D',
                    'FBA9DADFFE2B3A5F', 'FBEFFEBA7AEEE43A', 'FFDA9F7FDC824D05', 'FFF6DD3CD3B2D90B']
    conn = create_connection()
    s = time.monotonic()
    print(conn.execute(sql, parameters=[query_values]).fetchall())
    e = time.monotonic()
    print(e - s)
    conn.close()


if __name__ == '__main__':
    fast()
    slow()

OS:

MacOS Sequoia 15.5

DuckDB Version:

1.4.1 & 1.3.2

DuckDB Client:

Python & CLI

Hardware:

Apple M4 Pro 48GB

Full Name:

Gregory Hartranft

Affiliation:

Prefer not to say

Did you include all relevant configuration (e.g., CPU architecture, Linux distribution) to reproduce the issue?

  • Yes, I have

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant data sets for reproducing the issue?

No - I cannot share the data sets because they are confidential

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions