-
Notifications
You must be signed in to change notification settings - Fork 602
Description
I have code in which I am using applyinPandas along witha a udf function that processes two dataframes, one on which the groupby is applied and another is passed as paramter to the udf function.
Now whenever, I run the function for a smaller dataset let's say for around ~200k record - it runs smoothly finished within an hours.
But when the data size increases to >800k records - It throws following error
Error Description
"Caused by: org.apache.spark.api.python.PythonException: 'TypeError: Expected Array, got <class 'pyarrow.lib.ChunkedArray'>'. Full traceback below: Traceback (most recent call last): File "pyarrow/array.pxi", line 2377, in pyarrow.lib.StructArray.from_arrays TypeError: Expected Array, got <class 'pyarrow.lib.ChunkedArray'>"
Description
Please provide a clear and concise description of the problem. This should contain all the steps needed to reproduce the problem. A minimal code example that exposes the problem is very appreciated.
I am running the above code on Databricks
Software information
- Databricks (15.4 ML, included Apache Spark 3.5.0)
- Python version - 3.11
Additional information
Below are the spark configurations in the cluster
spark.databricks.service.server.enabled true
spark.databricks.service.port 15001
spark.databricks.delta.preview.enabled true