BUG: partitioning parquet by pyarrow.date32 fails when reading #53008

alippai · 2023-04-30T00:56:49Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import datetime
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
s = pd.Series(pa.array([datetime.date.today(), datetime.date.today(), datetime.date.today()]), dtype='date32[pyarrow]') 
df = pd.DataFrame({'c1': s, 'c2': s})
pq.write_to_dataset(pa.Table.from_pandas(df, preserve_index=False), 'dataset', ['c1'])
ret = pd.read_parquet('dataset') # exception

Issue Description

When partitioning is used, the pyarrow date32 is written to the path and read back as a dictionary of strings instead of a dictionary of date32 types (or simply date32, I was surprised dataset writing converts to a category type automatically). When trying to cast string to date32 an exception is thrown.

Expected Behavior

Something similar to this:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
s = pd.Series(pa.array([datetime.date.today(), datetime.date.today(), datetime.date.today()]), dtype='date32[pyarrow]') 
df = pd.DataFrame({'c1': s, 'c2': s})
t = pa.Table.from_pandas(df, preserve_index=False)
pq.write_to_dataset(t, 'dataset', ['c1'])
dataset = pq.ParquetDataset('dataset/', schema=t.schema)
ret = dataset.read().to_pandas()

Which returns the original DataFrame

Installed Versions

pandas : 2.0.1 pyarrow : 11.0.0

alippai · 2023-04-30T01:04:41Z

There is a suggestion here: apache/arrow#22510 that the pandas metadata could be used to specify the schema needed for the pyarrow Dataset

alippai · 2023-04-30T01:51:26Z

I'm not familiar with the pandas read_parquet() implementation yet, but something like this could work:

import pyarrow.parquet as pq

dataset_pre = pq.ParquetDataset('dataset/')
columns_to_cast = [c for c in  pq.ParquetDataset('dataset/').schema.pandas_metadata['columns'] if c['field_name'] in dataset_pre.partitioning.schema.names]

correct_schema = dataset_pre.schema.
for column in columns_to_cast:
  correct_schema = correct_schema.set(correct_schema.field_by_name(column['field_name']).with_type(convert_pandas_to_pyarrow(column['numpy_type'])))

dataset_final = pq.ParquetDataset('dataset/', schema=correct_schema)
table = dataset_final.read()
df = table.to_pandas()

Would you be interested in changing the PyArrowImpl to use something like the above or should this be part of pyarrow? I can draft a PR if you think this is desired.

alippai · 2023-04-30T01:54:38Z

@phofl @jorisvandenbossche @mroeschke I see you are the main contributors of this code

rhshadrach · 2023-05-02T01:51:06Z

For the partitioning column, I believe the data is not stored in the parquet files. When reading back in, the column is constructed from the file paths instead, so roundtripping with dtypes is not possible.

In any case, this is a pyarrow issue and not pandas.

alippai · 2023-05-02T01:53:44Z

@rhshadrach pandas stores metadata info in the parquet files and this extra data makes it possible to restore the schema and dtype. Since it needs pandas specific info, it's debatable which option of the two (pyarrow or pandas) should be implemented.

This is the same suggestion as @jorisvandenbossche had in the linked JIRA issue.

rhshadrach · 2023-05-03T02:50:24Z

pandas stores metadata info in the parquet files

I am no expert here, so correct me if this is wrong, but I believe this is done on the pyarrow side and not within pandas.

alippai · 2023-05-03T03:04:31Z

@rhshadrach looks like it is there. I guess in this case the only question is if it's acceptable that storing and loading a pandas dataframe raises an exception. Note that storing and loading the pyarrow table has no issues.

Regardless, this can be closed, adding the notes to the pyarrow issue now.

jorisvandenbossche · 2023-05-03T15:20:48Z

I think there are multiple aspects that interact:

When reading a partitioned dataset, the column that originates from the partitioning (embedded in the file names) is indeed reconstructed from the strings. PyArrow itself only does very little type inference when parsing those strings (I think just numbers or strings)
The default for this partitioning field is to reconstruct it as a dictionary array (categorical), because it always results in a highly repetitive value (one value for all rows that come from the same file). This can be turned off in pyarrow, would have to check if you can also do this through pd.read_parquet
When converting a pandas DataFrame to pyarrow Table (when writing), pyarrow does store metadata about the original pandas data types, and this metadata already gets used on the conversion back from pyarrow to pandas.

And it's the combination of those three items that results in having a dictionary typed column with string categories (result of the first two bullet points) that we want to try to convert to a pd.ArrowDtype("date32[day]") (because that is what the stored metadata says the original was, result of the third bullet point).

And then it's actually the pandas implementation of this conversion (in ArrowDtype.__from_arrow__) that pandas tries to convert the pyarrow array it gets passed (the dictionary) to the pyarrow type of the ArrowDtype (date32) using pyarrow's cast function. And then this cast fails with "Cast type date32[day] incompatible with dictionary type string" (which is a difficult way to say "cast from dictionary with string type to date32[day] is not implemented").

rhshadrach · 2023-05-03T16:33:21Z

@jorisvandenbossche - is there something that should be done here on the pandas side?

jorisvandenbossche · 2023-05-03T20:44:45Z

I was still contemplating that, it might depend on the exact behaviour we would prefer to see in practice?

If we want that the dictionary string array actually gets converted to date32, then this conversion could be fixed on the pandas side within __from_arrow__ (by special casing dictionary input, by decoding the dictionary or by only casting its dictionary values)
Another option could be to say that on the pyarrow side, if a conversion for a column based on calling __arrow_dtype__ (and when inferred from the stored pandas metadata) fails, that in such a case pyarrow falls back to the normal, default conversion (as if there was no metadata). A potential issue with this is that it might also hide errors that ideally are bubbled up to the user.
If we want the result to be a dictionary of date32, then in theory pyarrow could know that if the original pandas metadata indicated a non-dictionary type, it should cast the dictionary values to that type, and then create a pandas.Categorical for it. But that might get a bit complicated. A problem is that at the point of converting the pyarrow.Table to pandas.DataFrame, pyarrow doesn't know anymore that a certain dictionary typed column originates from a partitioning field, so it's difficult to do something specific to this use case.

rhshadrach · 2023-05-04T02:12:44Z

As a user, I desire round tripping regardless of whether partitions are used or not. For example:

df1 = pd.DataFrame({'a': [1 ,1, 2], 'b': [3, 4, 5]})
df2 = pd.DataFrame({'a': ['1', '1', '2'], 'b': [3, 4, 5]})

If I call df1.to_parquet('test.parquet', partition_cols=['a']); pd.read_parquet('test.parquet') then I get a DataFrame with a as int64 and likewise doing the same with df2 gives a as object containing strings. Similarly for any other dtype that is supported by parquet and has a faithful repr.

I personally do not find getting dictionary array / Categorical for partition columns to be a feature.

alippai · 2023-05-04T04:39:11Z

I don't have a strong opinion on categorical vs non-categorical (but if defaulting to convert, then a flag to disable it is always nice).
The performance improvements are usually non-negligible, so the extra conversion makes sense.

The int vs string vs date issue is more annoying (mainly because it raises Exception). I've just realized that I had similar issues with pyarrow partition filters as well (the behavior has changed in the recent versions): apache/arrow#34727

jorisvandenbossche · 2023-05-04T07:20:00Z

If I call df1.to_parquet('test.parquet', partition_cols=['a']); pd.read_parquet('test.parquet') then I get a DataFrame with a as int64 and likewise doing the same with df2 gives a as object containing strings. Similarly for any other dtype that is supported by parquet and has a faithful repr.

Hmm, that's not what I see:

In [8]: df1.to_parquet('test.parquet', partition_cols=['a']); df1_roundtrip = pd.read_parquet('test.parquet')

In [9]: df1_roundtrip.dtypes
Out[9]: 
b       int64
a    category
dtype: object

In [10]: df1_roundtrip["a"].dtype
Out[10]: CategoricalDtype(categories=[1, 2], ordered=False)

In [11]: df1_roundtrip["a"].dtype.categories
Out[11]: Int64Index([1, 2], dtype='int64')

So the categories are correctly converted back to integers, but the column itself is categorical.

(above is with pyarrow 11.0.0 and pandas 1.5.3)

rhshadrach · 2023-05-04T21:15:17Z

Hmm, that's not what I see:

@jorisvandenbossche: #53008 (comment) was my desired behavior, not current state. But I'd be more than happy with a flag to disable categorical conversion, and even really okay if I'm stuck with categories. It's the dtype conversion (string -> int), especially the stripping of leading 0s, that is a pain point for me.

So the categories are correctly converted back to integers, but the column itself is categorical.

df2 are also converted to integers, even though they start out as strings.

alippai added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 30, 2023

rhshadrach added IO Parquet parquet, feather Closing Candidate May be closeable, needs more eyeballs Arrow pyarrow functionality labels May 2, 2023

alippai mentioned this issue May 3, 2023

[Python] Datatypes not preserved for partition fields in roundtrip to partitioned parquet dataset apache/arrow#22510

Open

DeaMariaLeon removed the Needs Triage Issue that has not been reviewed by a pandas team member label May 3, 2023

jorisvandenbossche removed the Closing Candidate May be closeable, needs more eyeballs label May 3, 2023

rhshadrach added the Needs Discussion Requires discussion from core team before further action label May 3, 2023

alippai mentioned this issue May 23, 2023

BUG: read_parquet not preserving type of partition_cols #53345

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: partitioning parquet by pyarrow.date32 fails when reading #53008

BUG: partitioning parquet by pyarrow.date32 fails when reading #53008

alippai commented Apr 30, 2023 •

edited by jorisvandenbossche

Loading

alippai commented Apr 30, 2023

alippai commented Apr 30, 2023 •

edited

Loading

alippai commented Apr 30, 2023

rhshadrach commented May 2, 2023 •

edited

Loading

alippai commented May 2, 2023

rhshadrach commented May 3, 2023

alippai commented May 3, 2023

jorisvandenbossche commented May 3, 2023

rhshadrach commented May 3, 2023

jorisvandenbossche commented May 3, 2023 •

edited

Loading

rhshadrach commented May 4, 2023 •

edited

Loading

alippai commented May 4, 2023 •

edited

Loading

jorisvandenbossche commented May 4, 2023

rhshadrach commented May 4, 2023

BUG: partitioning parquet by pyarrow.date32 fails when reading #53008

BUG: partitioning parquet by pyarrow.date32 fails when reading #53008

Comments

alippai commented Apr 30, 2023 • edited by jorisvandenbossche Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

alippai commented Apr 30, 2023

alippai commented Apr 30, 2023 • edited Loading

alippai commented Apr 30, 2023

rhshadrach commented May 2, 2023 • edited Loading

alippai commented May 2, 2023

rhshadrach commented May 3, 2023

alippai commented May 3, 2023

jorisvandenbossche commented May 3, 2023

rhshadrach commented May 3, 2023

jorisvandenbossche commented May 3, 2023 • edited Loading

rhshadrach commented May 4, 2023 • edited Loading

alippai commented May 4, 2023 • edited Loading

jorisvandenbossche commented May 4, 2023

rhshadrach commented May 4, 2023

alippai commented Apr 30, 2023 •

edited by jorisvandenbossche

Loading

alippai commented Apr 30, 2023 •

edited

Loading

rhshadrach commented May 2, 2023 •

edited

Loading

jorisvandenbossche commented May 3, 2023 •

edited

Loading

rhshadrach commented May 4, 2023 •

edited

Loading

alippai commented May 4, 2023 •

edited

Loading