-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: partitioning parquet by pyarrow.date32 fails when reading #53008
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There is a suggestion here: apache/arrow#22510 that the pandas metadata could be used to specify the schema needed for the pyarrow Dataset |
I'm not familiar with the pandas
Would you be interested in changing the |
@phofl @jorisvandenbossche @mroeschke I see you are the main contributors of this code |
For the partitioning column, I believe the data is not stored in the parquet files. When reading back in, the column is constructed from the file paths instead, so roundtripping with dtypes is not possible. In any case, this is a pyarrow issue and not pandas. |
@rhshadrach pandas stores metadata info in the parquet files and this extra data makes it possible to restore the schema and dtype. Since it needs pandas specific info, it's debatable which option of the two (pyarrow or pandas) should be implemented. This is the same suggestion as @jorisvandenbossche had in the linked JIRA issue. |
I am no expert here, so correct me if this is wrong, but I believe this is done on the pyarrow side and not within pandas. |
@rhshadrach looks like it is there. I guess in this case the only question is if it's acceptable that storing and loading a pandas dataframe raises an exception. Note that storing and loading the pyarrow table has no issues. Regardless, this can be closed, adding the notes to the pyarrow issue now. |
I think there are multiple aspects that interact:
And it's the combination of those three items that results in having a dictionary typed column with string categories (result of the first two bullet points) that we want to try to convert to a And then it's actually the pandas implementation of this conversion (in |
@jorisvandenbossche - is there something that should be done here on the pandas side? |
I was still contemplating that, it might depend on the exact behaviour we would prefer to see in practice?
|
As a user, I desire round tripping regardless of whether partitions are used or not. For example:
If I call I personally do not find getting dictionary array / Categorical for partition columns to be a feature. |
I don't have a strong opinion on categorical vs non-categorical (but if defaulting to convert, then a flag to disable it is always nice). The int vs string vs date issue is more annoying (mainly because it raises Exception). I've just realized that I had similar issues with pyarrow partition filters as well (the behavior has changed in the recent versions): apache/arrow#34727 |
Hmm, that's not what I see:
So the categories are correctly converted back to integers, but the column itself is categorical. (above is with pyarrow 11.0.0 and pandas 1.5.3) |
@jorisvandenbossche: #53008 (comment) was my desired behavior, not current state. But I'd be more than happy with a flag to disable categorical conversion, and even really okay if I'm stuck with categories. It's the dtype conversion (string -> int), especially the stripping of leading 0s, that is a pain point for me.
df2 are also converted to integers, even though they start out as strings. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
When partitioning is used, the pyarrow date32 is written to the path and read back as a dictionary of strings instead of a dictionary of date32 types (or simply date32, I was surprised dataset writing converts to a category type automatically). When trying to cast string to date32 an exception is thrown.
Expected Behavior
Something similar to this:
Which returns the original DataFrame
Installed Versions
The text was updated successfully, but these errors were encountered: