-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Fix roundtripping with pyarrow schema #54768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Nice thanks @phofl |
…hema) (#54773) Backport PR #54768: Fix roundtripping with pyarrow schema Co-authored-by: Patrick Hoefler <[email protected]>
path = tmp_path / "decimal.p" | ||
df = pd.DataFrame({"a": [Decimal("123.00")]}, dtype="string[pyarrow]") | ||
df.to_parquet(path, schema=pa.schema([("a", pa.decimal128(5))])) | ||
result = read_parquet(path) | ||
expected = pd.DataFrame({"a": ["123"]}, dtype="string[python]") | ||
tm.assert_frame_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look correct to me? If you specify that the schema should use decimals, it should not come back as string but I would expect object dtype (with decimal.Decimal
objects)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's because regardless of specifying schema
, the original dtype (string) is still stored in the "pandas_metadata", which is used when reading back and converting to pandas (this is something on the pyarrow side, but to be honest this doesn't feel right to me when the user also specifies a schema
..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this looks buggy, but probably on the arrow side. That said, better strings instead of all NA and we are only restoring the previous behaviour
* Fix roundtripping with pyarrow schema * Skip for lower versions
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.that's a regression on main as well, it's currently returning NA
cc @mroeschke