-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: avoid specifying default coerce_timestamps in to_parquet #31652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: avoid specifying default coerce_timestamps in to_parquet #31652
Conversation
pandas/tests/io/test_parquet.py
Outdated
@td.skip_if_no("pyarrow", min_version="0.14") | ||
def test_timestamp_nanoseconds(self, pa): | ||
# with version 2.0, pyarrow defaults to writing the nanoseconds, so | ||
# this should work with error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this say "work without error"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, of course ... !
Co-Authored-By: Tom Augspurger <[email protected]>
thanks @jorisvandenbossche |
For us this change caused a nasty bug after upgrading from pandas 1.0.3 to 1.1.x. Apparently AWS Athena/Presto doesn't support nanosecond precision, so our timestamps started appearing with year 52000 when we created the parquet files using pandas 1.1.x. Just FYI for anyone else having the same issue. Adding |
@mnylen FWIW, since recently Presto does support nanosecond precision for date/time types, including timestamps. (see trinodb/trino#1284 for more info). |
@findepi I mentioned Presto only because AWS Athena uses that underneath. Athena is based on Presto 0.172, so it's likely that even if Presto nowadays supports nanoseconds for timestamps, it'll take a long time before Athena upgrades to the newer version. |
Looking into the usage question of #31572, I noticed that specifying the version to allow writing nanoseconds to parquet worked in plain pyarrow code, but not with pandas'
to_parquet
.This is because we hardcode
coerce_timestamps="ms"
while the default isNone
, which has version-dependent behaviour (eg if version="2.0", actually write the nanosecond data)