Skip to content

BUG: avoid specifying default coerce_timestamps in to_parquet #31652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 5, 2020

Conversation

jorisvandenbossche
Copy link
Member

Looking into the usage question of #31572, I noticed that specifying the version to allow writing nanoseconds to parquet worked in plain pyarrow code, but not with pandas' to_parquet.
This is because we hardcode coerce_timestamps="ms" while the default is None, which has version-dependent behaviour (eg if version="2.0", actually write the nanosecond data)

@jorisvandenbossche jorisvandenbossche added Bug IO Parquet parquet, feather labels Feb 4, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.1 milestone Feb 4, 2020
@td.skip_if_no("pyarrow", min_version="0.14")
def test_timestamp_nanoseconds(self, pa):
# with version 2.0, pyarrow defaults to writing the nanoseconds, so
# this should work with error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this say "work without error"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, of course ... !

@jreback jreback merged commit be9ee6d into pandas-dev:master Feb 5, 2020
@jreback
Copy link
Contributor

jreback commented Feb 5, 2020

thanks @jorisvandenbossche

@mnylen
Copy link

mnylen commented Oct 1, 2020

For us this change caused a nasty bug after upgrading from pandas 1.0.3 to 1.1.x. Apparently AWS Athena/Presto doesn't support nanosecond precision, so our timestamps started appearing with year 52000 when we created the parquet files using pandas 1.1.x. Just FYI for anyone else having the same issue. Adding coerce_timestamps="ms" (the previous default) to the to_parquet() call fixes the issue.

@findepi
Copy link

findepi commented Oct 1, 2020

@mnylen FWIW, since recently Presto does support nanosecond precision for date/time types, including timestamps. (see trinodb/trino#1284 for more info).
However, this seems to be a different problem. If you can time, would you be able to answer a question in trinodb/trino#4662 (comment) so that we know what needs to be improved?

@mnylen
Copy link

mnylen commented Oct 3, 2020

@findepi I mentioned Presto only because AWS Athena uses that underneath. Athena is based on Presto 0.172, so it's likely that even if Presto nowadays supports nanoseconds for timestamps, it'll take a long time before Athena upgrades to the newer version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants