Skip to content

BUG: to_csv with chunksize mismatched formatting of dt64 #55481

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jbrockmendel opened this issue Oct 10, 2023 · 5 comments
Open

BUG: to_csv with chunksize mismatched formatting of dt64 #55481

jbrockmendel opened this issue Oct 10, 2023 · 5 comments
Assignees
Labels
Bug Datetime Datetime data dtype IO CSV read_csv, to_csv

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Oct 10, 2023

dti = pd.date_range("2016-01-01", periods=3, freq="D")
df = pd.DataFrame({"A": dti})
df.iloc[-1, -1] += pd.Timedelta(minutes=1)

res = df.to_csv(chunksize=1)

>>> print(res)
,A
0,2016-01-01
1,2016-01-02
2,2016-01-03 00:01:00

In io.formats.csvs we call _save_body which iterates over chunks calling _save_chunk, which determines the format to use chunk-by-chunk. But for dt64 values (also td64) we determine the format to use (in most rendering contexts) by looking at the entire array.

Note this doesn't actually require chunksize to be specified to get the bad behavior, just to get a concise example.

Expected Behavior

>>> print(res)
,A
0,2016-01-01 00:00:00
1,2016-01-02 00:00:00
2,2016-01-03 00:01:00
@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 10, 2023
@lithomas1 lithomas1 added Datetime Datetime data dtype IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 11, 2024
@xiaohuanlin
Copy link
Contributor

@jbrockmendel Hi, I think your PR introduced this problem. Would you like to submit a new PR to fix it or me?
The reason is the first two rows were treated as date so they was printed like "%Y-%m-%d"

@jbrockmendel
Copy link
Member Author

Go for it.

@xiaohuanlin
Copy link
Contributor

take

@xiaohuanlin
Copy link
Contributor

After investigating it carefully, I found it is not easy to fix.
The format of the date is a special case: https://github.com/pandas-dev/pandas/blob/main/pandas/core/arrays/datetimes.py#L762

While it's tough to infer the format by looking at the entire array.
The representation for datetime is controlled by low-level code when we don't pass the parameter of date_format: https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/tslib.pyx#L163. So it means it will have a lot of different formats within the same array.
For example:

dti = pd.date_range("2016-01-01", periods=3, freq="D")
df = pd.DataFrame({"A": dti})
df.iloc[-1, -1] += pd.Timedelta(microseconds=1)
df.iloc[-2, -1] += pd.Timedelta(minutes=1)

res = df.to_csv(chunksize=1)
>>> print(res)
,A
0,2016-01-01
1,2016-01-02 00:01:00
2,2016-01-03 00:00:00.000001

I used to consider removing the special case, but the code has existed for over 10 years so it probably breaks the compatibility. Could you provide me with more information about your original expectations?

@jbrockmendel
Copy link
Member Author

Best guess is that to address this we would need to find the appropriate format by looking at the entire column(s) in/before _save_body, then pass that to _save_chunk.

This is probably not a Good First Issue unless you really need this to be fixed in the short term.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

3 participants