BUG: to_csv with chunksize mismatched formatting of dt64 #55481

jbrockmendel · 2023-10-10T23:35:06Z

dti = pd.date_range("2016-01-01", periods=3, freq="D")
df = pd.DataFrame({"A": dti})
df.iloc[-1, -1] += pd.Timedelta(minutes=1)

res = df.to_csv(chunksize=1)

>>> print(res)
,A
0,2016-01-01
1,2016-01-02
2,2016-01-03 00:01:00

In io.formats.csvs we call _save_body which iterates over chunks calling _save_chunk, which determines the format to use chunk-by-chunk. But for dt64 values (also td64) we determine the format to use (in most rendering contexts) by looking at the entire array.

Note this doesn't actually require chunksize to be specified to get the bad behavior, just to get a concise example.

Expected Behavior

>>> print(res)
,A
0,2016-01-01 00:00:00
1,2016-01-02 00:00:00
2,2016-01-03 00:01:00

The text was updated successfully, but these errors were encountered:

xiaohuanlin · 2024-01-20T16:02:56Z

@jbrockmendel Hi, I think your PR introduced this problem. Would you like to submit a new PR to fix it or me?
The reason is the first two rows were treated as date so they was printed like "%Y-%m-%d"

jbrockmendel · 2024-01-20T23:05:30Z

Go for it.

xiaohuanlin · 2024-01-22T19:57:47Z

take

xiaohuanlin · 2024-01-23T19:43:59Z

After investigating it carefully, I found it is not easy to fix.
The format of the date is a special case: https://github.com/pandas-dev/pandas/blob/main/pandas/core/arrays/datetimes.py#L762

While it's tough to infer the format by looking at the entire array.
The representation for datetime is controlled by low-level code when we don't pass the parameter of date_format: https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/tslib.pyx#L163. So it means it will have a lot of different formats within the same array.
For example:

dti = pd.date_range("2016-01-01", periods=3, freq="D")
df = pd.DataFrame({"A": dti})
df.iloc[-1, -1] += pd.Timedelta(microseconds=1)
df.iloc[-2, -1] += pd.Timedelta(minutes=1)

res = df.to_csv(chunksize=1)
>>> print(res)
,A
0,2016-01-01
1,2016-01-02 00:01:00
2,2016-01-03 00:00:00.000001

I used to consider removing the special case, but the code has existed for over 10 years so it probably breaks the compatibility. Could you provide me with more information about your original expectations?

jbrockmendel · 2024-01-23T20:06:20Z

Best guess is that to address this we would need to find the appropriate format by looking at the entire column(s) in/before _save_body, then pass that to _save_chunk.

This is probably not a Good First Issue unless you really need this to be fixed in the short term.

jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 10, 2023

lithomas1 added Datetime Datetime data dtype IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 11, 2024

github-actions bot assigned xiaohuanlin Jan 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: to_csv with chunksize mismatched formatting of dt64 #55481

BUG: to_csv with chunksize mismatched formatting of dt64 #55481

jbrockmendel commented Oct 10, 2023 •

edited

Loading

xiaohuanlin commented Jan 20, 2024

jbrockmendel commented Jan 20, 2024

xiaohuanlin commented Jan 22, 2024

xiaohuanlin commented Jan 23, 2024

jbrockmendel commented Jan 23, 2024

BUG: to_csv with chunksize mismatched formatting of dt64 #55481

BUG: to_csv with chunksize mismatched formatting of dt64 #55481

Comments

jbrockmendel commented Oct 10, 2023 • edited Loading

Expected Behavior

xiaohuanlin commented Jan 20, 2024

jbrockmendel commented Jan 20, 2024

xiaohuanlin commented Jan 22, 2024

xiaohuanlin commented Jan 23, 2024

jbrockmendel commented Jan 23, 2024

jbrockmendel commented Oct 10, 2023 •

edited

Loading