PERF: Fix performance regression in read_csv when converting datetimes #52057

phofl · 2023-03-17T18:48:12Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

the read-only pr changed this to _values which returns a DatetimeArray instead of a ndarray, which caused a performance regression

https://asv-runner.github.io/asv-collection/pandas/#io.csv.ReadCSVConcatDatetime.time_read_csv

jbrockmendel · 2023-03-17T19:15:10Z

pandas/io/parsers/base_parser.py

-            )._values
+            )
+            if isinstance(result, DatetimeIndex):
+                arr = result.to_numpy()


do we know it is timezone-naive here?

no tz-aware.

I guess you are referring to result._values._ndarray? Tried this first, but breaks tests

its its tzaware then to_numpy() should convert to object, which i havent checked but assume we dont want here

Yeah it does, but I am not too concerned by this since this is the same behavior as before. This keeps performance at least stable.

I have to clarify my initial response: we can either be tz aware or naive, depends on the input

@jbrockmendel ok with merging? We should get this into 2.0

…d_csv when converting datetimes

…in read_csv when converting datetimes) (#52278) Backport PR #52057: PERF: Fix performance regression in read_csv when converting datetimes Co-authored-by: Patrick Hoefler <[email protected]>

phofl added 2 commits March 17, 2023 19:41

PERF: Fix performance regression in read_csv when converting datetimes

87da43f

PERF: Fix performance regression in read_csv when converting datetimes

4066884

phofl requested a review from jorisvandenbossche March 17, 2023 18:48

phofl added the IO CSV read_csv, to_csv label Mar 17, 2023

phofl added this to the 2.0 milestone Mar 17, 2023

jbrockmendel reviewed Mar 17, 2023

View reviewed changes

phofl added the Performance Memory or execution speed performance label Mar 29, 2023

phofl merged commit beec0e8 into pandas-dev:main Mar 29, 2023

meeseeksmachine mentioned this pull request Mar 29, 2023

Backport PR #52057 on branch 2.0.x (PERF: Fix performance regression in read_csv when converting datetimes) #52278

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Mar 29, 2023

Backport PR pandas-dev#52057: PERF: Fix performance regression in rea…

dc49903

…d_csv when converting datetimes

phofl deleted the perf_read_csv_datetime branch March 29, 2023 15:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Fix performance regression in read_csv when converting datetimes #52057

PERF: Fix performance regression in read_csv when converting datetimes #52057

phofl commented Mar 17, 2023

jbrockmendel Mar 17, 2023

phofl Mar 17, 2023

jbrockmendel Mar 17, 2023

phofl Mar 17, 2023

phofl Mar 17, 2023

phofl Mar 29, 2023

jbrockmendel Mar 29, 2023

phofl Mar 29, 2023

PERF: Fix performance regression in read_csv when converting datetimes #52057

PERF: Fix performance regression in read_csv when converting datetimes #52057

Conversation

phofl commented Mar 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment