Skip to content

PERF: Fix performance regression in read_csv when converting datetimes #52057

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 29, 2023

Conversation

phofl
Copy link
Member

@phofl phofl commented Mar 17, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

the read-only pr changed this to _values which returns a DatetimeArray instead of a ndarray, which caused a performance regression

https://asv-runner.github.io/asv-collection/pandas/#io.csv.ReadCSVConcatDatetime.time_read_csv

@phofl phofl added the IO CSV read_csv, to_csv label Mar 17, 2023
@phofl phofl added this to the 2.0 milestone Mar 17, 2023
)._values
)
if isinstance(result, DatetimeIndex):
arr = result.to_numpy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know it is timezone-naive here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no tz-aware.

I guess you are referring to result._values._ndarray? Tried this first, but breaks tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its its tzaware then to_numpy() should convert to object, which i havent checked but assume we dont want here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it does, but I am not too concerned by this since this is the same behavior as before. This keeps performance at least stable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to clarify my initial response: we can either be tz aware or naive, depends on the input

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel ok with merging? We should get this into 2.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK with me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx

@phofl phofl added the Performance Memory or execution speed performance label Mar 29, 2023
@phofl phofl merged commit beec0e8 into pandas-dev:main Mar 29, 2023
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Mar 29, 2023
@phofl phofl deleted the perf_read_csv_datetime branch March 29, 2023 15:33
mroeschke pushed a commit that referenced this pull request Mar 29, 2023
…in read_csv when converting datetimes) (#52278)

Backport PR #52057: PERF: Fix performance regression in read_csv when converting datetimes

Co-authored-by: Patrick Hoefler <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants