-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: parse non-ambiguous date formats in c #12667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
See the 0.18.0 perf improvements
|
note if this were implemented: #11665 this would easily be 10x improvement here as there are lots of repeats. pull-requests on that are welcome! |
I realize you actualy want |
|
btw, even though |
so there was a bunch of discussion on #12585 on how this could be fixed. The primary way would be to allow additional c-code to parse some other non-ambiguous formats. |
Jeff, thanks for taking the time to look at this. Honestly, if this is a non ISO format then you guys should probably not It just so happens that our data storage ends up dumping the timestamps in I brought this up because I just thought there might be something out there If someone takes this on and looks to make a more generalizeable date Alex
|
I should clarify, I think the support in pandas provided for messy/non
|
FWIW format caching is pretty high on the python-dateutil todo list |
caching is now supported, and this format isn't ISO8601, so I think we can close this thanks for the suggestion! |
Date parsing is slow on dates of the form "%m.%d.%Y" compared to "%Y-%m-%d". I am using 0.17.1 so maybe it's improved in 0.18 but I didn't see anything about it. I provide a benchmark below to compare the two datetime formats.
Some observations:
infer_datetime_format
is the next best option but it appears to be a massive slowdown. Compare the relative times on the large file to the relative times of the pure read on the small file and date parsing withinfer_datetime_format
.Code Sample
test.csv
(1 million lines)test2.csv
(5000 lines)Output (2012 MacBook Air)
output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: