-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: add date_format to read_csv / Date parsing mistake. read_csv #2586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Try using the In [4]: read_csv('test.csv', index_col=0, parse_dates=True, dayfirst=True)
Out[4]:
b
a
2003-03-27 14:55:00 1
2003-08-03 15:20:00 2 If you want to specify the datetime format you can use the In [5]: read_csv('test.csv', index_col=0, date_parser=lambda x: datetime.strptime(x, '%d.%m.%Y %H:%M:%S.%f'))
Out[5]:
b
a
2003-03-27 14:55:00 1
2003-08-03 15:20:00 2 |
sorry, my bad for not looking at the API reference closely enough to spot the dayfirst flag. the date_parser parameter is what I was using to test using the datetime constructor. In my experience strptime is very slow compared to using the constructor and string slicing. Having some default formats to choose from that then call an optimised parser would seem a good idea. It would speed up parsing for those formats while also allowing the user to be strict with the format, without them having to write their own, potentially slow, parsing function. It seems worthwhile, what with the recent efforts to make file parsing much faster the date parsing really dominates the runtime of read_csv |
Here are the benchmarks I'm getting (50k of the format In [68]: %timeit read_csv('dates.csv', parse_dates=[0], dayfirst=True)
1 loops, best of 3: 4.6 s per loop
In [69]: %timeit read_csv('dates.csv', parse_dates=[0], date_parser=lambda x: datetime.strptime(x, '%d/%M/%Y'), dayfirst=True)
1 loops, best of 3: 1.18 s per loop And for passing directly to the constructor: In [70]: %timeit data = read_csv('dates.csv')
10 loops, best of 3: 21.1 ms per loop
In [71]: %time data.Date = map(lambda x, y, z: datetime(x, y, z), data.Date.map(lambda a: int(a[4:])), data.Date.map(lambda a: int(a[2:3])), data.Date.map(lambda a: int(a[:1])))
CPU times: user 0.35 s, sys: 0.00 s, total: 0.35 s
Wall time: 0.35 s So there's a ~3x speedup by specifying the format, and another ~3x from manually passing to the constructor. I don't think the user should be forced to pass a date format to get faster performance, especially when there might be multiple columns with different formats. We're currently assuming that a date column is not of a homogeneous format (by calling If we have a homogeneity flag, we could call |
I actually added a |
A 'format' argument would be a great feature in read_csv(). Would save you from writhing the 'fast' parser lambda function with strptime() while also handling the special non-parseable case where you want to just fill parsing errors with a np.nan or NaT. I'm assuming this would use datetime's format strings. |
the new infer_datetime_format flag solves this problem (though leaving open because ideally you specify the format directly) |
Small point, but just to give an example of a similar use case that should be handled - I have a csv in which some columns are dates and some are datetimes. I need to be able to provide a different format for each column that I want to parse (as per the OP's suggested specification). |
@maxgrenderjones interested in doing a PR for this? as an aside, try |
@mroeschke has this been addressed? Closeable? |
Doesn't appear so. But given that you can also just pass |
Yep, agreed that given the huge amount of keywords, we should maybe not add yet another one (although I would personally find a |
I still think is useful. Having to provide a lambda in the current state is rather wonky from an end user perspective and an explicit format keyword could probably enable much better performance on larger datasets |
Just had this problem in my code -
still gets "2010-10-01" for "10/01/2010" |
@alexbk66 that seems like a bug (maybe
|
What do you guys think about expanding
This would address the need for greater customized keyword arguments without bloating or requiring wonky lambda arguments. Also, if we allow |
That sounds like a nice idea to me! |
Well, who says that we couldn't accept a string (i.e. date format) for |
Is the only downside to having a dedicated |
API bloat isn't the only reason. Another benefit of this consolidation is that the I also might add that adapting |
Solved as of PDEP4, let's close In [53]: data = """a,b
...: 27.03.2003 14:55:00.000,1
...: 03.08.2003 15:20:00.000,2"""
In [54]: read_csv(io.StringIO(data),index_col=0, parse_dates=True)
Out[54]:
b
a
2003-03-27 14:55:00 1
2003-03-08 15:20:00 2 |
Sorry, I was too fast This is wrong, and is a serious bug. It should've been addressed by PDEP4, but I'm working on a fix, but in any case, we shouldn't be falling back to dateutil here #50319 will fix this, but I'll make a separate PR to remove this fallback and ensure this doesn't happen again pandas/pandas/io/parsers/base_parser.py Lines 1122 to 1134 in 7b0d4dd
|
Only took 10 years and 5 days, but this is actually fixed now: In [1]: data = """a,b
...: 27.03.2003 14:55:00.000,1
...: 03.08.2003 15:20:00.000,2"""
In [2]: read_csv(io.StringIO(data),index_col=0, parse_dates=True)
<ipython-input-2-1174b5f5763d>:1: UserWarning: Parsing dates in %d.%m.%Y %H:%M:%S.%f format when dayfirst=False was specified. Pass `dayfirst=True` or specify a format to silence this warning.
read_csv(io.StringIO(data),index_col=0, parse_dates=True)
Out[2]:
b
a
2003-03-27 14:55:00 1
2003-08-03 15:20:00 2 As the saying goes - "in open source, bugs get fixed...eventually" 🥳 |
While the original use case in the top post (specific example dates that got silently inconsitently parsed with day first vs last) has been resolved (we are now strict, i.e. PDEP-4), the general discussion here was about how we can expose a Of course, with the fact that with PDEP-4 we now try to guess the format by default, there is probably less need to pass custom For example, for the specific case here, the new warning message says "... or specify a format to silence this warning". But you can't actually specify a format through (if we still want to consider this feature request, it might be better to open a new, dedicated issue about it though, and keep this one closed) |
I read this as "here's a date parsing mistake, and a possible solution would be to add a You can already pass
I think that'd be better, the conversation here has already got quite long and many comments are now outdated - I've opened #50528 |
I've looked into this further, and I'll take back
As mentioned in #50601, Let's take the conversation over to #50601 then - closing again, sorry for the open/close ping pong here |
date_format
keyword could take the format, dict of columns to format, or list of formats (and could then obviate the need for parse_dates)Sometimes months and days get mixed up.
E.g.
test.csv:
a,b
27.03.2003 14:55:00.000,1
03.08.2003 15:20:00.000,2
read_csv("/home/john/Documents/test.csv",index_col=0, parse_dates=True)
b
a
2003-03-27 14:55:00 1
2003-03-08 15:20:00 2
There doesn't appear to be any continuity in the date parsing over the rows. As well as meaning things can easily get switched around, this makes date parsing VERY slow. Once you know the format, using the datetime constructor with string slicing as a parser makes read_csv 20x faster on my machine.
I think there needs to be some more parameters for specifying date formats. seeing as in the general case dates a string of dates can be ambiguous (see above).
A possible approach: Have a few default formats to choose from, as well as a more general format string approach. Obviously the defaults could use the datetime constructor with string slicing, which is very fast.
Perhaps have a dayfirst and yearfirst flag that gets passed to dateutil.parser.parse to solve ambiguities when using automatic parsing.
The text was updated successfully, but these errors were encountered: