Skip to content

BUG: add date_format to read_csv / Date parsing mistake. read_csv #2586

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
John-Colvin opened this issue Dec 22, 2012 · 25 comments · Fixed by #50320
Closed

BUG: add date_format to read_csv / Date parsing mistake. read_csv #2586

John-Colvin opened this issue Dec 22, 2012 · 25 comments · Fixed by #50320
Labels
Bug Datetime Datetime data dtype IO CSV read_csv, to_csv

Comments

@John-Colvin
Copy link
Contributor

date_format keyword could take the format, dict of columns to format, or list of formats (and could then obviate the need for parse_dates)

Sometimes months and days get mixed up.

E.g.
test.csv:

a,b
27.03.2003 14:55:00.000,1
03.08.2003 15:20:00.000,2

read_csv("/home/john/Documents/test.csv",index_col=0, parse_dates=True)
b
a
2003-03-27 14:55:00 1
2003-03-08 15:20:00 2

There doesn't appear to be any continuity in the date parsing over the rows. As well as meaning things can easily get switched around, this makes date parsing VERY slow. Once you know the format, using the datetime constructor with string slicing as a parser makes read_csv 20x faster on my machine.

I think there needs to be some more parameters for specifying date formats. seeing as in the general case dates a string of dates can be ambiguous (see above).

A possible approach: Have a few default formats to choose from, as well as a more general format string approach. Obviously the defaults could use the datetime constructor with string slicing, which is very fast.

Perhaps have a dayfirst and yearfirst flag that gets passed to dateutil.parser.parse to solve ambiguities when using automatic parsing.

@qwhelan
Copy link
Contributor

qwhelan commented Dec 23, 2012

Try using the dayfirst parameter (new in v0.10.0):

In [4]: read_csv('test.csv', index_col=0, parse_dates=True, dayfirst=True)
Out[4]: 
                     b
a                     
2003-03-27 14:55:00  1
2003-08-03 15:20:00  2

If you want to specify the datetime format you can use the date_parser parameter:

In [5]: read_csv('test.csv', index_col=0, date_parser=lambda x: datetime.strptime(x, '%d.%m.%Y %H:%M:%S.%f'))
Out[5]: 
                     b
a                     
2003-03-27 14:55:00  1
2003-08-03 15:20:00  2

@John-Colvin
Copy link
Contributor Author

sorry, my bad for not looking at the API reference closely enough to spot the dayfirst flag.

the date_parser parameter is what I was using to test using the datetime constructor. In my experience strptime is very slow compared to using the constructor and string slicing. Having some default formats to choose from that then call an optimised parser would seem a good idea.

It would speed up parsing for those formats while also allowing the user to be strict with the format, without them having to write their own, potentially slow, parsing function.

It seems worthwhile, what with the recent efforts to make file parsing much faster the date parsing really dominates the runtime of read_csv

@qwhelan
Copy link
Contributor

qwhelan commented Dec 24, 2012

Here are the benchmarks I'm getting (50k of the format %d/%M/%Y):

In [68]: %timeit read_csv('dates.csv', parse_dates=[0], dayfirst=True)
1 loops, best of 3: 4.6 s per loop

In [69]: %timeit read_csv('dates.csv', parse_dates=[0], date_parser=lambda x: datetime.strptime(x, '%d/%M/%Y'), dayfirst=True)
1 loops, best of 3: 1.18 s per loop

And for passing directly to the constructor:

In [70]: %timeit data = read_csv('dates.csv')
10 loops, best of 3: 21.1 ms per loop

In [71]: %time data.Date = map(lambda x, y, z: datetime(x, y, z), data.Date.map(lambda a: int(a[4:])), data.Date.map(lambda a: int(a[2:3])), data.Date.map(lambda a: int(a[:1])))
CPU times: user 0.35 s, sys: 0.00 s, total: 0.35 s
Wall time: 0.35 s

So there's a ~3x speedup by specifying the format, and another ~3x from manually passing to the constructor.

I don't think the user should be forced to pass a date format to get faster performance, especially when there might be multiple columns with different formats. We're currently assuming that a date column is not of a homogeneous format (by calling parse() independently on every value). All those independent calls probably means we're seeing a lot of object creation/destruction (ie, compiled regexes that are used once).

If we have a homogeneity flag, we could call parse() on the first row, get the pattern it settled on (may not be currently available from parse()), and then construct an array of datetimes.

@wesm
Copy link
Member

wesm commented Apr 7, 2013

I actually added a format argument to to_datetime-- it might be worth adding to read_csv, maybe. Alternately you can parse without using parse_dates and use to_datetime yourself and it should be reasonably fast.

@dragoljub
Copy link

A 'format' argument would be a great feature in read_csv(). Would save you from writhing the 'fast' parser lambda function with strptime() while also handling the special non-parseable case where you want to just fill parsing errors with a np.nan or NaT. I'm assuming this would use datetime's format strings.

@jreback
Copy link
Contributor

jreback commented Feb 28, 2014

the new infer_datetime_format flag solves this problem (though leaving open because ideally you specify the format directly)

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Feb 28, 2014
@maxgrenderjones
Copy link
Contributor

Small point, but just to give an example of a similar use case that should be handled - I have a csv in which some columns are dates and some are datetimes. I need to be able to provide a different format for each column that I want to parse (as per the OP's suggested specification).

@jreback
Copy link
Contributor

jreback commented Mar 21, 2014

@maxgrenderjones interested in doing a PR for this?

as an aside, try infer_datetime_format kw (its in 0.13.1), should be able to figure out most common formats.

@jreback jreback modified the milestones: 0.16.0, 0.17.0 Jan 26, 2015
@datapythonista datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018
@jbrockmendel
Copy link
Member

@mroeschke has this been addressed? Closeable?

@mroeschke
Copy link
Member

Doesn't appear so. read_csv has a date_parser argument that accepts a generic parsing function but I think the ask is for a format argument like pd.to_datetime(..., format='...')

But given that you can also just pass pd.read_csv(..., date_parser=lambda x: pd.to_datetime(x, format='...')) as a suitable alternative, I am not sure how on board I am with added a separate format argument to read_csv. I'd be more clean, but it would grow the already large number of kwargs.

@jorisvandenbossche
Copy link
Member

Yep, agreed that given the huge amount of keywords, we should maybe not add yet another one (although I would personally find a format keyword more useful than the current date_parser)

@WillAyd
Copy link
Member

WillAyd commented Oct 26, 2018

I still think is useful. Having to provide a lambda in the current state is rather wonky from an end user perspective and an explicit format keyword could probably enable much better performance on larger datasets

@alexbk66
Copy link

Just had this problem in my code -

dayfirst = True,
infer_datetime_format = True

still gets "2010-10-01" for "10/01/2010"

@jorisvandenbossche
Copy link
Member

@alexbk66 that seems like a bug (maybe dayfirst is not passed through correctly somewhere), as this combination works fine with to_datetime:

In [81]: pd.to_datetime("10/01/2010", infer_datetime_format=True)                                                                                                                                                   
Out[81]: Timestamp('2010-10-01 00:00:00')

In [82]: pd.to_datetime("10/01/2010", infer_datetime_format=True, dayfirst=True)                                                                                                                                    
Out[82]: Timestamp('2010-01-10 00:00:00')

@gfyoung
Copy link
Member

gfyoung commented Nov 21, 2018

What do you guys think about expanding date_parser to be the following:

  • callable: custom date parser (existing)
  • kwargs: parameters to pass to default date converter (proposed)

This would address the need for greater customized keyword arguments without bloating or requiring wonky lambda arguments. Also, if we allow date_parser to be kwargs, we could then deprecate / remove dayfirst and infer_datetime_format from the signature and require them to be passed as a dict via date_parser (i.e. less API bloat).

@gfyoung gfyoung modified the milestones: Someday, Contributions Welcome Nov 21, 2018
@jorisvandenbossche
Copy link
Member

That sounds like a nice idea to me!
It's of course not yet that user friendly as a date_format keyword, but I would say better than a lambda function that calls to_datetime.

@gfyoung
Copy link
Member

gfyoung commented Nov 21, 2018

It's of course not yet that user friendly as a date_format keyword, but I would say better than a lambda function that calls to_datetime.

Well, who says that we couldn't accept a string (i.e. date format) for date_parser ? 😉 I think that could be a good enhancement after the refactor.

@WillAyd
Copy link
Member

WillAyd commented Nov 21, 2018

Is the only downside to having a dedicated format argument that we already have enough keywords? I still think that as a dedicated keyword is best here, especially given we already have that in to_datetime

@gfyoung
Copy link
Member

gfyoung commented Nov 21, 2018

API bloat isn't the only reason. Another benefit of this consolidation is that the read_csv API becomes agnostic to the signature of to_datetime. We could have 20 keyword arguments for to_datetime, but we wouldn't have to change the signature of read_csv to accommodate. That's what we're running into now. Adding keyword arguments as has been suggested is only a band-aid IMO.

I also might add that adapting date_parser to perform different functions based on type would be consistent with other keywords such as usecols, skiprows, and parse_dates.

@jbrockmendel jbrockmendel removed the IO Data IO issues that don't fit into a more specific label label Dec 1, 2019
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@MarcoGorelli
Copy link
Member

Solved as of PDEP4, let's close

In [53]: data = """a,b
    ...: 27.03.2003 14:55:00.000,1
    ...: 03.08.2003 15:20:00.000,2"""

In [54]: read_csv(io.StringIO(data),index_col=0, parse_dates=True)
Out[54]: 
                     b
a                     
2003-03-27 14:55:00  1
2003-03-08 15:20:00  2

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Dec 17, 2022

Sorry, I was too fast

This is wrong, and is a serious bug. It should've been addressed by PDEP4, but guess_datetime_format errors on this input #50317

I'm working on a fix, but in any case, we shouldn't be falling back to dateutil here


#50319 will fix this, but I'll make a separate PR to remove this fallback and ensure this doesn't happen again

try:
return tools.to_datetime(
ensure_object(strs),
utc=False,
dayfirst=dayfirst,
errors="ignore",
cache=cache_dates,
).to_numpy()
except ValueError:
return tools.to_datetime(
parsing.try_parse_dates(strs, dayfirst=dayfirst), cache=cache_dates
)

@MarcoGorelli MarcoGorelli changed the title ENH: add date_format to read_csv / Date parsing mistake. read_csv BUG: add date_format to read_csv / Date parsing mistake. read_csv Dec 17, 2022
@MarcoGorelli
Copy link
Member

MarcoGorelli commented Dec 27, 2022

Only took 10 years and 5 days, but this is actually fixed now:

In [1]: data = """a,b
   ...:     27.03.2003 14:55:00.000,1
   ...:     03.08.2003 15:20:00.000,2"""

In [2]: read_csv(io.StringIO(data),index_col=0, parse_dates=True)
<ipython-input-2-1174b5f5763d>:1: UserWarning: Parsing dates in     %d.%m.%Y %H:%M:%S.%f format when dayfirst=False was specified. Pass `dayfirst=True` or specify a format to silence this warning.
  read_csv(io.StringIO(data),index_col=0, parse_dates=True)
Out[2]: 
                     b
a                     
2003-03-27 14:55:00  1
2003-08-03 15:20:00  2

As the saying goes - "in open source, bugs get fixed...eventually"

🥳

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 2, 2023

While the original use case in the top post (specific example dates that got silently inconsitently parsed with day first vs last) has been resolved (we are now strict, i.e. PDEP-4), the general discussion here was about how we can expose a format keyword (or a more general mechanism to expose to_datetime keywords in read_csv), i.e. the "add date_format to read_csv" part of the title. And I don't think that is already resolved?

Of course, with the fact that with PDEP-4 we now try to guess the format by default, there is probably less need to pass custom to_datetime keywords. But for the subset where you still need to specify a keyword, having a mechanism to pass it through could still be useful.

For example, for the specific case here, the new warning message says "... or specify a format to silence this warning". But you can't actually specify a format through pd.read_csv

(if we still want to consider this feature request, it might be better to open a new, dedicated issue about it though, and keep this one closed)

@MarcoGorelli
Copy link
Member

I read this as "here's a date parsing mistake, and a possible solution would be to add a date_format argument" - as the parsing mistake is addressed, that's why I considered it closed

You can already pass date_parser = lambda x: to_datetime(x, format=fmt), not sure there needs to be a separate argument

(if we still want to consider this feature request, it might be better to open a new, dedicated issue about it though, and keep this one closed)

I think that'd be better, the conversation here has already got quite long and many comments are now outdated - I've opened #50528

@MarcoGorelli
Copy link
Member

I've looked into this further, and I'll take back

You can already pass date_parser = lambda x: to_datetime(x, format=fmt), not sure there needs to be a separate argument

As mentioned in #50601, date_parser is actually a net negative (if anyone has a counter-example, please do bring it up) - so, I'd be totally onboard with adding date_format, and deprecating date_parser

Let's take the conversation over to #50601 then - closing again, sorry for the open/close ping pong here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.