BUG: add date_format to read_csv / Date parsing mistake. read_csv #2586

John-Colvin · 2012-12-22T23:57:52Z

date_format keyword could take the format, dict of columns to format, or list of formats (and could then obviate the need for parse_dates)

Sometimes months and days get mixed up.

E.g.
test.csv:

a,b
27.03.2003 14:55:00.000,1
03.08.2003 15:20:00.000,2

read_csv("/home/john/Documents/test.csv",index_col=0, parse_dates=True)
b
a
2003-03-27 14:55:00 1
2003-03-08 15:20:00 2

There doesn't appear to be any continuity in the date parsing over the rows. As well as meaning things can easily get switched around, this makes date parsing VERY slow. Once you know the format, using the datetime constructor with string slicing as a parser makes read_csv 20x faster on my machine.

I think there needs to be some more parameters for specifying date formats. seeing as in the general case dates a string of dates can be ambiguous (see above).

A possible approach: Have a few default formats to choose from, as well as a more general format string approach. Obviously the defaults could use the datetime constructor with string slicing, which is very fast.

Perhaps have a dayfirst and yearfirst flag that gets passed to dateutil.parser.parse to solve ambiguities when using automatic parsing.

The text was updated successfully, but these errors were encountered:

qwhelan · 2012-12-23T10:24:38Z

Try using the dayfirst parameter (new in v0.10.0):

In [4]: read_csv('test.csv', index_col=0, parse_dates=True, dayfirst=True)
Out[4]: 
                     b
a                     
2003-03-27 14:55:00  1
2003-08-03 15:20:00  2

If you want to specify the datetime format you can use the date_parser parameter:

In [5]: read_csv('test.csv', index_col=0, date_parser=lambda x: datetime.strptime(x, '%d.%m.%Y %H:%M:%S.%f'))
Out[5]: 
                     b
a                     
2003-03-27 14:55:00  1
2003-08-03 15:20:00  2

John-Colvin · 2012-12-23T16:03:05Z

sorry, my bad for not looking at the API reference closely enough to spot the dayfirst flag.

the date_parser parameter is what I was using to test using the datetime constructor. In my experience strptime is very slow compared to using the constructor and string slicing. Having some default formats to choose from that then call an optimised parser would seem a good idea.

It would speed up parsing for those formats while also allowing the user to be strict with the format, without them having to write their own, potentially slow, parsing function.

It seems worthwhile, what with the recent efforts to make file parsing much faster the date parsing really dominates the runtime of read_csv

qwhelan · 2012-12-24T08:33:35Z

Here are the benchmarks I'm getting (50k of the format %d/%M/%Y):

In [68]: %timeit read_csv('dates.csv', parse_dates=[0], dayfirst=True)
1 loops, best of 3: 4.6 s per loop

In [69]: %timeit read_csv('dates.csv', parse_dates=[0], date_parser=lambda x: datetime.strptime(x, '%d/%M/%Y'), dayfirst=True)
1 loops, best of 3: 1.18 s per loop

And for passing directly to the constructor:

In [70]: %timeit data = read_csv('dates.csv')
10 loops, best of 3: 21.1 ms per loop

In [71]: %time data.Date = map(lambda x, y, z: datetime(x, y, z), data.Date.map(lambda a: int(a[4:])), data.Date.map(lambda a: int(a[2:3])), data.Date.map(lambda a: int(a[:1])))
CPU times: user 0.35 s, sys: 0.00 s, total: 0.35 s
Wall time: 0.35 s

So there's a ~3x speedup by specifying the format, and another ~3x from manually passing to the constructor.

I don't think the user should be forced to pass a date format to get faster performance, especially when there might be multiple columns with different formats. We're currently assuming that a date column is not of a homogeneous format (by calling parse() independently on every value). All those independent calls probably means we're seeing a lot of object creation/destruction (ie, compiled regexes that are used once).

If we have a homogeneity flag, we could call parse() on the first row, get the pattern it settled on (may not be currently available from parse()), and then construct an array of datetimes.

wesm · 2013-04-07T02:49:17Z

I actually added a format argument to to_datetime-- it might be worth adding to read_csv, maybe. Alternately you can parse without using parse_dates and use to_datetime yourself and it should be reasonably fast.

dragoljub · 2013-04-08T03:48:55Z

A 'format' argument would be a great feature in read_csv(). Would save you from writhing the 'fast' parser lambda function with strptime() while also handling the special non-parseable case where you want to just fill parsing errors with a np.nan or NaT. I'm assuming this would use datetime's format strings.

jreback · 2014-02-28T11:39:58Z

the new infer_datetime_format flag solves this problem (though leaving open because ideally you specify the format directly)

maxgrenderjones · 2014-03-21T12:45:16Z

Small point, but just to give an example of a similar use case that should be handled - I have a csv in which some columns are dates and some are datetimes. I need to be able to provide a different format for each column that I want to parse (as per the OP's suggested specification).

jreback · 2014-03-21T12:47:45Z

@maxgrenderjones interested in doing a PR for this?

as an aside, try infer_datetime_format kw (its in 0.13.1), should be able to figure out most common formats.

jbrockmendel · 2018-10-25T03:14:04Z

@mroeschke has this been addressed? Closeable?

mroeschke · 2018-10-25T18:21:42Z

Doesn't appear so. read_csv has a date_parser argument that accepts a generic parsing function but I think the ask is for a format argument like pd.to_datetime(..., format='...')

But given that you can also just pass pd.read_csv(..., date_parser=lambda x: pd.to_datetime(x, format='...')) as a suitable alternative, I am not sure how on board I am with added a separate format argument to read_csv. I'd be more clean, but it would grow the already large number of kwargs.

jorisvandenbossche · 2018-10-26T10:54:11Z

Yep, agreed that given the huge amount of keywords, we should maybe not add yet another one (although I would personally find a format keyword more useful than the current date_parser)

WillAyd · 2018-10-26T15:00:05Z

I still think is useful. Having to provide a lambda in the current state is rather wonky from an end user perspective and an explicit format keyword could probably enable much better performance on larger datasets

alexbk66 · 2018-11-21T00:57:23Z

Just had this problem in my code -

dayfirst = True,
infer_datetime_format = True

still gets "2010-10-01" for "10/01/2010"

jorisvandenbossche · 2018-11-21T09:09:36Z

@alexbk66 that seems like a bug (maybe dayfirst is not passed through correctly somewhere), as this combination works fine with to_datetime:

In [81]: pd.to_datetime("10/01/2010", infer_datetime_format=True)                                                                                                                                                   
Out[81]: Timestamp('2010-10-01 00:00:00')

In [82]: pd.to_datetime("10/01/2010", infer_datetime_format=True, dayfirst=True)                                                                                                                                    
Out[82]: Timestamp('2010-01-10 00:00:00')

gfyoung · 2018-11-21T18:14:22Z

What do you guys think about expanding date_parser to be the following:

callable: custom date parser (existing)
kwargs: parameters to pass to default date converter (proposed)

This would address the need for greater customized keyword arguments without bloating or requiring wonky lambda arguments. Also, if we allow date_parser to be kwargs, we could then deprecate / remove dayfirst and infer_datetime_format from the signature and require them to be passed as a dict via date_parser (i.e. less API bloat).

jorisvandenbossche · 2018-11-21T20:22:31Z

That sounds like a nice idea to me!
It's of course not yet that user friendly as a date_format keyword, but I would say better than a lambda function that calls to_datetime.

gfyoung · 2018-11-21T20:28:16Z

It's of course not yet that user friendly as a date_format keyword, but I would say better than a lambda function that calls to_datetime.

Well, who says that we couldn't accept a string (i.e. date format) for date_parser ? 😉 I think that could be a good enhancement after the refactor.

WillAyd · 2018-11-21T20:51:28Z

Is the only downside to having a dedicated format argument that we already have enough keywords? I still think that as a dedicated keyword is best here, especially given we already have that in to_datetime

gfyoung · 2018-11-21T21:45:56Z

API bloat isn't the only reason. Another benefit of this consolidation is that the read_csv API becomes agnostic to the signature of to_datetime. We could have 20 keyword arguments for to_datetime, but we wouldn't have to change the signature of read_csv to accommodate. That's what we're running into now. Adding keyword arguments as has been suggested is only a band-aid IMO.

I also might add that adapting date_parser to perform different functions based on type would be consistent with other keywords such as usecols, skiprows, and parse_dates.

MarcoGorelli · 2022-12-17T15:02:24Z

Solved as of PDEP4, let's close

In [53]: data = """a,b
    ...: 27.03.2003 14:55:00.000,1
    ...: 03.08.2003 15:20:00.000,2"""

In [54]: read_csv(io.StringIO(data),index_col=0, parse_dates=True)
Out[54]: 
                     b
a                     
2003-03-27 14:55:00  1
2003-03-08 15:20:00  2

MarcoGorelli · 2022-12-17T15:03:55Z

Sorry, I was too fast

This is wrong, and is a serious bug. It should've been addressed by PDEP4, but guess_datetime_format errors on this input #50317

I'm working on a fix, but in any case, we shouldn't be falling back to dateutil here

#50319 will fix this, but I'll make a separate PR to remove this fallback and ensure this doesn't happen again

pandas/pandas/io/parsers/base_parser.py

Lines 1122 to 1134 in 7b0d4dd

    
           try: 
        
               return tools.to_datetime( 
        
                   ensure_object(strs), 
        
                   utc=False, 
        
                   dayfirst=dayfirst, 
        
                   errors="ignore", 
        
                   cache=cache_dates, 
        
               ).to_numpy() 
        
           except ValueError: 
        
               return tools.to_datetime( 
        
                   parsing.try_parse_dates(strs, dayfirst=dayfirst), cache=cache_dates 
        
               )

MarcoGorelli · 2022-12-27T20:59:09Z

Only took 10 years and 5 days, but this is actually fixed now:

In [1]: data = """a,b
   ...:     27.03.2003 14:55:00.000,1
   ...:     03.08.2003 15:20:00.000,2"""

In [2]: read_csv(io.StringIO(data),index_col=0, parse_dates=True)
<ipython-input-2-1174b5f5763d>:1: UserWarning: Parsing dates in     %d.%m.%Y %H:%M:%S.%f format when dayfirst=False was specified. Pass `dayfirst=True` or specify a format to silence this warning.
  read_csv(io.StringIO(data),index_col=0, parse_dates=True)
Out[2]: 
                     b
a                     
2003-03-27 14:55:00  1
2003-08-03 15:20:00  2

As the saying goes - "in open source, bugs get fixed...eventually"

🥳

jorisvandenbossche · 2023-01-02T13:37:46Z

While the original use case in the top post (specific example dates that got silently inconsitently parsed with day first vs last) has been resolved (we are now strict, i.e. PDEP-4), the general discussion here was about how we can expose a format keyword (or a more general mechanism to expose to_datetime keywords in read_csv), i.e. the "add date_format to read_csv" part of the title. And I don't think that is already resolved?

Of course, with the fact that with PDEP-4 we now try to guess the format by default, there is probably less need to pass custom to_datetime keywords. But for the subset where you still need to specify a keyword, having a mechanism to pass it through could still be useful.

For example, for the specific case here, the new warning message says "... or specify a format to silence this warning". But you can't actually specify a format through pd.read_csv

(if we still want to consider this feature request, it might be better to open a new, dedicated issue about it though, and keep this one closed)

MarcoGorelli · 2023-01-02T14:20:48Z

I read this as "here's a date parsing mistake, and a possible solution would be to add a date_format argument" - as the parsing mistake is addressed, that's why I considered it closed

You can already pass date_parser = lambda x: to_datetime(x, format=fmt), not sure there needs to be a separate argument

(if we still want to consider this feature request, it might be better to open a new, dedicated issue about it though, and keep this one closed)

I think that'd be better, the conversation here has already got quite long and many comments are now outdated - I've opened #50528

MarcoGorelli · 2023-01-06T14:10:29Z

I've looked into this further, and I'll take back

You can already pass date_parser = lambda x: to_datetime(x, format=fmt), not sure there needs to be a separate argument

As mentioned in #50601, date_parser is actually a net negative (if anyone has a counter-example, please do bring it up) - so, I'd be totally onboard with adding date_format, and deprecating date_parser

Let's take the conversation over to #50601 then - closing again, sorry for the open/close ping pong here

jreback modified the milestones: 0.15.0, 0.14.0 Feb 28, 2014

jreback modified the milestones: 0.16.0, 0.17.0 Jan 26, 2015

jorisvandenbossche mentioned this issue Jan 31, 2015

DOC: Clarify how date_parser is called (GH9376) #9377

Merged

datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018

gfyoung mentioned this issue Nov 21, 2018

REF/TST: Add more pytest idiom to parsers tests #23832

Merged

gfyoung modified the milestones: Someday, Contributions Welcome Nov 21, 2018

jbrockmendel removed the IO Data IO issues that don't fit into a more specific label label Dec 1, 2019

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

MarcoGorelli closed this as completed Dec 17, 2022

MarcoGorelli reopened this Dec 17, 2022

MarcoGorelli changed the title ~~ENH: add date_format to read_csv / Date parsing mistake. read_csv~~ BUG: add date_format to read_csv / Date parsing mistake. read_csv Dec 17, 2022

MarcoGorelli added Bug and removed Enhancement labels Dec 17, 2022

This was referenced Dec 17, 2022

BUG: add date_format to read_csv / Date parsing mistake. read_csv #50320

Merged

BUG guess_Datetime_format doesn't guess 27.03.2003 14:55:00.000 #50319

Merged

mroeschke closed this as completed in #50320 Dec 27, 2022

jorisvandenbossche reopened this Jan 2, 2023

MarcoGorelli mentioned this issue Jan 2, 2023

API add date_format to read_csv #50528

Closed

MarcoGorelli closed this as completed Jan 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: add date_format to read_csv / Date parsing mistake. read_csv #2586

BUG: add date_format to read_csv / Date parsing mistake. read_csv #2586

John-Colvin commented Dec 22, 2012

qwhelan commented Dec 23, 2012

John-Colvin commented Dec 23, 2012

qwhelan commented Dec 24, 2012

wesm commented Apr 7, 2013

dragoljub commented Apr 8, 2013

jreback commented Feb 28, 2014

maxgrenderjones commented Mar 21, 2014

jreback commented Mar 21, 2014

jbrockmendel commented Oct 25, 2018

mroeschke commented Oct 25, 2018

jorisvandenbossche commented Oct 26, 2018

WillAyd commented Oct 26, 2018

alexbk66 commented Nov 21, 2018

jorisvandenbossche commented Nov 21, 2018

gfyoung commented Nov 21, 2018 •

edited

Loading

jorisvandenbossche commented Nov 21, 2018

gfyoung commented Nov 21, 2018

WillAyd commented Nov 21, 2018

gfyoung commented Nov 21, 2018 •

edited

Loading

MarcoGorelli commented Dec 17, 2022

MarcoGorelli commented Dec 17, 2022 •

edited

Loading

MarcoGorelli commented Dec 27, 2022 •

edited

Loading

jorisvandenbossche commented Jan 2, 2023 •

edited

Loading

MarcoGorelli commented Jan 2, 2023

MarcoGorelli commented Jan 6, 2023

BUG: add date_format to read_csv / Date parsing mistake. read_csv #2586

BUG: add date_format to read_csv / Date parsing mistake. read_csv #2586

Comments

John-Colvin commented Dec 22, 2012

qwhelan commented Dec 23, 2012

John-Colvin commented Dec 23, 2012

qwhelan commented Dec 24, 2012

wesm commented Apr 7, 2013

dragoljub commented Apr 8, 2013

jreback commented Feb 28, 2014

maxgrenderjones commented Mar 21, 2014

jreback commented Mar 21, 2014

jbrockmendel commented Oct 25, 2018

mroeschke commented Oct 25, 2018

jorisvandenbossche commented Oct 26, 2018

WillAyd commented Oct 26, 2018

alexbk66 commented Nov 21, 2018

jorisvandenbossche commented Nov 21, 2018

gfyoung commented Nov 21, 2018 • edited Loading

jorisvandenbossche commented Nov 21, 2018

gfyoung commented Nov 21, 2018

WillAyd commented Nov 21, 2018

gfyoung commented Nov 21, 2018 • edited Loading

MarcoGorelli commented Dec 17, 2022

MarcoGorelli commented Dec 17, 2022 • edited Loading

MarcoGorelli commented Dec 27, 2022 • edited Loading

jorisvandenbossche commented Jan 2, 2023 • edited Loading

MarcoGorelli commented Jan 2, 2023

MarcoGorelli commented Jan 6, 2023

gfyoung commented Nov 21, 2018 •

edited

Loading

gfyoung commented Nov 21, 2018 •

edited

Loading

MarcoGorelli commented Dec 17, 2022 •

edited

Loading

MarcoGorelli commented Dec 27, 2022 •

edited

Loading

jorisvandenbossche commented Jan 2, 2023 •

edited

Loading