Skip to content

PERF: parse non-ambiguous date formats in c #12667

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
apapanico opened this issue Mar 18, 2016 · 10 comments
Closed

PERF: parse non-ambiguous date formats in c #12667

apapanico opened this issue Mar 18, 2016 · 10 comments
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance

Comments

@apapanico
Copy link

Date parsing is slow on dates of the form "%m.%d.%Y" compared to "%Y-%m-%d". I am using 0.17.1 so maybe it's improved in 0.18 but I didn't see anything about it. I provide a benchmark below to compare the two datetime formats.

Some observations:

  • Letting pandas figure it out is unacceptably slow. In small files it won't make much of a difference but it became a massive bottleneck in an application.
  • In my particular use case, I know each timestamp in my file will have the same date so it is very easy to do a string replace. And this turns out to be by far the fastest option here. Even with the actual string replacement time.
  • Regex is slow. I'm sure that would have been obvious but I tried it out for comparison.
  • infer_datetime_format is the next best option but it appears to be a massive slowdown. Compare the relative times on the large file to the relative times of the pure read on the small file and date parsing with infer_datetime_format.
  • A custom date parser doesn't help. Not sure why it doesn't help. Maybe it's not compiled?

Code Sample

test.csv (1 million lines)

2016-03-12, 1
2016-03-12, 1
2016-03-12, 1
2016-03-12, 1
...

test2.csv (5000 lines)

03.12.2016, 1
03.12.2016, 1
03.12.2016, 1
03.12.2016, 1
...
import pandas as pd
from cStringIO import StringIO
import datetime
import re
print "Large file: pure read"
%timeit pd.read_csv('test.csv')

print "Large file: parse dates"
%timeit pd.read_csv('test.csv', parse_dates=[0])

print "Small file: pure read"
%timeit pd.read_csv('test2.csv')

print "Small file: parse dates"
%timeit pd.read_csv('test2.csv', parse_dates=[0])

print "Small file: parse dates with infer datetime format"
%timeit pd.read_csv('test2.csv', parse_dates=[0], infer_datetime_format=True)

print "Small file: string replace then parse dates "
def test_replace():
    s = file('test2.csv', 'r').read()
    s = s.replace('03.12.2016', '2016-03-12')
    pd.read_csv(StringIO(s), parse_dates=[0])
%timeit test_replace()

print "Small file: parse dates with custom date parser"
date_parser=lambda t: datetime.datetime.strptime(t, "%m.%d.%Y")
%timeit pd.read_csv('test2.csv', parse_dates=[0], date_parser=date_parser)

print "Small file: regex sub then parse dates"
def test_re_sub():
    s = file('test2.csv', 'r').read()
    p = re.compile('(?P<month> \d{2}).(?P<day> \d{2}).(?P<year> \d{4})')
    s = p.sub('\g<year>-\g<month>-\g<day>', s)
    pd.read_csv(StringIO(s), parse_dates=[0])
%timeit test_re_sub()

Output (2012 MacBook Air)

Large file: pure read
1 loop, best of 3: 253 ms per loop
Large file: parse dates
1 loop, best of 3: 790 ms per loop

Small file: pure read
100 loops, best of 3: 2.9 ms per loop
Small file: parse dates
1 loop, best of 3: 1.22 s per loop
Small file: parse dates with infer datetime format
10 loops, best of 3: 32.7 ms per loop
Small file: string replace then parse dates 
100 loops, best of 3: 4.97 ms per loop
Small file: parse dates with custom date parser
10 loops, best of 3: 90.6 ms per loop
Small file: regex sub then parse dates
1 loop, best of 3: 1.08 s per loop

output of pd.show_versions()

>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.1
nose: 1.3.7
pip: 8.1.0
setuptools: 20.2.2
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: None
IPython: 4.1.2
sphinx: 1.3.6
patsy: None
dateutil: 2.5.1
pytz: 2016.1
blosc: None
bottleneck: None
tables: None
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
Jinja2: None
@jreback
Copy link
Contributor

jreback commented Mar 18, 2016

See the 0.18.0 perf improvements

In [16]: pd.__version__
Out[16]: u'0.18.0'

In [7]: s = pd.concat([Series(pd.date_range('20150101',periods=1000))]*1000)

In [8]: len(s)
Out[8]: 1000000

In [9]: s1 = s.astype(str)

In [10]: s1 = s.dt.strftime('%Y-%m-%d')

In [11]: s2 = s.dt.strftime('%Y.%m.%d')

In [12]: s1.head()
Out[12]: 
0    2015-01-01
1    2015-01-02
2    2015-01-03
3    2015-01-04
4    2015-01-05
dtype: object

In [13]: s2.head()
Out[13]: 
0    2015.01.01
1    2015.01.02
2    2015.01.03
3    2015.01.04
4    2015.01.05
dtype: object

In [14]: %timeit pd.to_datetime(s1)
10 loops, best of 3: 170 ms per loop

In [15]: %timeit pd.to_datetime(s2)
10 loops, best of 3: 175 ms per loop

@jreback jreback closed this as completed Mar 18, 2016
@jreback
Copy link
Contributor

jreback commented Mar 18, 2016

note if this were implemented: #11665

this would easily be 10x improvement here as there are lots of repeats.

pull-requests on that are welcome!

@jreback jreback added Datetime Datetime data dtype Performance Memory or execution speed performance labels Mar 18, 2016
@jreback
Copy link
Contributor

jreback commented Mar 18, 2016

I realize you actualy want %m.%d.%Y. This is not an ISO8601 format. So not really sure what/if anything could be done here. (The reason its not is its ambiguous to dayfirst)

@jreback jreback reopened this Mar 18, 2016
@jreback
Copy link
Contributor

jreback commented Mar 18, 2016

In [38]: s3 = s.dt.strftime('%m.%d.%Y')
s3.head()

In [39]: s3.head()
Out[39]: 
0    01.01.2015
1    01.02.2015
2    01.03.2015
3    01.04.2015
4    01.05.2015
dtype: object

# this may work for you
In [36]: %timeit pd.to_datetime(s3.str[6:10] + s3.str[0:2] + s3.str[3:5])
1 loop, best of 3: 1.24 s per loop

In [37]: %timeit pd.to_datetime(s3,format='%m.%d.%Y')
1 loop, best of 3: 3.35 s per loop

@jreback
Copy link
Contributor

jreback commented Mar 18, 2016

btw, even though parse_dates does dispatch to pd.to_datetime, its meant for the simplest cases. Any heavy lifing you should do after with pd.to_datetime, which provides a bunch more options.

@jreback
Copy link
Contributor

jreback commented Mar 18, 2016

so there was a bunch of discussion on #12585 on how this could be fixed. The primary way would be to allow additional c-code to parse some other non-ambiguous formats.

@apapanico
Copy link
Author

Jeff, thanks for taking the time to look at this.

Honestly, if this is a non ISO format then you guys should probably not
bother supporting it.

It just so happens that our data storage ends up dumping the timestamps in
this format. Luckily for me, the simple string replace method works
incredibly well when I read in the csv string.

I brought this up because I just thought there might be something out there
that can create a kind of custom pre-compiled format that you can define
(like I tried to do in my date parser example) and I think this is what
you're getting at in your last comment.

If someone takes this on and looks to make a more generalizeable date
parser, then great. But like I said, if it's a non standard format, it
should probably be ignored for the sake of everyone's sanity.

Alex
On Mar 18, 2016 12:51 PM, "Jeff Reback" [email protected] wrote:

so there was a bunch of discussion on #12585
#12585 on how this could be
fixed. The primary way would be to allow additional c-code to parse some
other non-ambiguous formats.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#12667 (comment)

@apapanico
Copy link
Author

I should clarify, I think the support in pandas provided for messy/non
standard formats is great. I love it. I meant with respect performance
for those formats.
On Mar 18, 2016 12:59 PM, "Alex Papanicolaou" [email protected]
wrote:

Jeff, thanks for taking the time to look at this.

Honestly, if this is a non ISO format then you guys should probably not
bother supporting it.

It just so happens that our data storage ends up dumping the timestamps in
this format. Luckily for me, the simple string replace method works
incredibly well when I read in the csv string.

I brought this up because I just thought there might be something out
there that can create a kind of custom pre-compiled format that you can
define (like I tried to do in my date parser example) and I think this is
what you're getting at in your last comment.

If someone takes this on and looks to make a more generalizeable date
parser, then great. But like I said, if it's a non standard format, it
should probably be ignored for the sake of everyone's sanity.

Alex
On Mar 18, 2016 12:51 PM, "Jeff Reback" [email protected] wrote:

so there was a bunch of discussion on #12585
#12585 on how this could be
fixed. The primary way would be to allow additional c-code to parse some
other non-ambiguous formats.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#12667 (comment)

@jreback jreback added this to the Next Major Release milestone Mar 19, 2016
@jreback jreback changed the title Slow performance of date parsing in read_csv PERF: parse non-ambiguous date formats in c Mar 19, 2016
@jbrockmendel
Copy link
Member

FWIW format caching is pretty high on the python-dateutil todo list

@MarcoGorelli
Copy link
Member

caching is now supported, and this format isn't ISO8601, so I think we can close this

thanks for the suggestion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

5 participants