Skip to content

PERF: add exact kw to to_datetime to enable faster regex format parsing #8904

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 5, 2014

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Nov 27, 2014

closes #8989
closes #8903

Clearly the default is exact=True for back-compat
but allows for a match starting at the beginning (has always been like this),
but doesn't require the string to ONLY match the format, IOW, can be extra stuff after the match.

Avoids having to do a regex replace first.

In [21]: s = Series(['19MAY11','19MAY11:00:00:00']*100000)

In [22]: %timeit pd.to_datetime(s.str.replace(':\S+$',''),format='%d%b%y')
1 loops, best of 3: 828 ms per loop

In [23]: %timeit pd.to_datetime(s,format='%d%b%y',exact=False)
1 loops, best of 3: 603 ms per loop

@jreback jreback added API Design Strings String extension data type and string data Datetime Datetime data dtype Performance Memory or execution speed performance labels Nov 27, 2014
@jreback jreback added this to the 0.15.2 milestone Nov 27, 2014
@jorisvandenbossche
Copy link
Member

  • is the proposed keyword exact or complete?
  • the match can be anywhere? (not the start) but I suppose the most common use case is with / without time part, of subsecond part, timezone part (endings)

@jreback
Copy link
Contributor Author

jreback commented Nov 27, 2014

yeh, changing back to exact. Also going to change it to do a search rather than a match if exact=False so can find arbitrarily.

Acutally other use-cases for this

In [4]: s = Series(['19MAY11','foobar19MAY11','19MAY11:00:00:00','19MAY11 00:00:00Z'])

In [5]: s
Out[5]: 
0              19MAY11
1        foobar19MAY11
2     19MAY11:00:00:00
3    19MAY11 00:00:00Z
dtype: object

In [6]: pd.to_datetime(s,format='%d%b%y',exact=False)
Out[6]: 
0   2011-05-19
1   2011-05-19
2   2011-05-19
3   2011-05-19
dtype: datetime64[ns]

@jreback jreback force-pushed the timere branch 4 times, most recently from 8c7f505 to cc8a3de Compare November 30, 2014 17:24
@jreback
Copy link
Contributor Author

jreback commented Dec 1, 2014

@jorisvandenbossche ?

1 similar comment
@jreback
Copy link
Contributor Author

jreback commented Dec 3, 2014

@jorisvandenbossche ?

@@ -195,6 +195,9 @@ def to_datetime(arg, errors='ignore', dayfirst=False, utc=None, box=True,
If True returns a DatetimeIndex, if False returns ndarray of values
format : string, default None
strftime to parse time, eg "%d/%m/%Y"
exact : boolean, True by default
if True, require an exact format match
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add punctuation and capitals here? (when rendered html, sphinx does not keep the newline, so the visual formatting does not work there)

@jorisvandenbossche
Copy link
Member

Does this also work for example what now fails (reading in nanoseconds when specifying a format):

In [11]: pd.to_datetime("2012-01-01 09:00:00.000000001", format="%Y-%m-%d %H:%M:%S.%f")
-> 
ValueError: unconverted data remains: 001

@@ -79,6 +79,8 @@ Performance
~~~~~~~~~~~
- Reduce memory usage when skiprows is an integer in read_csv (:issue:`8681`)

- Performance boost for ``to_datetime`` conversions with a passed ``format=`` kw, and the new ``exact=False`` for kw with non-exact matching (:issue:`8904`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the new exact keyword as a separate entry in the enhancements section?

@jorisvandenbossche
Copy link
Member

for the keyword name: match is also a possibility, but not sure it is better

Out of curiousity, where does the speed-up come from? The extra cython type declerations?

@jreback
Copy link
Contributor Author

jreback commented Dec 3, 2014

@jorisvandenbossche the speedup comes from not have to do an .str.extract(.....) first, then regular parse it. Currently the regex parsers uses .match (and nothing trailing). This allows it to use .search.

@jreback
Copy link
Contributor Author

jreback commented Dec 3, 2014

ok with match rather than exact

@jreback
Copy link
Contributor Author

jreback commented Dec 3, 2014

This works; is their an issue about this?

In [3]: pd.to_datetime("2012-01-01 09:00:00.000000001", format="%Y-%m-%d %H:%M:",exact=False)
Out[3]: Timestamp('2012-01-01 09:00:00')

@jorisvandenbossche
Copy link
Member

No, no issue I think, and using this keyword solves the error, but of course not the problem that the nanoseconds are not read in. But I don't know if we can do something about that, as this is defined python formatting strings where f means microseconds

@jreback
Copy link
Contributor Author

jreback commented Dec 3, 2014

You don't actually need to use a formatting string. But we could easily change the 'f' to allow for nanoseconds I think. let me see.

`I``
n [4]: pd.to_datetime("2012-01-01 09:00:00.000000001")
Out[4]: Timestamp('2012-01-01 09:00:00.000000001')

"""
timeseries_with_format = Benchmark("to_datetime(s,format='%d%b%y',exact=False)", \
setup, start_date=datetime(2014, 11, 26))
timeseries_with_format_replace = Benchmark("to_datetime(s.str.replace(':\S+$',''),format='%d%b%y')", \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe timeseries_with_format_exact is a better name? (the replace does not really matter for what it is about)

@jorisvandenbossche
Copy link
Member

@jreback yes, I know, the format string is not needed in this example because I used a default format, but in general that does not hold always

@jorisvandenbossche
Copy link
Member

opened an issue for the nanoseconds #8989

@jorisvandenbossche
Copy link
Member

And I copied only part of the line. I meant pd.to_datetime("2012-01-01 09:00:00.000000001", format="%Y-%m-%d %H:%M:%S.%f") above

@jreback jreback force-pushed the timere branch 5 times, most recently from 2e165e2 to 5d972d0 Compare December 4, 2014 11:19
@jreback
Copy link
Contributor Author

jreback commented Dec 5, 2014

@jorisvandenbossche ok on the kw?

@jorisvandenbossche
Copy link
Member

you mean exact vs match ? No strong opinion, exact seems a bit more self-explanatory, match is a bit more in line with the regex naming.

@jreback
Copy link
Contributor Author

jreback commented Dec 5, 2014

ok, merging then.

jreback added a commit that referenced this pull request Dec 5, 2014
PERF: add exact kw to to_datetime to enable faster regex format parsing
@jreback jreback merged commit 6f7f5f8 into pandas-dev:master Dec 5, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Datetime Datetime data dtype Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
2 participants