Skip to content

#18058: improve DatetimeIndex.date performance #18163

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Nov 22, 2017
Merged

#18058: improve DatetimeIndex.date performance #18163

merged 9 commits into from
Nov 22, 2017

Conversation

jamestran201
Copy link

@jamestran201 jamestran201 commented Nov 8, 2017

@jamestran201
Copy link
Author

I ran asv with the tests from indexing.py, the output was "Benchmarks not significantly changed".

A quick test with %timeit:

In [1]: import pandas as pd

In [2]: rng = pd.date_range('2000-04-03', periods=200000, freq='2H')

In [3]: %timeit rng.date
555 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit rng.date_new
90.4 ms ± 5.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %timeit rng.normalize().to_pydatetime()
121 ms ± 3.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: rng.date
Out[6]:
array([datetime.date(2000, 4, 3), datetime.date(2000, 4, 3),
       datetime.date(2000, 4, 3), ..., datetime.date(2045, 11, 19),
       datetime.date(2045, 11, 19), datetime.date(2045, 11, 19)], dtype=object)

In [7]: rng.date_new
Out[7]:
array([datetime.date(2000, 4, 3), datetime.date(2000, 4, 3),
       datetime.date(2000, 4, 3), ..., datetime.date(2045, 11, 19),
       datetime.date(2045, 11, 19), datetime.date(2045, 11, 19)], dtype=object)

In [8]: rng.normalize().to_pydatetime()
Out[8]:
array([datetime.datetime(2000, 4, 3, 0, 0),
       datetime.datetime(2000, 4, 3, 0, 0),
       datetime.datetime(2000, 4, 3, 0, 0), ...,
       datetime.datetime(2045, 11, 19, 0, 0),
       datetime.datetime(2045, 11, 19, 0, 0),
       datetime.datetime(2045, 11, 19, 0, 0)], dtype=object)

Please let me know if there are some other asv tests that need to be run.

@codecov
Copy link

codecov bot commented Nov 8, 2017

Codecov Report

Merging #18163 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18163      +/-   ##
==========================================
- Coverage   91.35%   91.33%   -0.03%     
==========================================
  Files         163      163              
  Lines       49714    49714              
==========================================
- Hits        45415    45405      -10     
- Misses       4299     4309      +10
Flag Coverage Δ
#multiple 89.13% <100%> (-0.01%) ⬇️
#single 39.63% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/dtypes/concat.py 99.13% <ø> (ø) ⬆️
pandas/core/indexes/datetimes.py 95.39% <100%> (-0.1%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 103ea6f...3b7fa77. Read the comment docs.


if tz is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the box param and have it take strings instead of boolean

iow box can be date/datetime/timestamp

you can assert that it is one of those

you will have to change places where this function. is called (eg box=True -> timestamp) etc

@jreback jreback added Performance Memory or execution speed performance Datetime Datetime data dtype labels Nov 8, 2017

def ints_to_pydatetime(ndarray[int64_t] arr, tz=None, freq=None,
box="datetime"):
# convert an i8 repr to an ndarray of datetimes (if box == "datetime"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a Parameters section here

func_create = create_datetime_from_ts

if tz is not None:
if tz is not None and box != "date":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this, put an assert in the box == 'date' section that tz is None

Copy link
Author

@jamestran201 jamestran201 Nov 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if box == "date":
        assert (tz is not None), "tz should be None when converting to date"

        func_create = create_date_from_ts

@jreback Can I do something like this instead? That way the error message will be more specific when tz is used with date.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes absolutely. The point is we should not be passing a tz with the date routine.

x = x.reshape(shape)

elif x.dtype == _TD_DTYPE:
shape = x.shape
x = tslib.ints_to_pytimedelta(x.view(np.int64).ravel(), box=True)
x = tslib.ints_to_pytimedelta(x.view(np.int64).ravel(), box="True")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a different routine

func_create = create_datetime_from_ts

if tz is not None:
if tz is not None and box != "date":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes absolutely. The point is we should not be passing a tz with the date routine.

@jreback
Copy link
Contributor

jreback commented Nov 12, 2017

can u add an asv for this?

@jreback
Copy link
Contributor

jreback commented Nov 12, 2017

add a whatsnew note for 0.22, perf section

@jamestran201
Copy link
Author

Is the change in the timeseries.py file what you meant? Do you know how I can resolve the conflict?

@@ -107,6 +107,8 @@ def time_infer_freq_daily(self):
def time_infer_freq_business(self):
infer_freq(self.b_freq)

def time_to_date(self):
self.rng.date
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might as well as ones for self.rng.time() and self.rng.to_pydatetime() while at it here.

note that this should be self.rng.date()

Copy link
Author

@jamestran201 jamestran201 Nov 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DatetimeIndex.date and DatetimeIndex.time are tagged with the @property tag at the moment. Do you want me to remove that tag so that they can be used as functions instead?

https://github.com/tmnhat2001/pandas/blob/e1b6d6230f54f3eee90225b25f8907989e3828df/pandas/core/indexes/datetimes.py#L1631

https://github.com/tmnhat2001/pandas/blob/e1b6d6230f54f3eee90225b25f8907989e3828df/pandas/core/indexes/datetimes.py#L1640

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry that was a typo;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, it's ok

@@ -69,8 +69,9 @@ Removal of prior version deprecations/changes
Performance Improvements
~~~~~~~~~~~~~~~~~~~~~~~~

- Indexers on ``Series`` or ``DataFrame`` no longer create a reference cycle (:issue:`17956`)
- Indexers on Series or DataFrame no longer create a reference cycle (:issue:`17956`)
- Added a keyword argument, ``cache``, to :func:`to_datetime` that improved the performance of converting duplicate datetime arguments (:issue:`11665`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should only include your change.

@@ -1633,8 +1643,7 @@ def date(self):
Returns numpy array of python datetime.date objects (namely, the date
part of Timestamps without timezone information).
"""
return self._maybe_mask_results(libalgos.arrmap_object(
self.asobject.values, lambda x: x.date()))
return self.normalize().to_pydate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just call the code for what you have in to_pydate() above, and remove that method; we don't actually want to expose this to the user.

- Added a keyword argument, ``cache``, to :func:`to_datetime` that improved the performance of converting duplicate datetime arguments (:issue:`11665`)
- DatetimeIndex.date skips the iterations that create many python objects (:issue:`18058`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved performance of :func:`Series.dt.date` and :func:`DatetimeIndex.date`

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments. needs a rebase. can you show the updated asv results. (for those 3 that we added), you can do --bench timeseries

Parameters
----------
arr : array of i8 repr
tz : the timezone to convert to,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so doc-strings should be slightly different

arr : array of i8
tz : str, default None
    convert to this timezone
freq : str/Offset, default None
    freq to convert
box : {'datetime', 'timestamp', 'date'}, default 'datetime'
    # add what you have below here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The asv results are:
asv continuous -f 1.1 upstream/master issue18058 -b timeseries.DatetimeIndex.time_dti_time -b timeseries.DatetimeIndex.time_to_date -b timeseries.DatetimeIndex.time_to_pydatetime

conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Installing into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
· Running 6 total benchmarks (2 commits * 1 environments * 3 benchmarks)
[  0.00%] · For pandas commit hash 04c90d90:
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 16.67%] ··· Running timeseries.DatetimeIndex.time_dti_time                                312ms
[ 33.33%] ··· Running timeseries.DatetimeIndex.time_to_date                                 40.6ms
[ 50.00%] ··· Running timeseries.DatetimeIndex.time_to_pydatetime                           46.9ms
[ 50.00%] · For pandas commit hash c176a3c2:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 66.67%] ··· Running timeseries.DatetimeIndex.time_dti_time                               375ms
[ 83.33%] ··· Running timeseries.DatetimeIndex.time_to_date                                258±4ms
[100.00%] ··· Running timeseries.DatetimeIndex.time_to_pydatetime                          48.8±1ms       
      before           after          ratio
     [c176a3c2]       [04c90d90]
-           375ms            312ms     0.83  timeseries.DatetimeIndex.time_dti_time
-         258±4ms           40.6ms     0.16  timeseries.DatetimeIndex.time_to_date

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great!

we should fix time too :) (another PR)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yeah, i was about to ask

@jreback
Copy link
Contributor

jreback commented Nov 13, 2017

can u rebase again
sorry lots of merges happening

@jamestran201
Copy link
Author

yeah, i'm on it

@jamestran201
Copy link
Author

just finished it, do i need to push again?

@jreback
Copy link
Contributor

jreback commented Nov 13, 2017

yes need to push again.

@pep8speaks
Copy link

pep8speaks commented Nov 14, 2017

Hello @tmnhat2001! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on November 22, 2017 at 04:46 Hours UTC

@jreback
Copy link
Contributor

jreback commented Nov 22, 2017

can you rebase

@jreback jreback added this to the 0.22.0 milestone Nov 22, 2017
@jreback jreback merged commit 9776868 into pandas-dev:master Nov 22, 2017
@jreback
Copy link
Contributor

jreback commented Nov 22, 2017

thanks @tmnhat2001

@jreback
Copy link
Contributor

jreback commented Nov 22, 2017

FYI @jbrockmendel IIRC you may have a PR to move some of this code to tslibs.conversion

@jamestran201
Copy link
Author

thanks @jreback
I'll open a pr for dti.time as well

@jbrockmendel
Copy link
Member

IIRC you may have a PR to move some of this code to tslibs.conversion

@jreback #18389 is close, but shouldn't overlap with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: making DatetimeIndex.date more performant
5 participants