Skip to content

PERF: Make DateOffset (partially) immutable, cache _params for __eq__ speedup #17137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

jbrockmendel
Copy link
Member

Any time two pd.Period objects are compared, it starts off by checking self.freq != other.freq. These freq attributes are DateOffset objects defined in tseries.offsets. DateOffset.__ne__ calls self._params(), which is the bottleneck in a lot of Period-based work.

import pandas as pd
import cProfile, pstats
nobs = 2000
dr = pd.date_range('1901-01-01', freq='M', periods=nobs)
ser = dr.to_series().apply(pd.Period, freq='M')
pr = cProfile.Profile()
pr.enable()
fr = ser.to_frame().set_index(0, append=True)
pr.disable()
p = pstats.Stats(pr)
p.sort_stats('cumulative')
p.print_stats(20)

Results under the status quo:

         340859 function calls (340834 primitive calls) in 1.564 seconds

   Ordered by: cumulative time
   List reduced from 279 to 28 due to restriction <0.1>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.562    1.562 /usr/local/lib/python2.7/site-packages/pandas/core/frame.py:2761(set_index)
        1    0.000    0.000    1.555    1.555 /usr/local/lib/python2.7/site-packages/pandas/core/indexes/multi.py:1059(from_arrays)
        1    0.000    0.000    1.555    1.555 /usr/local/lib/python2.7/site-packages/pandas/core/categorical.py:2171(_factorize_from_iterables)
        2    0.000    0.000    1.555    0.777 /usr/local/lib/python2.7/site-packages/pandas/core/categorical.py:2134(_factorize_from_iterable)
        2    0.000    0.000    1.555    0.777 /usr/local/lib/python2.7/site-packages/pandas/core/categorical.py:251(__init__)
        2    0.001    0.001    1.543    0.772 /usr/local/lib/python2.7/site-packages/pandas/core/algorithms.py:527(factorize)
    20013    0.037    0.000    1.453    0.000 /usr/local/lib/python2.7/site-packages/pandas/tseries/offsets.py:374(__ne__)
    20013    0.143    0.000    1.417    0.000 /usr/local/lib/python2.7/site-packages/pandas/tseries/offsets.py:360(__eq__)
        2    0.000    0.000    1.381    0.691 /usr/local/lib/python2.7/site-packages/pandas/core/algorithms.py:429(safe_sort)
        2    0.053    0.026    1.381    0.690 {method 'argsort' of 'numpy.ndarray' objects}
    40026    1.029    0.000    1.240    0.000 /usr/local/lib/python2.7/site-packages/pandas/tseries/offsets.py:311(_params)
        1    0.016    0.016    0.157    0.157 {method 'get_labels' of 'pandas._libs.hashtable.PyObjectHashTable' objects}
    40026    0.102    0.000    0.102    0.000 {sorted}
   120079    0.074    0.000    0.074    0.000 {method 'items' of 'dict' objects}
    40026    0.036    0.000    0.036    0.000 {vars}
    40431    0.034    0.000    0.034    0.000 {isinstance}
     9578    0.017    0.000    0.022    0.000 /usr/local/lib/python2.7/site-packages/pandas/tseries/offsets.py:495(freqstr)
        2    0.000    0.000    0.011    0.005 /usr/local/lib/python2.7/site-packages/pandas/core/categorical.py:543(_validate_categories)
        1    0.000    0.000    0.010    0.010 /usr/local/lib/python2.7/site-packages/pandas/core/indexes/base.py:152(__new__)
        1    0.000    0.000    0.009    0.009 /usr/local/lib/python2.7/site-packages/pandas/core/indexes/period.py:202(__new__)

Results under this PR:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.141    0.141 pandas/core/frame.py:2862(set_index)
        1    0.000    0.000    0.139    0.139 pandas/core/indexes/multi.py:1054(from_arrays)
        1    0.000    0.000    0.139    0.139 pandas/core/categorical.py:2146(_factorize_from_iterables)
        2    0.000    0.000    0.139    0.069 pandas/core/categorical.py:2109(_factorize_from_iterable)
        2    0.000    0.000    0.139    0.069 pandas/core/categorical.py:251(__init__)
        2    0.000    0.000    0.132    0.066 pandas/core/algorithms.py:441(factorize)
    20013    0.012    0.000    0.098    0.000 pandas/tseries/offsets.py:383(__ne__)
    20013    0.036    0.000    0.086    0.000 pandas/tseries/offsets.py:369(__eq__)
        1    0.008    0.008    0.068    0.068 {method 'get_labels' of 'pandas._libs.hashtable.PyObjectHashTable' objects}
        2    0.000    0.000    0.063    0.031 pandas/core/sorting.py:383(safe_sort)
        2    0.016    0.008    0.063    0.031 {method 'argsort' of 'numpy.ndarray' objects}
     2000    0.030    0.000    0.035    0.000 pandas/tseries/offsets.py:319(_params)
     9578    0.009    0.000    0.012    0.000 pandas/tseries/offsets.py:504(freqstr)
    40456    0.011    0.000    0.011    0.000 {isinstance}
        2    0.000    0.000    0.006    0.003 pandas/core/categorical.py:544(_validate_categories)
        1    0.000    0.000    0.006    0.006 pandas/core/indexes/base.py:154(__new__)
        1    0.000    0.000    0.006    0.006 pandas/core/indexes/period.py:194(__new__)
        1    0.002    0.002    0.005    0.005 {pandas._libs.period.extract_ordinals}
     2000    0.004    0.000    0.004    0.000 pandas/tseries/offsets.py:193(__setattr__)
     9578    0.003    0.000    0.003    0.000 pandas/tseries/offsets.py:500(rule_code)

Note that in this example the periods were already ordered, so the number of comparisons (20013) is pretty much best-case. After shuffling them (sample size=1) there were 35181 comparisons.

There are some things I'd like to touch up to move towards actual-immutability, but the caching here is a big win.

@codecov
Copy link

codecov bot commented Aug 1, 2017

Codecov Report

Merging #17137 into master will decrease coverage by 0.07%.
The diff coverage is 80%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17137      +/-   ##
==========================================
- Coverage   91.03%   90.95%   -0.08%     
==========================================
  Files         161      161              
  Lines       49405    49411       +6     
==========================================
- Hits        44974    44943      -31     
- Misses       4431     4468      +37
Flag Coverage Δ
#multiple 88.74% <80%> (-0.05%) ⬇️
#single 40.2% <70%> (-0.13%) ⬇️
Impacted Files Coverage Δ
pandas/tseries/offsets.py 97.07% <80%> (-0.06%) ⬇️
pandas/io/feather_format.py 20% <0%> (-65.72%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.56% <0%> (-0.2%) ⬇️
pandas/core/series.py 94.85% <0%> (-0.1%) ⬇️
pandas/core/generic.py 91.96% <0%> (-0.06%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f2b0bdc...2e3e639. Read the comment docs.

@gfyoung gfyoung added Performance Memory or execution speed performance Enhancement labels Aug 1, 2017

# default for prior pickles
normalize = False

def __setattr__(self, key, value):
if key in ['n', '_offset'] and hasattr(self, key):
raise TypeError('%s is intended to be immutable; "%s" '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test to hit this TypeError.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually. First there are some lingering corner cases in which the current implementation can be incorrect.

@jreback
Copy link
Contributor

jreback commented Aug 1, 2017

see if any asv's are needed for this (I don't think so actually), and show a representative speedup (not the profile), just a before / after timeit.

if it looks good, pls add a whatsnew in perf.

@jreback jreback changed the title Make DateOffset (partially) immutable, cache _params for __eq__ speedup PERF: Make DateOffset (partially) immutable, cache _params for __eq__ speedup Aug 1, 2017
@jbrockmendel
Copy link
Member Author

Will do. There are some cases I want to take a closer look at before calling this ready; parking some timeit results here in the meantime:

import random
import timeit

import pandas as pd

nobs = 1000

months = pd.date_range('1901-01-01', freq='M', periods=nobs)
days = pd.date_range('1923-04-05', freq='D', periods=nobs)

dser = days.to_series().apply(pd.Period, freq='D')
mser = months.to_series().apply(pd.Period, freq='M')

dlist = list(dser)
mlist = list(mser)

random.shuffle(dlist)
random.shuffle(mlist)

mser2 = pd.Series(mlist)
dser2 = pd.Series(dlist)

timeit.timeit(mser.sort_values, number=100)
timeit.timeit(dser.sort_values, number=100)
timeit.timeit(mser2.sort_values, number=100)
timeit.timeit(dser2.sort_values, number=100)

Old Results

>>> timeit.timeit(mser.sort_values, number=100)
29.13339614868164
>>> timeit.timeit(dser.sort_values, number=100)
46.98428416252136
>>> timeit.timeit(mser2.sort_values, number=100)
107.7828049659729
>>> timeit.timeit(dser2.sort_values, number=100)
113.00968098640442

New Results

>>> timeit.timeit(mser.sort_values, number=100)
2.2172720432281494
>>> timeit.timeit(dser.sort_values, number=100)
1.7678871154785156
>>> timeit.timeit(mser2.sort_values, number=100)
3.523206949234009
>>> timeit.timeit(dser2.sort_values, number=100)
2.4215428829193115

@jbrockmendel
Copy link
Member Author

There may be a problem. A lot of cross-design appears to have built up over the years including at least two approaches to caching.

It looks like the original author had caching in mind, but it got dropped here. There is a _cacheable attribute that is False everywhere except one unused mixin class, and a _should_cache that is referenced over in core.indexes.datetimes but is apparently always False.

AFAICT the idea in tseries.frequencies.get_offset is that the user calls e.g. get_offset(B), gets a copy of the cached <BusinessDay> offset, and then can modify if desired. My approach of making DateOffset immutable would break that.

The motivation is that __eq__ is currently very slow. It is slow because it re-evaluates self._params() on every call. I'm advocating caching self._params(). Doing this requires ensuring that cache remains valid, so freezing all the relevant attributes. But those attributes are blacklist-based instead of whitelist-based:

def _params(self):
        all_paras = dict(list(vars(self).items()) + list(self.kwds.items()))
        if 'holidays' in all_paras and not all_paras['holidays']:
            all_paras.pop('holidays')
        exclude = ['kwds', 'name', 'normalize', 'calendar']
        attrs = [(k, v) for k, v in all_paras.items()
                 if (k not in exclude) and (k[0] != '_')]
        attrs = sorted(set(attrs))
        params = tuple([str(self.__class__)] + attrs)
        return params

(There is a TODO note further down suggesting this should be changed:

    # TODO: Combine this with DateOffset by defining a whitelisted set of
    # attributes on each object rather than the existing behavior of iterating
    # over internal ``__dict__``

)

Making _params cacheable for the general case would require basically deprecating the self.kwds attribute, which is I'm reticent to do as users may be using it in creative ways. My inclination is to have _params defer to a _cached_params iff len(self.kwds) == 0. Then __setattr__ can refuse to change a handful of attributes, but it doesn't need to be an exhaustive list.

I'm still not sure if this approach would play nicely with the frequencies.get_offset cache.

@jreback
Copy link
Contributor

jreback commented Aug 8, 2017

closing in favor of #17181

@jreback jreback closed this Aug 8, 2017
@jbrockmendel jbrockmendel deleted the date_offset_immutable branch October 30, 2017 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants