PERF: Make DateOffset (partially) immutable, cache _params for eq speedup #17137

jbrockmendel · 2017-08-01T04:38:26Z

Any time two pd.Period objects are compared, it starts off by checking self.freq != other.freq. These freq attributes are DateOffset objects defined in tseries.offsets. DateOffset.__ne__ calls self._params(), which is the bottleneck in a lot of Period-based work.

import pandas as pd
import cProfile, pstats
nobs = 2000
dr = pd.date_range('1901-01-01', freq='M', periods=nobs)
ser = dr.to_series().apply(pd.Period, freq='M')
pr = cProfile.Profile()
pr.enable()
fr = ser.to_frame().set_index(0, append=True)
pr.disable()
p = pstats.Stats(pr)
p.sort_stats('cumulative')
p.print_stats(20)

Results under the status quo:

         340859 function calls (340834 primitive calls) in 1.564 seconds

   Ordered by: cumulative time
   List reduced from 279 to 28 due to restriction <0.1>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.562    1.562 /usr/local/lib/python2.7/site-packages/pandas/core/frame.py:2761(set_index)
        1    0.000    0.000    1.555    1.555 /usr/local/lib/python2.7/site-packages/pandas/core/indexes/multi.py:1059(from_arrays)
        1    0.000    0.000    1.555    1.555 /usr/local/lib/python2.7/site-packages/pandas/core/categorical.py:2171(_factorize_from_iterables)
        2    0.000    0.000    1.555    0.777 /usr/local/lib/python2.7/site-packages/pandas/core/categorical.py:2134(_factorize_from_iterable)
        2    0.000    0.000    1.555    0.777 /usr/local/lib/python2.7/site-packages/pandas/core/categorical.py:251(__init__)
        2    0.001    0.001    1.543    0.772 /usr/local/lib/python2.7/site-packages/pandas/core/algorithms.py:527(factorize)
    20013    0.037    0.000    1.453    0.000 /usr/local/lib/python2.7/site-packages/pandas/tseries/offsets.py:374(__ne__)
    20013    0.143    0.000    1.417    0.000 /usr/local/lib/python2.7/site-packages/pandas/tseries/offsets.py:360(__eq__)
        2    0.000    0.000    1.381    0.691 /usr/local/lib/python2.7/site-packages/pandas/core/algorithms.py:429(safe_sort)
        2    0.053    0.026    1.381    0.690 {method 'argsort' of 'numpy.ndarray' objects}
    40026    1.029    0.000    1.240    0.000 /usr/local/lib/python2.7/site-packages/pandas/tseries/offsets.py:311(_params)
        1    0.016    0.016    0.157    0.157 {method 'get_labels' of 'pandas._libs.hashtable.PyObjectHashTable' objects}
    40026    0.102    0.000    0.102    0.000 {sorted}
   120079    0.074    0.000    0.074    0.000 {method 'items' of 'dict' objects}
    40026    0.036    0.000    0.036    0.000 {vars}
    40431    0.034    0.000    0.034    0.000 {isinstance}
     9578    0.017    0.000    0.022    0.000 /usr/local/lib/python2.7/site-packages/pandas/tseries/offsets.py:495(freqstr)
        2    0.000    0.000    0.011    0.005 /usr/local/lib/python2.7/site-packages/pandas/core/categorical.py:543(_validate_categories)
        1    0.000    0.000    0.010    0.010 /usr/local/lib/python2.7/site-packages/pandas/core/indexes/base.py:152(__new__)
        1    0.000    0.000    0.009    0.009 /usr/local/lib/python2.7/site-packages/pandas/core/indexes/period.py:202(__new__)

Results under this PR:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.141    0.141 pandas/core/frame.py:2862(set_index)
        1    0.000    0.000    0.139    0.139 pandas/core/indexes/multi.py:1054(from_arrays)
        1    0.000    0.000    0.139    0.139 pandas/core/categorical.py:2146(_factorize_from_iterables)
        2    0.000    0.000    0.139    0.069 pandas/core/categorical.py:2109(_factorize_from_iterable)
        2    0.000    0.000    0.139    0.069 pandas/core/categorical.py:251(__init__)
        2    0.000    0.000    0.132    0.066 pandas/core/algorithms.py:441(factorize)
    20013    0.012    0.000    0.098    0.000 pandas/tseries/offsets.py:383(__ne__)
    20013    0.036    0.000    0.086    0.000 pandas/tseries/offsets.py:369(__eq__)
        1    0.008    0.008    0.068    0.068 {method 'get_labels' of 'pandas._libs.hashtable.PyObjectHashTable' objects}
        2    0.000    0.000    0.063    0.031 pandas/core/sorting.py:383(safe_sort)
        2    0.016    0.008    0.063    0.031 {method 'argsort' of 'numpy.ndarray' objects}
     2000    0.030    0.000    0.035    0.000 pandas/tseries/offsets.py:319(_params)
     9578    0.009    0.000    0.012    0.000 pandas/tseries/offsets.py:504(freqstr)
    40456    0.011    0.000    0.011    0.000 {isinstance}
        2    0.000    0.000    0.006    0.003 pandas/core/categorical.py:544(_validate_categories)
        1    0.000    0.000    0.006    0.006 pandas/core/indexes/base.py:154(__new__)
        1    0.000    0.000    0.006    0.006 pandas/core/indexes/period.py:194(__new__)
        1    0.002    0.002    0.005    0.005 {pandas._libs.period.extract_ordinals}
     2000    0.004    0.000    0.004    0.000 pandas/tseries/offsets.py:193(__setattr__)
     9578    0.003    0.000    0.003    0.000 pandas/tseries/offsets.py:500(rule_code)

Note that in this example the periods were already ordered, so the number of comparisons (20013) is pretty much best-case. After shuffling them (sample size=1) there were 35181 comparisons.

There are some things I'd like to touch up to move towards actual-immutability, but the caching here is a big win.

codecov · 2017-08-01T05:12:19Z

Codecov Report

Merging #17137 into master will decrease coverage by 0.07%.
The diff coverage is 80%.

@@            Coverage Diff             @@
##           master   #17137      +/-   ##
==========================================
- Coverage   91.03%   90.95%   -0.08%     
==========================================
  Files         161      161              
  Lines       49405    49411       +6     
==========================================
- Hits        44974    44943      -31     
- Misses       4431     4468      +37

Flag	Coverage Δ
#multiple	`88.74% <80%> (-0.05%)`	⬇️
#single	`40.2% <70%> (-0.13%)`	⬇️

Impacted Files	Coverage Δ
pandas/tseries/offsets.py	`97.07% <80%> (-0.06%)`	⬇️
pandas/io/feather_format.py	`20% <0%> (-65.72%)`	⬇️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.56% <0%> (-0.2%)`	⬇️
pandas/core/series.py	`94.85% <0%> (-0.1%)`	⬇️
pandas/core/generic.py	`91.96% <0%> (-0.06%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f2b0bdc...2e3e639. Read the comment docs.

gfyoung · 2017-08-01T06:40:17Z

pandas/tseries/offsets.py


    # default for prior pickles
    normalize = False

+    def __setattr__(self, key, value):
+        if key in ['n', '_offset'] and hasattr(self, key):
+            raise TypeError('%s is intended to be immutable; "%s" '


Add a test to hit this TypeError.

Eventually. First there are some lingering corner cases in which the current implementation can be incorrect.

jreback · 2017-08-01T22:36:08Z

see if any asv's are needed for this (I don't think so actually), and show a representative speedup (not the profile), just a before / after timeit.

if it looks good, pls add a whatsnew in perf.

jbrockmendel · 2017-08-02T01:05:11Z

Will do. There are some cases I want to take a closer look at before calling this ready; parking some timeit results here in the meantime:

import random
import timeit

import pandas as pd

nobs = 1000

months = pd.date_range('1901-01-01', freq='M', periods=nobs)
days = pd.date_range('1923-04-05', freq='D', periods=nobs)

dser = days.to_series().apply(pd.Period, freq='D')
mser = months.to_series().apply(pd.Period, freq='M')

dlist = list(dser)
mlist = list(mser)

random.shuffle(dlist)
random.shuffle(mlist)

mser2 = pd.Series(mlist)
dser2 = pd.Series(dlist)

timeit.timeit(mser.sort_values, number=100)
timeit.timeit(dser.sort_values, number=100)
timeit.timeit(mser2.sort_values, number=100)
timeit.timeit(dser2.sort_values, number=100)

Old Results

>>> timeit.timeit(mser.sort_values, number=100)
29.13339614868164
>>> timeit.timeit(dser.sort_values, number=100)
46.98428416252136
>>> timeit.timeit(mser2.sort_values, number=100)
107.7828049659729
>>> timeit.timeit(dser2.sort_values, number=100)
113.00968098640442

New Results

>>> timeit.timeit(mser.sort_values, number=100)
2.2172720432281494
>>> timeit.timeit(dser.sort_values, number=100)
1.7678871154785156
>>> timeit.timeit(mser2.sort_values, number=100)
3.523206949234009
>>> timeit.timeit(dser2.sort_values, number=100)
2.4215428829193115

jbrockmendel · 2017-08-02T04:26:29Z

There may be a problem. A lot of cross-design appears to have built up over the years including at least two approaches to caching.

It looks like the original author had caching in mind, but it got dropped here. There is a _cacheable attribute that is False everywhere except one unused mixin class, and a _should_cache that is referenced over in core.indexes.datetimes but is apparently always False.

AFAICT the idea in tseries.frequencies.get_offset is that the user calls e.g. get_offset(B), gets a copy of the cached <BusinessDay> offset, and then can modify if desired. My approach of making DateOffset immutable would break that.

The motivation is that __eq__ is currently very slow. It is slow because it re-evaluates self._params() on every call. I'm advocating caching self._params(). Doing this requires ensuring that cache remains valid, so freezing all the relevant attributes. But those attributes are blacklist-based instead of whitelist-based:

def _params(self):
        all_paras = dict(list(vars(self).items()) + list(self.kwds.items()))
        if 'holidays' in all_paras and not all_paras['holidays']:
            all_paras.pop('holidays')
        exclude = ['kwds', 'name', 'normalize', 'calendar']
        attrs = [(k, v) for k, v in all_paras.items()
                 if (k not in exclude) and (k[0] != '_')]
        attrs = sorted(set(attrs))
        params = tuple([str(self.__class__)] + attrs)
        return params

(There is a TODO note further down suggesting this should be changed:

    # TODO: Combine this with DateOffset by defining a whitelisted set of
    # attributes on each object rather than the existing behavior of iterating
    # over internal ``__dict__``

)

Making _params cacheable for the general case would require basically deprecating the self.kwds attribute, which is I'm reticent to do as users may be using it in creative ways. My inclination is to have _params defer to a _cached_params iff len(self.kwds) == 0. Then __setattr__ can refuse to change a handful of attributes, but it doesn't need to be an exhaustive list.

I'm still not sure if this approach would play nicely with the frequencies.get_offset cache.

jreback · 2017-08-08T23:55:36Z

closing in favor of #17181

Make DateOffset (partially) immutable, cache _params for __eq__ speedup

2e3e639

gfyoung added Performance Memory or execution speed performance Enhancement labels Aug 1, 2017

gfyoung reviewed Aug 1, 2017

View reviewed changes

jreback changed the title ~~Make DateOffset (partially) immutable, cache _params for __eq__ speedup~~ PERF: Make DateOffset (partially) immutable, cache _params for __eq__ speedup Aug 1, 2017

This was referenced Aug 5, 2017

DateOffset.offset vs DateOffset._offset #17176

Closed

Make some attributes of DateOffset read-only; add tests #17181

Closed

jreback closed this Aug 8, 2017

jbrockmendel deleted the date_offset_immutable branch October 30, 2017 16:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: Make DateOffset (partially) immutable, cache _params for eq speedup #17137

PERF: Make DateOffset (partially) immutable, cache _params for eq speedup #17137

Uh oh!

jbrockmendel commented Aug 1, 2017

Uh oh!

codecov bot commented Aug 1, 2017 •

edited

Loading

Uh oh!

gfyoung Aug 1, 2017

Uh oh!

jbrockmendel Aug 1, 2017

Uh oh!

jreback commented Aug 1, 2017

Uh oh!

jbrockmendel commented Aug 2, 2017

Uh oh!

jbrockmendel commented Aug 2, 2017

Uh oh!

jreback commented Aug 8, 2017

Uh oh!

Uh oh!

Uh oh!

PERF: Make DateOffset (partially) immutable, cache _params for __eq__ speedup #17137

PERF: Make DateOffset (partially) immutable, cache _params for __eq__ speedup #17137

Uh oh!

Conversation

jbrockmendel commented Aug 1, 2017

Uh oh!

codecov bot commented Aug 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gfyoung Aug 1, 2017

Choose a reason for hiding this comment

Uh oh!

jbrockmendel Aug 1, 2017

Choose a reason for hiding this comment

Uh oh!

jreback commented Aug 1, 2017

Uh oh!

jbrockmendel commented Aug 2, 2017

Uh oh!

jbrockmendel commented Aug 2, 2017

Uh oh!

jreback commented Aug 8, 2017

Uh oh!

Uh oh!

PERF: Make DateOffset (partially) immutable, cache _params for eq speedup #17137

PERF: Make DateOffset (partially) immutable, cache _params for eq speedup #17137

codecov bot commented Aug 1, 2017 •

edited

Loading