Skip to content

PERF: regression in getattr for IntervalIndex #30742

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Jan 6, 2020 · 20 comments · Fixed by #30797
Closed

PERF: regression in getattr for IntervalIndex #30742

jorisvandenbossche opened this issue Jan 6, 2020 · 20 comments · Fixed by #30797
Labels
Interval Interval data type Performance Memory or execution speed performance

Comments

@jorisvandenbossche
Copy link
Member

Master:

In [14]: idx = pd.interval_range(0, 1000, 1000) 

In [15]: %timeit getattr(idx, '_ndarray_values', idx)
1.29 ms ± 30.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [16]: %timeit idx.closed
321 ns ± 2.66 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

while on 0.25.3:

In [13]: idx = pd.interval_range(0, 1000, 1000) 

In [14]: %timeit getattr(idx, '_ndarray_values', idx)  
90.5 ns ± 2.09 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [15]: %timeit idx.closed 
105 ns ± 1.61 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

(just checked a few attributes, didn't check if it is related to those specific ones or getattr in general)

I think this is a cause / one of the causes of several regressions that can currently be seen at https://pandas.pydata.org/speed/pandas/ (eg https://pandas.pydata.org/speed/pandas/#reshape.Cut.time_cut_timedelta?p-bins=1000&commits=6efc2379-b9de33e3)

@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance Interval Interval data type labels Jan 6, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.0 milestone Jan 6, 2020
@jorisvandenbossche
Copy link
Member Author

cc @jbrockmendel this might be related: #30605

@jbrockmendel
Copy link
Member

#30605 changed _ndarray_values from a cache_readonly to a property. Easy to update.

@jorisvandenbossche
Copy link
Member Author

That might be easy to update, but I am not sure that this explains the slowdown in the other attributes / slowdown in dependent functionality? Eg pd.cut (the actual benchmark I was investigating):

bins = 1000 
N = 10 ** 5
timedelta_series = pd.Series(np.random.randint(N, size=N), dtype="timedelta64[ns]")
pd.cut(timedelta_series, bins)

This pd.cut call also shows a slowdown, so would need to check if the caching improves that again (note, I also didn't fully investigate if the slowdown here is caused by the IntervalIndex attributes slowdown, there might also be other reasons).

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 6, 2020

On master right now, CategoricalIndex._ndarray_values is using Index._ndarray_values, which is something like 6x slower than the old implementation (just an attribute lookup).

https://github.com/pandas-dev/pandas/pull/30717/files is adding an ExtensionIndex._ndarray_values which would I think fix at least one source of the slowdown. Or we can have a targeted fix restoring CategoricalIndex._ndarray_values as _data.values.

@jorisvandenbossche
Copy link
Member Author

Reopening this because the other attributes are still slower (the _ndarray_values was solved in #30797), eg the original example in the top post for idx.closes is still 3x slower.

@jorisvandenbossche
Copy link
Member Author

The original cut benchmark (see example in #30742 (comment)) seems to be fixed with #30797!

@jreback
Copy link
Contributor

jreback commented Jan 23, 2020

The original cut benchmark (see example in #30742 (comment)) seems to be fixed with #30797!

is this fixed @jorisvandenbossche

@jorisvandenbossche
Copy link
Member Author

Not in general, only the specific case of the _ndarray_values attribute

@TomAugspurger
Copy link
Contributor

Any suggestions for the .closed regression?

IIUC, the additional overhead comes from going through accessor.PandasDelegate. Some options

  1. Just manually write the definition of IntervalIndex.closed
  2. Make IntervalIndex.closed cache readonly
  3. Some simpler mixin, such that the definition of IntervalIndex.closed becomes just ._data.closed, rather than the dynamic _delegate_property_get?

@jbrockmendel
Copy link
Member

indexes.extension.inherit_names has a cache kwarg to turn a property into a cache_readonly. I can update #30860 to make closed (and anything else?) cached.

@TomAugspurger
Copy link
Contributor

I think just closed IntervalIndex.

@jorisvandenbossche
Copy link
Member Author

I don't think it's just closed, it's all attributes that use the same mechanism.

I think we could also have a version of PandasDelagate that is specific for the indexes that have _data as the underlying data as extension array. I think that way, it can be written more specific / more efficient.

@TomAugspurger
Copy link
Contributor

@jbrockmendel does #30860 close this, or is the caching being done in a separate PR?

Or do we want a PandasDelegate subclass that's specific to cases where Index.attr is just Index._data.attr?

@jorisvandenbossche
Copy link
Member Author

BTW, I don't think this should necessarily be a blocker for 1.0.
But we should keep this in mind for all index->EA delegation refactor that is happening, and look into improving this.

@jorisvandenbossche jorisvandenbossche modified the milestones: 1.0.0, 1.1 Jan 28, 2020
@jbrockmendel
Copy link
Member

#30860 did the caching for closed but there may be other properties that can be made cached

@jreback jreback modified the milestones: 1.1, 1.1.1 Jul 10, 2020
@simonjayhawkins
Copy link
Member

moved off 1.1.1 milestone (scheduled for this week) as no PRs to fix in the pipeline

@simonjayhawkins simonjayhawkins modified the milestones: 1.1.1, 1.1.2 Aug 17, 2020
@simonjayhawkins simonjayhawkins modified the milestones: 1.1.2, 1.1.3 Sep 7, 2020
@simonjayhawkins
Copy link
Member

moved off 1.1.2 milestone (scheduled for this week) as no PRs to fix in the pipeline

@simonjayhawkins simonjayhawkins modified the milestones: 1.1.3, 1.1.4 Oct 5, 2020
@simonjayhawkins
Copy link
Member

moved off 1.1.3 milestone (overdue) as no PRs to fix in the pipeline

@simonjayhawkins simonjayhawkins modified the milestones: 1.1.4, 1.1.5 Oct 29, 2020
@simonjayhawkins
Copy link
Member

moved off 1.1.4 milestone (scheduled for release tomorrow) as no PRs to fix in the pipeline

@jreback jreback modified the milestones: 1.1.5, Contributions Welcome Nov 25, 2020
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@jbrockmendel
Copy link
Member

Closing as fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Interval Interval data type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants