Skip to content

PERF: improve access of .array #31037

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jan 24, 2020

Conversation

jorisvandenbossche
Copy link
Member

xref #30790

This improves the performance of .array access considerably (and thus of extract_array, which eg caused the slowdown in the indexing benchmark). The speed-up depends on the dtype (the numpy dtypes are still quite slow ..).

This touches quite some code, but I think this is OK for 1.0.0 (although I mixed a bit of clean-up into it, I can also remove that again).
@jbrockmendel I went the way of adding "yet another array/values like thing", but only on the Block (Block.array_values()). I think it might be possible to let _values return a DatetimeArray instead of datetime64 ndarray for datetime-like data, but that's something that needs more investigation, and also, even if possible, not something I would like to do on 1.0.0, but only for 1.1.

On master:

In [1]: s = pd.Series(pd.date_range("2012-01-01", periods=3, tz='UTC')) 

In [3]: %timeit s.array   
17.2 µs ± 219 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [4]: s = pd.Series(pd.date_range("2012-01-01", periods=3)) 

In [6]: %timeit s.array   
4.3 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [7]: s = pd.Series([1, 2, 3]) 

In [9]: %timeit s.array 
6.76 µs ± 208 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

With this PR:

In [1]: s = pd.Series(pd.date_range("2012-01-01", periods=3, tz='UTC'))  

In [3]: %timeit s.array 
290 ns ± 6.26 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: s = pd.Series(pd.date_range("2012-01-01", periods=3))   

In [6]: %timeit s.array 
809 ns ± 7.05 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [7]: s = pd.Series([1, 2, 3])   

In [9]: %timeit s.array 
2.34 µs ± 93.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

@jorisvandenbossche jorisvandenbossche added the Performance Memory or execution speed performance label Jan 15, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.0.0 milestone Jan 15, 2020
@jbrockmendel
Copy link
Member

I think it might be possible to let _values return a DatetimeArray instead of datetime64 ndarray for datetime-like data, but that's something that needs more investigation,

I've been looking at something similar. If we do the same for TimedeltaArray, this would allow us to clean up a lot of code that we currently use for boxing dt64/td64 ndarrays (com.maybe_box_datetimelike, libindex.convert_scalar). So generally +1 on this idea in a separate PR.

@jbrockmendel
Copy link
Member

I generally like what's going on here, though as Joris mentioned the proliferation of foo_value is not ideal.

It would help if the cleanups could be separated out to limit the scope.

If the motivation is the performance of extract_array, then I think a much less invasive solution would be to call obj._values rather than obj.array. Slightly different behavior, but I think a large majority of use cases use extract_numpy=True anyway.

@jorisvandenbossche
Copy link
Member Author

If the motivation is the performance of extract_array, then I think a much less invasive solution would be to call obj._values rather than obj.array

It's also the performance of .array itself, I would say. But for extract_array, that means we would still need to do a dtype check, as _values gives datetime64/timedelta64 and needs to be converted in DatetimeArray/TimedeltaArray.

@jbrockmendel
Copy link
Member

But for extract_array, that means we would still need to do a dtype check, as _values gives datetime64/timedelta64 and needs to be converted in DatetimeArray/TimedeltaArray.

To retain the current behavior yes, but the original intention for extract_array didnt do that casting, it was just un-wrapping whatever we already had. It was pretty specifically a replacement for the getattr(values, "_values", getattr(values, "values", ...)) patterns

That said, I do like the idea of making Series._values return DTA/TDA for instead of ndarray (or if necessary a different name), as that would allow de-duplicating a bunch of code.

@jorisvandenbossche
Copy link
Member Author

To retain the current behavior yes,

But changing that behaviour is not something I want to do for 1.0 I think. So if that is the goal, are you OK with the approach in this PR?

@jbrockmendel
Copy link
Member

But changing that behaviour is not something I want to do for 1.0 I think. So if that is the goal, are you OK with the approach in this PR?

I think all the relevant methods are private so we retain the option to revisit in the future, so this works for me.

@jorisvandenbossche
Copy link
Member Author

Updated this to not include the cleaning part. What's left is just for improving .array performance, with limited changes.

@TomAugspurger
Copy link
Contributor

Thanks Joris, the changes look good.

Can you add a comment somewhere documenting the differences between .array, array_values, _values?

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jan 18, 2020

Can you add a comment somewhere documenting the differences between .array, array_values, _values?

Can you indicate what you find missing in the docs? See also #31103 (I split that off from here on request) which included the docstring updates (eg I updated Index._values docstring, and clarified internal/external_values in the blocks)

@@ -3923,6 +3923,16 @@ def values(self):
"""
return self._data.view(np.ndarray)

@Appender(IndexOpsMixin.array.__doc__) # type: ignore
@property
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be a cached_property I think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the Index is immutable but not the underlying, hmm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the Index is immutable but not the underlying, hmm.

That has always been the case for Index?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We fixed the issue with cache_readonly docs not working right? If so, +1 for using it as long as we're micro-optimizing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when putting them in the order of

    @cache_readonly
    @Appender(IndexOpsMixin.array.__doc__)  # type: ignore
    def array(self) -> ExtensionArray:

it seems to work

@@ -208,6 +216,12 @@ def internal_values(self):
"""
return self.values

def array_values(self) -> ExtensionArray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need external_values at all? this is getting very confusing with all of the values

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need external_values at all? this is getting very confusing with all of the values

I clarified the docstrings (in this PR, but split off and already merged in #31103) to make it clear what external_values (-> .values), internal_values (-> _values) and array_values (-> .array) are used for. As long as those all have different behaviour (something that can be gradually cleaned up, but not here), we need a mechanism to create those different results.

@jbrockmendel
Copy link
Member

What if we override Block.internal_values for DatetimeBlock and TimedeltaBlock to return DTA/TDA in 1D cases? i.e. block.internal_values would behave like block.array_values does in this PR ATM, but without adding a new foo_values.

IIUC the main thing this would effect would be Series._values by making it match Index._values, which I view as desirable.

@TomAugspurger
Copy link
Contributor

See also #31103 (I split that off from here on request) which included the docstring updates

Thanks, that looks good.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jan 19, 2020

What if we override Block.internal_values for DatetimeBlock and TimedeltaBlock to return DTA/TDA in 1D cases? i.e. block.internal_values would behave like block.array_values does in this PR ATM, but without adding a new foo_values.

IIUC the main thing this would effect would be Series._values by making it match Index._values, which I view as desirable.

@jbrockmendel Isn't that in practice the same as what we discussed before on using _values and change that to return DTA/TDA? To which I said: I think that would be a nice clean-up, but not something I want to do for 1.0.

@jreback
Copy link
Contributor

jreback commented Jan 19, 2020

What if we override Block.internal_values for DatetimeBlock and TimedeltaBlock to return DTA/TDA in 1D cases? i.e. block.internal_values would behave like block.array_values does in this PR ATM, but without adding a new foo_values.
IIUC the main thing this would effect would be Series._values by making it match Index._values, which I view as desirable.

@jbrockmendel Isn't that in practice the same as what we discussed before on using _values and change that to return DTA/TDA? To which I said: I think that would be a nice clean-up, but not something I want to do for 1.0.

let’s make a function to do this then; having 4 properties to do very similar things is so confusing

@jorisvandenbossche
Copy link
Member Author

let’s make a function to do this then; having 4 properties to do very similar things is so confusing

Can you be more specific in what you are suggesting?
There are 3 properties related to this PR: values, _values and array, and those are obviously not going away.
Currently, those are implemented by dispatching each of those properties to an equivalent method on the Block (well, 2 already did it this way, and in this PR I make the third (array) consistent with this). For sure, there are other ways to implement this instead of dispatching to the Block. But such a refactor is out of scope for this PR, I think. In this PR, I looked for a minimal change, consistent with the code we already have, to fix the performance issue for 1.0.

@jbrockmendel
Copy link
Member

I tried out a branch where I made Block.internal_values return TDA/DTA for TimedeltaBlock/DatetimeBlock. What I learned:

  • a 1-line patch is needed in core.apply to prevent passing TDA/DTA to cython fastpath
  • a 3-line patch is needed in IndexOpsMixin.unique to keep the return type unchanged
  • s.array is slightly more performant than this PR for the tz-naive dt64 case, less performant in the other two cases (though still better than master, especially for the tzaware case)
import pandas as pd
from pandas.core.construction import extract_array

s1 = pd.Series(pd.date_range("2012-01-01", periods=3, tz='UTC'))
s2 = pd.Series(pd.date_range("2012-01-01", periods=3))
s3 = pd.Series([1, 2, 3])

%timeit s1.array
21.9 µs ± 706 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)    <-- master
738 ns ± 10.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  <-- _values branch
357 ns                                                                      <-- #31037 (est)

%timeit s2.array
5.39 µs ± 155 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)    <-- master
783 ns ± 22.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)   <-- _values branch
998 ns                                                                       <-- #31037 (est)

%timeit s3.array
7.94 µs ± 145 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  <-- master
5.81 µs ± 224 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  <-- _values branch
2.88 µs                                                                    <-- #31037 (est)

%timeit extract_array(s1, extract_numpy=True)
23.2 µs ± 371 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)     <-- master
1.32 µs ± 13.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  <-- _values branch

%timeit extract_array(s2, extract_numpy=True)
6.73 µs ± 136 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)    <-- master
1.38 µs ± 13.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  <-- _values branch

%timeit extract_array(s3, extract_numpy=True)
10.3 µs ± 252 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)    <-- master
1.33 µs ± 9.33 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  <-- _values branch

%timeit extract_array(s3, extract_numpy=False)
8.89 µs ± 93.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  <-- master
7.49 µs ± 276 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)   <-- _values branch

@jreback
Copy link
Contributor

jreback commented Jan 19, 2020

let’s make a function to do this then; having 4 properties to do very similar things is so confusing

Can you be more specific in what you are suggesting?
There are 3 properties related to this PR: values, _values and array, and those are obviously not going away.
Currently, those are implemented by dispatching each of those properties to an equivalent method on the Block (well, 2 already did it this way, and in this PR I make the third (array) consistent with this). For sure, there are other ways to implement this instead of dispatching to the Block. But such a refactor is out of scope for this PR, I think. In this PR, I looked for a minimal change, consistent with the code we already have, to fix the performance issue for 1.0.

I don't think this patch is worth the added complexity in 1.0.0; happy to entertain in 1.1, even there having now 4 internal properties is just too much. need to find a better way.

@jorisvandenbossche
Copy link
Member Author

I don't think this patch is worth the added complexity in 1.0.0; happy to entertain in 1.1, even there having now 4 internal properties is just too much. need to find a better way.

If you have a proposal for a simpler way to fix the performance regression, that's very welcome, I am all ears. But I think this PR is really not that complex: it's not adding more properties, it's using similar patterns as we already use, and it's only a 40-lines diff.

@jreback
Copy link
Contributor

jreback commented Jan 20, 2020

Regardless the PandasDtype.name thing should be done. @jorisvandenbossche any idea what portion of the perf bump we get from that?

happy for that. also your _values branch @jbrockmendel could be a substitute for this.

@jorisvandenbossche
Copy link
Member Author

Regardless the PandasDtype.name thing should be done. @jorisvandenbossche any idea what portion of the perf bump we get from that?

See #31037 (comment). There I timed this to be around 1.8µs (on a total of around 8µs). So it is a smaller, but significant part (but only for pandas arrays of course)

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think https://github.com/pandas-dev/pandas/pull/31037/files#r368317201 clears things up nicely for me.

It's a fact that we have inconsistencies between values, _values, and array, and cleaning up those inconsistencies is probably worthwhile. In the meantime, we need a way to refer to each of them (external_values -> .values, internal_values -> _values, and array_values -> array).

@@ -3923,6 +3923,16 @@ def values(self):
"""
return self._data.view(np.ndarray)

@Appender(IndexOpsMixin.array.__doc__) # type: ignore
@property
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We fixed the issue with cache_readonly docs not working right? If so, +1 for using it as long as we're micro-optimizing.

@jreback
Copy link
Contributor

jreback commented Jan 23, 2020

@jbrockmendel what happened to your PR (values?)

@TomAugspurger
Copy link
Contributor

#31182

@jbrockmendel thinks those changes are too large for 1.0.0.

@jreback
Copy link
Contributor

jreback commented Jan 23, 2020

@jorisvandenbossche ok fine. However pls make a BIG GIANT section (in the code somewhere) where you explain exactly what .value, _values, ._ndarray_values, .array_values, .external_values, internal_values, get_values are.

@jorisvandenbossche
Copy link
Member Author

However pls make a BIG GIANT section (in the code somewhere) where you explain exactly what .value, _values, ._ndarray_values, .array_values, .external_values, internal_values, get_values are.

I already updated the docstrings of internal_values and external_values to be much clearer in a previous PR (and did the same for array_values in this PR).
Will make the one of Series._values more explicit (the one of Index._values is already very elaborate)

@jorisvandenbossche
Copy link
Member Author

@jreback can you take a look at the docstring I added for Series._values ?

@jreback
Copy link
Contributor

jreback commented Jan 23, 2020

i want something central that lists all of these

@TomAugspurger
Copy link
Contributor

What does "central" mean here? https://github.com/pandas-dev/pandas/pull/31037/files#diff-150fd3c5a732ae915ec47bc54a933c41R515 does cover everything I think.

@jbrockmendel
Copy link
Member

i want something central that lists all of these

Maybe this becomes easier if/when we do #31182 for 1.1.0 and can share a docstring between Series._values/Index._values?

@jreback
Copy link
Contributor

jreback commented Jan 23, 2020

ok that's better, though it still does not cover the internals (e.g. interval_values, external_values), e.g. that's worth a small doc in internals.py.

The reason I am being so annoying / pedantic about this, is because this is just creating massive technical debt, that while in the short term looks ok because it allows us to move forward. This is a huge long term problem. The development velocity of pandas is decreasing because of these differences, bugs appear, and its extremely hard for new (or existing folks) to make a pass at the internals.

@jorisvandenbossche
Copy link
Member Author

The reason I am being so annoying / pedantic about this, is because this is just creating massive technical debt

And in my turn to be pedantic: I fully disagree with that. If something happened due to this PR and the PR that was split off, is rather a clear net improvement: it made the dependencies between the Series attributes (values, _values, array) with the Block (internal_values, external_values) much clearer and improved the docs on this considerably.

Yes, we have a lot of different attributes (values, _values, array, _ndarray_values), and yes this is an unfortunate technical debt, and yes we should improve this (and we have already actively discussed this and have plans for it due to this PR). But this all already exists, this is not caused by this PR, this PR did not add any attribute on Series or Index.

So I would say that further doc improvements and attempts to reduce technical debt on this are out of scope for this PR.

@jreback
Copy link
Contributor

jreback commented Jan 23, 2020

The reason I am being so annoying / pedantic about this, is because this is just creating massive technical debt

And in my turn to be pedantic: I fully disagree with that. If something happened due to this PR and the PR that was split off, is rather a clear net improvement: it made the dependencies between the Series attributes (values, _values, array) with the Block (internal_values, external_values) much clearer and improved the docs on this considerably.

Yes, we have a lot of different attributes (values, _values, array, _ndarray_values), and yes this is an unfortunate technical debt, and yes we should improve this (and we have already actively discussed this and have plans for it due to this PR). But this all already exists, this is not caused by this PR, this PR did not add any attribute on Series or Index.

So I would say that further doc improvements and attempts to reduce technical debt on this are out of scope for this PR.

you are missing the point. We need a serious cleanup in this codebase. There have quite some already, so certainly not discounting anything you, tom or brock is doing (enouraging it actually).

but rather than adding workaround all over the place would be much better to tackle the internals. Not saying it is easy at all, but I constantly the worth of a PR when to fix something in a tactical way and not make any progress towards strategic things.

Sure this might be an aspirational goal, but w/o refactorings, the codebase become less and less approachable and likely produces more bugs.

I applaud that you are fixing some performance issues that is great. But don't lose sight of the larger issues.

All that said this is fine.

@jreback jreback merged commit 38ea154 into pandas-dev:master Jan 24, 2020
@lumberbot-app
Copy link

lumberbot-app bot commented Jan 24, 2020

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
$ git checkout 1.0.x
$ git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
$ git cherry-pick -m1 38ea15482aaef03647e4bac6ccb49a4667a0e2ac
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
$ git commit -am 'Backport PR #31037: PERF: improve access of .array'
  1. Push to a named branch :
git push YOURFORK 1.0.x:auto-backport-of-pr-31037-on-1.0.x
  1. Create a PR against branch 1.0.x, I would have named this PR:

"Backport PR #31037 on branch 1.0.x"

And apply the correct labels and milestones.

Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon!

If these instruction are inaccurate, feel free to suggest an improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants