PERF: improve access of .array #31037

jorisvandenbossche · 2020-01-15T11:17:29Z

This improves the performance of .array access considerably (and thus of extract_array, which eg caused the slowdown in the indexing benchmark). The speed-up depends on the dtype (the numpy dtypes are still quite slow ..).

This touches quite some code, but I think this is OK for 1.0.0 (although I mixed a bit of clean-up into it, I can also remove that again).
@jbrockmendel I went the way of adding "yet another array/values like thing", but only on the Block (Block.array_values()). I think it might be possible to let _values return a DatetimeArray instead of datetime64 ndarray for datetime-like data, but that's something that needs more investigation, and also, even if possible, not something I would like to do on 1.0.0, but only for 1.1.

On master:

In [1]: s = pd.Series(pd.date_range("2012-01-01", periods=3, tz='UTC')) 

In [3]: %timeit s.array   
17.2 µs ± 219 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [4]: s = pd.Series(pd.date_range("2012-01-01", periods=3)) 

In [6]: %timeit s.array   
4.3 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [7]: s = pd.Series([1, 2, 3]) 

In [9]: %timeit s.array 
6.76 µs ± 208 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

With this PR:

In [1]: s = pd.Series(pd.date_range("2012-01-01", periods=3, tz='UTC'))  

In [3]: %timeit s.array 
290 ns ± 6.26 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: s = pd.Series(pd.date_range("2012-01-01", periods=3))   

In [6]: %timeit s.array 
809 ns ± 7.05 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [7]: s = pd.Series([1, 2, 3])   

In [9]: %timeit s.array 
2.34 µs ± 93.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

This reverts commit b0afa10.

pandas/core/internals/blocks.py

pandas/core/indexes/interval.py

pandas/core/internals/blocks.py

jbrockmendel · 2020-01-15T16:20:34Z

I think it might be possible to let _values return a DatetimeArray instead of datetime64 ndarray for datetime-like data, but that's something that needs more investigation,

I've been looking at something similar. If we do the same for TimedeltaArray, this would allow us to clean up a lot of code that we currently use for boxing dt64/td64 ndarrays (com.maybe_box_datetimelike, libindex.convert_scalar). So generally +1 on this idea in a separate PR.

pandas/core/arrays/numpy_.py

jbrockmendel · 2020-01-15T16:34:29Z

I generally like what's going on here, though as Joris mentioned the proliferation of foo_value is not ideal.

It would help if the cleanups could be separated out to limit the scope.

If the motivation is the performance of extract_array, then I think a much less invasive solution would be to call obj._values rather than obj.array. Slightly different behavior, but I think a large majority of use cases use extract_numpy=True anyway.

jorisvandenbossche · 2020-01-15T17:51:36Z

If the motivation is the performance of extract_array, then I think a much less invasive solution would be to call obj._values rather than obj.array

It's also the performance of .array itself, I would say. But for extract_array, that means we would still need to do a dtype check, as _values gives datetime64/timedelta64 and needs to be converted in DatetimeArray/TimedeltaArray.

jbrockmendel · 2020-01-15T19:10:14Z

But for extract_array, that means we would still need to do a dtype check, as _values gives datetime64/timedelta64 and needs to be converted in DatetimeArray/TimedeltaArray.

To retain the current behavior yes, but the original intention for extract_array didnt do that casting, it was just un-wrapping whatever we already had. It was pretty specifically a replacement for the getattr(values, "_values", getattr(values, "values", ...)) patterns

That said, I do like the idea of making Series._values return DTA/TDA for instead of ndarray (or if necessary a different name), as that would allow de-duplicating a bunch of code.

jorisvandenbossche · 2020-01-15T19:14:28Z

To retain the current behavior yes,

But changing that behaviour is not something I want to do for 1.0 I think. So if that is the goal, are you OK with the approach in this PR?

jbrockmendel · 2020-01-15T19:25:21Z

But changing that behaviour is not something I want to do for 1.0 I think. So if that is the goal, are you OK with the approach in this PR?

I think all the relevant methods are private so we retain the option to revisit in the future, so this works for me.

jorisvandenbossche · 2020-01-17T17:59:01Z

Updated this to not include the cleaning part. What's left is just for improving .array performance, with limited changes.

TomAugspurger · 2020-01-17T21:18:09Z

Thanks Joris, the changes look good.

Can you add a comment somewhere documenting the differences between .array, array_values, _values?

jorisvandenbossche · 2020-01-18T07:51:33Z

Can you add a comment somewhere documenting the differences between .array, array_values, _values?

Can you indicate what you find missing in the docs? See also #31103 (I split that off from here on request) which included the docstring updates (eg I updated Index._values docstring, and clarified internal/external_values in the blocks)

jreback · 2020-01-18T15:25:16Z

pandas/core/indexes/base.py

@@ -3923,6 +3923,16 @@ def values(self):
        """
        return self._data.view(np.ndarray)

+    @Appender(IndexOpsMixin.array.__doc__)  # type: ignore
+    @property


this could be a cached_property I think?

the Index is immutable but not the underlying, hmm.

the Index is immutable but not the underlying, hmm.

That has always been the case for Index?

We fixed the issue with cache_readonly docs not working right? If so, +1 for using it as long as we're micro-optimizing.

when putting them in the order of

@cache_readonly @Appender(IndexOpsMixin.array.__doc__) # type: ignore def array(self) -> ExtensionArray:

it seems to work

jreback · 2020-01-18T15:26:41Z

pandas/core/internals/blocks.py

@@ -208,6 +216,12 @@ def internal_values(self):
        """
        return self.values

+    def array_values(self) -> ExtensionArray:


do we need external_values at all? this is getting very confusing with all of the values

do we need external_values at all? this is getting very confusing with all of the values

I clarified the docstrings (in this PR, but split off and already merged in #31103) to make it clear what external_values (-> .values), internal_values (-> _values) and array_values (-> .array) are used for. As long as those all have different behaviour (something that can be gradually cleaned up, but not here), we need a mechanism to create those different results.

jbrockmendel · 2020-01-19T00:55:59Z

What if we override Block.internal_values for DatetimeBlock and TimedeltaBlock to return DTA/TDA in 1D cases? i.e. block.internal_values would behave like block.array_values does in this PR ATM, but without adding a new foo_values.

IIUC the main thing this would effect would be Series._values by making it match Index._values, which I view as desirable.

TomAugspurger · 2020-01-19T17:49:15Z

See also #31103 (I split that off from here on request) which included the docstring updates

Thanks, that looks good.

jorisvandenbossche · 2020-01-19T19:45:42Z

What if we override Block.internal_values for DatetimeBlock and TimedeltaBlock to return DTA/TDA in 1D cases? i.e. block.internal_values would behave like block.array_values does in this PR ATM, but without adding a new foo_values.

IIUC the main thing this would effect would be Series._values by making it match Index._values, which I view as desirable.

@jbrockmendel Isn't that in practice the same as what we discussed before on using _values and change that to return DTA/TDA? To which I said: I think that would be a nice clean-up, but not something I want to do for 1.0.

jreback · 2020-01-19T19:57:52Z

What if we override Block.internal_values for DatetimeBlock and TimedeltaBlock to return DTA/TDA in 1D cases? i.e. block.internal_values would behave like block.array_values does in this PR ATM, but without adding a new foo_values.
IIUC the main thing this would effect would be Series._values by making it match Index._values, which I view as desirable.

@jbrockmendel Isn't that in practice the same as what we discussed before on using _values and change that to return DTA/TDA? To which I said: I think that would be a nice clean-up, but not something I want to do for 1.0.

let’s make a function to do this then; having 4 properties to do very similar things is so confusing

jorisvandenbossche · 2020-01-19T20:03:18Z

let’s make a function to do this then; having 4 properties to do very similar things is so confusing

Can you be more specific in what you are suggesting?
There are 3 properties related to this PR: values, _values and array, and those are obviously not going away.
Currently, those are implemented by dispatching each of those properties to an equivalent method on the Block (well, 2 already did it this way, and in this PR I make the third (array) consistent with this). For sure, there are other ways to implement this instead of dispatching to the Block. But such a refactor is out of scope for this PR, I think. In this PR, I looked for a minimal change, consistent with the code we already have, to fix the performance issue for 1.0.

jbrockmendel · 2020-01-19T20:30:03Z

I tried out a branch where I made Block.internal_values return TDA/DTA for TimedeltaBlock/DatetimeBlock. What I learned:

a 1-line patch is needed in core.apply to prevent passing TDA/DTA to cython fastpath
a 3-line patch is needed in IndexOpsMixin.unique to keep the return type unchanged
s.array is slightly more performant than this PR for the tz-naive dt64 case, less performant in the other two cases (though still better than master, especially for the tzaware case)

import pandas as pd
from pandas.core.construction import extract_array

s1 = pd.Series(pd.date_range("2012-01-01", periods=3, tz='UTC'))
s2 = pd.Series(pd.date_range("2012-01-01", periods=3))
s3 = pd.Series([1, 2, 3])

%timeit s1.array
21.9 µs ± 706 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)    <-- master
738 ns ± 10.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  <-- _values branch
357 ns                                                                      <-- #31037 (est)

%timeit s2.array
5.39 µs ± 155 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)    <-- master
783 ns ± 22.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)   <-- _values branch
998 ns                                                                       <-- #31037 (est)

%timeit s3.array
7.94 µs ± 145 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  <-- master
5.81 µs ± 224 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  <-- _values branch
2.88 µs                                                                    <-- #31037 (est)

%timeit extract_array(s1, extract_numpy=True)
23.2 µs ± 371 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)     <-- master
1.32 µs ± 13.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  <-- _values branch

%timeit extract_array(s2, extract_numpy=True)
6.73 µs ± 136 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)    <-- master
1.38 µs ± 13.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  <-- _values branch

%timeit extract_array(s3, extract_numpy=True)
10.3 µs ± 252 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)    <-- master
1.33 µs ± 9.33 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  <-- _values branch

%timeit extract_array(s3, extract_numpy=False)
8.89 µs ± 93.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  <-- master
7.49 µs ± 276 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)   <-- _values branch

jreback · 2020-01-19T20:35:09Z

let’s make a function to do this then; having 4 properties to do very similar things is so confusing

Can you be more specific in what you are suggesting?
There are 3 properties related to this PR: values, _values and array, and those are obviously not going away.
Currently, those are implemented by dispatching each of those properties to an equivalent method on the Block (well, 2 already did it this way, and in this PR I make the third (array) consistent with this). For sure, there are other ways to implement this instead of dispatching to the Block. But such a refactor is out of scope for this PR, I think. In this PR, I looked for a minimal change, consistent with the code we already have, to fix the performance issue for 1.0.

I don't think this patch is worth the added complexity in 1.0.0; happy to entertain in 1.1, even there having now 4 internal properties is just too much. need to find a better way.

jorisvandenbossche · 2020-01-19T20:41:26Z

I don't think this patch is worth the added complexity in 1.0.0; happy to entertain in 1.1, even there having now 4 internal properties is just too much. need to find a better way.

If you have a proposal for a simpler way to fix the performance regression, that's very welcome, I am all ears. But I think this PR is really not that complex: it's not adding more properties, it's using similar patterns as we already use, and it's only a 40-lines diff.

jreback · 2020-01-20T16:39:08Z

Regardless the PandasDtype.name thing should be done. @jorisvandenbossche any idea what portion of the perf bump we get from that?

happy for that. also your _values branch @jbrockmendel could be a substitute for this.

jorisvandenbossche · 2020-01-20T16:54:03Z

Regardless the PandasDtype.name thing should be done. @jorisvandenbossche any idea what portion of the perf bump we get from that?

See #31037 (comment). There I timed this to be around 1.8µs (on a total of around 8µs). So it is a smaller, but significant part (but only for pandas arrays of course)

TomAugspurger

I think https://github.com/pandas-dev/pandas/pull/31037/files#r368317201 clears things up nicely for me.

It's a fact that we have inconsistencies between values, _values, and array, and cleaning up those inconsistencies is probably worthwhile. In the meantime, we need a way to refer to each of them (external_values -> .values, internal_values -> _values, and array_values -> array).

TomAugspurger · 2020-01-23T13:53:34Z

pandas/core/indexes/base.py

@@ -3923,6 +3923,16 @@ def values(self):
        """
        return self._data.view(np.ndarray)

+    @Appender(IndexOpsMixin.array.__doc__)  # type: ignore
+    @property


We fixed the issue with cache_readonly docs not working right? If so, +1 for using it as long as we're micro-optimizing.

jreback · 2020-01-23T14:23:45Z

@jbrockmendel what happened to your PR (values?)

TomAugspurger · 2020-01-23T14:27:28Z

#31182

@jbrockmendel thinks those changes are too large for 1.0.0.

jreback · 2020-01-23T14:30:00Z

@jorisvandenbossche ok fine. However pls make a BIG GIANT section (in the code somewhere) where you explain exactly what .value, _values, ._ndarray_values, .array_values, .external_values, internal_values, get_values are.

jorisvandenbossche · 2020-01-23T14:46:07Z

However pls make a BIG GIANT section (in the code somewhere) where you explain exactly what .value, _values, ._ndarray_values, .array_values, .external_values, internal_values, get_values are.

I already updated the docstrings of internal_values and external_values to be much clearer in a previous PR (and did the same for array_values in this PR).
Will make the one of Series._values more explicit (the one of Index._values is already very elaborate)

jorisvandenbossche · 2020-01-23T15:18:45Z

@jreback can you take a look at the docstring I added for Series._values ?

jreback · 2020-01-23T17:15:03Z

i want something central that lists all of these

TomAugspurger · 2020-01-23T17:24:17Z

What does "central" mean here? https://github.com/pandas-dev/pandas/pull/31037/files#diff-150fd3c5a732ae915ec47bc54a933c41R515 does cover everything I think.

jbrockmendel · 2020-01-23T17:39:41Z

i want something central that lists all of these

Maybe this becomes easier if/when we do #31182 for 1.1.0 and can share a docstring between Series._values/Index._values?

jreback · 2020-01-23T17:40:04Z

ok that's better, though it still does not cover the internals (e.g. interval_values, external_values), e.g. that's worth a small doc in internals.py.

The reason I am being so annoying / pedantic about this, is because this is just creating massive technical debt, that while in the short term looks ok because it allows us to move forward. This is a huge long term problem. The development velocity of pandas is decreasing because of these differences, bugs appear, and its extremely hard for new (or existing folks) to make a pass at the internals.

jorisvandenbossche · 2020-01-23T18:55:12Z

The reason I am being so annoying / pedantic about this, is because this is just creating massive technical debt

And in my turn to be pedantic: I fully disagree with that. If something happened due to this PR and the PR that was split off, is rather a clear net improvement: it made the dependencies between the Series attributes (values, _values, array) with the Block (internal_values, external_values) much clearer and improved the docs on this considerably.

Yes, we have a lot of different attributes (values, _values, array, _ndarray_values), and yes this is an unfortunate technical debt, and yes we should improve this (and we have already actively discussed this and have plans for it due to this PR). But this all already exists, this is not caused by this PR, this PR did not add any attribute on Series or Index.

So I would say that further doc improvements and attempts to reduce technical debt on this are out of scope for this PR.

jreback · 2020-01-23T23:59:53Z

The reason I am being so annoying / pedantic about this, is because this is just creating massive technical debt

And in my turn to be pedantic: I fully disagree with that. If something happened due to this PR and the PR that was split off, is rather a clear net improvement: it made the dependencies between the Series attributes (values, _values, array) with the Block (internal_values, external_values) much clearer and improved the docs on this considerably.

Yes, we have a lot of different attributes (values, _values, array, _ndarray_values), and yes this is an unfortunate technical debt, and yes we should improve this (and we have already actively discussed this and have plans for it due to this PR). But this all already exists, this is not caused by this PR, this PR did not add any attribute on Series or Index.

So I would say that further doc improvements and attempts to reduce technical debt on this are out of scope for this PR.

you are missing the point. We need a serious cleanup in this codebase. There have quite some already, so certainly not discounting anything you, tom or brock is doing (enouraging it actually).

but rather than adding workaround all over the place would be much better to tackle the internals. Not saying it is easy at all, but I constantly the worth of a PR when to fix something in a tactical way and not make any progress towards strategic things.

Sure this might be an aspirational goal, but w/o refactorings, the codebase become less and less approachable and likely produces more bugs.

I applaud that you are fixing some performance issues that is great. But don't lose sight of the larger issues.

All that said this is fine.

lumberbot-app · 2020-01-24T00:00:53Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

$ git checkout 1.0.x
$ git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

$ git cherry-pick -m1 38ea15482aaef03647e4bac6ccb49a4667a0e2ac

You will likely have some merge/cherry-pick conflict here, fix them and commit:

$ git commit -am 'Backport PR #31037: PERF: improve access of .array'

Push to a named branch :

git push YOURFORK 1.0.x:auto-backport-of-pr-31037-on-1.0.x

Create a PR against branch 1.0.x, I would have named this PR:

"Backport PR #31037 on branch 1.0.x"

And apply the correct labels and milestones.

Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon!

If these instruction are inaccurate, feel free to suggest an improvement.

jorisvandenbossche added 5 commits January 15, 2020 10:50

PERF: improve access of .array

3ea30d0

use _simple_new for Datetime/TimedeltaArray

8f5d886

improve performance of PandasDtype construction (lazy dtype.name)

d67dab8

temp

b0afa10

Revert "temp"

4706b83

This reverts commit b0afa10.

jorisvandenbossche added the Performance Memory or execution speed performance label Jan 15, 2020

jorisvandenbossche added this to the 1.0.0 milestone Jan 15, 2020

jreback reviewed Jan 15, 2020

View reviewed changes

pandas/core/internals/blocks.py Show resolved Hide resolved

jorisvandenbossche commented Jan 15, 2020

View reviewed changes

pandas/core/indexes/interval.py Outdated Show resolved Hide resolved

pandas/core/internals/blocks.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Jan 15, 2020

View reviewed changes

pandas/core/arrays/numpy_.py Show resolved Hide resolved

jorisvandenbossche mentioned this pull request Jan 17, 2020

CLN: Index._values docstring + Block.internal/external_values #31103

Merged

Merge remote-tracking branch 'upstream/master' into array-values-cleanup

3ef079d

jreback requested changes Jan 18, 2020

View reviewed changes

jbrockmendel mentioned this pull request Jan 21, 2020

CLN: Make Series._values match Index._values #31182

Merged

TomAugspurger reviewed Jan 23, 2020

View reviewed changes

jorisvandenbossche added 3 commits January 23, 2020 15:53

Merge remote-tracking branch 'upstream/master' into array-values-cleanup

1185d44

use cache_readonly for Index.array

f9f8631

elaborate Series._values docstring

53fd7e5

TomAugspurger approved these changes Jan 23, 2020

View reviewed changes

jreback approved these changes Jan 24, 2020

View reviewed changes

jreback merged commit 38ea154 into pandas-dev:master Jan 24, 2020

lumberbot-app bot added the Still Needs Manual Backport label Jan 24, 2020

jorisvandenbossche deleted the array-values-cleanup branch January 24, 2020 08:26

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Jan 24, 2020

PERF: improve access of .array (pandas-dev#31037)

f025456

jorisvandenbossche mentioned this pull request Jan 24, 2020

Backport PR #31037: PERF: improve access of .array #31275

Merged

jorisvandenbossche removed the Still Needs Manual Backport label Jan 24, 2020

jorisvandenbossche added a commit that referenced this pull request Jan 24, 2020

PERF: improve access of .array (#31037) (#31275)

e93bc80

jorisvandenbossche mentioned this pull request Jan 24, 2020

ASV: add benchmarks for Series.array and extract_array #31277

Merged

simonjayhawkins mentioned this pull request Jan 27, 2020

TYP: core.arrays.numpy_ #31348

Merged

PERF: improve access of .array #31037

PERF: improve access of .array #31037

Conversation

jorisvandenbossche commented Jan 15, 2020

jbrockmendel commented Jan 15, 2020

jbrockmendel commented Jan 15, 2020

jorisvandenbossche commented Jan 15, 2020

jbrockmendel commented Jan 15, 2020

jorisvandenbossche commented Jan 15, 2020

jbrockmendel commented Jan 15, 2020

jorisvandenbossche commented Jan 17, 2020

TomAugspurger commented Jan 17, 2020

jorisvandenbossche commented Jan 18, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Jan 19, 2020

TomAugspurger commented Jan 19, 2020

jorisvandenbossche commented Jan 19, 2020 • edited Loading

jreback commented Jan 19, 2020

jorisvandenbossche commented Jan 19, 2020

jbrockmendel commented Jan 19, 2020

jreback commented Jan 19, 2020

jorisvandenbossche commented Jan 19, 2020

jreback commented Jan 20, 2020

jorisvandenbossche commented Jan 20, 2020

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 23, 2020

TomAugspurger commented Jan 23, 2020

jreback commented Jan 23, 2020 • edited Loading

jorisvandenbossche commented Jan 23, 2020

jorisvandenbossche commented Jan 23, 2020

jreback commented Jan 23, 2020

TomAugspurger commented Jan 23, 2020

jbrockmendel commented Jan 23, 2020

jreback commented Jan 23, 2020

jorisvandenbossche commented Jan 23, 2020

jreback commented Jan 23, 2020

lumberbot-app bot commented Jan 24, 2020

jorisvandenbossche commented Jan 18, 2020 •

edited

Loading

jorisvandenbossche commented Jan 19, 2020 •

edited

Loading

jreback commented Jan 23, 2020 •

edited

Loading