-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: improve access of .array #31037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: improve access of .array #31037
Conversation
I've been looking at something similar. If we do the same for TimedeltaArray, this would allow us to clean up a lot of code that we currently use for boxing dt64/td64 ndarrays (com.maybe_box_datetimelike, libindex.convert_scalar). So generally +1 on this idea in a separate PR. |
I generally like what's going on here, though as Joris mentioned the proliferation of foo_value is not ideal. It would help if the cleanups could be separated out to limit the scope. If the motivation is the performance of |
It's also the performance of |
To retain the current behavior yes, but the original intention for extract_array didnt do that casting, it was just un-wrapping whatever we already had. It was pretty specifically a replacement for the That said, I do like the idea of making Series._values return DTA/TDA for instead of ndarray (or if necessary a different name), as that would allow de-duplicating a bunch of code. |
But changing that behaviour is not something I want to do for 1.0 I think. So if that is the goal, are you OK with the approach in this PR? |
I think all the relevant methods are private so we retain the option to revisit in the future, so this works for me. |
Updated this to not include the cleaning part. What's left is just for improving |
Thanks Joris, the changes look good. Can you add a comment somewhere documenting the differences between |
Can you indicate what you find missing in the docs? See also #31103 (I split that off from here on request) which included the docstring updates (eg I updated |
pandas/core/indexes/base.py
Outdated
@@ -3923,6 +3923,16 @@ def values(self): | |||
""" | |||
return self._data.view(np.ndarray) | |||
|
|||
@Appender(IndexOpsMixin.array.__doc__) # type: ignore | |||
@property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could be a cached_property I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the Index is immutable but not the underlying, hmm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the Index is immutable but not the underlying, hmm.
That has always been the case for Index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We fixed the issue with cache_readonly docs not working right? If so, +1 for using it as long as we're micro-optimizing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when putting them in the order of
@cache_readonly
@Appender(IndexOpsMixin.array.__doc__) # type: ignore
def array(self) -> ExtensionArray:
it seems to work
@@ -208,6 +216,12 @@ def internal_values(self): | |||
""" | |||
return self.values | |||
|
|||
def array_values(self) -> ExtensionArray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need external_values at all? this is getting very confusing with all of the values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need external_values at all? this is getting very confusing with all of the values
I clarified the docstrings (in this PR, but split off and already merged in #31103) to make it clear what external_values
(-> .values
), internal_values
(-> _values
) and array_values
(-> .array
) are used for. As long as those all have different behaviour (something that can be gradually cleaned up, but not here), we need a mechanism to create those different results.
What if we override Block.internal_values for DatetimeBlock and TimedeltaBlock to return DTA/TDA in 1D cases? i.e. block.internal_values would behave like block.array_values does in this PR ATM, but without adding a new foo_values. IIUC the main thing this would effect would be Series._values by making it match Index._values, which I view as desirable. |
Thanks, that looks good. |
@jbrockmendel Isn't that in practice the same as what we discussed before on using |
let’s make a function to do this then; having 4 properties to do very similar things is so confusing |
Can you be more specific in what you are suggesting? |
I tried out a branch where I made Block.internal_values return TDA/DTA for TimedeltaBlock/DatetimeBlock. What I learned:
|
I don't think this patch is worth the added complexity in 1.0.0; happy to entertain in 1.1, even there having now 4 internal properties is just too much. need to find a better way. |
If you have a proposal for a simpler way to fix the performance regression, that's very welcome, I am all ears. But I think this PR is really not that complex: it's not adding more properties, it's using similar patterns as we already use, and it's only a 40-lines diff. |
happy for that. also your _values branch @jbrockmendel could be a substitute for this. |
See #31037 (comment). There I timed this to be around 1.8µs (on a total of around 8µs). So it is a smaller, but significant part (but only for pandas arrays of course) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think https://github.com/pandas-dev/pandas/pull/31037/files#r368317201 clears things up nicely for me.
It's a fact that we have inconsistencies between values
, _values
, and array
, and cleaning up those inconsistencies is probably worthwhile. In the meantime, we need a way to refer to each of them (external_values
-> .values
, internal_values
-> _values
, and array_values
-> array
).
pandas/core/indexes/base.py
Outdated
@@ -3923,6 +3923,16 @@ def values(self): | |||
""" | |||
return self._data.view(np.ndarray) | |||
|
|||
@Appender(IndexOpsMixin.array.__doc__) # type: ignore | |||
@property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We fixed the issue with cache_readonly docs not working right? If so, +1 for using it as long as we're micro-optimizing.
@jbrockmendel what happened to your PR (values?) |
@jbrockmendel thinks those changes are too large for 1.0.0. |
@jorisvandenbossche ok fine. However pls make a BIG GIANT section (in the code somewhere) where you explain exactly what .value, _values, ._ndarray_values, .array_values, .external_values, internal_values, get_values are. |
I already updated the docstrings of |
@jreback can you take a look at the docstring I added for |
i want something central that lists all of these |
What does "central" mean here? https://github.com/pandas-dev/pandas/pull/31037/files#diff-150fd3c5a732ae915ec47bc54a933c41R515 does cover everything I think. |
Maybe this becomes easier if/when we do #31182 for 1.1.0 and can share a docstring between Series._values/Index._values? |
ok that's better, though it still does not cover the internals (e.g. interval_values, external_values), e.g. that's worth a small doc in internals.py. The reason I am being so annoying / pedantic about this, is because this is just creating massive technical debt, that while in the short term looks ok because it allows us to move forward. This is a huge long term problem. The development velocity of pandas is decreasing because of these differences, bugs appear, and its extremely hard for new (or existing folks) to make a pass at the internals. |
And in my turn to be pedantic: I fully disagree with that. If something happened due to this PR and the PR that was split off, is rather a clear net improvement: it made the dependencies between the Series attributes (values, _values, array) with the Block (internal_values, external_values) much clearer and improved the docs on this considerably. Yes, we have a lot of different attributes (values, _values, array, _ndarray_values), and yes this is an unfortunate technical debt, and yes we should improve this (and we have already actively discussed this and have plans for it due to this PR). But this all already exists, this is not caused by this PR, this PR did not add any attribute on Series or Index. So I would say that further doc improvements and attempts to reduce technical debt on this are out of scope for this PR. |
you are missing the point. We need a serious cleanup in this codebase. There have quite some already, so certainly not discounting anything you, tom or brock is doing (enouraging it actually). but rather than adding workaround all over the place would be much better to tackle the internals. Not saying it is easy at all, but I constantly the worth of a PR when to fix something in a tactical way and not make any progress towards strategic things. Sure this might be an aspirational goal, but w/o refactorings, the codebase become less and less approachable and likely produces more bugs. I applaud that you are fixing some performance issues that is great. But don't lose sight of the larger issues. All that said this is fine. |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon! If these instruction are inaccurate, feel free to suggest an improvement. |
xref #30790
This improves the performance of
.array
access considerably (and thus ofextract_array
, which eg caused the slowdown in the indexing benchmark). The speed-up depends on the dtype (the numpy dtypes are still quite slow ..).This touches quite some code, but I think this is OK for 1.0.0 (although I mixed a bit of clean-up into it, I can also remove that again).
@jbrockmendel I went the way of adding "yet another array/values like thing", but only on the Block (
Block.array_values()
). I think it might be possible to let_values
return a DatetimeArray instead of datetime64 ndarray for datetime-like data, but that's something that needs more investigation, and also, even if possible, not something I would like to do on 1.0.0, but only for 1.1.On master:
With this PR: