Skip to content

BUG: raise on RangeIndex.array #40022

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Apr 8, 2021

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

xerf #37277 cc @TomAugspurger this is failing one of the dask tests locally, looks like dask is calling index.array. Thoughts on how to handle this?

Might be worth folding the RangeIndex check into a keyword in extract_array.

@jreback
Copy link
Contributor

jreback commented Feb 24, 2021

Might be worth folding the RangeIndex check into a keyword in extract_array.

yes please as we already special case MI there.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah let's fold this into extract_array

@jreback jreback added the Compat pandas objects compatability with Numpy or Python functions label Feb 24, 2021
@TomAugspurger
Copy link
Contributor

This can and should go through a deprecation warning IMO.

@jorisvandenbossche
Copy link
Member

This can and should go through a deprecation warning IMO.

Agreed.

In addition, are we sure that we actually want to raise an error? Eg .values also works. And if for some reason you need an array, it seems a bit annoying to need to check for RangeIndex.

(the public .array API can be different than extract_array)

@jbrockmendel
Copy link
Member Author

This can and should go through a deprecation warning IMO.

Agreed.

In addition, are we sure that we actually want to raise an error? Eg .values also works. And if for some reason you need an array, it seems a bit annoying to need to check for RangeIndex.

The only strong opinion i have on this is that RangeIndex behavior should match MultiIndex behavior.

@TomAugspurger
Copy link
Contributor

In addition, are we sure that we actually want to raise an error?

I'm not sure. I think it's surprising that RangeIndex.array allocates memory though. I suppose we could implement a RangeArray extension array? And it wouldn't implement __setitem__, so we can support the use case where a user wants read-only access to the array, but defer raising till they do an unsupported operation?

@jorisvandenbossche
Copy link
Member

There are many operations on RangeIndex that do allocate memory though. For me, I see RangeIndex a bit like Int64Index but without materializing "as long as possible". And when asking for its array representation, it is no longer possible.

The only strong opinion i have on this is that RangeIndex behavior should match MultiIndex behavior.

@jbrockmendel Can you explain why you want this? A practical reason? Or conceptually since both are not directly backed by a single array? (on which I would say: that's true, but RangeIndex still "represents" a single array, while MultiIndex does not. So I think there are also reasons to give them different behaviour)

@jbrockmendel
Copy link
Member Author

I suppose we could implement a RangeArray extension array?

id like to see this. among other things it would allows set_index(rangecol).reset_index() to round-trip nicely

The only strong opinion i have on this is that RangeIndex behavior should match MultiIndex behavior.

@jbrockmendel Can you explain why you want this? A practical reason? Or conceptually since both are not directly backed by a single array? (on which I would say: that's true, but RangeIndex still "represents" a single array, while MultiIndex does not. So I think there are also reasons to give them different behaviour)

(At least until there is a RangeArray) RangeIndex doesn't represent a single array. As you mentioned above, it has .values, but if that is the standard, then we should do the same for MultiIndex.

(I actually dont like the .array property in most cases, pretty much always find ._values more useful)

I am more concerned with extract_array than with .array, and extract_array definitely shouldn't be allocating an entirely new array.

@jorisvandenbossche
Copy link
Member

I am more concerned with extract_array than with .array, and extract_array definitely shouldn't be allocating an entirely new array.

Yes, as mentioned above, IMO the two don't necessarily need to do the same. I certainly follow that extract_array shouldn't allocate a new array.

(I actually dont like the .array property in most cases, pretty much always find ._values more useful)

The .array us public API, and being able to use array-like values (eg to circumvent alignment) reliably is nice. For this use case, having RangeIndex raise seems more annoying than useful? (of course, you still have the memory allocation you might not expect)

I suppose we could implement a RangeArray extension array?

id like to see this. among other things it would allows set_index(rangecol).reset_index() to round-trip nicely

Let's open a separate issue to discuss this if people would like to see this.
(I am personally not convinced that we should add it: if we have it, it needs to be a proper dtype and allowed in columns, and IMO it is not needed to support a "range" dtype)

Co-authored-by: Joris Van den Bossche <[email protected]>
@pep8speaks
Copy link

pep8speaks commented Mar 8, 2021

Hello @jbrockmendel! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-04-07 17:00:09 UTC

@jbrockmendel
Copy link
Member Author

Should be ready for another look

@jorisvandenbossche
Copy link
Member

Apart from #40022 (comment), not further comments from my side

@jreback
Copy link
Contributor

jreback commented Mar 30, 2021

can you merge master. i will look at this soon.

@jbrockmendel
Copy link
Member Author

rebased + greenish

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rebase one more time, ping on green

@@ -367,9 +370,12 @@ def nargsort(
mask=mask,
)

items = extract_array(items)
if isinstance(items, ABCRangeIndex):
return items.argsort(ascending=ascending) # TODO: test coverage with key?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm if you use extract_range on L376 i think you can just fall thru no? (or I guess need L380 to change), anyhow for the future

@jreback jreback added this to the 1.3 milestone Apr 2, 2021
@jreback
Copy link
Contributor

jreback commented Apr 2, 2021

also do you need a whatnew? e.g. what exactly is the bug

@jbrockmendel
Copy link
Member Author

also do you need a whatnew? e.g. what exactly is the bug

there shouldnt be anything user-facing

@jreback jreback merged commit 36b4990 into pandas-dev:master Apr 8, 2021
@jbrockmendel jbrockmendel deleted the util-extract_array branch April 8, 2021 01:15
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants