Skip to content

REGR: Index.union preserve numerical index for range/float and int/float combinations #26778

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Jun 11, 2019 · 13 comments · Fixed by #27034
Closed
Labels
Blocker Blocking issue or pull request for an upcoming release Index Related to the Index class or subclasses
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Recent change in master:

In [11]: index = pd.RangeIndex(0, 10)

In [12]: index.union(pd.Index([0.5, 1.5]))
Out[12]: Index([0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0], dtype='object')

the dtype should be float.

cc @topper-123

@topper-123
Copy link
Contributor

Yeah, I'll look into it tonight

@jreback
Copy link
Contributor

jreback commented Jun 11, 2019

this follows our general rule which is that dissimilar cast to object

are you proposing a rule that we cast to float if float and numbering ?

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jun 11, 2019

this follows our general rule which is that dissimilar cast to object

That is not the "general" rule, in case of numerical indices. Eg (released pandas):

In [1]: pd.Index([1, 2, 3]).union(pd.Index([.5, 1.5]))
Out[1]: Float64Index([0.5, 1.0, 1.5, 2.0, 3.0], dtype='float64')

Or at least, was not the general rule, as I see that the above example also changed on master. So @topper-123 it is not specific to RangeIndex (I was thinking it might have been related to some of the changes to it you have been doing recently).

are you proposing a rule that we cast to float if float and numbering ?

For me, that seems very logical that int + float = float (for set ops), and this is also what we have been doing up to recently.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jun 11, 2019

Possibly related to #23538

@jorisvandenbossche jorisvandenbossche added this to the 0.25.0 milestone Jun 11, 2019
@topper-123
Copy link
Contributor

If this has been done by me, it has not been on purpose. I agrre It is more logical and more efficient that unions of numerical indices return a numerical index rather than an object index

@jreback
Copy link
Contributor

jreback commented Jun 11, 2019

http://pandas-docs.github.io/pandas-docs-travis/whatsnew/v0.25.0.html#incompatible-index-type-unions

not averse to coercing to float if we have a float for one
we already coerce between range / int indices

@jorisvandenbossche
Copy link
Member Author

Yes, that's the PR I was pointing at.
Updating the issue to reflect the discussion.

@jorisvandenbossche jorisvandenbossche changed the title REGR: RangeIndex.union converts to object dtype instead of float REGR: Index.union preserve numerical index for range/float int/float combinations Jun 11, 2019
@jorisvandenbossche jorisvandenbossche changed the title REGR: Index.union preserve numerical index for range/float int/float combinations REGR: Index.union preserve numerical index for range/float and int/float combinations Jun 11, 2019
@jorisvandenbossche jorisvandenbossche added Blocker Blocking issue or pull request for an upcoming release Index Related to the Index class or subclasses labels Jun 11, 2019
@topper-123
Copy link
Contributor

I've bisected this and the cause of the change shown by @jorisvandenbossche is indeed #23538 (commit 20d0ad1).

I also agree this is a blocker a s this should not return a object dtype, as that will likely cause slowness and cause memory-usage to increase.

@jorisvandenbossche
Copy link
Member Author

cc ArtinSarraf would you be interested to look into this?

@TomAugspurger
Copy link
Contributor

Int64Index.union(Float64Index) -> float64 introduces a slight disagreement with how we decided to handle union(int, uint) in that PR. I think we have a choice between

  1. dtype stability of the output
  2. full precision
  3. speed

We can pick two.

With union(int64, uint64) -> object we chose 1 and 2, sacrificing speed.

With union(int64, float64) -> float64 we're choosing 1 and 3, sacrificing precision of integers that cannot be represented as floating point values.

Is that what we want? With index labels, perhaps full precision is more important?

@jorisvandenbossche
Copy link
Member Author

Personally, I think the easiest logic to explain would be to use the "normal" dtype promotion rules that we also use in eg Index/Series.append or concat.

A user can always cast to object to retain full information.

I understand the argument that for index the exact value is more important. But shouldn't that then be the same for Index.append as well? And where to draw the line? Also object for int + float?

@TomAugspurger
Copy link
Contributor

Yeah, it's fuzzy, especially when we consider backwards compatibility. My preference long-term would for append, union, etc. to preserve labels.

You wouldn't normally expect an append to change other values in a container:

In [34]: a = pd.UInt64Index([2 ** 64 - 1])

In [35]: a.append(pd.Index([0.1]))
Out[35]: Float64Index([1.8446744073709552e+19, 0.1], dtype='float64')

But, for now we care about backwards compatibility. To match 0.24.x, we should have

  1. union(int, float) -> float
  2. union(unit, int) -> object
    • In 0.24.x, this was uint64 if possible to safely cast, object otherwise. Just doing object for now.

A user can always cast to object to retain full information.

I'm worried that's an unsafe default, though if we've gotten this far, maybe it's not such a big deal.

@TomAugspurger
Copy link
Contributor

OK here's my proposal.

left right output of left.union(right)
int float float64
int uint object
float uint float64

AFAICT the changes from 0.24.x are that

  1. Int64Index | UInt64Index -> Index[object] always, rather than if there's overflow
  2. (left | right).dtype == (right | left).dtype (the original PR)

I think that's fine for 0.25. But I would be OK with revisiting things to make them consistent across dtypes.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jun 25, 2019
This restores the 0.24.x behavior of Index.union(other) between
Float and (U)Int indexes. These are now floating dtype.

left  | right | output of left.union(right)
----- | ----- | ------
int   |float  | float64
int   |uint   | object
float | uint  | float64

pandas-dev#26778 (comment)

Closes pandas-dev#26778
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Blocking issue or pull request for an upcoming release Index Related to the Index class or subclasses
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants