Skip to content

BUG: Index.union with both bools and ints, duplicates #44000

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
jbrockmendel opened this issue Oct 12, 2021 · 3 comments
Open
3 tasks done

BUG: Index.union with both bools and ints, duplicates #44000

jbrockmendel opened this issue Oct 12, 2021 · 3 comments
Labels
Bug Index Related to the Index class or subclasses setops union, intersection, difference, symmetric_difference

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Oct 12, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

idx = pd.Index([1, True, 0, False])
idx2 = idx[1:]

>>> idx.union(idx2)
Index([0, 0, 1, 1], dtype='object')

Issue Description

Best guess is union_with_duplicates cc @phofl

Breaks at least one test in #43930

Expected Behavior

Index([1, True, 0, False], dtype=object)

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 12, 2021
@CloseChoice
Copy link
Member

CloseChoice commented Oct 13, 2021

The problem in here is the value_counts function which is called inside union_with_duplicates:

>>> import pandas as pd
>>> pd.value_counts([0, True, 1])
True    2
0       1
dtype: int64
>>> pd.value_counts([0, 1, True])
1    2
0    1
dtype: int64

Edit:
I dug a bit deeper and value_counts is in this case calling the value_count_object function. It looks to me like one needs to fix this bug in the following function. Looking even deeper into it, it seems that this line casts 1 to True if True is already in the table and vice versa.

@CloseChoice CloseChoice added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Oct 13, 2021
@mroeschke mroeschke added Index Related to the Index class or subclasses setops union, intersection, difference, symmetric_difference and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 16, 2021
@jorisvandenbossche
Copy link
Member

There are several other places where mixed ints and bools are considered interchangeably:

# unique doesn't distinguish, but result depends on first encountered value
In [12]: pd.Index([True, 1]).unique()
Out[12]: Index([True], dtype='object')

In [13]: pd.Index([1, True]).unique()
Out[13]: Index([1], dtype='object')

# indexing doesn't distinguish True and 1, and will find both
In [15]: pd.Index([1, True]).get_loc(1)
Out[15]: slice(0, 2, None)

In [16]: pd.Index([1, True]).get_loc(True)
Out[16]: slice(0, 2, None)

So unless we want to distinguish True and 1 in all those cases, I am not sure the example in the top post is actually "wrong".

@jbrockmendel
Copy link
Member Author

So unless we want to distinguish True and 1 in all those cases, I am not sure the example in the top post is actually "wrong".

Fair enough. The OP was an attempt at boiling down a more definitely-wrong behavior:

idx = pd.Index([True, False, True, False])
idx2 = pd.Index([0, 0, 1, 1, 2, 2])

>>> idx.union(idx2)
[...]
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Index Related to the Index class or subclasses setops union, intersection, difference, symmetric_difference
Projects
None yet
Development

No branches or pull requests

4 participants