Skip to content

BUG: indexing with boolean array and categoricals #12861

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue Apr 11, 2016 · 2 comments
Closed

BUG: indexing with boolean array and categoricals #12861

jreback opened this issue Apr 11, 2016 · 2 comments
Labels
Bug Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@jreback
Copy link
Contributor

jreback commented Apr 11, 2016

xref #12857

In [24]: df2 = df.apply(lambda x: x.astype('category',categories=(np.sort(pd.unique(df.values.ravel())))))

In [25]: df2
Out[25]: 
   A  B
0  a  b
1  a  c
2  b  d
3  c  a
4  d  a
5  a  e

after #12564, the boolean array is ok

In [28]: df2=='d'
Out[28]: 
       A      B
0  False  False
1  False  False
2  False   True
3  False  False
4   True  False
5  False  False

but indexing is broken

In [26]: df2[df2=='c']
ValueError: Wrong number of dimensions
@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves Categorical Categorical Data Type Difficulty Intermediate labels Apr 11, 2016
@jreback jreback added this to the 0.18.1 milestone Apr 11, 2016
@cchrysostomou
Copy link

I have been looking into this bug, and I found a change I could make in internals.py that did not raise any Value Errors and appeared to slice as expected; but I am not sure this is the safest change to make. The change I made resulted from me noticing that the transpose of each block within block manager did not seem to return the proper dimensions that would be expected from the boolean conditional dataframe.

Heres what I did:
In Internals.py -> where():

  1. Line 1186, rather than take the transpose of the values, I took the transpose of cond
if transpose:
            #values = values.T
            cond = cond.T
  1. Since I am not taking the transpose of values, then I commented out the following lines (1234-1235) where result is transposed
if self._can_hold_na or self.ndim == 1:

            #if transpose:
            #    result = result.T

            # try to cast if requested
            if try_cast:
                result = self._try_cast_result(result)

Now this did not solve the issue completely because I have found some additional bugs with the category dataframe keeping its format as category and not an object:

First note the following, using the changes above, and performing the following:
df = pd.DataFrame([list('ddddedddddddddcdddddddf'),list('cddddddddddddddddddddde')])
df2 = df.apply(lambda x: x.astype('category',categories=np.sort(pd.unique(df.values.ravel()))))
df[df=='d']  # this still works as expected
df2[df2=='d']  # this now returns a "mixture" of objects and categories
df2.info()

I have followed this bug into the following location: expressions.py/_where_standard()

np.where(_values_from_object(cond), _values_from_object(a),
                    _values_from_object(b))

If "a" is a category, then after performing this function, the result is converted to an object.

Second, I have noticed a similar bug when performing a transpose on categorical data (results are returned as objects):
df = pd.DataFrame([list('ddddedddddddddcdddddddf'),list('cddddddddddddddddddddde')])
df2 = df.apply(lambda x: x.astype('category',categories=np.sort(pd.unique(df.values.ravel()))))
df2.info()  # data format is categories
df2.transpose().transpose().info()  # data format becomes objects

@jreback jreback modified the milestones: 0.18.1, 0.18.2 Apr 26, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.20.0, 0.19.0 Sep 1, 2016
@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@mroeschke
Copy link
Member

Pretty difficult to see the buggy behavior without the original dataframe so closing but can reopen at a later date

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

5 participants