Skip to content

BUG: merging with a boolean/int categorical column #17187

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lvphj opened this issue Aug 7, 2017 · 2 comments · Fixed by #17841
Closed

BUG: merging with a boolean/int categorical column #17187

lvphj opened this issue Aug 7, 2017 · 2 comments · Fixed by #17841
Labels
Bug Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@lvphj
Copy link

lvphj commented Aug 7, 2017

Code Sample, a copy-pastable example if possible

dfA = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],'colA':[3,4,2,4,3,4,5,4,5,6],'colB':[7,6,5,6,5,7,8,7,6,7],'colC':[False,True,True,False,False,True,False,True,True,True]})
dfA['colC'] = dfA['colC'].astype('category',categories=[True,False],ordered=True)
dfB = pd.DataFrame({'id':[2,5,7,8],'colD':[1,9,7,3]})

print("Before\n====")
print('dfA dtypes\n------')
print(dfA.dtypes)
print('\ndfA\n---')
print(dfA)
print('\ndfB\n---')
print(dfB)

dfA = pd.merge(left=dfA,right=dfB,how='left',on='id')
print("\nAfter\n=====")
print(dfA)

Problem description

This problem was asked on StackOverflow at https://stackoverflow.com/questions/45538092/merging-pandas-dataframes-containing-a-categorical-variable-fails-with-valueerr where it was suggested that it was a bug.

Two dataframes containing different columns can be combined using the pandas.merge() method. This works well but in the above example, converting one of the columns in the dataframe to a categorical variable causes the method to fail with error:

/Users/.../env3/lib/python3.4/site-packages/pandas/core/internals.py in __init__(self, values, placement, ndim, fastpath)
    104             ndim = values.ndim
    105         elif values.ndim != ndim:
--> 106             raise ValueError('Wrong number of dimensions')
    107         self.ndim = ndim
    108 

ValueError: Wrong number of dimensions

Using df.ndim() indicates that both dataframes have 2 dimensions.

Expected Output

The expected output can be generated simply by commenting out the second line in the above code, the line that converts one of the columns to a categorical variable.

   colA  colB   colC  id  colD
0     3     7  False   1   NaN
1     4     6   True   2   1.0
2     2     5   True   3   NaN
3     4     6  False   4   NaN
4     3     5  False   5   9.0
5     4     7   True   6   NaN
6     5     8  False   7   7.0
7     4     7   True   8   3.0
8     5     6   True   9   NaN
9     6     7   True  10   NaN

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.4.1.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.20.1
pytest: None
pip: 9.0.1
setuptools: 34.1.0
Cython: None
numpy: 1.12.1
scipy: 0.16.1
xarray: None
IPython: 4.1.1
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 1.5.3
openpyxl: 2.4.7
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: 0.7.11.None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added Categorical Categorical Data Type Bug labels Aug 7, 2017
@gfyoung
Copy link
Member

gfyoung commented Aug 7, 2017

@lvphj : Thanks for the issue! I don't see why this shouldn't work given that only the dtype changed. PR to patch this is welcome!

@jreback
Copy link
Contributor

jreback commented Aug 7, 2017

something odd going on in the actual merge. If we replace the boolean with strings it is ok; with ints looks like some odd type mapping.

In [24]: pd.merge(left=dfA.assign(colC=dfA.colC.cat.rename_categories(['foo', 'bar'])),right=dfB,how='left',on='id')
Out[24]: 
   colA  colB colC  id  colD
0     3     7  bar   1   NaN
1     4     6  foo   2   1.0
2     2     5  foo   3   NaN
3     4     6  bar   4   NaN
4     3     5  bar   5   9.0
5     4     7  foo   6   NaN
6     5     8  bar   7   7.0
7     4     7  foo   8   3.0
8     5     6  foo   9   NaN
9     6     7  foo  10   NaN

In [25]: pd.merge(left=dfA.assign(colC=dfA.colC.cat.rename_categories([1,2])),right=dfB,how='left',on='id')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
Exception ignored in: 'pandas._libs.lib.is_bool_array'
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
Out[25]: 
   colA  colB colC  id  colD
0     3     7    2   1   NaN
1     4     6    1   2   1.0
2     2     5    1   3   NaN
3     4     6    2   4   NaN
4     3     5    2   5   9.0
5     4     7    1   6   NaN
6     5     8    2   7   7.0
7     4     7    1   8   3.0
8     5     6    1   9   NaN
9     6     7    1  10   NaN

@jreback jreback added this to the Interesting Issues milestone Aug 7, 2017
@jreback jreback added Difficulty Intermediate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Aug 7, 2017
@jreback jreback changed the title Merging Pandas dataframes containing a categorical variable fails with “ValueError: Wrong number of dimensions” BUG: merging with a boolean/int categorical column Aug 7, 2017
jdrudolph pushed a commit to jdrudolph/pandas that referenced this issue Oct 10, 2017
jdrudolph pushed a commit to jdrudolph/pandas that referenced this issue Oct 10, 2017
@jreback jreback modified the milestones: Interesting Issues, 0.21.0 Oct 14, 2017
jreback pushed a commit that referenced this issue Oct 14, 2017
* BUG: merging with a boolean/int categorical column #17187
ghost pushed a commit to reef-technologies/pandas that referenced this issue Oct 16, 2017
alanbato pushed a commit to alanbato/pandas that referenced this issue Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants