Skip to content

BUG: merge with categoricals does not preserve categories dtype #10409

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amelio-vazquez-reina opened this issue Jun 23, 2015 · 8 comments
Closed
Assignees
Labels
API Design Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@amelio-vazquez-reina
Copy link
Contributor

amelio-vazquez-reina commented Jun 23, 2015

xref #14351

None of the following merge operations retain the category types. Is this expected? How can I keep them?

Merging on a category type:

Consider the following:

A = pd.DataFrame({'X': np.random.choice(['foo', 'bar'],size=(10,)), 
                  'Y': np.random.choice(['one', 'two', 'three'], size=(10,))})
A['X'] = A['X'].astype('category')

B = pd.DataFrame({'X': np.random.choice(['foo', 'bar'],size=(10,)), 
                  'Z': np.random.choice(['jjj', 'kkk', 'sss'], size=(10,))})
B['X'] = B['X'].astype('category')

if I do the merge, we end up with:

> pd.merge(A, B, on='X').dtypes 
X    object
Y    object
Z    object
dtype: object

Merging on a non-category type:

A = pd.DataFrame({'X': np.random.choice(['foo', 'bar'],size=(10,)), 
                  'Y': np.random.choice(['one', 'two', 'three'], size=(10,))})
A['Y'] = A['Y'].astype('category')

B = pd.DataFrame({'X': np.random.choice(['foo', 'bar'],size=(10,)), 
                  'Z': np.random.choice(['jjj', 'kkk', 'sss'], size=(10,))})
B['Z'] = B['Z'].astype('category')

if I do the merge, we end up with:

pd.merge(A, B, on='X').dtypes
X    object
Y    object
Z    object
dtype: object
@amelio-vazquez-reina amelio-vazquez-reina changed the title Merging with categories Merging with categories. Type retention Jun 23, 2015
@jreback
Copy link
Contributor

jreback commented Jun 23, 2015

ok, this works for a simple concat, but merging will not preserve the categoricals. Needs some post processing logic. Keep in mind this will ONLY work for identically categorized blocks. I think we originally left this out and didn't automagically recategorize, but let me see what I can do.

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design Categorical Categorical Data Type Difficulty Advanced labels Jun 23, 2015
@jreback jreback added this to the 0.17.0 milestone Jun 23, 2015
@wikiped
Copy link

wikiped commented Jun 25, 2015

@jreback

Keep in mind this will ONLY work for identically categorized blocks.

Do you mean that attempting to merge on Categorical with one of categories missing should normally fail like this:

A = pd.DataFrame({'X': np.random.choice(['foo', 'bar'],size=(5,)), 
                  'Y': np.random.choice(['one', 'two', 'three'], size=(5,))})
A['X'] = A['X'].astype('category')

B = pd.DataFrame({'X': np.random.choice(['foo', 'baz'],size=(5,)),   # baz instead of bar
                  'Z': np.random.choice(['jjj', 'kkk', 'sss'], size=(5,))})
B['X'] = B['X'].astype('category')

pd.merge(A, B, on='X', how='outer')

Returning error:

d:\Anaconda\envs\py2k\lib\site-packages\pandas\core\common.pyc in take_nd(arr, indexer, axis, out, fill_value, mask_info, allow_fill)
    829         out_shape[axis] = len(indexer)
    830         out_shape = tuple(out_shape)
--> 831         if arr.flags.f_contiguous and axis == arr.ndim - 1:
    832             # minor tweak that can make an order-of-magnitude difference
    833             # for dataframes initialized directly from 2-d ndarrays

AttributeError: 'Categorical' object has no attribute 'flags'

Perhaps it is linked to issue #10374...

Ideally this would return a new Categorical with inserted NaNs.

The above was tried on pandas 0.16.2

@jreback
Copy link
Contributor

jreback commented Jun 25, 2015

well aside from the bug in #10324

  • An ordered Categorical merge should raise as the new ordering would be impossible to determine.
  • A merge on a Categorical and an object column would be ok with nan for any non-represented categories.
  • A merge with 2 different Categoricals I suppose could succeed with a union of the categoricals.

@jreback jreback self-assigned this Jul 7, 2015
@joshlk
Copy link

joshlk commented Jul 7, 2015

Additionally the performance benefits of the underlying int representation of the categories are lost when doing the merge. For example:

# Here we are joining on a int column and appending two string columns
A = pd.DataFrame({'X': np.random.choice(range(0, 10), size=(100000,)), 
                                'Y': np.random.choice(['one', 'two', 'three'], size=(100000,))})

B = pd.DataFrame({'X': np.random.choice(range(0, 10), size=(100000,)), 
                                'Z': np.random.choice(['jjj', 'kkk', 'sss'], size=(100000,))})

%time r = pd.merge(A, B, on='X')
Wall time: 57.3 s

# Lets do the same but convert the appending columns 'Y' and 'Z' to a category
A['Y'] = A['Y'].astype('category')
B['Z'] = B['Z'].astype('category')

%time r = pd.merge(A, B, on='X')
Wall time: 54.9 s          # Slightly less

# Now lets use the int representation of the categorical columns and merge
A['Y'] = A['Y'].cat.codes
B['Z'] = B['Z'].cat.codes

%time r = pd.merge(A, B, on='X')
Wall time: 41.7 s           # 72% of the original time taken

@jreback
Copy link
Contributor

jreback commented Jul 7, 2015

@joshlk all true

if u would like to try an impl would be great
I will get to this but prob not for a little bit

@littlegreenbean33
Copy link

Hi,

is there any plans to solve this issue ? losing categorical types when merging data frames ?

Thanks.

JCG

@jorisvandenbossche jorisvandenbossche changed the title Merging with categories. Type retention BUG: merge with categoricals does not preserve categories dtype Feb 6, 2017
jreback added a commit to jreback/pandas that referenced this issue Feb 6, 2017
jreback added a commit to jreback/pandas that referenced this issue Feb 6, 2017
jreback added a commit to jreback/pandas that referenced this issue Feb 7, 2017
jreback added a commit to jreback/pandas that referenced this issue Feb 7, 2017
jreback added a commit to jreback/pandas that referenced this issue Feb 9, 2017
jreback added a commit to jreback/pandas that referenced this issue Feb 10, 2017
jreback added a commit to jreback/pandas that referenced this issue Feb 10, 2017
jreback added a commit to jreback/pandas that referenced this issue Feb 12, 2017
jreback added a commit to jreback/pandas that referenced this issue Feb 15, 2017
jreback added a commit to jreback/pandas that referenced this issue Feb 15, 2017
jreback added a commit to jreback/pandas that referenced this issue Feb 15, 2017
jreback added a commit to jreback/pandas that referenced this issue Feb 15, 2017
jreback added a commit to jreback/pandas that referenced this issue Feb 16, 2017
jreback added a commit to jreback/pandas that referenced this issue Feb 17, 2017
jreback added a commit to jreback/pandas that referenced this issue Feb 23, 2017
jreback added a commit to jreback/pandas that referenced this issue Mar 2, 2017
jreback added a commit to jreback/pandas that referenced this issue Mar 5, 2017
jreback added a commit to jreback/pandas that referenced this issue Mar 8, 2017
jreback added a commit to jreback/pandas that referenced this issue Mar 9, 2017
jreback added a commit to jreback/pandas that referenced this issue Mar 10, 2017
jreback added a commit to jreback/pandas that referenced this issue Mar 10, 2017
jreback added a commit to jreback/pandas that referenced this issue Mar 10, 2017
jreback added a commit to jreback/pandas that referenced this issue Mar 10, 2017
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
…erve category dtype

closes pandas-dev#10409

Author: Jeff Reback <[email protected]>

Closes pandas-dev#15321 from jreback/merge_cat and squashes the following commits:

3671dad [Jeff Reback] DOC: merge docs
a4b2ee6 [Jeff Reback] BUG/API: .merge() and .join() on category dtype columns will now preserve the category dtype when possible
has2k1 added a commit to has2k1/plotnine that referenced this issue Jun 20, 2017
The workaround was due to a bug in pandas,
pandas-dev/pandas#10409 that has been fixed.
When that was fixed upstream, the local fix led to another
bug, pandas-dev/pandas#10696!!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
6 participants