-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
BUG: when inner/outer-joining dataframes with categorical MultiIndex, the output index dtype depends on row ordering #50906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I noticed this when working on a project that the categorical levels of my dataframes were changing to |
This appears to be fixed on the main branch. I see all cases returning categorical dtypes. Could use a test. |
Yeah the join logic was different when your index was non-monotonic before, but we fixed this on main. |
@phofl I'm seeing the same issue arise even when both indexes are monotonic increasing. Here is a fairly simple example: import pandas as pd
df1 = pd.DataFrame({
"idx1": pd.Categorical(['a', 'a', 'a']),
"idx2": pd.Categorical(['a', 'a', 'b']),
"data": [1, 2, 3]
}).set_index(["idx1", "idx2"])
df2 = pd.DataFrame({
"idx1": pd.Categorical(['a', 'a', 'a']),
"idx2": pd.Categorical(['a', 'b', 'b']),
"data2": [1, 2, 3]
}).set_index(["idx1", "idx2"])
df3 = df1.join(df2, how="outer")
Note that:
I'm not sure if the main branch resolution fully resolved this if the thinking was that this was due to monotonicity (I'm not set up to check that at the moment, I can work to do so if it will help). |
The initial cases all work correctly on main. We did some refactoring in how MultiIndex ops work with regards to materialising values, this had impact here too I guess. |
Hi, I will work on the test for the initial cases. |
Uh oh!
There was an error while loading. Please reload this page.
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
If two dataframes both are multi-indexed with categorical levels, then performing a
join
operation results in the dtype of the index being un-categorized depending on the ordering of the input. If the indexes match ordering, the output is categorical; if the indexes have different ordering, the output is cast to the underlying categorical dtype.Expected Behavior
All joins shown above should produce categorical index levels.
Installed Versions
The text was updated successfully, but these errors were encountered: