Skip to content

BUG: fix reindexing MultiIndex with categorical datetime-like level #21657

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Jun 27, 2018

closes #21390

This fixes the bug, but is not really a general solution. However, I would like to keep that for another PR (won't have time for this before 0.23.2), also because the underlying reason is more widely present than in just the MultiIndex.values (will open a separate issue about this -> #21658).

I added some tests of the bug at several levels where it surfaces (groupby, reindex, index.get_indexer)

@jorisvandenbossche jorisvandenbossche added Regression Functionality that used to work in a prior pandas version MultiIndex Categorical Categorical Data Type labels Jun 27, 2018
@jorisvandenbossche jorisvandenbossche added this to the 0.23.2 milestone Jun 27, 2018
@pep8speaks
Copy link

pep8speaks commented Jun 27, 2018

Hello @jorisvandenbossche! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on June 29, 2018 at 09:13 Hours UTC

# Need to box timestamps, etc.
box = hasattr(lev, '_box_values')
if is_categorical_dtype(lev):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need and len(lev) > len(lab) here? I'm not sure what it's for, but the other condition had it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Below it is too switch between "first boxing, then take" and "first take, then boxing", as far as I understand "Try to minimize boxing." for performance reasons? (both should be the same).
Anyhow, I don't think it is relevant here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a gigantic hack, I would really not do this. Simply push to 0.23.3

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have ideas where to look for a better solution?

# Need to box timestamps, etc.
box = hasattr(lev, '_box_values')
if is_categorical_dtype(lev):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a gigantic hack, I would really not do this. Simply push to 0.23.3

@jreback jreback modified the milestones: 0.23.2, 0.23.3 Jun 28, 2018
@codecov
Copy link

codecov bot commented Jun 28, 2018

Codecov Report

Merging #21657 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21657      +/-   ##
==========================================
- Coverage    91.9%    91.9%   -0.01%     
==========================================
  Files         154      154              
  Lines       49656    49559      -97     
==========================================
- Hits        45637    45546      -91     
+ Misses       4019     4013       -6
Flag Coverage Δ
#multiple 90.27% <100%> (-0.01%) ⬇️
#single 42.03% <50%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/multi.py 94.9% <100%> (-0.06%) ⬇️
pandas/core/arrays/base.py 83.95% <0%> (-3.65%) ⬇️
pandas/core/ops.py 96.41% <0%> (-0.06%) ⬇️
pandas/core/dtypes/dtypes.py 95.9% <0%> (-0.05%) ⬇️
pandas/util/testing.py 85.19% <0%> (-0.04%) ⬇️
pandas/core/algorithms.py 94.83% <0%> (-0.02%) ⬇️
pandas/core/indexes/base.py 96.63% <0%> (-0.01%) ⬇️
pandas/core/reshape/reshape.py 99.78% <0%> (-0.01%) ⬇️
pandas/core/indexes/period.py 92.67% <0%> (ø) ⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b63e81...2261eef. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much nicer! minor comment

vals = self._get_level_values(i)
if is_categorical_dtype(vals):
vals = vals.get_values()
if (isinstance(vals.dtype, (PandasExtensionDtype, ExtensionDtype))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elif here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I need the 'if' here because the result of categorical.get_values() can still be an Index with extension dtype / datetime dtype.

I would like to explore a bit more how to streamline the path from series/index/array object -> numpy array that is boxed if needed (currenlty that doesn't seem to easy, and is handled again in many different places), but that is for another PR

@jreback jreback modified the milestones: 0.23.3, 0.23.2 Jun 28, 2018
@jorisvandenbossche jorisvandenbossche merged commit 1cc5471 into pandas-dev:master Jul 2, 2018
@jorisvandenbossche jorisvandenbossche deleted the bug-groupby-categorical-21390 branch July 2, 2018 15:26
jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Jul 2, 2018
jorisvandenbossche added a commit that referenced this pull request Jul 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type MultiIndex Regression Functionality that used to work in a prior pandas version
Projects
None yet
4 participants