Skip to content

fix a indices bug for categorical-datetime columns #26860

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from

Conversation

alexifm
Copy link

@alexifm alexifm commented Jun 14, 2019

This is to fix a bug reported in #26859

@codecov
Copy link

codecov bot commented Jun 14, 2019

Codecov Report

Merging #26860 into master will decrease coverage by 1.41%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26860      +/-   ##
==========================================
- Coverage   91.88%   90.46%   -1.42%     
==========================================
  Files         179      179              
  Lines       50696    50697       +1     
==========================================
- Hits        46581    45865     -716     
- Misses       4115     4832     +717
Flag Coverage Δ
#multiple 90.46% <100%> (ø) ⬆️
#single ?
Impacted Files Coverage Δ
pandas/core/groupby/ops.py 96% <100%> (ø) ⬆️
pandas/core/computation/pytables.py 62.5% <0%> (-27.75%) ⬇️
pandas/io/pytables.py 63.82% <0%> (-26.48%) ⬇️
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
pandas/core/computation/common.py 84.21% <0%> (-5.27%) ⬇️
pandas/core/computation/expr.py 94.78% <0%> (-3.03%) ⬇️
pandas/io/clipboard/clipboards.py 31.88% <0%> (-2.9%) ⬇️
pandas/io/formats/printing.py 84.49% <0%> (-1.07%) ⬇️
pandas/core/indexes/datetimes.py 96.21% <0%> (-0.17%) ⬇️
pandas/core/arrays/categorical.py 95.8% <0%> (-0.13%) ⬇️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 430f0fd...6f8fdc0. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 14, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@d91ffa6). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #26860   +/-   ##
=========================================
  Coverage          ?   91.87%           
=========================================
  Files             ?      179           
  Lines             ?    50697           
  Branches          ?        0           
=========================================
  Hits              ?    46578           
  Misses            ?     4119           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.46% <100%> (?)
#single 41.11% <0%> (?)
Impacted Files Coverage Δ
pandas/core/groupby/ops.py 96% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d91ffa6...85b3b1a. Read the comment docs.

@WillAyd
Copy link
Member

WillAyd commented Jun 15, 2019

Can you add a test? That's typically the first thing we look for with PRs

Adds a test for a bug fix for DataFrameGroupby.indices in  pandas-dev#26860
@pep8speaks
Copy link

pep8speaks commented Jun 15, 2019

Hello @alexifm! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-08-29 03:58:29 UTC

cleans up the test to adhere to pep8 formatting
@alexifm
Copy link
Author

alexifm commented Jun 15, 2019

I need to find a better way to test to equality of the dictionaries in the output for 3.5.

alexifm added 2 commits June 14, 2019 20:14
The test no longer depends on ordering of a dictionary.  Also, the test matches the timestamp/datetime outputs that are the current standard in the code.
Handle Py3.5 dict ordering issues.  Cleanup for Pep8.  No longer using numpy testing utility.
@topper-123
Copy link
Contributor

This should also have a entry in the whatsnew.

@topper-123 topper-123 added the Bug label Jun 16, 2019
@topper-123 topper-123 added this to the 0.25.0 milestone Jun 16, 2019
@alexifm
Copy link
Author

alexifm commented Jun 17, 2019

Thanks for the feedback. Will update.

@jreback
Copy link
Contributor

jreback commented Jun 28, 2019

@alexifm can you merge master and update

@jreback jreback removed this from the 0.25.0 milestone Jun 28, 2019
@jbrockmendel
Copy link
Member

@alexifm can you merge master and ill take a look

@alexifm
Copy link
Author

alexifm commented Aug 28, 2019

Updated it and addressed the comments in the review. Sorry it took so long.

@@ -99,7 +99,7 @@ Other
^^^^^

- Compatibility with Python 3.8 in :meth:`DataFrame.query` (:issue:`27261`)
-
- Bug in :func:`get_indexer_dict` when passed keys are not numpy array. (:issue:`26860`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a public function. Can you rephrase this to make sense for an end user?

And can you move the release note to the 1.0.0 whatsenew?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to put it in terms of gb.indices which was where the problem originally came about?

@@ -305,6 +305,8 @@ def get_flattened_iterator(comp_ids, ngroups, levels, labels):

def get_indexer_dict(label_list, keys):
""" return a diction of {labels} -> {indexers} """
# address GH 26860
keys = [np.asarray(key) for key in keys]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the types on key here? Series, Index, Array?

I worry a bit about doing this on a DatetimeIndex with tz. That will emit a warning, since we're changing how we handle datetimes in np.asarray.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'm not all that sure what is going into get_indexer_dict which was why I put the fix under the indices property since it was more about fixing that particular input.

]),
ids=lambda cols: ",".join(cols)
)
def test_groupby_indices(gb_cols):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems very complicated. I haven't gone through it yet, but I would appreciate at least one test as simple as the example from the issue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think it makes sense to simplify. My idea was that gb.indices fails under certain combinations of types of columns and I wanted to enumerate as many of the combinations as possible. The original iteration was inside the test but it was a mess. I can still do the iteration inside the test but in a much cleaner manner.

ids=lambda cols: ",".join(cols)
)
def test_groupby_indices(gb_cols):
def test_groupby_indices_output():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is parametrizing no longer viable?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I can revert it or find a middle ground so the parameterization isn't overkill. Thoughts on that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified the test and parametrized it.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, there's a merge conflict in the whatsnew. Can you merge master and resolve that?

@jbrockmendel do you have thoughts here?

@@ -176,7 +176,7 @@ Groupby/resample/rolling
^^^^^^^^^^^^^^^^^^^^^^^^

-
-
- Bug in :meth:`DataFrameGroupBy.indices` raises exception when grouping on multiple columns and one is a categorical with datetime values. (:issue:`26860`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"raises" -> "raising an"

"and one" -> "when one"

@jbrockmendel
Copy link
Member

i think we need the test cases to include Categorical[datetimetz]

@jreback
Copy link
Contributor

jreback commented Oct 6, 2019

can you merge master and update to comments

@jreback
Copy link
Contributor

jreback commented Oct 18, 2019

can you merge master

@jbrockmendel
Copy link
Member

@alexifm can you rebase. this would be nice to get in

@jreback
Copy link
Contributor

jreback commented Nov 13, 2019

closing as stale, if you want to continue pls ping.

@jreback jreback closed this Nov 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Groupby indices error with datetime categorical
7 participants