-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Fix 'observed' kwarg not doing anything on SeriesGroupBy #26463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
when u make changes - do it in the original PR otherwise this is very confusing |
Sorry, I did not make any changes between the two PRs, except changing the branch. I wanted to switch because the first one was pushed from master. |
ok fair enough - pls close one of them |
ok, closed the first one |
Codecov Report
@@ Coverage Diff @@
## master #26463 +/- ##
==========================================
- Coverage 91.77% 91.76% -0.01%
==========================================
Files 174 174
Lines 50649 50652 +3
==========================================
- Hits 46483 46482 -1
- Misses 4166 4170 +4
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good just a few questions and comments on tests
df['a'] = df['a'].astype('category') | ||
df['b'] = df['b'].astype('category') | ||
|
||
# test .agg and .apply when observed == False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than duplicate things in individual sections here should try and parametrize on agg/apply and True/False (for observed)
df['a'] = df['a'].astype('category') | ||
df['b'] = df['b'].astype('category') | ||
|
||
# observed == False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parametrize on True/False here as well. There should already be an observed fixture defined in the top level conftest.py that you can use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't make use of the existing fixture due to other parameters. I also included the cases for None, in case the default changes. The tests look much cleaner now, but there's a couple of fixtures that probably won't be used often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great - just a few more things from my perspective
|
||
|
||
@pytest.mark.parametrize("observed, index, op, data", [ | ||
(True, 'multi_index_cat_partial', 'agg', [3, 3, 4]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add a separate decorator for op
parametrized on agg/apply rather than having to duplicate each time here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also use the observed fixture separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think different values for index prevent me from doing that: I don't want apply
to run with index multi_index_cat_partial
(line 970), and I don't want to run agg
on multi_index_non_cat_partial
(line 971). BaseGrouper.apply changes the index when observed=True
and I did not find a simple way to keep it, thus the differences in indices. Looking at values only would make this more simple, but that does not feel sufficient.
The same goes for observed
, as observed=True
and observed=False
return different indices, I can't run all values of observed
against all values of expected indices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh OK I misread those values. You can also split those out into a separate test if they should exhibit different behavior
expected = pd.Series(data=data, index=index, name='c') | ||
grouped = df_cat.groupby(['a', 'b'], observed=observed).c | ||
actual = getattr(grouped, op)(sum) | ||
assert_series_equal(expected, actual) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change actual
to result and do assert_series_equal(result, expected)
?
# GH 24880 | ||
index = request.getfixturevalue(index) | ||
expected = pd.Series(data=data, index=index, name='c') | ||
grouped = df_cat.groupby(['a', 'b'], observed=observed).c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistic nit but can you use bracket notation instead of dot notation?
# GH 24880 | ||
index = request.getfixturevalue(index) | ||
expected = pd.Series(data=data, index=index, name='c') | ||
actual = df_cat.groupby(['a', 'b'], observed=observed).c.\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also stylistic but we only use implicit line continuations in code base, so could just break after the opening paranethes to apply here if not too long
pandas/tests/groupby/conftest.py
Outdated
|
||
|
||
@pytest.fixture | ||
def multi_index_cat_partial(df_cat): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question on naming - what does partial
mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By partial
I mean that it's not a full product of index level values. For example, if first level has ['x', 'y']
and second level ['a', 'b']
, then a complete index is [('x', 'a'), ('x', 'b'), ('y', 'a'), ('y', 'b')]
and I construct that with .from_product
, and if not all combinations are in, then it's partial
(for example, [('x', 'a'), ('x', 'b'), ('y', 'a')]
where ('y', 'b')
is not in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm OK. I find this rather confusing though that the determination as to whether or not this is "partial" is done entirely outside the scope of the fixture (i.e. it is up to the injected test to only partially align).
Is there not a way to make the fixture self contained to provide a Series or Frame instead? May require a rewrite of your tests but I think this is just confusing
@krsnik93 note that you also have an isort failure |
pandas/tests/groupby/conftest.py
Outdated
|
||
|
||
@pytest.fixture | ||
def multi_index_cat_partial(df_cat): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm OK. I find this rather confusing though that the determination as to whether or not this is "partial" is done entirely outside the scope of the fixture (i.e. it is up to the injected test to only partially align).
Is there not a way to make the fixture self contained to provide a Series or Frame instead? May require a rewrite of your tests but I think this is just confusing
pandas/tests/groupby/conftest.py
Outdated
def multi_index_cat_complete(): | ||
lvls = [CategoricalIndex(['x', 'y'], categories=['x', 'y'], ordered=False), | ||
CategoricalIndex(['a', 'b'], categories=['a', 'b'], ordered=False)] | ||
index = MultiIndex.from_product(lvls, names=['a', 'b']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since 'a' and 'b' are categories can you use distinct names? 'foo' and 'bar' are fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment in all places; I think these can be made more readable if the code gets reduced
Otherwise lgtm so over to @jreback
def test_groupby_series_observed_true(df_cat, operation): | ||
# GH 24880 | ||
index = { | ||
'agg': MultiIndex.from_frame(df_cat[['a', 'b']].drop_duplicates()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to just use literal values here instead of deriving via these expressions; it would be easier to read and could be passed in via parametrization instead of using this local dict
index_names = df_cat.select_dtypes( | ||
'category').columns.values.tolist() + [None] | ||
index = { | ||
True: MultiIndex.from_tuples( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment
@pytest.mark.parametrize('operation', ['agg', 'apply']) | ||
def test_groupby_series_observed_true(df_cat, operation): | ||
# GH 24880 | ||
index = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just parametrize this instead of using the dict this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also using MultiIndex.from_* methods would be preferable for readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
pandas/core/groupby/groupby.py
Outdated
@@ -2301,6 +2302,69 @@ def tail(self, n=5): | |||
mask = self._cumcount_array(ascending=False) < n | |||
return self._selected_obj[mask] | |||
|
|||
def _reindex_output(self, result): | |||
""" | |||
If we have categorical groupers, then we want to make sure that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you update the doc-string with Parameters / Results; type things if you can
does this have an associated issue? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs a whatsnew note as well
pandas/tests/groupby/conftest.py
Outdated
@@ -76,3 +76,13 @@ def three_group(): | |||
'D': np.random.randn(11), | |||
'E': np.random.randn(11), | |||
'F': np.random.randn(11)}) | |||
|
|||
|
|||
@pytest.fixture |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be more generally used in groupby/test_categorical.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have to keep doing .astype('category')
in each of my tests for this. Either that, or derive another fixture from three_group
, so it would not decrease the number of fixtures.
I also preferred literal values to random ones for easier equality checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it’s not about decreasing the number of fixtures
but rather reusing them across tests as much as possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
literal values are fine (u could just replace the random with fixed values)
changing existing to accommodate new tests is better than rolling new
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change looks fine, can handle the cleaning up of the _wrap* at a later date, just a test comment
class TestSeriesGroupByObservedKwarg: | ||
# GH 24880 | ||
|
||
@pytest.fixture(autouse=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use setup_method, this creates a rather opaque path. I don't think the class adds anything here (or rather it might but we don't use any classes in this file, so rather refactor in a later change).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used a class not to have to do this bit:
df = df.copy()[:4] # leave out some groups
df['A'] = df['A'].astype('category')
df['B'] = df['B'].astype('category')
df['C'] = pd.Series([1, 2, 3, 4])
in each of the 3 test functions. An alternative is to put the above in another fixture derived from the first one, but that would probably not be reused much.
So do I just rename setup_method
to something like setup_df
or do something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just define the fixture as a function outside of the class (and remove the class as indicated)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few minor points regarding the tests.
|
||
|
||
@pytest.fixture | ||
def df_cat(df): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a doctring, xref #19159
df_cat['A'] = df_cat['A'].astype('category') | ||
df_cat['B'] = df_cat['B'].astype('category') | ||
df_cat['C'] = pd.Series([1, 2, 3, 4]) | ||
yield df_cat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yield is perfectly valid, but for consistency with the rest of the fixtures can you use a return. There's no context, teardown or finalization here.
'B': ['one', 'two', 'one', 'three'] | ||
}, dtype='category'))), | ||
('apply', MultiIndex.from_frame( | ||
pd.DataFrame({'A': ['foo', 'foo', 'bar', 'bar'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the index is the same as above with just the dtype being different, might be clearer if you parametrize over just dtype, something like...
@pytest.mark.paramtrize(...., 'kwargs', [(..., None), (..., dict(dtype='category'))]
and then
MultiIndex.from_frame(...., **kwargs)
})))]) | ||
def test_seriesgroupby_observed_true(df_cat, operation, index): | ||
# GH 24880 | ||
expected = pd.Series(data=[1, 3, 2, 4], index=index, name='C') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Series and DataFrame are already imported, so you could remove the pd prefixes, here and elsewhere in the tests.
Am I seeing this correctly? I don't see how test_constructor_list_frames could be failing from my changes... |
It is unrelated; see #26546 |
I see, thanks. |
@krsnik93 : the CI failures are fixed. can you merge master. |
very nice @krsnik93 would love to have followups as discussed above (or other issues) if you can. |
observed
keyword for SeriesGroupBy Ignored #24880git diff upstream/master -u -- "*.py" | flake8 --diff