Skip to content

Fix 'observed' kwarg not doing anything on SeriesGroupBy #26463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
May 30, 2019

Conversation

krsnik93
Copy link
Contributor

@krsnik93 krsnik93 commented May 19, 2019

@jreback
Copy link
Contributor

jreback commented May 19, 2019

when u make changes - do it in the original PR otherwise this is very confusing

@krsnik93
Copy link
Contributor Author

Sorry, I did not make any changes between the two PRs, except changing the branch. I wanted to switch because the first one was pushed from master.

@jreback
Copy link
Contributor

jreback commented May 19, 2019

ok fair enough - pls close one of them

@krsnik93
Copy link
Contributor Author

ok, closed the first one

@codecov
Copy link

codecov bot commented May 19, 2019

Codecov Report

Merging #26463 into master will decrease coverage by <.01%.
The diff coverage is 96.77%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26463      +/-   ##
==========================================
- Coverage   91.77%   91.76%   -0.01%     
==========================================
  Files         174      174              
  Lines       50649    50652       +3     
==========================================
- Hits        46483    46482       -1     
- Misses       4166     4170       +4
Flag Coverage Δ
#multiple 90.3% <96.77%> (ø) ⬆️
#single 41.69% <12.9%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/generic.py 88.81% <100%> (-0.16%) ⬇️
pandas/core/groupby/groupby.py 97.17% <95.65%> (-0.06%) ⬇️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e7ad884...e6bca5e. Read the comment docs.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good just a few questions and comments on tests

df['a'] = df['a'].astype('category')
df['b'] = df['b'].astype('category')

# test .agg and .apply when observed == False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than duplicate things in individual sections here should try and parametrize on agg/apply and True/False (for observed)

df['a'] = df['a'].astype('category')
df['b'] = df['b'].astype('category')

# observed == False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parametrize on True/False here as well. There should already be an observed fixture defined in the top level conftest.py that you can use

Copy link
Contributor Author

@krsnik93 krsnik93 May 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't make use of the existing fixture due to other parameters. I also included the cases for None, in case the default changes. The tests look much cleaner now, but there's a couple of fixtures that probably won't be used often.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great - just a few more things from my perspective



@pytest.mark.parametrize("observed, index, op, data", [
(True, 'multi_index_cat_partial', 'agg', [3, 3, 4]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add a separate decorator for op parametrized on agg/apply rather than having to duplicate each time here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also use the observed fixture separately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think different values for index prevent me from doing that: I don't want apply to run with index multi_index_cat_partial (line 970), and I don't want to run agg on multi_index_non_cat_partial (line 971). BaseGrouper.apply changes the index when observed=True and I did not find a simple way to keep it, thus the differences in indices. Looking at values only would make this more simple, but that does not feel sufficient.

The same goes for observed, as observed=True and observed=False return different indices, I can't run all values of observed against all values of expected indices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh OK I misread those values. You can also split those out into a separate test if they should exhibit different behavior

expected = pd.Series(data=data, index=index, name='c')
grouped = df_cat.groupby(['a', 'b'], observed=observed).c
actual = getattr(grouped, op)(sum)
assert_series_equal(expected, actual)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change actual to result and do assert_series_equal(result, expected)?

# GH 24880
index = request.getfixturevalue(index)
expected = pd.Series(data=data, index=index, name='c')
grouped = df_cat.groupby(['a', 'b'], observed=observed).c
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistic nit but can you use bracket notation instead of dot notation?

# GH 24880
index = request.getfixturevalue(index)
expected = pd.Series(data=data, index=index, name='c')
actual = df_cat.groupby(['a', 'b'], observed=observed).c.\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also stylistic but we only use implicit line continuations in code base, so could just break after the opening paranethes to apply here if not too long



@pytest.fixture
def multi_index_cat_partial(df_cat):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question on naming - what does partial mean here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By partial I mean that it's not a full product of index level values. For example, if first level has ['x', 'y'] and second level ['a', 'b'], then a complete index is [('x', 'a'), ('x', 'b'), ('y', 'a'), ('y', 'b')] and I construct that with .from_product, and if not all combinations are in, then it's partial (for example, [('x', 'a'), ('x', 'b'), ('y', 'a')] where ('y', 'b') is not in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm OK. I find this rather confusing though that the determination as to whether or not this is "partial" is done entirely outside the scope of the fixture (i.e. it is up to the injected test to only partially align).

Is there not a way to make the fixture self contained to provide a Series or Frame instead? May require a rewrite of your tests but I think this is just confusing

@WillAyd
Copy link
Member

WillAyd commented May 20, 2019

@krsnik93 note that you also have an isort failure



@pytest.fixture
def multi_index_cat_partial(df_cat):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm OK. I find this rather confusing though that the determination as to whether or not this is "partial" is done entirely outside the scope of the fixture (i.e. it is up to the injected test to only partially align).

Is there not a way to make the fixture self contained to provide a Series or Frame instead? May require a rewrite of your tests but I think this is just confusing

def multi_index_cat_complete():
lvls = [CategoricalIndex(['x', 'y'], categories=['x', 'y'], ordered=False),
CategoricalIndex(['a', 'b'], categories=['a', 'b'], ordered=False)]
index = MultiIndex.from_product(lvls, names=['a', 'b'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since 'a' and 'b' are categories can you use distinct names? 'foo' and 'bar' are fine

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment in all places; I think these can be made more readable if the code gets reduced

Otherwise lgtm so over to @jreback

def test_groupby_series_observed_true(df_cat, operation):
# GH 24880
index = {
'agg': MultiIndex.from_frame(df_cat[['a', 'b']].drop_duplicates()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to just use literal values here instead of deriving via these expressions; it would be easier to read and could be passed in via parametrization instead of using this local dict

index_names = df_cat.select_dtypes(
'category').columns.values.tolist() + [None]
index = {
True: MultiIndex.from_tuples(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment

@WillAyd WillAyd added this to the 0.25.0 milestone May 21, 2019
@pytest.mark.parametrize('operation', ['agg', 'apply'])
def test_groupby_series_observed_true(df_cat, operation):
# GH 24880
index = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just parametrize this instead of using the dict this way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also using MultiIndex.from_* methods would be preferable for readability

@pep8speaks
Copy link

pep8speaks commented May 22, 2019

Hello @krsnik93! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-05-29 09:02:24 UTC

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@@ -2301,6 +2302,69 @@ def tail(self, n=5):
mask = self._cumcount_array(ascending=False) < n
return self._selected_obj[mask]

def _reindex_output(self, result):
"""
If we have categorical groupers, then we want to make sure that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update the doc-string with Parameters / Results; type things if you can

@jreback
Copy link
Contributor

jreback commented May 26, 2019

does this have an associated issue?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a whatsnew note as well

@@ -76,3 +76,13 @@ def three_group():
'D': np.random.randn(11),
'E': np.random.randn(11),
'F': np.random.randn(11)})


@pytest.fixture
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be more generally used in groupby/test_categorical.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have to keep doing .astype('category') in each of my tests for this. Either that, or derive another fixture from three_group, so it would not decrease the number of fixtures.

I also preferred literal values to random ones for easier equality checks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it’s not about decreasing the number of fixtures
but rather reusing them across tests as much as possible

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

literal values are fine (u could just replace the random with fixed values)
changing existing to accommodate new tests is better than rolling new

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change looks fine, can handle the cleaning up of the _wrap* at a later date, just a test comment

class TestSeriesGroupByObservedKwarg:
# GH 24880

@pytest.fixture(autouse=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use setup_method, this creates a rather opaque path. I don't think the class adds anything here (or rather it might but we don't use any classes in this file, so rather refactor in a later change).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used a class not to have to do this bit:

df = df.copy()[:4]  # leave out some groups
df['A'] = df['A'].astype('category')
df['B'] = df['B'].astype('category')
df['C'] = pd.Series([1, 2, 3, 4])

in each of the 3 test functions. An alternative is to put the above in another fixture derived from the first one, but that would probably not be reused much.

So do I just rename setup_method to something like setup_df or do something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just define the fixture as a function outside of the class (and remove the class as indicated)

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few minor points regarding the tests.



@pytest.fixture
def df_cat(df):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doctring, xref #19159

df_cat['A'] = df_cat['A'].astype('category')
df_cat['B'] = df_cat['B'].astype('category')
df_cat['C'] = pd.Series([1, 2, 3, 4])
yield df_cat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yield is perfectly valid, but for consistency with the rest of the fixtures can you use a return. There's no context, teardown or finalization here.

'B': ['one', 'two', 'one', 'three']
}, dtype='category'))),
('apply', MultiIndex.from_frame(
pd.DataFrame({'A': ['foo', 'foo', 'bar', 'bar'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the index is the same as above with just the dtype being different, might be clearer if you parametrize over just dtype, something like...

@pytest.mark.paramtrize(...., 'kwargs', [(..., None), (..., dict(dtype='category'))]

and then

MultiIndex.from_frame(...., **kwargs)

})))])
def test_seriesgroupby_observed_true(df_cat, operation, index):
# GH 24880
expected = pd.Series(data=[1, 3, 2, 4], index=index, name='C')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Series and DataFrame are already imported, so you could remove the pd prefixes, here and elsewhere in the tests.

@krsnik93
Copy link
Contributor Author

Am I seeing this correctly? I don't see how test_constructor_list_frames could be failing from my changes...

@WillAyd
Copy link
Member

WillAyd commented May 28, 2019

It is unrelated; see #26546

@krsnik93
Copy link
Contributor Author

I see, thanks.

@simonjayhawkins
Copy link
Member

@krsnik93 : the CI failures are fixed. can you merge master.

@jreback jreback merged commit 9e76f4a into pandas-dev:master May 30, 2019
@jreback
Copy link
Contributor

jreback commented May 30, 2019

very nice @krsnik93

would love to have followups as discussed above (or other issues) if you can.

@krsnik93 krsnik93 deleted the GH24880 branch June 17, 2019 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

observed keyword for SeriesGroupBy Ignored
5 participants