Skip to content

BUG: Groupby.apply wasn't allowing for functions which return lists #31456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Feb 1, 2020

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli changed the title wip BUG: Groupby.apply wasn't allowing for functions which return lists Jan 30, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think also need to run some perf checks to make sure this isn't super expensive.

@@ -499,7 +499,9 @@ def apply_frame_axis0(object frame, object f, object names,
# `piece` might not have an index, could be e.g. an int
pass

if not is_scalar(piece):
if isinstance(piece, list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not the correct patch,

instead lets change line 507 (in revised code) to be

piece = deepcopy(piece)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not the correct patch,

instead lets change line 507 (in revised code) to be

piece = deepcopy(piece)

This causes some failures, e.g.

$ pytest pandas/tests/groupby/test_groupby.py::test_no_mutate_but_looks_like

which doesn't fail with

            if isinstance(piece, list):
                piece = deepcopy(piece)
            elif not is_scalar(piece):
                # Need to copy data to avoid appending references
                if hasattr(piece, "copy"):
                    piece = piece.copy(deep='all')
                else:
                    piece = copy(piece)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli what is the difference that this causes in that test? Would it return one of the results completely as a list instead of exploding to a series?

Returning a list from a groupby op is thorny to begin with so I think should streamline logic as much as possible

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd here's the difference:

>>> df = DataFrame({"key": [1, 1, 1, 2, 2, 2, 3, 3, 3], "value": range(9)})

>>> result1 = df.groupby("key", group_keys=True).apply(lambda x: x[:].key)
>>> result2 = df.groupby("key", group_keys=True).apply(lambda x: x.key)
>>> result1
key   
1    0    1
     1    1
     2    1
2    3    2
     4    2
     5    2
3    6    3
     7    3
     8    3

>>> result2
key  0  1  2
key         
1    1  1  1
2    2  2  2
3    3  3  3

Returning a list from a groupby op is thorny to begin with so I think should streamline logic as much as possible

OK, I'll think of a workaround (but I realise that you'd like to get a v1.0.1 release out very soon so if you think this'll take me too long please do feel free to take over)

Copy link
Member Author

@MarcoGorelli MarcoGorelli Jan 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm somewhat confused here. If I use this, the test case passes:

pd._testing.assert_series_equal(deepcopy(piece), piece.copy(deep="all"))
piece = piece.copy(deep="all")

but if I use this, it doesn't!

pd._testing.assert_series_equal(deepcopy(piece), piece.copy(deep="all"))
piece = deepcopy(piece)

will look into it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsaxton see here - I share your puzzlement :) Will look into Series.copy tomorrow

@jreback jreback added Groupby Regression Functionality that used to work in a prior pandas version labels Jan 31, 2020
@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Jan 31, 2020

NOTE

The following refers to a commit with line 507 (in revised code) changed to:

piece = deepcopy(piece)

(so, without the if isinstance(piece, list) part)

i think also need to run some perf checks to make sure this isn't super expensive.

Sure, here's the output from a run I did:

$ python setup.py develop  [I've removed the output]

$ cd asv_bench/

$ asv continuous -f 1.1 upstream/master HEAD -b groupby.Apply
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Building 6369f768 <issue-31441> for conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt......................................................
·· Installing 6369f768 <issue-31441> into conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
· Running 10 total benchmarks (2 commits * 1 environments * 5 benchmarks)
[  0.00%] · For pandas commit d81e226b <issue-31441~3> (round 1/2):
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  5.00%] ··· Setting up groupby:72                                                                                                                                                                            ok
[  5.00%] ··· Running (groupby.Apply.time_copy_function_multi_col--).....
[ 25.00%] · For pandas commit 6369f768 <issue-31441> (round 1/2):
[ 25.00%] ·· Building for conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[ 25.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 30.00%] ··· Setting up groupby:72                                                                                                                                                                            ok
[ 30.00%] ··· Running (groupby.Apply.time_copy_function_multi_col--).....
[ 50.00%] · For pandas commit 6369f768 <issue-31441> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 55.00%] ··· Setting up groupby:72                                                                                                                                                                            ok
[ 55.00%] ··· groupby.Apply.time_copy_function_multi_col                                                                                                                                               3.64±0.01s
[ 60.00%] ··· groupby.Apply.time_copy_overhead_single_col                                                                                                                                              1.44±0.01s
[ 65.00%] ··· groupby.Apply.time_scalar_function_multi_col                                                                                                                                             57.7±0.4ms
[ 70.00%] ··· groupby.Apply.time_scalar_function_single_col                                                                                                                                            17.8±0.4ms
[ 75.00%] ··· groupby.ApplyDictReturn.time_groupby_apply_dict_return                                                                                                                                    183±0.9ms
[ 75.00%] · For pandas commit d81e226b <issue-31441~3> (round 2/2):
[ 75.00%] ·· Building for conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[ 75.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 80.00%] ··· Setting up groupby:72                                                                                                                                                                            ok
[ 80.00%] ··· groupby.Apply.time_copy_function_multi_col                                                                                                                                                  5.07±0s
[ 85.00%] ··· groupby.Apply.time_copy_overhead_single_col                                                                                                                                                 2.02±0s
[ 90.00%] ··· groupby.Apply.time_scalar_function_multi_col                                                                                                                                               59.1±3ms
[ 95.00%] ··· groupby.Apply.time_scalar_function_single_col                                                                                                                                            18.1±0.5ms
[100.00%] ··· groupby.ApplyDictReturn.time_groupby_apply_dict_return                                                                                                                                      185±4ms
       before           after         ratio
     [d81e226b]       [6369f768]
     <issue-31441~3>       <issue-31441>
-         5.07±0s       3.64±0.01s     0.72  groupby.Apply.time_copy_function_multi_col
-         2.02±0s       1.44±0.01s     0.71  groupby.Apply.time_copy_overhead_single_col

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

$ asv continuous -f 1.1 upstream/master HEAD -b groupby.ApplyDictReturn
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Installing 6369f768 <issue-31441> into conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For pandas commit d81e226b <issue-31441~3> (round 1/2):
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ··· Running (groupby.ApplyDictReturn.time_groupby_apply_dict_return--).
[ 25.00%] · For pandas commit 6369f768 <issue-31441> (round 1/2):
[ 25.00%] ·· Building for conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[ 25.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running (groupby.ApplyDictReturn.time_groupby_apply_dict_return--).
[ 50.00%] · For pandas commit 6369f768 <issue-31441> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· groupby.ApplyDictReturn.time_groupby_apply_dict_return                                                                                                                                      186±2ms
[ 75.00%] · For pandas commit d81e226b <issue-31441~3> (round 2/2):
[ 75.00%] ·· Building for conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[ 75.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· groupby.ApplyDictReturn.time_groupby_apply_dict_return                                                                                                                                      187±2ms

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

EDIT

seems like there's some test failures though

@MarcoGorelli
Copy link
Member Author

Performance results with current commit:

$ asv continuous -f 1.1 upstream/master HEAD -b groupby.Apply
· Creating environments
· Discovering benchmarks
· Running 10 total benchmarks (2 commits * 1 environments * 5 benchmarks)
[  0.00%] · For pandas commit d81e226b <issue-31441~3> (round 1/2):
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  5.00%] ··· Setting up groupby:72                                                                                                                                                                            ok
[  5.00%] ··· Running (groupby.Apply.time_copy_function_multi_col--).....
[ 25.00%] · For pandas commit 6369f768 <issue-31441> (round 1/2):
[ 25.00%] ·· Building for conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[ 25.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 30.00%] ··· Setting up groupby:72                                                                                                                                                                            ok
[ 30.00%] ··· Running (groupby.Apply.time_copy_function_multi_col--).....
[ 50.00%] · For pandas commit 6369f768 <issue-31441> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 55.00%] ··· Setting up groupby:72                                                                                                                                                                            ok
[ 55.00%] ··· groupby.Apply.time_copy_function_multi_col                                                                                                                                               3.77±0.07s
[ 60.00%] ··· groupby.Apply.time_copy_overhead_single_col                                                                                                                                              1.46±0.02s
[ 65.00%] ··· groupby.Apply.time_scalar_function_multi_col                                                                                                                                             58.1±0.7ms
[ 70.00%] ··· groupby.Apply.time_scalar_function_single_col                                                                                                                                           17.7±0.08ms
[ 75.00%] ··· groupby.ApplyDictReturn.time_groupby_apply_dict_return                                                                                                                                    181±0.3ms
[ 75.00%] · For pandas commit d81e226b <issue-31441~3> (round 2/2):
[ 75.00%] ·· Building for conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[ 75.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 80.00%] ··· Setting up groupby:72                                                                                                                                                                            ok
[ 80.00%] ··· groupby.Apply.time_copy_function_multi_col                                                                                                                                                  5.14±0s
[ 85.00%] ··· groupby.Apply.time_copy_overhead_single_col                                                                                                                                              2.09±0.04s
[ 90.00%] ··· groupby.Apply.time_scalar_function_multi_col                                                                                                                                             57.4±0.8ms
[ 95.00%] ··· groupby.Apply.time_scalar_function_single_col                                                                                                                                            17.6±0.1ms
[100.00%] ··· groupby.ApplyDictReturn.time_groupby_apply_dict_return                                                                                                                                      181±1ms
       before           after         ratio
     [d81e226b]       [6369f768]
     <issue-31441~3>       <issue-31441>
-         5.14±0s       3.77±0.07s     0.73  groupby.Apply.time_copy_function_multi_col
-      2.09±0.04s       1.46±0.02s     0.70  groupby.Apply.time_copy_overhead_single_col

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

$ asv continuous -f 1.1 upstream/master HEAD -b groupby.ApplyDictReturn
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Installing 6369f768 <issue-31441> into conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For pandas commit d81e226b <issue-31441~3> (round 1/2):
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ··· Running (groupby.ApplyDictReturn.time_groupby_apply_dict_return--).
[ 25.00%] · For pandas commit 6369f768 <issue-31441> (round 1/2):
[ 25.00%] ·· Building for conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[ 25.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running (groupby.ApplyDictReturn.time_groupby_apply_dict_return--).
[ 50.00%] · For pandas commit 6369f768 <issue-31441> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· groupby.ApplyDictReturn.time_groupby_apply_dict_return                                                                                                                                    181±0.2ms
[ 75.00%] · For pandas commit d81e226b <issue-31441~3> (round 2/2):
[ 75.00%] ·· Building for conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[ 75.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· groupby.ApplyDictReturn.time_groupby_apply_dict_return                                                                                                                                    182±0.8ms

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

@@ -499,7 +499,9 @@ def apply_frame_axis0(object frame, object f, object names,
# `piece` might not have an index, could be e.g. an int
pass

if not is_scalar(piece):
if isinstance(piece, list):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to special case lists specifically here? How about special casing only those times when copy is known to have the deep keyword for piece.copy(deep="all") and letting everything else use a deepcopy? (eg., something like df.groupby("a").apply(lambda x: set(x.index)) would still raise here)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about special casing only those times when copy is known to have the deep keyword

How - by using inspect?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had been thinking checking specifically for Series, DataFrame, or Index and trying to deepcopy everything else, but there may be a "prettier" way of approaching it. I remember getting the exact same test failure you're seeing now and am still puzzled by it.

def test_apply_function_returns_list():
# GH 31441
df = pd.DataFrame(["A", "A", "B", "B"], columns=["groups"])
result = df.groupby("groups").apply(lambda x: x.index.to_list())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to add tests for other return types as well (e.g., set or tuple)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good suggestion, thanks!

@jreback jreback added this to the 1.0.1 milestone Feb 1, 2020
@pytest.mark.parametrize(
"function, expected_values",
[
(lambda x: x.index.to_list(), [[0, 1], [2, 3]]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to skip the conversion to list in the remaining test cases

@jreback jreback merged commit bfb80fa into pandas-dev:master Feb 1, 2020
@jreback
Copy link
Contributor

jreback commented Feb 1, 2020

thanks @MarcoGorelli

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TypeError: copy() takes no keyword arguments
4 participants