API: DataFrameGroupBy column subset selection with single list? #23566

jorisvandenbossche · 2018-11-08T14:18:22Z

I wouldn't be surprised if there is already an issue about this, but couldn't directly find one.

When doing a subselection of columns on a DataFrameGroupBy object, both a plain list (so a tuple within the __getitem__ [] brackets) as the double square brackets (a list inside the __getitem__ [] brackets) seems to work:

In [6]: df = pd.DataFrame(np.random.randint(10, size=(10, 4)), columns=['a', 'b', 'c', 'd'])

In [8]: df.groupby('a').sum()
Out[8]: 
    b   c   d
a            
0   0   5   7
3  18   6  12
4  16   6   9
6  10  11  11
9   3   3   0

In [9]: df.groupby('a')['b', 'c'].sum()
Out[9]: 
    b   c
a        
0   0   5
3  18   6
4  16   6
6  10  11
9   3   3

In [10]: df.groupby('a')[['b', 'c']].sum()
Out[10]: 
    b   c
a        
0   0   5
3  18   6
4  16   6
6  10  11
9   3   3

Personally I find this df.groupby('a')['b', 'c'].sum() a bit strange, and inconsistent with how DataFrame indexing works.

Of course, on a DataFrameGroupBy you don't have the possible confusion with indexing multiple dimensions (rows, columns), but still.

cc @jreback @WillAyd

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-11-08T14:28:34Z

You do have ambiguity with tuples though (not that anyone should do that)

In [14]: df = pd.DataFrame(np.random.randint(10, size=(10, 4)), columns=['a', 'b', 'c', ('a', 'b')])

In [15]: df.groupby('c')['a', 'b'].sum()
Out[15]:
    a   b
c
0   1   7
1   6   9
2   7   7
5   9  11
6   8   6
7  10   6
8  11   8

In [16]: df.groupby('c')[('a', 'b')].sum()
Out[16]:
    a   b
c
0   1   7
1   6   9
2   7   7
5   9  11
6   8   6
7  10   6
8  11   8

I think both of those are incorrect. It should rather be

In [19]: df.groupby('c').sum()[('a', 'b')]
Out[19]:
c
0     7
1     3
2     5
5     7
6     8
7    16
8    11
Name: (a, b), dtype: int64

WillAyd · 2018-11-13T15:41:43Z

I don't disagree here. There is a difference when selecting only one column (specifically returning a Series vs a DataFrame) but when selecting multiple columns it would be more consistent if we ALWAYS required double brackets brackets. I assume this would also yield a simpler implementation.

Maybe a conversation piece for 1.0? Would be a breaking change for sure so probably best served in a major release like that

TomAugspurger · 2018-11-13T15:45:04Z

I think the hope is for 1.0 to be backwards compatible with 0.25.x.

Do we have a chance to detect this case and throw a FutureWarning (assuming we want to change)?

jorisvandenbossche · 2018-11-13T15:46:25Z

Yeah, if we want, I would think it should be possible with a deprecation cycle.

jreback · 2018-11-14T02:28:20Z

this i suspect is actually very common in the wild (not using the double brackets)

but i agree we should deprecate as it is inconsistent

yehoshuadimarsky · 2019-12-25T01:46:31Z

Can I take a crack at this or has it already been fixed?

Also, will this be a part of the 1.0 or other milestones?

WillAyd · 2019-12-25T02:01:20Z

Go for it!

…

Sent from my iPhone

On Dec 24, 2019, at 8:46 PM, Josh Dimarsky ***@***.***> wrote: Can I take a crack at this or has it already been fixed? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

yehoshuadimarsky · 2019-12-25T02:23:39Z

Thanks will do.

yehoshuadimarsky · 2019-12-25T04:32:35Z

take

yehoshuadimarsky · 2019-12-25T16:48:11Z

So this is my first time working on pandas code, and I'm a little confused here, so please bear with me. I'm also new to linking to code on GitHub.

As I understand, when an object calls __getitem__ by using brackets, if you pass in several keys, they are implicitly converted to a tuple of one key. So df['a','b'] is really df[('a','b')] under the hood.

I'm having trouble in tracing the code path to figure out where exactly the __getitem__ on the GroupBy is actually implemented here:

DataFrame.groupby is called on the superclass NDFrame here
This eventually creates the specific DataFrameGroupBy object here
Which is a subclass of GroupBy
Which is a subclass of _GroupBy
Which has the mixin named SelectionMixin, defined here
Which implements __getitem__ here
Which, if the key is a list or tuple, returns self._gotitem(list(key), ndim=2)
self._gotitem needs to be implemented by the respective subclasses, which in this case is the DataFrameGroupBy object, and is implemented here
But all this does is simply create an instance of itself (DataFrameGroupBy) with the key (a list/tuple) passed as a slice to the selection parameter
The selection parameter is implemented in the parent _GroupBy object, which sets the internal self._selection attribute to the key here
This is where I'm lost. How does this actually slice the object and only return a subset of it?

Any help here would be greatly appreciated. Thanks.

yehoshuadimarsky · 2019-12-29T00:59:05Z

@WillAyd @jorisvandenbossche are you able to help point me in the right direction? ☝️

…andas-dev#23566)

yehoshuadimarsky · 2019-12-29T23:27:48Z

Just to close the loop on my earlier question, I never fully figured out how the slicing happens, but it seems it happens at some Cython layer via the self._selected_obj attribute and its @cache_readonly decorator. Happily, it didn't end up mattering, as all I needed to do was intercept the call on the DataFrameGroupBy to the __getitem__ method of the SelectionMixin by adding its own __getitem__ method to check if the keys are a tuple and raise the warning there.

yehoshuadimarsky · 2019-12-29T23:29:54Z

Although I will confess to being a bit surprised (disappointed?) in the total lack of response from the pandas developers to my question above. Pandas has a reputation as being a welcoming OSS community, and the question was well-researched and clearly stated, so I thought I'd get a bit more feedback than that. Guess I'll attribute it to the holiday season.

jreback · 2019-12-29T23:40:01Z

@yehoshuadimarsky we have 3000+ issues and constant comments - to be honest we barely have time to triage on the PRs

even really important things are not necessarily discussed at length

just like everyone else has limited time - the best way to prompt a discussion is to push a change

yehoshuadimarsky · 2019-12-29T23:43:21Z

Totally understand. Thanks for acknowledging, and more importantly, thanks for the incredibly important work you do in maintaining pandas.

See pandas-dev/pandas#23566

* MNT keyword only in examples * MNT pandas 1.0.0 deprectation See pandas-dev/pandas#23566 * MNT new keyword in 0.23

kusaasira · 2020-06-24T09:14:13Z

@yehoshuadimarsky , this was closed, right?

yehoshuadimarsky · 2020-06-25T00:48:34Z

yes

* MNT keyword only in examples * MNT pandas 1.0.0 deprectation See pandas-dev/pandas#23566 * MNT new keyword in 0.23

Thuoq · 2021-05-08T07:21:16Z

Thanks for , I at version Pandas I using group[[colone_name,]] so it is useful and clear code better

…uble [[]] https://stackoverflow.com/questions/60999753/pandas-future-warning-indexing-with-multiple-keys pandas-dev/pandas#23566 Verification: ./test_l3.py --lfmgr 192.168.0.104 --test_duration 20s --polling_interval 5s --upstream_port 1.1.eth2 --radio 'radio==wiphy2,stations==1,ssid==axe11000_5g,ssid_pw==lf_axe11000_5g,security==wpa2,wifi_mode==0,wifi_settings==wifi_settings,enable_flags==(ht160_enable&&wpa2_enable)' --endp_type mc_udp --rates_are_totals --side_a_min_bps=20000 --side_b_min_bps=3000000 --tos BK --log_level debug --csv_data_to_report Signed-off-by: Chuck SmileyRekiere <[email protected]>

… to double [[]] https://stackoverflow.com/questions/60999753/pandas-future-warning-indexing-with-multiple-keys pandas-dev/pandas#23566 Signed-off-by: Chuck SmileyRekiere <[email protected]>

…[] to double [[]] https://stackoverflow.com/questions/60999753/pandas-future-warning-indexing-with-multiple-keys pandas-dev/pandas#23566 Signed-off-by: Chuck SmileyRekiere <[email protected]>

jorisvandenbossche added the API Design label Nov 8, 2018

WillAyd added the Groupby label Nov 13, 2018

WillAyd added the Deprecate Functionality to remove in pandas label Nov 14, 2018

github-actions bot assigned yehoshuadimarsky Dec 25, 2019

yehoshuadimarsky added a commit to yehoshuadimarsky/pandas that referenced this issue Dec 29, 2019

BUG: DataFrame GroupBy indexing with single items DeprecationWarning(p…

568c1ad

…andas-dev#23566)

yehoshuadimarsky mentioned this issue Dec 29, 2019

DEPR: DataFrame GroupBy indexing with single items DeprecationWarning #30546

Merged

5 tasks

jreback added this to the 1.0 milestone Jan 3, 2020

jreback closed this as completed in #30546 Jan 3, 2020

lorentzenchr pushed a commit to lorentzenchr/scikit-learn that referenced this issue May 18, 2020

MNT pandas 1.0.0 deprectation

3d1ac58

See pandas-dev/pandas#23566

lorentzenchr pushed a commit to lorentzenchr/scikit-learn that referenced this issue May 18, 2020

MNT pandas 1.0.0 deprectation

949307c

See pandas-dev/pandas#23566

lorentzenchr mentioned this issue May 18, 2020

DOC avoid FutureWarnings for deprecations examples scikit-learn/scikit-learn#17264

Merged

adrinjalali pushed a commit to scikit-learn/scikit-learn that referenced this issue May 18, 2020

DOC avoid FutureWarnings for deprecations examples (#17264)

79b0943

* MNT keyword only in examples * MNT pandas 1.0.0 deprectation See pandas-dev/pandas#23566 * MNT new keyword in 0.23

adrinjalali pushed a commit to adrinjalali/scikit-learn that referenced this issue May 18, 2020

DOC avoid FutureWarnings for deprecations examples (scikit-learn#17264)

f0982d3

* MNT keyword only in examples * MNT pandas 1.0.0 deprectation See pandas-dev/pandas#23566 * MNT new keyword in 0.23

adrinjalali pushed a commit to scikit-learn/scikit-learn that referenced this issue May 19, 2020

DOC avoid FutureWarnings for deprecations examples (#17264)

04e485e

* MNT keyword only in examples * MNT pandas 1.0.0 deprectation See pandas-dev/pandas#23566 * MNT new keyword in 0.23

viclafargue pushed a commit to viclafargue/scikit-learn that referenced this issue Jun 26, 2020

DOC avoid FutureWarnings for deprecations examples (scikit-learn#17264)

abb0ea2

* MNT keyword only in examples * MNT pandas 1.0.0 deprectation See pandas-dev/pandas#23566 * MNT new keyword in 0.23

rhshadrach mentioned this issue Sep 12, 2020

BUG: DataFrameGroupBy.__getitem__ should warn on tuple of length 1 #36302

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: DataFrameGroupBy column subset selection with single list? #23566

API: DataFrameGroupBy column subset selection with single list? #23566

jorisvandenbossche commented Nov 8, 2018

TomAugspurger commented Nov 8, 2018 •

edited

Loading

WillAyd commented Nov 13, 2018

TomAugspurger commented Nov 13, 2018

jorisvandenbossche commented Nov 13, 2018

jreback commented Nov 14, 2018

yehoshuadimarsky commented Dec 25, 2019 •

edited

Loading

WillAyd commented Dec 25, 2019 via email

yehoshuadimarsky commented Dec 25, 2019

yehoshuadimarsky commented Dec 25, 2019

yehoshuadimarsky commented Dec 25, 2019 •

edited

Loading

yehoshuadimarsky commented Dec 29, 2019

yehoshuadimarsky commented Dec 29, 2019

yehoshuadimarsky commented Dec 29, 2019

jreback commented Dec 29, 2019

yehoshuadimarsky commented Dec 29, 2019

kusaasira commented Jun 24, 2020

yehoshuadimarsky commented Jun 25, 2020

Thuoq commented May 8, 2021

API: DataFrameGroupBy column subset selection with single list? #23566

API: DataFrameGroupBy column subset selection with single list? #23566

Comments

jorisvandenbossche commented Nov 8, 2018

TomAugspurger commented Nov 8, 2018 • edited Loading

WillAyd commented Nov 13, 2018

TomAugspurger commented Nov 13, 2018

jorisvandenbossche commented Nov 13, 2018

jreback commented Nov 14, 2018

yehoshuadimarsky commented Dec 25, 2019 • edited Loading

WillAyd commented Dec 25, 2019 via email

yehoshuadimarsky commented Dec 25, 2019

yehoshuadimarsky commented Dec 25, 2019

yehoshuadimarsky commented Dec 25, 2019 • edited Loading

yehoshuadimarsky commented Dec 29, 2019

yehoshuadimarsky commented Dec 29, 2019

yehoshuadimarsky commented Dec 29, 2019

jreback commented Dec 29, 2019

yehoshuadimarsky commented Dec 29, 2019

kusaasira commented Jun 24, 2020

yehoshuadimarsky commented Jun 25, 2020

Thuoq commented May 8, 2021

TomAugspurger commented Nov 8, 2018 •

edited

Loading

yehoshuadimarsky commented Dec 25, 2019 •

edited

Loading

yehoshuadimarsky commented Dec 25, 2019 •

edited

Loading