Skip to content

API: DataFrameGroupBy column subset selection with single list? #23566

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Nov 8, 2018 · 18 comments · Fixed by #30546
Closed

API: DataFrameGroupBy column subset selection with single list? #23566

jorisvandenbossche opened this issue Nov 8, 2018 · 18 comments · Fixed by #30546
Assignees
Labels
API Design Deprecate Functionality to remove in pandas Groupby
Milestone

Comments

@jorisvandenbossche
Copy link
Member

I wouldn't be surprised if there is already an issue about this, but couldn't directly find one.

When doing a subselection of columns on a DataFrameGroupBy object, both a plain list (so a tuple within the __getitem__ [] brackets) as the double square brackets (a list inside the __getitem__ [] brackets) seems to work:

In [6]: df = pd.DataFrame(np.random.randint(10, size=(10, 4)), columns=['a', 'b', 'c', 'd'])

In [8]: df.groupby('a').sum()
Out[8]: 
    b   c   d
a            
0   0   5   7
3  18   6  12
4  16   6   9
6  10  11  11
9   3   3   0

In [9]: df.groupby('a')['b', 'c'].sum()
Out[9]: 
    b   c
a        
0   0   5
3  18   6
4  16   6
6  10  11
9   3   3

In [10]: df.groupby('a')[['b', 'c']].sum()
Out[10]: 
    b   c
a        
0   0   5
3  18   6
4  16   6
6  10  11
9   3   3

Personally I find this df.groupby('a')['b', 'c'].sum() a bit strange, and inconsistent with how DataFrame indexing works.

Of course, on a DataFrameGroupBy you don't have the possible confusion with indexing multiple dimensions (rows, columns), but still.

cc @jreback @WillAyd

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 8, 2018

You do have ambiguity with tuples though (not that anyone should do that)

In [14]: df = pd.DataFrame(np.random.randint(10, size=(10, 4)), columns=['a', 'b', 'c', ('a', 'b')])

In [15]: df.groupby('c')['a', 'b'].sum()
Out[15]:
    a   b
c
0   1   7
1   6   9
2   7   7
5   9  11
6   8   6
7  10   6
8  11   8

In [16]: df.groupby('c')[('a', 'b')].sum()
Out[16]:
    a   b
c
0   1   7
1   6   9
2   7   7
5   9  11
6   8   6
7  10   6
8  11   8

I think both of those are incorrect. It should rather be

In [19]: df.groupby('c').sum()[('a', 'b')]
Out[19]:
c
0     7
1     3
2     5
5     7
6     8
7    16
8    11
Name: (a, b), dtype: int64

@WillAyd
Copy link
Member

WillAyd commented Nov 13, 2018

I don't disagree here. There is a difference when selecting only one column (specifically returning a Series vs a DataFrame) but when selecting multiple columns it would be more consistent if we ALWAYS required double brackets brackets. I assume this would also yield a simpler implementation.

Maybe a conversation piece for 1.0? Would be a breaking change for sure so probably best served in a major release like that

@TomAugspurger
Copy link
Contributor

I think the hope is for 1.0 to be backwards compatible with 0.25.x.

Do we have a chance to detect this case and throw a FutureWarning (assuming we want to change)?

@jorisvandenbossche
Copy link
Member Author

Yeah, if we want, I would think it should be possible with a deprecation cycle.

@jreback
Copy link
Contributor

jreback commented Nov 14, 2018

this i suspect is actually very common in the wild (not using the double brackets)

but i agree we should deprecate as it is inconsistent

@WillAyd WillAyd added the Deprecate Functionality to remove in pandas label Nov 14, 2018
@yehoshuadimarsky
Copy link
Contributor

yehoshuadimarsky commented Dec 25, 2019

Can I take a crack at this or has it already been fixed?

Also, will this be a part of the 1.0 or other milestones?

@WillAyd
Copy link
Member

WillAyd commented Dec 25, 2019 via email

@yehoshuadimarsky
Copy link
Contributor

Thanks will do.

@yehoshuadimarsky
Copy link
Contributor

take

@yehoshuadimarsky
Copy link
Contributor

yehoshuadimarsky commented Dec 25, 2019

So this is my first time working on pandas code, and I'm a little confused here, so please bear with me. I'm also new to linking to code on GitHub.

As I understand, when an object calls __getitem__ by using brackets, if you pass in several keys, they are implicitly converted to a tuple of one key. So df['a','b'] is really df[('a','b')] under the hood.

I'm having trouble in tracing the code path to figure out where exactly the __getitem__ on the GroupBy is actually implemented here:

  1. DataFrame.groupby is called on the superclass NDFrame here
  2. This eventually creates the specific DataFrameGroupBy object here
  3. Which is a subclass of GroupBy
  4. Which is a subclass of _GroupBy
  5. Which has the mixin named SelectionMixin, defined here
  6. Which implements __getitem__ here
  7. Which, if the key is a list or tuple, returns self._gotitem(list(key), ndim=2)
  8. self._gotitem needs to be implemented by the respective subclasses, which in this case is the DataFrameGroupBy object, and is implemented here
  9. But all this does is simply create an instance of itself (DataFrameGroupBy) with the key (a list/tuple) passed as a slice to the selection parameter
  10. The selection parameter is implemented in the parent _GroupBy object, which sets the internal self._selection attribute to the key here
  11. This is where I'm lost. How does this actually slice the object and only return a subset of it?

Any help here would be greatly appreciated. Thanks.

@yehoshuadimarsky
Copy link
Contributor

@WillAyd @jorisvandenbossche are you able to help point me in the right direction? ☝️

@yehoshuadimarsky
Copy link
Contributor

Just to close the loop on my earlier question, I never fully figured out how the slicing happens, but it seems it happens at some Cython layer via the self._selected_obj attribute and its @cache_readonly decorator. Happily, it didn't end up mattering, as all I needed to do was intercept the call on the DataFrameGroupBy to the __getitem__ method of the SelectionMixin by adding its own __getitem__ method to check if the keys are a tuple and raise the warning there.

@yehoshuadimarsky
Copy link
Contributor

Although I will confess to being a bit surprised (disappointed?) in the total lack of response from the pandas developers to my question above. Pandas has a reputation as being a welcoming OSS community, and the question was well-researched and clearly stated, so I thought I'd get a bit more feedback than that. Guess I'll attribute it to the holiday season.

@jreback
Copy link
Contributor

jreback commented Dec 29, 2019

@yehoshuadimarsky we have 3000+ issues and constant comments - to be honest we barely have time to triage on the PRs

even really important things are not necessarily discussed at length

just like everyone else has limited time - the best way to prompt a discussion is to push a change

@yehoshuadimarsky
Copy link
Contributor

Totally understand. Thanks for acknowledging, and more importantly, thanks for the incredibly important work you do in maintaining pandas.

@jreback jreback added this to the 1.0 milestone Jan 3, 2020
lorentzenchr pushed a commit to lorentzenchr/scikit-learn that referenced this issue May 18, 2020
lorentzenchr pushed a commit to lorentzenchr/scikit-learn that referenced this issue May 18, 2020
adrinjalali pushed a commit to scikit-learn/scikit-learn that referenced this issue May 18, 2020
* MNT keyword only in examples

* MNT pandas 1.0.0 deprectation

See pandas-dev/pandas#23566

* MNT new keyword in 0.23
adrinjalali pushed a commit to adrinjalali/scikit-learn that referenced this issue May 18, 2020
* MNT keyword only in examples

* MNT pandas 1.0.0 deprectation

See pandas-dev/pandas#23566

* MNT new keyword in 0.23
adrinjalali pushed a commit to scikit-learn/scikit-learn that referenced this issue May 19, 2020
* MNT keyword only in examples

* MNT pandas 1.0.0 deprectation

See pandas-dev/pandas#23566

* MNT new keyword in 0.23
@kusaasira
Copy link

@yehoshuadimarsky , this was closed, right?

@yehoshuadimarsky
Copy link
Contributor

yes

viclafargue pushed a commit to viclafargue/scikit-learn that referenced this issue Jun 26, 2020
* MNT keyword only in examples

* MNT pandas 1.0.0 deprectation

See pandas-dev/pandas#23566

* MNT new keyword in 0.23
@Thuoq
Copy link

Thuoq commented May 8, 2021

Thanks for , I at version Pandas I using group[[colone_name,]] so it is useful and clear code better

smileyrekiere added a commit to greearb/lanforge-scripts that referenced this issue Aug 31, 2023
…uble [[]]

https://stackoverflow.com/questions/60999753/pandas-future-warning-indexing-with-multiple-keys
pandas-dev/pandas#23566

Verification:
./test_l3.py --lfmgr 192.168.0.104 --test_duration 20s --polling_interval 5s --upstream_port 1.1.eth2 --radio 'radio==wiphy2,stations==1,ssid==axe11000_5g,ssid_pw==lf_axe11000_5g,security==wpa2,wifi_mode==0,wifi_settings==wifi_settings,enable_flags==(ht160_enable&&wpa2_enable)' --endp_type mc_udp --rates_are_totals --side_a_min_bps=20000 --side_b_min_bps=3000000 --tos BK --log_level debug --csv_data_to_report


Signed-off-by: Chuck SmileyRekiere <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Deprecate Functionality to remove in pandas Groupby
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants