Skip to content

BUG: slicing DataFrameGroupBy to SeriesGroupBy doesn't propagate dropna #35745

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
arw2019 opened this issue Aug 15, 2020 · 4 comments
Closed
3 tasks done
Labels

Comments

@arw2019
Copy link
Member

arw2019 commented Aug 15, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


xref #9959, #35444

Slicing a DataFrameGroupBy object to SeriesGroupBy doesn't correctly propagate dropna.

The DataFrameGroupByDataFrameGroupBy case is being handled in #35444. Opening this because as far as I can tell the DataFrameGroupBySeriesGroupBy variant is an independent bug.

Code Sample, a copy-pastable example

In [11]: df = pd.DataFrame({"a": [1], "b": [2], "c": [3]}) 
    ...: gb = df.groupby('a', dropna=False)                                                                                                                                                                       

In [12]: gb.dropna                                                                                                                                                                                                
Out[12]: False

In [13]: gb['b'].dropna                                                                                                                                                                                   
Out[13]: True

Expected Output

I'd like to see:

In [22]: gb.dropna == gb['b'].dropna                                                                                                                                                                              
Out[22]: True

Output of pd.show_versions()

INSTALLED VERSIONS

commit : a3f5c6a
python : 3.8.3.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-42-generic
Version : #46-Ubuntu SMP Fri Jul 10 00:24:02 UTC 2020
machine : x86_64
processor :
byteorder : little
LC_ALL : C.UTF-8
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.0.dev0+99.ga3f5c6a5a.dirty
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.1.0.post20200704
Cython : 0.29.21
pytest : 5.4.3
hypothesis : 5.19.0
sphinx : 3.1.1
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : 4.5.2
html5lib : 1.1
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.16.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fsspec : 0.7.4
fastparquet : 0.4.0
gcsfs : 0.6.2
matplotlib : 3.2.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pyxlsb : None
s3fs : 0.4.2
scipy : 1.5.0
sqlalchemy : 1.3.18
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.15.1
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.50.1

@arw2019 arw2019 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 15, 2020
@rhshadrach
Copy link
Member

Thanks for reporting this. I think it is part of #35443.

@arw2019
Copy link
Member Author

arw2019 commented Aug 16, 2020

@rhshadrach I checked, it's not. This refers to the ndim=1 part of _gotitem:

def _gotitem(self, key, ndim: int, subset=None):
"""
sub-classes to define
return a sliced object
Parameters
----------
key : string / list of selections
ndim : 1,2
requested ndim of result
subset : object, default None
subset to act on
"""
if ndim == 2:
if subset is None:
subset = self.obj
return DataFrameGroupBy(
subset,
self.grouper,
selection=key,
grouper=self.grouper,
exclusions=self.exclusions,
as_index=self.as_index,
observed=self.observed,
)
elif ndim == 1:
if subset is None:
subset = self.obj[key]
return SeriesGroupBy(
subset, selection=key, grouper=self.grouper, observed=self.observed
)
raise AssertionError("invalid ndim for _gotitem")

whereas currently in #35443 you fix the ndim=2 part. Would you be happy to handle this in #35443 or should I open a separate PR?

I'm interested in getting this fixed because I have a solution to #35612 that relies on dropna propagating correctly

@rhshadrach
Copy link
Member

Hmm, #35443 is an issue I opened when working on #35444 specifically to talk about the ndim=1 case, as I saw it would require fixes that could be considered API changes.

That said, propagating dropna would not involve anything that could be considered an API change, I'll add that (and possibly others) to #35444.

@arw2019
Copy link
Member Author

arw2019 commented Aug 16, 2020

Hmm, #35443 is an issue I opened when working on #35444 specifically to talk about the ndim=1 case, as I saw it would require fixes that could be considered API changes.

That said, propagating dropna would not involve anything that could be considered an API change, I'll add that (and possibly others) to #35444.

Ah yeah I should have looked. If you're happy to add just the dropna argument to the ndim=1 that would be fantastic

Closing this in favour of #35443 as it's a duplicate

@arw2019 arw2019 closed this as completed Aug 16, 2020
@simonjayhawkins simonjayhawkins removed the Needs Triage Issue that has not been reviewed by a pandas team member label Aug 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants