Skip to content

GroupBy any/all Fails with Duplicate Column Names #21668

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
WillAyd opened this issue Jun 28, 2018 · 3 comments · Fixed by #29124
Closed

GroupBy any/all Fails with Duplicate Column Names #21668

WillAyd opened this issue Jun 28, 2018 · 3 comments · Fixed by #29124

Comments

@WillAyd
Copy link
Member

WillAyd commented Jun 28, 2018

Code Sample, a copy-pastable example if possible

>>> df = pd.DataFrame([[True, True, True]], columns=['key', 'a', 'a'])
>>> df.groupby('key').any()

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

Problem description

This throws because of the below line:

mask = isnull(obj.values).view(np.uint8)

Cython is expecting a one dimensional array of masks, but self._iterate_slices() which contains the above statement will provide a multi-dimensional array. Therefore, when the Cython function gets called with mask it is a multi-dimensional object and causes the ValueError.

I'm not sure if this is the expected behavior of self._iterate_slices() or not. If so, there may be implications in other parts of the module.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: 0801b8c
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 17.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0.dev0+184.g0801b8c90
pytest: 3.4.1
pip: 10.0.1
setuptools: 38.5.1
Cython: 0.27.3
numpy: 1.14.1
scipy: 1.0.0
pyarrow: 0.8.0
xarray: 0.10.0
IPython: 6.2.1
sphinx: 1.7.0
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.1.2
openpyxl: 2.5.0
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: 0.8.0
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: 0.1.3
fastparquet: 0.1.4
pandas_gbq: 0.4.1
pandas_datareader: None
gcsfs: None

@WillAyd
Copy link
Member Author

WillAyd commented Jun 30, 2018

Just for a little more clarity on where I think this originates, here's the GroupBy iteration behavior below:

In [19]: df = pd.DataFrame([[True, True, True]], columns=['key', 'a', 'a'])
In [20]: grp = df.groupby('key')
In [21]: grp_iter = grp._iterate_slices()

In [22]: next(grp_iter)
Out[22]: 
('a',       a     a
 0  True  True)

In [23]: next(grp_iter)
Out[23]: 
('a',       a     a
 0  True  True)

In [24]: next(grp_iter)
StopIteration

The duplicated names causes the slices to be multi-dimensional and repetitive. Not sure if this causes problems elsewhere in the code base but is at least responsible for the issue in the original post

@mroeschke
Copy link
Member

I think the correct result is returned on master. Could use a test.

In [206]: >>> df = pd.DataFrame([[True, True, True]], columns=['key', 'a', 'a'])
     ...: >>> df.groupby('key').any()
Out[206]:
         a
key
True  True

In [207]: df
Out[207]:
    key     a     a
0  True  True  True

In [208]: pd.__version__
Out[208]: '0.26.0.dev0+682.g08ab156eb'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Groupby labels Oct 27, 2019
@WillAyd
Copy link
Member Author

WillAyd commented Oct 28, 2019

Hmm well still not technically correct since it is dropping one of the columns, though interesting that this no longer raises. Fix is in #29124 - I'll see what I can find on why this no longer raises though

@mroeschke mroeschke added Bug Groupby and removed Needs Tests Unit test(s) needed to prevent regressions good first issue labels Oct 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants