-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
GroupBy any/all Fails with Duplicate Column Names #21668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Just for a little more clarity on where I think this originates, here's the GroupBy iteration behavior below: In [19]: df = pd.DataFrame([[True, True, True]], columns=['key', 'a', 'a'])
In [20]: grp = df.groupby('key')
In [21]: grp_iter = grp._iterate_slices()
In [22]: next(grp_iter)
Out[22]:
('a', a a
0 True True)
In [23]: next(grp_iter)
Out[23]:
('a', a a
0 True True)
In [24]: next(grp_iter)
StopIteration The duplicated names causes the slices to be multi-dimensional and repetitive. Not sure if this causes problems elsewhere in the code base but is at least responsible for the issue in the original post |
I think the correct result is returned on master. Could use a test.
|
Hmm well still not technically correct since it is dropping one of the columns, though interesting that this no longer raises. Fix is in #29124 - I'll see what I can find on why this no longer raises though |
Code Sample, a copy-pastable example if possible
Problem description
This throws because of the below line:
pandas/pandas/core/groupby/groupby.py
Line 2024 in 0801b8c
Cython is expecting a one dimensional array of masks, but
self._iterate_slices()
which contains the above statement will provide a multi-dimensional array. Therefore, when the Cython function gets called withmask
it is a multi-dimensional object and causes theValueError
.I'm not sure if this is the expected behavior of
self._iterate_slices()
or not. If so, there may be implications in other parts of the module.Output of
pd.show_versions()
INSTALLED VERSIONS
commit: 0801b8c
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 17.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.0.dev0+184.g0801b8c90
pytest: 3.4.1
pip: 10.0.1
setuptools: 38.5.1
Cython: 0.27.3
numpy: 1.14.1
scipy: 1.0.0
pyarrow: 0.8.0
xarray: 0.10.0
IPython: 6.2.1
sphinx: 1.7.0
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.1.2
openpyxl: 2.5.0
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: 0.8.0
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: 0.1.3
fastparquet: 0.1.4
pandas_gbq: 0.4.1
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: