Skip to content

unique aggregation unexpectedly returning different type #22558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ibackus opened this issue Sep 1, 2018 · 5 comments · Fixed by #47603
Closed

unique aggregation unexpectedly returning different type #22558

ibackus opened this issue Sep 1, 2018 · 5 comments · Fixed by #47603
Assignees
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Dtype Conversions Unexpected or buggy dtype conversions good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@ibackus
Copy link

ibackus commented Sep 1, 2018

Code Sample

import pandas as pd

x1 = pd.DataFrame({'a': [1, 2, 3], 'b': [1, 1, 1]})
x2 = pd.DataFrame({'a': [2, 2, 2], 'b': [1, 1, 1]})
aggregation = {'a': 'unique', 'b': 'unique'}

agg1 = x1.agg(aggregation)
agg2 = x2.agg(aggregation)

print("First aggregation:", type(agg1))
print(agg1)

print("Second aggregation:", type(agg2))
print(agg2)

Output

First aggregation: <class 'pandas.core.series.Series'>
a    [1, 2, 3]
b          [1]
dtype: object
Second aggregation: <class 'pandas.core.frame.DataFrame'>
   a  b
0  2  1

Problem description

When performing 'unique' aggregations on a dataframe, the results can be returned as different types in an unexpected manner.

Generally, when performing a 'unique' aggregation on several columns of a dataframe as done above, a pandas.Series of numpy arrays is returned, with one element per aggregation column. This, I think, is the expected behavior, and is demonstrated in the first aggregation above.

However, there is a special case. When all aggregation columns have exactly 1 unique element, a pandas.DataFrame with one row is returned instead. I'm pretty sure this is unintended behavior, and it requires special case handling when doing such aggregations.

Expected Output

First aggregation: <class 'pandas.core.series.Series'>
a    [1, 2, 3]
b          [1]
dtype: object
Second aggregation: <class 'pandas.core.series.Series'>
a          [2]
b          [1]
dtype: object

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.1
pytest: None
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.14.5
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.7.5
patsy: 0.5.0
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff API Design labels Sep 1, 2018
@gfyoung
Copy link
Member

gfyoung commented Sep 1, 2018

How weird! I would have expected a Series of lists for both as you said. Investigation and PR are welcome!

@alanbato
Copy link
Contributor

alanbato commented Sep 5, 2018

After doing some digging I was able to find the cause for the mismatch in behavior.

pandas/pandas/core/base.py

Lines 538 to 544 in a5fe9cf

try:
result = DataFrame(result)
except ValueError:
# we have a dict of scalars
result = Series(result,
name=getattr(self, 'name', None))

After getting the result of the 'unique' aggregation as an OrderedDict,
OrderedDict([('a', array([1, 2, 3])), ('b', array([1]))] and
OrderedDict([('a', array([2])), ('b', array([1]))]
the code I'm linking tries to cast it to a Dataframe, and return a Series of lists if the casting fails. In other words, when the lengths of the lists in the values of result are equal a DataFrame is returned, and a Series otherwise.

I've confirmed that we can get the desired behavior if we don't try to cast to a DataFrame at all.

Thoughts?

@gfyoung
Copy link
Member

gfyoung commented Sep 5, 2018

I said give it a try, run tests, and see what happens!

@mroeschke mroeschke added Apply Apply, Aggregate, Transform, Map Bug and removed API Design Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels May 13, 2020
@mroeschke
Copy link
Member

These look correct on master now. Could use a test

In [24]: import pandas as pd
    ...:
    ...: x1 = pd.DataFrame({'a': [1, 2, 3], 'b': [1, 1, 1]})
    ...: x2 = pd.DataFrame({'a': [2, 2, 2], 'b': [1, 1, 1]})
    ...: aggregation = {'a': 'unique', 'b': 'unique'}
    ...:
    ...: agg1 = x1.agg(aggregation)
    ...: agg2 = x2.agg(aggregation)
    ...:
    ...: print("First aggregation:", type(agg1))
    ...: print(agg1)
    ...:
    ...: print("Second aggregation:", type(agg2))
    ...: print(agg2)
First aggregation: <class 'pandas.core.series.Series'>
a    [1, 2, 3]
b          [1]
dtype: object
Second aggregation: <class 'pandas.core.series.Series'>
a    [2]
b    [1]
dtype: object

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Apply Apply, Aggregate, Transform, Map Bug labels Jun 22, 2021
@srotondo
Copy link
Contributor

take

srotondo pushed a commit to srotondo/pandas that referenced this issue Jul 5, 2022
srotondo pushed a commit to srotondo/pandas that referenced this issue Jul 5, 2022
srotondo pushed a commit to srotondo/pandas that referenced this issue Jul 6, 2022
mroeschke pushed a commit that referenced this issue Jul 7, 2022
* TST: Added test for consistent type with unique agg #22558

* TST: Added test for consistent type with unique agg #22558

* TST: Moved and restructured test #22558

* TST: Added test for nested series #22400

* TST: Added equality test for nested series #22400

Co-authored-by: Steven Rotondo <[email protected]>
@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Jul 8, 2022
@jreback jreback added this to the 1.5 milestone Jul 8, 2022
mroeschke pushed a commit that referenced this issue Jul 8, 2022
* TST: Added test for consistent type with unique agg #22558

* TST: Added test for consistent type with unique agg #22558

* TST: Moved and restructured test #22558

* TYP: Fixed mypy issues in frequencies

* TYP: Removed accidental inclusion

Co-authored-by: Steven Rotondo <[email protected]>
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this issue Jul 13, 2022
* TST: Added test for consistent type with unique agg pandas-dev#22558

* TST: Added test for consistent type with unique agg pandas-dev#22558

* TST: Moved and restructured test pandas-dev#22558

* TST: Added test for nested series pandas-dev#22400

* TST: Added equality test for nested series pandas-dev#22400

Co-authored-by: Steven Rotondo <[email protected]>
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this issue Jul 13, 2022
* TST: Added test for consistent type with unique agg pandas-dev#22558

* TST: Added test for consistent type with unique agg pandas-dev#22558

* TST: Moved and restructured test pandas-dev#22558

* TYP: Fixed mypy issues in frequencies

* TYP: Removed accidental inclusion

Co-authored-by: Steven Rotondo <[email protected]>
srotondo pushed a commit to srotondo/pandas that referenced this issue Jul 15, 2022
mroeschke pushed a commit that referenced this issue Jul 26, 2022
* TST: Added test for consistent type with unique agg #22558

* TST: Added test for consistent type with unique agg #22558

* TST: Moved and restructured test #22558

* TST: Moved test to different file #22558

* TST: Changed scalars to 1-element lists

Co-authored-by: Steven Rotondo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Dtype Conversions Unexpected or buggy dtype conversions good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants