Skip to content

BUG: pandas groupby.apply return type changes depending on number of unique groupkeys for same custom-function #54992

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
Jakobhenningjensen opened this issue Sep 4, 2023 · 12 comments
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby

Comments

@Jakobhenningjensen
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
def return_series(x):
    return pd.Series([x["A"] + x["B"]], index=x.index)


a_one = pd.DataFrame({"A": [1], "B": [2], "C": ["c"]})
a_two = pd.DataFrame({"A": [1, 2], "B": [2, 2], "C": ["c", "d"]})

res_one = a_one.groupby("C").apply(return_series) # Returns dataframe
res_two = a_two.groupby("C").apply(return_series) # Returns series

Issue Description

When using groupby.apply with the same custom-function the end result is not consistent; it is either a series if there's multiple keys in the group-by or a dataframe if there's only one key value present.

Expected Behavior

It is understandable that some functions return either a series/dataframe depending on the function, but I would assume that the same function returns the same type i.e that the example above always returns either a dataframe or a series.

Installed Versions

INSTALLED VERSIONS

commit : ba1cccd
python : 3.11.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22621
machine : AMD64
processor : Intel64 Family 6 Model 141 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Danish_Denmark.1252
pandas : 2.1.0
numpy : 1.24.1
pytz : 2023.3
dateutil : 2.8.2
setuptools : 65.5.0
pip : 23.1.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : 1.11.1
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader : None
bs4 : None
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.6.0
gcsfs : None
matplotlib : 3.7.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : 0.19.2
pyarrow : 12.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@Jakobhenningjensen Jakobhenningjensen added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 4, 2023
@rhshadrach rhshadrach added Groupby Apply Apply, Aggregate, Transform, Map and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 4, 2023
@rhshadrach
Copy link
Member

This is similar in spirit to #42749. Apply tries to infer how to best reshape the result, and uses whether the returned objects are indexed the same to make some decisions. When there is one group, it can't tell the difference between a UDF that always returns the same index vs one that does not. cc @topper-123

@Jakobhenningjensen - it looks like the example you posted is minimal (which is great), but I'd guess not similar to what you're really trying to accomplish via apply. I'd also be curious to understand the real nature of your operation if you're able to share. We are trying to understand how users are using apply and whether agg/transform are better alternatives.

@topper-123
Copy link
Contributor

topper-123 commented Sep 4, 2023

Yeah, this method is inconsistent for a few known reasons. There are some other issues with groupby.apply, so it would help if you could give the details on @rhshadrach question, so we can look into improve it/find alternatives.

@Jakobhenningjensen
Copy link
Author

Jakobhenningjensen commented Sep 4, 2023

@rhshadrach and @topper-123 Thank you for the swift reply; I'll try to elaborate as close to the real code as possible with an example.

Say I want to do some classification using Tf-IDF + cosine-similarity. Since I have a lot of users, I want to train a Tf-IDF model for each user (id).

I create a simple class which does the following:

  • Fit
    • Group all data within each Id
    • Transform Sound using TfIdf and store that (X) along with the corresponding animal (y)
    • Save the Tf-IDF, X and y in a dictionary, id_dict with the Id as key
  • Predict
    • Group data within each Id
    • Fetch the information from id_dict, if possible. If not, return empty predictions
    • Transform the new data using the Tf-IDF model, and get the y which has the greates cosine-similarity
    • Rearrange the predictions to ensure that the output order is the same as input order (This is where it fails)
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

class AnimalPredictor:
    def __init__(self):
        self.id_dict = {}

    def _set_model_and_X_and_y(self, data_id):
        id_ = data_id.name
        tf_idf = TfidfVectorizer()
        X = tf_idf.fit_transform(data_id["Sound"])
        y = data_id["Animal"]

        temp_dict = {"model": tf_idf, "X": X, "y": y}
        self.id_dict[id_] = temp_dict

    def fit(self, X):
        X.groupby("Id").apply(self._set_model_and_X_and_y)

    def _predict_within_id(self, X_id):
        id_ = X_id.name
        try:
            id_data = self.id_dict[id_]
        except KeyError:
            return pd.Series([None] * len(X_id), index=X_id.index)

        model = id_data["model"]
        x_train = id_data["X"]
        y_train = id_data["y"]

        data_transformed = model.transform(X_id["Sound"])
        similarity = data_transformed @ x_train.T
        max_similarity = similarity.argmax(axis=1).A1
        predictions = y_train.iloc[max_similarity]
        predictions.index = X_id.index
        return predictions

    def predict(self, X):
        predictions = X.groupby("Id", group_keys=False).apply(self._predict_within_id)
        predictions = predictions.loc[X.index] #Very important that we get the same input/output order
        return predictions


animal_predictor = AnimalPredictor()

data_train = pd.DataFrame(
    {"Id": [1, 1, 1, 2, 2, 3, 3, 3], "Animal": ["dog", "dog", "bird", "bird", "cat", "dog", "cat", "bird"],
     "Sound": ["woof wow", "baw waw", "tweety", "peepy tweety", "miav", "miav woof", "hfhfhf", "hello world"]})

animal_predictor.fit(data_train)

data_test_multiple = pd.DataFrame({"Id": [1, 2, 3], "Sound": ["woof wow", "baw waw", "tweety"]}, index=[10, 9, 17]) 
data_test_single = pd.DataFrame({"Id": [1], "Sound": ["woof wow"]}) 

predictions_multiple = animal_predictor.predict(data_test_multiple) # Works fine
predictions_single = animal_predictor.predict(data_test_single) # Throws index error

the issue is that I want to re-arrange the output from groupby.apply(_predict) but if we only have one unique id then that result is a dataframe which have the old index as columns.

If we have multiple ids, then the result is a series which works out well.

My main issue is not to find another solution if we always get e.g a dataframe returned, the issue is that it is inconsistent, eventho the function _predict_within_id returns the same type; a Series. I understand that the groupby can return either a dataframe or a Series depending on the function, but I have always assumed that the same function would return the same type - otherwise we have to write some checks every time we use the apply function depending on (very many) cases.

I do admit that I might be using the groupby.apply poorly here, and it might not be suited for such cases, but I still find it very dangerous that the output-type varies.
I wrote the code, used some test data to ensure the expected behaviour and that was fine. Then in production it suddenly failed since in one request we only had requests from one user - that is really difficult to figure out what the issue is.

@rhshadrach
Copy link
Member

rhshadrach commented Sep 5, 2023

@Jakobhenningjensen

I do admit that I might be using the groupby.apply poorly here, and it might not be suited for such cases, but I still find it very dangerous that the output-type varies.

Agreed, but the reason apply does this is for examples like:

def foo(x):
    return pd.Series({'mean': x['b'].mean(), 'median': x['b'].median()})

df = pd.DataFrame({'a': [1, 1, 1, 2, 2], 'b': [3, 4, 7, 8, 9]})
print(df.groupby('a').apply(foo))
#        mean  median
# a                  
# 1  4.666667     4.0
# 2  8.500000     8.5

I don't believe we can change this without breaking this behavior.

Thanks for all the details on your operation, this is very helpful! Does doing something like:

result = pd.concat({idx: foo(group) for idx, group in df.groupby('Id')})

work? This is essentially what apply does as well (with lots of logic for various cases), but now you have complete control over how the individual results are shaped back into the full result.

@Jakobhenningjensen
Copy link
Author

Jakobhenningjensen commented Sep 6, 2023

I'll give it a shot indeed - thank you!

But I still struggle to figure out why it returns a dataframe when there's only one unique groupby-key but a series afterwards? In your example it would return a dataframe always, right?

Is there some kind of flowchart one would be able to look at to determine what is being returned by the apply?

@rhshadrach
Copy link
Member

But I still struggle to figure out why it returns a dataframe when there's only one unique groupby-key but a series afterwards? In your example it would return a dataframe always, right?

The issue is how to determine whether to stack results vertically or horizontally. pandas looks at the index of the returned objects (when they are Series) and if they are all the same it stacks them horizontally. When they are different, it stacks them vertically.

It's a little bit hard to see due to your example having a Series inside a Series. This might make it clearer:

def return_series(x):
    print(x.index, x["A"] + x["B"])
    return pd.Series([0, 1], index=[x.index[0], 'y'])

a_one = pd.DataFrame({"A": [1], "B": [2], "C": ["c"]})
a_two = pd.DataFrame({"A": [1, 2], "B": [2, 2], "C": ["c", "d"]})

print(a_one.groupby("C").apply(return_series))
#    0  y
# C      
# c  0  1

print(a_two.groupby("C").apply(return_series))
# C   
# c  0    0
#    y    1
# d  1    0
#    y    1
# dtype: int64

@sfc-gh-mvashishtha
Copy link

sfc-gh-mvashishtha commented Feb 8, 2024

The issue is how to determine whether to stack results vertically or horizontally. pandas looks at the index of the returned objects (when they are Series) and if they are all the same it stacks them horizontally. When they are different, it stacks them vertically.

(by the way, we should document this behavior, maybe as part of #22545).

@rhshadrach is there a reason for this behavior? It makes more sense to me to always return a dataframe and "stack horizontally" / pivot. Here are my reasons:

  1. It's strange and unexpected that calling DataFrameGroupBy.apply() with the same parameters, including the same func, does not always return the same type of object, even though func is always returning the same type of object. @Jakobhenningjensen, this StackOverflow user, and I, at least, found this strange. I think it would be great if we could explain groupby behavior according to the return type of func, as someone attempted to do here.
  2. Stacking horizontally makes sense when the series all have the same index (as here), so if we want groupby.apply to be consistent, we should always stack horizontally.
  3. Always stacking horizontally is consistent with the behavior of df.apply(axis=1), which I think is very similar to groupby(axis=0).apply() in that it applies a function to subsets of the dataframe that include all the columns. For example, if we modify your example a bit into a df.apply(axis=1) that returns a differently indexed series for each row, pandas will pivot all the series results and align them for us instead of stacking the series vertically. The results for a_one and a_two are both dataframes, but a_two has an extra column because apply returned column d for one of the rows in a_two.
def return_series(x):
    return pd.Series([0, 1], index=[x['C'], 'y'])

a_one = pd.DataFrame({"A": [1], "B": [2], "C": ["c"]})
a_two = pd.DataFrame({"A": [1, 2], "B": [2, 2], "C": ["c", "d"]})

print(a_one.apply(return_series, axis=1))
"""
   c  y
0  0  1
"""

print(a_two.apply(return_series, axis=1))
"""
     c    d    y
0  0.0  NaN  1.0
1  NaN  0.0  1.0
"""

@rhshadrach
Copy link
Member

is there a reason for this behavior? It makes more sense to me to always return a dataframe and "stack horizontally" / pivot.

I think the main reason is to support filters and transformations. Here is a filter example, replacing with lambda x: x.cumsum() gives the same issue.

size = 10
df = pd.DataFrame(
    {
        "a": np.random.randint(0, 3, size),
        "b": np.random.random(size),
        "c": np.random.random(size),
    }
)
gb = df.groupby("a")
print(gb.apply(lambda x: x[x["b"] > 0.5]["c"], include_groups=False))
# Current behavior - stacking vertically
#             b         c
# a
# 0 2  0.669249  0.887126
# 1 3  0.694459  0.690484
#   5  0.878441  0.032467
#   9  0.630809  0.164901
# 2 7  0.647421  0.280941

# Stacking horizontally
# ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 3 and the array at index 1 has size 1

Even if this were to work, I don't think stacking horizontally is the intended behavior.

@sfc-gh-mvashishtha
Copy link

@rhshadrach your code is giving me a series, not a dataframe, on pandas 2.2.0, but I think that's what you intended:

"""
a
0  0    0.302575
   8    0.227312
1  3    0.777897
   6    0.978015
Name: c, dtype: float64
"""

I do see how vertical stacking makes sense here. that example isn't just a filter, though: it's a filter plus a scalar getitem, so it's filtering and selecting column c as a series within a group. I think it's also clear to write df.groupby('a').apply(lambda x: x[x["b"] > 0.5][["c"]]) and then squeeze the dataframe result if you want a series. That version would work even with the new behavior I'm proposing because it returns a dataframe.

replacing with lambda x: x.cumsum() gives the same issue.

Since this is a transform + scalar getitem, I also think it's also clear to write df.groupby('a').apply(lambda x: x.cumsum()[["c"]]) and then squeeze the dataframe result if you want a series.

For a counterpoint to the two examples you gave, consider this example where func is returning a series with the value counts of b within each group. The direction of stacking depends on which values of b are present in which group. Even if all values of b are present in each group, the direction of stack depends on the order in which func returns the keys for value counts in each group.

import pandas as pd

df = pd.DataFrame(
   [
      [0, 'b1'],
      [0, 'b1'],
      [1, 'b1'],
      [1, 'b2'],
   ],
   columns=['a', 'b']
)
df2 = pd.DataFrame(
   [
      [0, 'b1'],
      [0, 'b2'],
      [1, 'b1'],
      [1, 'b2'],
   ],
   columns=['a', 'b']
)

# currently, for df, we stack the value counts for each combination of group key + b value
# vertically because b2 never occurs in group 0
print(df.groupby('a').apply(lambda x: x['b'].value_counts()))
"""
a  b
0  b1    2
1  b1    1
   b2    1
Name: count, dtype: int64
"""

# ... but for df2, we stack the value counts for each combination of group key + b value
# horizontally because both groups have include both values of column b !
print(df2.groupby('a').apply(lambda x: x['b'].value_counts()))
"""
b  b1  b2
a
0   1   1
1   1   1
"""

The current behavior is undesirable here, but this is the behavior we would keep if we continued to stack vertically for mismatching indexes as you suggest.

I still think all my points here apply. I would say the better consistency is worth users' having to do make a small change that may be a little unintuitive (replacing ["c"] with [["c"]] and then squeezing) when they're trying to end up with a series that represents some version of a column of the original dataframe.

@rhshadrach
Copy link
Member

I think it's also clear to write df.groupby('a').apply(lambda x: x[x["b"] > 0.5][["c"]]) and then squeeze the dataframe result if you want a series. That version would work even with the new behavior I'm proposing because it returns a dataframe.

No disagreement here, but the apply method should not raise on valid input.

@sfc-gh-mvashishtha
Copy link

No disagreement here, but the apply method should not raise on valid input.

I agree with that, but if this is about the

ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 3 and the array at index 1 has size 1

that you mentioned, I wouldn't want an error there. I don't know how you got the ValueError, but I think groupby.apply should concatenate the transposed series as if they're dataframes, so it should be able to handle different indexes along dimension 1. For an example like yours:

import pandas as pd, numpy as np

df = pd.DataFrame({'a': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 2, 7: 1, 8: 0, 9: 2},
'b': {0: 0.4,1: 0.13,2: 0.32,3: 0.08,4: 0.72,5: 0.28,6: 0.28,7: 0.04,8: 0.27,9: 0.95},
 'c': {0: 0.93,1: 0.94,2: 0.78,3: 0.35,4: 0.59,5: 0.54,6: 0.49,7: 0.27,8: 0.9,9: 0.91}})

# what we currently get from this:
print(df.groupby('a').apply(lambda x: x[x["b"] > 0.5]["c"]))
"""
a
0  4    0.59
2  9    0.91
Name: c, dtype: float64
"""
# would instead look like the result of this:
print(df.groupby('a').apply(lambda x: x[x["b"] > 0.5]["c"].to_frame().T))
"""
        4     9
a
0 c  0.59   NaN
1 c   NaN   NaN
2 c   NaN  0.91
"""

admittedly, this doesn't look like a filter, but at least you don't get an error, and as I said, I think the consistency in return type is worth having to rewrite filters like this so that they return frames instead of series.

@sfc-gh-mvashishtha
Copy link

In #42608, someone else voted in favor of always returning a DataFrame when func returns a Series. The original post there also uses a value_counts example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby
Projects
None yet
Development

No branches or pull requests

4 participants