-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: pandas groupby.apply return type changes depending on number of unique groupkeys for same custom-function #54992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is similar in spirit to #42749. Apply tries to infer how to best reshape the result, and uses whether the returned objects are indexed the same to make some decisions. When there is one group, it can't tell the difference between a UDF that always returns the same index vs one that does not. cc @topper-123 @Jakobhenningjensen - it looks like the example you posted is minimal (which is great), but I'd guess not similar to what you're really trying to accomplish via apply. I'd also be curious to understand the real nature of your operation if you're able to share. We are trying to understand how users are using apply and whether agg/transform are better alternatives. |
Yeah, this method is inconsistent for a few known reasons. There are some other issues with groupby.apply, so it would help if you could give the details on @rhshadrach question, so we can look into improve it/find alternatives. |
@rhshadrach and @topper-123 Thank you for the swift reply; I'll try to elaborate as close to the real code as possible with an example. Say I want to do some classification using Tf-IDF + cosine-similarity. Since I have a lot of users, I want to train a Tf-IDF model for each user ( I create a simple class which does the following:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
class AnimalPredictor:
def __init__(self):
self.id_dict = {}
def _set_model_and_X_and_y(self, data_id):
id_ = data_id.name
tf_idf = TfidfVectorizer()
X = tf_idf.fit_transform(data_id["Sound"])
y = data_id["Animal"]
temp_dict = {"model": tf_idf, "X": X, "y": y}
self.id_dict[id_] = temp_dict
def fit(self, X):
X.groupby("Id").apply(self._set_model_and_X_and_y)
def _predict_within_id(self, X_id):
id_ = X_id.name
try:
id_data = self.id_dict[id_]
except KeyError:
return pd.Series([None] * len(X_id), index=X_id.index)
model = id_data["model"]
x_train = id_data["X"]
y_train = id_data["y"]
data_transformed = model.transform(X_id["Sound"])
similarity = data_transformed @ x_train.T
max_similarity = similarity.argmax(axis=1).A1
predictions = y_train.iloc[max_similarity]
predictions.index = X_id.index
return predictions
def predict(self, X):
predictions = X.groupby("Id", group_keys=False).apply(self._predict_within_id)
predictions = predictions.loc[X.index] #Very important that we get the same input/output order
return predictions
animal_predictor = AnimalPredictor()
data_train = pd.DataFrame(
{"Id": [1, 1, 1, 2, 2, 3, 3, 3], "Animal": ["dog", "dog", "bird", "bird", "cat", "dog", "cat", "bird"],
"Sound": ["woof wow", "baw waw", "tweety", "peepy tweety", "miav", "miav woof", "hfhfhf", "hello world"]})
animal_predictor.fit(data_train)
data_test_multiple = pd.DataFrame({"Id": [1, 2, 3], "Sound": ["woof wow", "baw waw", "tweety"]}, index=[10, 9, 17])
data_test_single = pd.DataFrame({"Id": [1], "Sound": ["woof wow"]})
predictions_multiple = animal_predictor.predict(data_test_multiple) # Works fine
predictions_single = animal_predictor.predict(data_test_single) # Throws index error the issue is that I want to re-arrange the output from If we have multiple ids, then the result is a series which works out well. My main issue is not to find another solution if we always get e.g a dataframe returned, the issue is that it is inconsistent, eventho the function I do admit that I might be using the |
Agreed, but the reason apply does this is for examples like:
I don't believe we can change this without breaking this behavior. Thanks for all the details on your operation, this is very helpful! Does doing something like:
work? This is essentially what apply does as well (with lots of logic for various cases), but now you have complete control over how the individual results are shaped back into the full result. |
I'll give it a shot indeed - thank you! But I still struggle to figure out why it returns a dataframe when there's only one unique groupby-key but a series afterwards? In your example it would return a dataframe always, right? Is there some kind of flowchart one would be able to look at to determine what is being returned by the apply? |
The issue is how to determine whether to stack results vertically or horizontally. pandas looks at the index of the returned objects (when they are Series) and if they are all the same it stacks them horizontally. When they are different, it stacks them vertically. It's a little bit hard to see due to your example having a Series inside a Series. This might make it clearer:
|
(by the way, we should document this behavior, maybe as part of #22545). @rhshadrach is there a reason for this behavior? It makes more sense to me to always return a dataframe and "stack horizontally" / pivot. Here are my reasons:
def return_series(x):
return pd.Series([0, 1], index=[x['C'], 'y'])
a_one = pd.DataFrame({"A": [1], "B": [2], "C": ["c"]})
a_two = pd.DataFrame({"A": [1, 2], "B": [2, 2], "C": ["c", "d"]})
print(a_one.apply(return_series, axis=1))
"""
c y
0 0 1
"""
print(a_two.apply(return_series, axis=1))
"""
c d y
0 0.0 NaN 1.0
1 NaN 0.0 1.0
""" |
I think the main reason is to support filters and transformations. Here is a filter example, replacing with
Even if this were to work, I don't think stacking horizontally is the intended behavior. |
@rhshadrach your code is giving me a series, not a dataframe, on pandas 2.2.0, but I think that's what you intended: """
a
0 0 0.302575
8 0.227312
1 3 0.777897
6 0.978015
Name: c, dtype: float64
""" I do see how vertical stacking makes sense here. that example isn't just a filter, though: it's a filter plus a scalar getitem, so it's filtering and selecting column
Since this is a transform + scalar getitem, I also think it's also clear to write For a counterpoint to the two examples you gave, consider this example where import pandas as pd
df = pd.DataFrame(
[
[0, 'b1'],
[0, 'b1'],
[1, 'b1'],
[1, 'b2'],
],
columns=['a', 'b']
)
df2 = pd.DataFrame(
[
[0, 'b1'],
[0, 'b2'],
[1, 'b1'],
[1, 'b2'],
],
columns=['a', 'b']
)
# currently, for df, we stack the value counts for each combination of group key + b value
# vertically because b2 never occurs in group 0
print(df.groupby('a').apply(lambda x: x['b'].value_counts()))
"""
a b
0 b1 2
1 b1 1
b2 1
Name: count, dtype: int64
"""
# ... but for df2, we stack the value counts for each combination of group key + b value
# horizontally because both groups have include both values of column b !
print(df2.groupby('a').apply(lambda x: x['b'].value_counts()))
"""
b b1 b2
a
0 1 1
1 1 1
""" The current behavior is undesirable here, but this is the behavior we would keep if we continued to stack vertically for mismatching indexes as you suggest. I still think all my points here apply. I would say the better consistency is worth users' having to do make a small change that may be a little unintuitive (replacing |
No disagreement here, but the apply method should not raise on valid input. |
I agree with that, but if this is about the
that you mentioned, I wouldn't want an error there. I don't know how you got the import pandas as pd, numpy as np
df = pd.DataFrame({'a': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 2, 7: 1, 8: 0, 9: 2},
'b': {0: 0.4,1: 0.13,2: 0.32,3: 0.08,4: 0.72,5: 0.28,6: 0.28,7: 0.04,8: 0.27,9: 0.95},
'c': {0: 0.93,1: 0.94,2: 0.78,3: 0.35,4: 0.59,5: 0.54,6: 0.49,7: 0.27,8: 0.9,9: 0.91}})
# what we currently get from this:
print(df.groupby('a').apply(lambda x: x[x["b"] > 0.5]["c"]))
"""
a
0 4 0.59
2 9 0.91
Name: c, dtype: float64
"""
# would instead look like the result of this:
print(df.groupby('a').apply(lambda x: x[x["b"] > 0.5]["c"].to_frame().T))
"""
4 9
a
0 c 0.59 NaN
1 c NaN NaN
2 c NaN 0.91
""" admittedly, this doesn't look like a filter, but at least you don't get an error, and as I said, I think the consistency in return type is worth having to rewrite filters like this so that they return frames instead of series. |
In #42608, someone else voted in favor of always returning a |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
When using
groupby.apply
with the same custom-function the end result is not consistent; it is either a series if there's multiple keys in the group-by or a dataframe if there's only one key value present.Expected Behavior
It is understandable that some functions return either a series/dataframe depending on the function, but I would assume that the same function returns the same type i.e that the example above always returns either a dataframe or a series.
Installed Versions
INSTALLED VERSIONS
commit : ba1cccd
python : 3.11.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22621
machine : AMD64
processor : Intel64 Family 6 Model 141 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Danish_Denmark.1252
pandas : 2.1.0
numpy : 1.24.1
pytz : 2023.3
dateutil : 2.8.2
setuptools : 65.5.0
pip : 23.1.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : 1.11.1
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader : None
bs4 : None
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.6.0
gcsfs : None
matplotlib : 3.7.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : 0.19.2
pyarrow : 12.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: