-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
groupby and resample methods do not preserve subclassed data structures #28330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Worth noting that the group-by objects do store the object it started with in the attribute import pandas as pd
class MySeries(pd.Series):
pass
class MyDataFrame(pd.DataFrame):
@property
def _constructor(self):
return MyDataFrame
_constructor_sliced = MySeries
MySeries._constructor_expanddim = MyDataFrame
for cls in (pd.DataFrame, MyDataFrame):
df = cls(
{"a": reversed(range(10)), "b": list('aaaabbbccc')}
)
s = df.groupby("b").sum()
print(type(df))
print(type(s))
print(type(s['a']))
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
<class '__main__.MyDataFrame'>
<class '__main__.MyDataFrame'>
<class '__main__.MySeries'> Try it out from my fork here: https://github.com/alkasm/pandas/tree/groupby-preserve-subclass I haven't taken a look at the resampling source code at all. However, it seems to use import pandas as pd
import numpy as np
class MyDataFrame(pd.DataFrame):
@property
def _constructor(self):
return MyDataFrame
dates = pd.date_range('2019', freq='H', periods=1000)
my_df = MyDataFrame(np.arange(len(dates)), index=dates)
print(type(my_df))
# <class '__main__.MyDataFrame'> (✓)
print(type(my_df.diff()))
# <class '__main__.MyDataFrame'> (✓)
print(type(my_df.sample(1)))
# <class '__main__.MyDataFrame'> (✓)
print(type(my_df.rolling('5H').mean()))
# <class '__main__.MyDataFrame'> (✓)
print(type(my_df.groupby(my_df.index.dayofweek).mean()))
# <class '__main__.MyDataFrame'> (✓)
print(type(my_df.resample('D').mean()))
# <class '__main__.MyDataFrame'> (✓) I will open up a PR after I'm able to look into the resampling stuff a little more and confirm whether or not this covers the bases. |
That was fast! I'll try out the fork when I've got access to a less locked down PC. As far as I can tell this solves the problem perfectly. |
AFAICT, resampling will do the right thing, as it just applies the functions/classes from the groupby module, so I don't think anything special is necessary. There isn't really hardcoded Edit: PR submitted: #28573 |
Code sample
Problem description
Originally posted on SO.
The intended behaviour for chain-able methods on subclassed data structures is clearly that the operation returns an instance of the subclass (i.e.
MyDataFrame
), rather than the native type (i.e.DataFrame
). This is the current behaviour for most operations (e.g., slicing, sampling, sorting) but not resample and groupby.Currently groupby and resample both return explicitly constructed pandas datatypes, e.g. here:
pandas/pandas/core/groupby/generic.py
Line 338 in ac69333
To get the expected behaviour, the intermediary classes (e.g. DataFrameGroupBy) would need to retain information about the calling class so that the appropriate constructor can be used (i.e. one of
_constructor
or_constructor_sliced
or_constructor_expanddim
).Note that operations that use Window and Rolling already appear have the expected behaviour because these assemble their results via a call to
concat
such as this one:pandas/pandas/core/window.py
Line 325 in 171c716
Output of
pd.show_versions()
pandas : 0.25.1
numpy : 1.17.1
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.3
setuptools : 40.8.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.8.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : None
tables : 3.5.2
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
The text was updated successfully, but these errors were encountered: