-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: groupby first()/last() change datatype of all NaN columns to float; nth() preserves datatype #33591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
Bug
Duplicate Report
Duplicate issue or pull request
Groupby
Missing-data
np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Comments
I think this is fixed on master: [ins] In [1]: import pandas as pd
...:
...: df = pd.DataFrame(
...: {
...: "id": ["a", "b", "b", "c"],
...: "sym": ["ibm", "msft", "msft", "goog"],
...: "sectype": ["E", "E", "E", "E"],
...: }
...: )
...: df["osi"] = df.sym.where(df.sectype == "O")
...: print(df.dtypes)
...: print(df.groupby("id")["osi"].first())
...: print(df.groupby("id")["osi"].last())
...: print(pd.__version__)
...:
id object
sym object
sectype object
osi object
dtype: object
id
a NaN
b NaN
c NaN
Name: osi, dtype: object
id
a NaN
b NaN
c NaN
Name: osi, dtype: object
1.1.0.dev0+1288.g3a5ae505b |
5 tasks
The code sample is different from the OP. The issue in OP does not appear to be fixed.
|
@simonjayhawkins Oops, yeah I think you're right |
If no one has taken this, I would like to help solve this issue. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Bug
Duplicate Report
Duplicate issue or pull request
Groupby
Missing-data
np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Version 1.0.3 from the conda distro.
Code Sample, a copy-pastable example
Problem description
Given a dataframe with an all-NaN column of str/objects, performing a groupby().first() will convert the all-NaN column to float. This is true for .last() as well, but not for .nth(0) and .nth(-1), so these are workarounds.
This is a problem for me as the groupby().first() is in general code that is iteratively called and the results either pd.concat'd (resulting in a column with mixed types) or appended to an hdf file (causing failure on the write).
Expected Output
The resulting dataframe from a groupby().first()/.last() should have the same metadata (column structure and datatypes) as the input dataframe. The result of .first()/.last() should match that of .nth(0)/.nth(-1) .
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 85 Stepping 4, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 1.0.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.1.3.post20200330
Cython : 0.29.15
pytest : 5.4.1
hypothesis : 5.8.3
sphinx : 2.4.4
blosc : None
feather : None
xlsxwriter : 1.2.8
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.2
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.3
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.4.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.15
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.8
numba : 0.48.0
The text was updated successfully, but these errors were encountered: