-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Groupby converting empty string to nan #24903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@djbarker Thanks for the report. I'm guessing it's an edge case when the result DataFrame is created from just a single series, evidenced by the series name from the series returned by the lambda becoming the columns index name in the result DataFrame. Investigation and PRs welcome |
@simonjayhawkins Surely the series name becomes the name of the index because it's what we grouped by? Also, I'm not sure I follow what you mean by it being created from a single series. The issue also occurs if we group by In [1]: pd.DataFrame(dict(a=[1, 1, 2, 2], b=["", "", "", ""], c=pd.to_datetime([1, 2, 3, 4], unit="s"))).groupby("a")
...: .apply(lambda df: df.iloc[-1])
Out[1]:
a b c
a
1 1 NaN 1970-01-01 00:00:02
2 2 NaN 1970-01-01 00:00:04
In [2]: pd.DataFrame(dict(a=[1, 1, 2, 2], b=["", "", "", ""], c=pd.to_datetime([1, 2, 3, 4], unit="s"))).groupby(["a"
...: , "b"]).apply(lambda df: df.iloc[-1])
Out[2]:
a b c
a b
1 1 NaN 1970-01-01 00:00:02
2 2 NaN 1970-01-01 00:00:04 |
sorry if I was not clear. the row index names are determined by the columns grouped on. in the first two examples you gave, the column index name is shown as 3.
correct, the additional examples show that it is not just confined to a single series |
Ah I see, I hadn't noticed that. I wonder where the 3 comes from? |
interestingly, if you change your lambda to return a DataFrame you won't get the NaNs but you'll get an extra index level.
|
I realised the In [33]: pd.DataFrame(dict(a=[1, 1, 2, 2], b=["", "", "", ""], c=pd.to_datetime([1, 2, 3, 4], unit="s"))).groupby(["a
...: "]).apply(lambda df: pd.DataFrame(dict(foo=["bar", "abc"], baz=["edf", "ghi"])).set_index("baz"))
Out[33]:
foo
a baz
1 edf bar
ghi abc
2 edf bar
ghi abc |
The original example looks to work on master. Could use a test
|
take |
Problem description
When using
groupBy
with a column of empty strings and a column ofTimestamp
s, the empty string can get converted to nan. See these three examples which display inconsistent behaviour:Expected Output
The output in
Out[3]
shouldn't contain a NaN but the empty string""
.Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.16-200.fc28.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
pandas: 0.23.4
pytest: 3.3.2
pip: 10.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.13.3
scipy: 0.19.1
pyarrow: 0.8.0
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.1.1
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: