Groupby converting empty string to nan #24903

djbarker · 2019-01-24T13:47:41Z

In [1]: pd.DataFrame(dict(a=[1, 1, 2, 2], b=["", "", "", ""])).groupby("b").apply(lambda df: df.iloc[-1])
Out[1]: 
3  a b
b     
   2  

In [2]: pd.DataFrame(dict(a=[1, 1, 2, 2], b=["", "", "", ""], c=pd.to_datetime([1, 2, 3, 4], unit="s"))).groupby("b").apply(lambda df: df.iloc[-1])
Out[2]: 
3  a   b                   c
b                           
   2 NaN 1970-01-01 00:00:04                <------------ empty string converted to NaN

In [3]: pd.DataFrame(dict(a=[1, 1, 2, 2], b=["abc", "abc", "", ""], c=pd.to_datetime([1, 2, 3, 4], unit="s"))).groupby("b").apply(lambda df: df.iloc[-1])
Out[3]: 
     a    b                   c
b                              
     2      1970-01-01 00:00:04
abc  1  abc 1970-01-01 00:00:02

Problem description

When using groupBy with a column of empty strings and a column of Timestamps, the empty string can get converted to nan. See these three examples which display inconsistent behaviour:

Expected Output

The output in Out[3] shouldn't contain a NaN but the empty string "".

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.16-200.fc28.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.23.4
pytest: 3.3.2
pip: 10.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.13.3
scipy: 0.19.1
pyarrow: 0.8.0
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.1.1
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

simonjayhawkins · 2019-01-24T15:14:09Z

@djbarker Thanks for the report. I'm guessing it's an edge case when the result DataFrame is created from just a single series, evidenced by the series name from the series returned by the lambda becoming the columns index name in the result DataFrame.

Investigation and PRs welcome

djbarker · 2019-01-24T15:32:49Z

@simonjayhawkins Surely the series name becomes the name of the index because it's what we grouped by? Also, I'm not sure I follow what you mean by it being created from a single series. The issue also occurs if we group by "a" or ["a", "b"]:

In [1]: pd.DataFrame(dict(a=[1, 1, 2, 2], b=["", "", "", ""], c=pd.to_datetime([1, 2, 3, 4], unit="s"))).groupby("a")
   ...: .apply(lambda df: df.iloc[-1])
Out[1]: 
   a   b                   c
a                           
1  1 NaN 1970-01-01 00:00:02
2  2 NaN 1970-01-01 00:00:04

In [2]: pd.DataFrame(dict(a=[1, 1, 2, 2], b=["", "", "", ""], c=pd.to_datetime([1, 2, 3, 4], unit="s"))).groupby(["a"
   ...: , "b"]).apply(lambda df: df.iloc[-1])
Out[2]: 
     a   b                   c
a b                           
1    1 NaN 1970-01-01 00:00:02
2    2 NaN 1970-01-01 00:00:04

simonjayhawkins · 2019-01-24T15:45:41Z

@djbarker

Surely the series name becomes the name of the index because it's what we grouped by?

sorry if I was not clear. the row index names are determined by the columns grouped on. in the first two examples you gave, the column index name is shown as 3.

Also, I'm not sure I follow what you mean by it being created from a single series. The issue also occurs if we group by "a" or ["a", "b"]

correct, the additional examples show that it is not just confined to a single series

djbarker · 2019-01-24T16:01:29Z

Ah I see, I hadn't noticed that. I wonder where the 3 comes from?

simonjayhawkins · 2019-01-24T16:03:38Z

interestingly, if you change your lambda to return a DataFrame you won't get the NaNs but you'll get an extra index level.

pd.DataFrame(dict(a=[1, 1, 2, 2], b=["", "", "", ""], c=pd.to_datetime([1, 2, 3, 4], unit="s"))).groupby("b").apply(lambda df: df.iloc[[-1]])

     a b                   c
b
  3  2   1970-01-01 00:00:04

pd.DataFrame(dict(a=[1, 1, 2, 2], b=["", "", "", ""], c=pd.to_datetime([1, 2, 3, 4], unit="s"))).groupby(["a", "b"]).apply(lambda df: df.iloc[[-1]])
       a b                   c
a b
1   1  1   1970-01-01 00:00:02
2   3  2   1970-01-01 00:00:04

djbarker · 2019-01-24T16:17:37Z

I realised the 3 is just the index of the last row in the DataFrame that the apply lambda sees in the case we return only one row. These become the extra index level if we return a DataFrame not a Series. This has to do with how apply handles joining the results back into one Dataframe. If the returned DataFrame has more than one row, one will be interested in keeping the index of these around:

In [33]: pd.DataFrame(dict(a=[1, 1, 2, 2], b=["", "", "", ""], c=pd.to_datetime([1, 2, 3, 4], unit="s"))).groupby(["a
    ...: "]).apply(lambda df: pd.DataFrame(dict(foo=["bar", "abc"], baz=["edf", "ghi"])).set_index("baz"))
Out[33]: 
       foo
a baz     
1 edf  bar
  ghi  abc
2 edf  bar
  ghi  abc

mroeschke · 2021-06-25T05:33:48Z

The original example looks to work on master. Could use a test

In [34]: In [2]: pd.DataFrame(dict(a=[1, 1, 2, 2], b=["", "", "", ""], c=pd.to_datetime([1, 2, 3, 4], unit="s"))).groupby("b").apply(lamb
    ...: da df: df.iloc[-1])
Out[34]:
3  a b                   c
b
   2   1970-01-01 00:00:04

tyvich · 2021-10-15T00:40:51Z

take

mroeschke added Bug Groupby labels May 27, 2019

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Groupby labels Jun 25, 2021

github-actions bot assigned tyvich Oct 15, 2021

tyvich mentioned this issue Oct 15, 2021

TST: added groupby apply test for nan coerce #44038

Merged

4 tasks

mroeschke added this to the 1.4 milestone Oct 15, 2021

jreback closed this as completed in #44038 Oct 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Groupby converting empty string to nan #24903

Groupby converting empty string to nan #24903

djbarker commented Jan 24, 2019 •

edited by mroeschke

Loading

INSTALLED VERSIONS

simonjayhawkins commented Jan 24, 2019

djbarker commented Jan 24, 2019

simonjayhawkins commented Jan 24, 2019

djbarker commented Jan 24, 2019

simonjayhawkins commented Jan 24, 2019

djbarker commented Jan 24, 2019

mroeschke commented Jun 25, 2021

tyvich commented Oct 15, 2021

Groupby converting empty string to nan #24903

Groupby converting empty string to nan #24903

Comments

djbarker commented Jan 24, 2019 • edited by mroeschke Loading

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

simonjayhawkins commented Jan 24, 2019

djbarker commented Jan 24, 2019

simonjayhawkins commented Jan 24, 2019

djbarker commented Jan 24, 2019

simonjayhawkins commented Jan 24, 2019

djbarker commented Jan 24, 2019

mroeschke commented Jun 25, 2021

tyvich commented Oct 15, 2021

djbarker commented Jan 24, 2019 •

edited by mroeschke

Loading

Output of `pd.show_versions()`