Skip to content

Groupby converting empty string to nan #24903

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
djbarker opened this issue Jan 24, 2019 · 8 comments · Fixed by #44038
Closed

Groupby converting empty string to nan #24903

djbarker opened this issue Jan 24, 2019 · 8 comments · Fixed by #44038
Assignees
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@djbarker
Copy link

djbarker commented Jan 24, 2019

In [1]: pd.DataFrame(dict(a=[1, 1, 2, 2], b=["", "", "", ""])).groupby("b").apply(lambda df: df.iloc[-1])
Out[1]: 
3  a b
b     
   2  

In [2]: pd.DataFrame(dict(a=[1, 1, 2, 2], b=["", "", "", ""], c=pd.to_datetime([1, 2, 3, 4], unit="s"))).groupby("b").apply(lambda df: df.iloc[-1])
Out[2]: 
3  a   b                   c
b                           
   2 NaN 1970-01-01 00:00:04                <------------ empty string converted to NaN

In [3]: pd.DataFrame(dict(a=[1, 1, 2, 2], b=["abc", "abc", "", ""], c=pd.to_datetime([1, 2, 3, 4], unit="s"))).groupby("b").apply(lambda df: df.iloc[-1])
Out[3]: 
     a    b                   c
b                              
     2      1970-01-01 00:00:04
abc  1  abc 1970-01-01 00:00:02

Problem description

When using groupBy with a column of empty strings and a column of Timestamps, the empty string can get converted to nan. See these three examples which display inconsistent behaviour:

Expected Output

The output in Out[3] shouldn't contain a NaN but the empty string "".

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.16-200.fc28.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.23.4
pytest: 3.3.2
pip: 10.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.13.3
scipy: 0.19.1
pyarrow: 0.8.0
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.1.1
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@simonjayhawkins
Copy link
Member

@djbarker Thanks for the report. I'm guessing it's an edge case when the result DataFrame is created from just a single series, evidenced by the series name from the series returned by the lambda becoming the columns index name in the result DataFrame.

Investigation and PRs welcome

@djbarker
Copy link
Author

@simonjayhawkins Surely the series name becomes the name of the index because it's what we grouped by? Also, I'm not sure I follow what you mean by it being created from a single series. The issue also occurs if we group by "a" or ["a", "b"]:

In [1]: pd.DataFrame(dict(a=[1, 1, 2, 2], b=["", "", "", ""], c=pd.to_datetime([1, 2, 3, 4], unit="s"))).groupby("a")
   ...: .apply(lambda df: df.iloc[-1])
Out[1]: 
   a   b                   c
a                           
1  1 NaN 1970-01-01 00:00:02
2  2 NaN 1970-01-01 00:00:04

In [2]: pd.DataFrame(dict(a=[1, 1, 2, 2], b=["", "", "", ""], c=pd.to_datetime([1, 2, 3, 4], unit="s"))).groupby(["a"
   ...: , "b"]).apply(lambda df: df.iloc[-1])
Out[2]: 
     a   b                   c
a b                           
1    1 NaN 1970-01-01 00:00:02
2    2 NaN 1970-01-01 00:00:04

@simonjayhawkins
Copy link
Member

@djbarker

Surely the series name becomes the name of the index because it's what we grouped by?

sorry if I was not clear. the row index names are determined by the columns grouped on. in the first two examples you gave, the column index name is shown as 3.

Also, I'm not sure I follow what you mean by it being created from a single series. The issue also occurs if we group by "a" or ["a", "b"]

correct, the additional examples show that it is not just confined to a single series

@djbarker
Copy link
Author

Ah I see, I hadn't noticed that. I wonder where the 3 comes from?

@simonjayhawkins
Copy link
Member

interestingly, if you change your lambda to return a DataFrame you won't get the NaNs but you'll get an extra index level.

pd.DataFrame(dict(a=[1, 1, 2, 2], b=["", "", "", ""], c=pd.to_datetime([1, 2, 3, 4], unit="s"))).groupby("b").apply(lambda df: df.iloc[[-1]])

     a b                   c
b
  3  2   1970-01-01 00:00:04

pd.DataFrame(dict(a=[1, 1, 2, 2], b=["", "", "", ""], c=pd.to_datetime([1, 2, 3, 4], unit="s"))).groupby(["a", "b"]).apply(lambda df: df.iloc[[-1]])
       a b                   c
a b
1   1  1   1970-01-01 00:00:02
2   3  2   1970-01-01 00:00:04

@djbarker
Copy link
Author

I realised the 3 is just the index of the last row in the DataFrame that the apply lambda sees in the case we return only one row. These become the extra index level if we return a DataFrame not a Series. This has to do with how apply handles joining the results back into one Dataframe. If the returned DataFrame has more than one row, one will be interested in keeping the index of these around:

In [33]: pd.DataFrame(dict(a=[1, 1, 2, 2], b=["", "", "", ""], c=pd.to_datetime([1, 2, 3, 4], unit="s"))).groupby(["a
    ...: "]).apply(lambda df: pd.DataFrame(dict(foo=["bar", "abc"], baz=["edf", "ghi"])).set_index("baz"))
Out[33]: 
       foo
a baz     
1 edf  bar
  ghi  abc
2 edf  bar
  ghi  abc

@mroeschke
Copy link
Member

The original example looks to work on master. Could use a test

In [34]: In [2]: pd.DataFrame(dict(a=[1, 1, 2, 2], b=["", "", "", ""], c=pd.to_datetime([1, 2, 3, 4], unit="s"))).groupby("b").apply(lamb
    ...: da df: df.iloc[-1])
Out[34]:
3  a b                   c
b
   2   1970-01-01 00:00:04

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Groupby labels Jun 25, 2021
@tyvich
Copy link
Contributor

tyvich commented Oct 15, 2021

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants