-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: assign with apply over axis=1 sometimes fails when the DataFrame has zero rows #50244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@cf-vrgl if this issue is not too complex, I would like to take it up. Could you guide me if possible? |
@ShisuiUzumaki , it seems theree is some unreliable behavior when the dataframe has no rows. Sometimes assign using an apply over rows is successful, other times it is not. The latest version seems to raise an error more often than earlier versions: import pandas as pd
import datetime
df = pd.DataFrame([
[
datetime.datetime(2022, 6, 1), datetime.datetime(2022, 6, 2)
]], columns=['a', 'b'])
def f(row):
return min(row['a'] - timedelta(days=1), row['b'])
# apply returns 1 column:
print(df.apply(lambda row: f(row), axis=1).shape)
# no problem adding column when at least 1 row of data
df = df.assign(both_work=lambda d: d.apply(lambda row: f(row), axis=1))
# drop row
df = df.query("a==b")
# now that a row has been dropped, apply retuns multiple columns
print(df.apply(lambda row: f(row), axis=1).shape)
# if the column being assined to already exists, this is ok in pandas 1.2.4, but not in 1.5.2:
df = df.assign(old_ok=datetime.datetime(2000, 1, 1)).assign(old_ok=lambda df: df.apply(lambda row: f(row), axis=1))
# both versions fail if the column does not already exist
df = df.assign(both_fail=lambda df: df.apply(lambda row: f(row), axis=1)) |
What solution do you suggest? I mean, how should I approach the issue? |
I suggest stepping through pandas 1.2.4 and 1.5.2 for this line to see why 1.5.2 fails but 1.2.4 does not. Hopefully, the difference on this line will give some insight on how to resolve the issue. df = df.assign(old_ok=datetime.datetime(2000, 1, 1)).assign(old_ok=lambda df: df.apply(lambda row: f(row), axis=1)) |
So, for starters, I should debug through this code snippet in both versions for identification of issue, right? |
Yes, I think that is a good first step. |
Ok, I will take start with that |
@cf-vrgl - could you give this issue an informative title. |
It would be helpful if you can give a reproducible example that only involves apply to an explicitly constructed DataFrame (rather than relying on operations to get the input), and what the output you believe to be buggy is. |
import datetime
import pandas as pd
df = pd.DataFrame({
'a': pd.Series([], dtype="datetime64[ns]", name='a'),
'b': pd.Series([], dtype="datetime64[ns]", name='b')})
# expected result is 0 x 1, actual is 0 x 2
print(
df
.apply(lambda row: min(row['a'] - datetime.timedelta(days=1), row['b']),
axis=1)
.shape)
# expected result is a df with three datetime64[ns] columns and no rows, actual is fail
new_df = df.assign(
new_datetime_col=lambda d: d.apply(lambda row: min(row['a'] - datetime.timedelta(days=1), row['b']),
axis=1)) |
Thanks @cf-vrgl. I think your example above can be further simplified to just:
I understand the desire to have pandas return 0 x 1 here, but the issue is how is pandas supposed to determine that? User Defined Functions (UDFs) are black boxes, pandas can only give them input and get their output. In this case, pandas tried to call the function but it raises, so pandas can't learn anything from the output. Compare this to
You would not expect this to return 0 x 1. However when the UDF raises, pandas has nothing to go on to determine what the resulting shape is. So pandas just returns the input's shape. This should be better documented, but prior to that I think we should establish what the desired behavior is first as in #47959. |
Thanks @rhshadrach , I have a habit of only using apply over axis 1 for functions that returned a single value and overlooked other possibilities. At the risk of turning my bug report into an enhancement request/suggestion, here's a workaround that seems like it could be the start of a general solution: import datetime
import numpy as np
import pandas as pd
from typing import Callable
df = pd.DataFrame({
'a': pd.Series([], dtype="datetime64[ns]", name='a'),
'b': pd.Series([], dtype="datetime64[ns]", name='b')})
def default_assign(df: pd.DataFrame, col_name: str, dtype: str, f: Callable):
"assign(col_name=lambda df: f(df)) with a default dtype for zero row case"
if len(df) > 0:
return df.assign(**{col_name: lambda d: f(d)})
if col_name in df.columns:
df.loc[:, col_name] = pd.Series(name=col_name, dtype=dtype)
else:
df = pd.concat([df, pd.Series(name=col_name, dtype=dtype)], axis=1)
return df
# expected result is a df with three datetime64[ns] columns and no rows
new_df = df.pipe(
default_assign,
f=lambda d: d.apply(lambda row: min(row['a'] - datetime.timedelta(days=1), row['b']), axis=1),
col_name='new_datetime_col',
dtype="datetime64[ns]",
)
print(new_df.loc[:, 'new_datetime_col'].dtype)
print(new_df) |
As far as I can tell, the issue isn't assign, it's apply. Even with the above code, apply would still be returning something you don't desire. So it wouldn't fix the root cause of the issue. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
When there are zero rows, apply(lambda row: , axis=1) returns all columns rather than a single result column leading to a value error.
Expected Behavior
A new column should be created.
Installed Versions
INSTALLED VERSIONS
commit : 8dab54d
python : 3.8.10.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252
pandas : 1.5.2
numpy : 1.22.4
pytz : 2022.1
dateutil : 2.8.2
setuptools : 63.4.1
pip : 22.2.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.4.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.8.2
gcsfs : None
matplotlib : 3.6.0
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.7.1
snappy : None
sqlalchemy : 1.4.45
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None
The text was updated successfully, but these errors were encountered: