Skip to content

BUG: assign with apply over axis=1 sometimes fails when the DataFrame has zero rows #50244

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
cf-vrgl opened this issue Dec 13, 2022 · 13 comments
Open
3 tasks done
Labels
Apply Apply, Aggregate, Transform, Map Bug

Comments

@cf-vrgl
Copy link

cf-vrgl commented Dec 13, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import datetime

df = pd.DataFrame([
    [
    datetime.datetime(2022, 6, 1), datetime.datetime(2022, 6, 2)
    ]], columns=['a', 'b'])

# works
df = df.assign(c=lambda d: d.apply(lambda row: min(row['a'] - datetime.timedelta(days=1), row['b']), axis=1).values)

# drop row
df = df.query("a==b")

# show column expansion
print(df.apply(lambda row: min(row['a'] - datetime.timedelta(days=1), row['b']), axis=1))

# fails
df = df.assign(d=lambda ld: ld.apply(lambda row: min(row['a'] - datetime.timedelta(days=1), row['b']), axis=1).values)

Issue Description

When there are zero rows, apply(lambda row: , axis=1) returns all columns rather than a single result column leading to a value error.

Expected Behavior

A new column should be created.

Installed Versions

INSTALLED VERSIONS

commit : 8dab54d
python : 3.8.10.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252
pandas : 1.5.2
numpy : 1.22.4
pytz : 2022.1
dateutil : 2.8.2
setuptools : 63.4.1
pip : 22.2.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.4.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.8.2
gcsfs : None
matplotlib : 3.6.0
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.7.1
snappy : None
sqlalchemy : 1.4.45
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

@cf-vrgl cf-vrgl added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 13, 2022
@ShisuiUzumaki
Copy link
Contributor

@cf-vrgl if this issue is not too complex, I would like to take it up. Could you guide me if possible?

@cf-vrgl
Copy link
Author

cf-vrgl commented Dec 14, 2022

@ShisuiUzumaki , it seems theree is some unreliable behavior when the dataframe has no rows. Sometimes assign using an apply over rows is successful, other times it is not. The latest version seems to raise an error more often than earlier versions:

import pandas as pd
import datetime

df = pd.DataFrame([
    [
    datetime.datetime(2022, 6, 1), datetime.datetime(2022, 6, 2)
    ]], columns=['a', 'b'])

def f(row):
    return  min(row['a'] - timedelta(days=1), row['b'])

# apply returns 1 column:
print(df.apply(lambda row: f(row), axis=1).shape)

# no problem adding column when at least 1 row of data
df = df.assign(both_work=lambda d: d.apply(lambda row: f(row), axis=1))

# drop row
df = df.query("a==b")

# now that a row has been dropped, apply retuns multiple columns
print(df.apply(lambda row: f(row), axis=1).shape)

# if the column being assined to already exists, this is ok in pandas 1.2.4, but not in 1.5.2:
df = df.assign(old_ok=datetime.datetime(2000, 1, 1)).assign(old_ok=lambda df: df.apply(lambda row: f(row), axis=1))

# both versions fail if the column does not already exist
df = df.assign(both_fail=lambda df: df.apply(lambda row: f(row), axis=1))

@ShisuiUzumaki
Copy link
Contributor

ShisuiUzumaki commented Dec 14, 2022

@ShisuiUzumaki , it seems theree is some unreliable behavior when the dataframe has no rows. Sometimes assign using an apply over rows is successful, other times it is not. The latest version seems to raise an error more often than earlier versions:

import pandas as pd
import datetime

df = pd.DataFrame([
    [
    datetime.datetime(2022, 6, 1), datetime.datetime(2022, 6, 2)
    ]], columns=['a', 'b'])

def f(row):
    return  min(row['a'] - timedelta(days=1), row['b'])

# apply returns 1 column:
print(df.apply(lambda row: f(row), axis=1).shape)

# no problem adding column when at least 1 row of data
df = df.assign(both_work=lambda d: d.apply(lambda row: f(row), axis=1))

# drop row
df = df.query("a==b")

# now that a row has been dropped, apply retuns multiple columns
print(df.apply(lambda row: f(row), axis=1).shape)

# if the column being assined to already exists, this is ok in pandas 1.2.4, but not in 1.5.2:
df = df.assign(old_ok=datetime.datetime(2000, 1, 1)).assign(old_ok=lambda df: df.apply(lambda row: f(row), axis=1))

# both versions fail if the column does not already exist
df = df.assign(both_fail=lambda df: df.apply(lambda row: f(row), axis=1))

What solution do you suggest? I mean, how should I approach the issue?

@cf-vrgl
Copy link
Author

cf-vrgl commented Dec 14, 2022

I suggest stepping through pandas 1.2.4 and 1.5.2 for this line to see why 1.5.2 fails but 1.2.4 does not. Hopefully, the difference on this line will give some insight on how to resolve the issue.

df = df.assign(old_ok=datetime.datetime(2000, 1, 1)).assign(old_ok=lambda df: df.apply(lambda row: f(row), axis=1))

@ShisuiUzumaki
Copy link
Contributor

I suggest stepping through pandas 1.2.4 and 1.5.2 for this line to see why 1.5.2 fails but 1.2.4 does not. Hopefully, the difference on this line will give some insight on how to resolve the issue.

df = df.assign(old_ok=datetime.datetime(2000, 1, 1)).assign(old_ok=lambda df: df.apply(lambda row: f(row), axis=1))

So, for starters, I should debug through this code snippet in both versions for identification of issue, right?

@cf-vrgl
Copy link
Author

cf-vrgl commented Dec 14, 2022

Yes, I think that is a good first step.

@ShisuiUzumaki
Copy link
Contributor

Yes, I think that is a good first step.

Ok, I will take start with that

@rhshadrach
Copy link
Member

@cf-vrgl - could you give this issue an informative title.

@rhshadrach rhshadrach added the Apply Apply, Aggregate, Transform, Map label Dec 15, 2022
@rhshadrach
Copy link
Member

When there are zero rows, apply(lambda row: , axis=1) returns all columns rather than a single result column leading to a value error.

It would be helpful if you can give a reproducible example that only involves apply to an explicitly constructed DataFrame (rather than relying on operations to get the input), and what the output you believe to be buggy is.

@rhshadrach rhshadrach added the Needs Info Clarification about behavior needed to assess issue label Dec 15, 2022
@cf-vrgl cf-vrgl changed the title BUG: BUG: assign with apply over axs=1 sometimes fails when the dataframe has zero rows Dec 15, 2022
@cf-vrgl
Copy link
Author

cf-vrgl commented Dec 15, 2022

@rhshadrach

import datetime
import pandas as pd
df = pd.DataFrame({
    'a': pd.Series([], dtype="datetime64[ns]", name='a'),
    'b': pd.Series([], dtype="datetime64[ns]", name='b')})

# expected result is 0 x 1, actual is 0 x 2
print(
    df
    .apply(lambda row: min(row['a'] - datetime.timedelta(days=1), row['b']),
           axis=1)
    .shape)

# expected result is a df with three datetime64[ns] columns and no rows, actual is fail
new_df = df.assign(
    new_datetime_col=lambda d: d.apply(lambda row: min(row['a'] - datetime.timedelta(days=1), row['b']),
                                       axis=1))

@cf-vrgl cf-vrgl changed the title BUG: assign with apply over axs=1 sometimes fails when the dataframe has zero rows BUG: assign with apply over axis=1 sometimes fails when the DataFrame has zero rows Dec 15, 2022
@rhshadrach
Copy link
Member

Thanks @cf-vrgl. I think your example above can be further simplified to just:

df.apply(
    lambda row: min(row['a'] - datetime.timedelta(days=1), row['b']),
    axis=1
)

I understand the desire to have pandas return 0 x 1 here, but the issue is how is pandas supposed to determine that? User Defined Functions (UDFs) are black boxes, pandas can only give them input and get their output. In this case, pandas tried to call the function but it raises, so pandas can't learn anything from the output.

Compare this to

df.apply(lambda row: row, axis=1)

You would not expect this to return 0 x 1. However when the UDF raises, pandas has nothing to go on to determine what the resulting shape is. So pandas just returns the input's shape. This should be better documented, but prior to that I think we should establish what the desired behavior is first as in #47959.

@cf-vrgl
Copy link
Author

cf-vrgl commented Dec 16, 2022

Thanks @rhshadrach , I have a habit of only using apply over axis 1 for functions that returned a single value and overlooked other possibilities. At the risk of turning my bug report into an enhancement request/suggestion, here's a workaround that seems like it could be the start of a general solution:

import datetime
import numpy as np
import pandas as pd
from typing import Callable

df = pd.DataFrame({
    'a': pd.Series([], dtype="datetime64[ns]", name='a'),
    'b': pd.Series([], dtype="datetime64[ns]", name='b')})

def default_assign(df: pd.DataFrame, col_name: str, dtype: str, f: Callable):
    "assign(col_name=lambda df: f(df)) with a default dtype for zero row case"

    if len(df) > 0:
        return df.assign(**{col_name: lambda d: f(d)})

    if col_name in df.columns:
        df.loc[:, col_name] = pd.Series(name=col_name, dtype=dtype)
    else:
        df = pd.concat([df, pd.Series(name=col_name, dtype=dtype)], axis=1)

    return df

# expected result is a df with three datetime64[ns] columns and no rows
new_df = df.pipe(
    default_assign,
    f=lambda d: d.apply(lambda row: min(row['a'] - datetime.timedelta(days=1), row['b']), axis=1),
    col_name='new_datetime_col',
    dtype="datetime64[ns]",
)


print(new_df.loc[:, 'new_datetime_col'].dtype)
print(new_df)

@rhshadrach
Copy link
Member

As far as I can tell, the issue isn't assign, it's apply. Even with the above code, apply would still be returning something you don't desire. So it wouldn't fix the root cause of the issue.

@rhshadrach rhshadrach removed the Needs Info Clarification about behavior needed to assess issue label Apr 17, 2023
@mroeschke mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug
Projects
None yet
Development

No branches or pull requests

4 participants