BUG: assign with apply over axis=1 sometimes fails when the DataFrame has zero rows #50244

cf-vrgl · 2022-12-13T21:48:16Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import datetime

df = pd.DataFrame([
    [
    datetime.datetime(2022, 6, 1), datetime.datetime(2022, 6, 2)
    ]], columns=['a', 'b'])

# works
df = df.assign(c=lambda d: d.apply(lambda row: min(row['a'] - datetime.timedelta(days=1), row['b']), axis=1).values)

# drop row
df = df.query("a==b")

# show column expansion
print(df.apply(lambda row: min(row['a'] - datetime.timedelta(days=1), row['b']), axis=1))

# fails
df = df.assign(d=lambda ld: ld.apply(lambda row: min(row['a'] - datetime.timedelta(days=1), row['b']), axis=1).values)

Issue Description

When there are zero rows, apply(lambda row: , axis=1) returns all columns rather than a single result column leading to a value error.

Expected Behavior

A new column should be created.

Installed Versions

INSTALLED VERSIONS

commit : 8dab54d
python : 3.8.10.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252
pandas : 1.5.2
numpy : 1.22.4
pytz : 2022.1
dateutil : 2.8.2
setuptools : 63.4.1
pip : 22.2.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.4.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.8.2
gcsfs : None
matplotlib : 3.6.0
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.7.1
snappy : None
sqlalchemy : 1.4.45
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

The text was updated successfully, but these errors were encountered:

ShisuiUzumaki · 2022-12-14T06:41:06Z

@cf-vrgl if this issue is not too complex, I would like to take it up. Could you guide me if possible?

cf-vrgl · 2022-12-14T15:48:57Z

@ShisuiUzumaki , it seems theree is some unreliable behavior when the dataframe has no rows. Sometimes assign using an apply over rows is successful, other times it is not. The latest version seems to raise an error more often than earlier versions:

import pandas as pd
import datetime

df = pd.DataFrame([
    [
    datetime.datetime(2022, 6, 1), datetime.datetime(2022, 6, 2)
    ]], columns=['a', 'b'])

def f(row):
    return  min(row['a'] - timedelta(days=1), row['b'])

# apply returns 1 column:
print(df.apply(lambda row: f(row), axis=1).shape)

# no problem adding column when at least 1 row of data
df = df.assign(both_work=lambda d: d.apply(lambda row: f(row), axis=1))

# drop row
df = df.query("a==b")

# now that a row has been dropped, apply retuns multiple columns
print(df.apply(lambda row: f(row), axis=1).shape)

# if the column being assined to already exists, this is ok in pandas 1.2.4, but not in 1.5.2:
df = df.assign(old_ok=datetime.datetime(2000, 1, 1)).assign(old_ok=lambda df: df.apply(lambda row: f(row), axis=1))

# both versions fail if the column does not already exist
df = df.assign(both_fail=lambda df: df.apply(lambda row: f(row), axis=1))

ShisuiUzumaki · 2022-12-14T15:55:52Z

@ShisuiUzumaki , it seems theree is some unreliable behavior when the dataframe has no rows. Sometimes assign using an apply over rows is successful, other times it is not. The latest version seems to raise an error more often than earlier versions:

import pandas as pd
import datetime

df = pd.DataFrame([
    [
    datetime.datetime(2022, 6, 1), datetime.datetime(2022, 6, 2)
    ]], columns=['a', 'b'])

def f(row):
    return  min(row['a'] - timedelta(days=1), row['b'])

# apply returns 1 column:
print(df.apply(lambda row: f(row), axis=1).shape)

# no problem adding column when at least 1 row of data
df = df.assign(both_work=lambda d: d.apply(lambda row: f(row), axis=1))

# drop row
df = df.query("a==b")

# now that a row has been dropped, apply retuns multiple columns
print(df.apply(lambda row: f(row), axis=1).shape)

# if the column being assined to already exists, this is ok in pandas 1.2.4, but not in 1.5.2:
df = df.assign(old_ok=datetime.datetime(2000, 1, 1)).assign(old_ok=lambda df: df.apply(lambda row: f(row), axis=1))

# both versions fail if the column does not already exist
df = df.assign(both_fail=lambda df: df.apply(lambda row: f(row), axis=1))

What solution do you suggest? I mean, how should I approach the issue?

cf-vrgl · 2022-12-14T16:01:06Z

I suggest stepping through pandas 1.2.4 and 1.5.2 for this line to see why 1.5.2 fails but 1.2.4 does not. Hopefully, the difference on this line will give some insight on how to resolve the issue.

df = df.assign(old_ok=datetime.datetime(2000, 1, 1)).assign(old_ok=lambda df: df.apply(lambda row: f(row), axis=1))

ShisuiUzumaki · 2022-12-14T16:02:52Z

I suggest stepping through pandas 1.2.4 and 1.5.2 for this line to see why 1.5.2 fails but 1.2.4 does not. Hopefully, the difference on this line will give some insight on how to resolve the issue.

df = df.assign(old_ok=datetime.datetime(2000, 1, 1)).assign(old_ok=lambda df: df.apply(lambda row: f(row), axis=1))

So, for starters, I should debug through this code snippet in both versions for identification of issue, right?

cf-vrgl · 2022-12-14T16:04:48Z

Yes, I think that is a good first step.

ShisuiUzumaki · 2022-12-14T16:08:54Z

Yes, I think that is a good first step.

Ok, I will take start with that

rhshadrach · 2022-12-15T01:02:56Z

@cf-vrgl - could you give this issue an informative title.

rhshadrach · 2022-12-15T01:10:13Z

When there are zero rows, apply(lambda row: , axis=1) returns all columns rather than a single result column leading to a value error.

It would be helpful if you can give a reproducible example that only involves apply to an explicitly constructed DataFrame (rather than relying on operations to get the input), and what the output you believe to be buggy is.

cf-vrgl · 2022-12-15T04:41:56Z

@rhshadrach

import datetime
import pandas as pd
df = pd.DataFrame({
    'a': pd.Series([], dtype="datetime64[ns]", name='a'),
    'b': pd.Series([], dtype="datetime64[ns]", name='b')})

# expected result is 0 x 1, actual is 0 x 2
print(
    df
    .apply(lambda row: min(row['a'] - datetime.timedelta(days=1), row['b']),
           axis=1)
    .shape)

# expected result is a df with three datetime64[ns] columns and no rows, actual is fail
new_df = df.assign(
    new_datetime_col=lambda d: d.apply(lambda row: min(row['a'] - datetime.timedelta(days=1), row['b']),
                                       axis=1))

rhshadrach · 2022-12-16T03:28:13Z

Thanks @cf-vrgl. I think your example above can be further simplified to just:

df.apply(
    lambda row: min(row['a'] - datetime.timedelta(days=1), row['b']),
    axis=1
)

I understand the desire to have pandas return 0 x 1 here, but the issue is how is pandas supposed to determine that? User Defined Functions (UDFs) are black boxes, pandas can only give them input and get their output. In this case, pandas tried to call the function but it raises, so pandas can't learn anything from the output.

Compare this to

df.apply(lambda row: row, axis=1)

You would not expect this to return 0 x 1. However when the UDF raises, pandas has nothing to go on to determine what the resulting shape is. So pandas just returns the input's shape. This should be better documented, but prior to that I think we should establish what the desired behavior is first as in #47959.

cf-vrgl · 2022-12-16T05:11:57Z

Thanks @rhshadrach , I have a habit of only using apply over axis 1 for functions that returned a single value and overlooked other possibilities. At the risk of turning my bug report into an enhancement request/suggestion, here's a workaround that seems like it could be the start of a general solution:

import datetime
import numpy as np
import pandas as pd
from typing import Callable

df = pd.DataFrame({
    'a': pd.Series([], dtype="datetime64[ns]", name='a'),
    'b': pd.Series([], dtype="datetime64[ns]", name='b')})

def default_assign(df: pd.DataFrame, col_name: str, dtype: str, f: Callable):
    "assign(col_name=lambda df: f(df)) with a default dtype for zero row case"

    if len(df) > 0:
        return df.assign(**{col_name: lambda d: f(d)})

    if col_name in df.columns:
        df.loc[:, col_name] = pd.Series(name=col_name, dtype=dtype)
    else:
        df = pd.concat([df, pd.Series(name=col_name, dtype=dtype)], axis=1)

    return df

# expected result is a df with three datetime64[ns] columns and no rows
new_df = df.pipe(
    default_assign,
    f=lambda d: d.apply(lambda row: min(row['a'] - datetime.timedelta(days=1), row['b']), axis=1),
    col_name='new_datetime_col',
    dtype="datetime64[ns]",
)


print(new_df.loc[:, 'new_datetime_col'].dtype)
print(new_df)

rhshadrach · 2022-12-17T14:04:04Z

As far as I can tell, the issue isn't assign, it's apply. Even with the above code, apply would still be returning something you don't desire. So it wouldn't fix the root cause of the issue.

cf-vrgl added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 13, 2022

rhshadrach added the Apply Apply, Aggregate, Transform, Map label Dec 15, 2022

rhshadrach added the Needs Info Clarification about behavior needed to assess issue label Dec 15, 2022

cf-vrgl changed the title ~~BUG:~~ BUG: assign with apply over axs=1 sometimes fails when the dataframe has zero rows Dec 15, 2022

cf-vrgl changed the title ~~BUG: assign with apply over axs=1 sometimes fails when the dataframe has zero rows~~ BUG: assign with apply over axis=1 sometimes fails when the DataFrame has zero rows Dec 15, 2022

rhshadrach removed the Needs Info Clarification about behavior needed to assess issue label Apr 17, 2023

mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: assign with apply over axis=1 sometimes fails when the DataFrame has zero rows #50244

BUG: assign with apply over axis=1 sometimes fails when the DataFrame has zero rows #50244

cf-vrgl commented Dec 13, 2022 •

edited

Loading

INSTALLED VERSIONS

ShisuiUzumaki commented Dec 14, 2022

cf-vrgl commented Dec 14, 2022 •

edited

Loading

ShisuiUzumaki commented Dec 14, 2022 •

edited

Loading

cf-vrgl commented Dec 14, 2022

ShisuiUzumaki commented Dec 14, 2022

cf-vrgl commented Dec 14, 2022

ShisuiUzumaki commented Dec 14, 2022

rhshadrach commented Dec 15, 2022

rhshadrach commented Dec 15, 2022

cf-vrgl commented Dec 15, 2022 •

edited

Loading

rhshadrach commented Dec 16, 2022

cf-vrgl commented Dec 16, 2022 •

edited

Loading

rhshadrach commented Dec 17, 2022

BUG: assign with apply over axis=1 sometimes fails when the DataFrame has zero rows #50244

BUG: assign with apply over axis=1 sometimes fails when the DataFrame has zero rows #50244

Comments

cf-vrgl commented Dec 13, 2022 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

ShisuiUzumaki commented Dec 14, 2022

cf-vrgl commented Dec 14, 2022 • edited Loading

ShisuiUzumaki commented Dec 14, 2022 • edited Loading

cf-vrgl commented Dec 14, 2022

ShisuiUzumaki commented Dec 14, 2022

cf-vrgl commented Dec 14, 2022

ShisuiUzumaki commented Dec 14, 2022

rhshadrach commented Dec 15, 2022

rhshadrach commented Dec 15, 2022

cf-vrgl commented Dec 15, 2022 • edited Loading

rhshadrach commented Dec 16, 2022

cf-vrgl commented Dec 16, 2022 • edited Loading

rhshadrach commented Dec 17, 2022

cf-vrgl commented Dec 13, 2022 •

edited

Loading

cf-vrgl commented Dec 14, 2022 •

edited

Loading

ShisuiUzumaki commented Dec 14, 2022 •

edited

Loading

cf-vrgl commented Dec 15, 2022 •

edited

Loading

cf-vrgl commented Dec 16, 2022 •

edited

Loading