BUG: df.apply(...) method will sometimes return DatetimeIndex on first iteration #56747

AlanCPSC · 2024-01-05T21:03:23Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

Create a file locally named example.json with the following contents:

{"Date":{"0":1703653200000,"1":1703566800000,"2":1703221200000,"3":1703134800000,"4":1703048400000,"5":1702962000000,"6":1702875600000,"7":1702616400000,"8":1702530000000,"9":1702443600000},"Revenue":{"0":3880359,"1":3139100,"2":2849700,"3":4884800,"4":4032200,"5":4979100,"6":6314700,"7":11503000,"8":8033300,"9":7727900}}

Create a file locally named example.py with the following contents:

import pandas as pd

# Assuming "example.json" is in the same directory as your Python script or notebook
file_path = 'example.json'

# Read DataFrame from JSON file
df = pd.read_json(file_path)

# Display the DataFrame
print(df)

# Recreate the problem
df['Date'].apply(lambda x: print(x, type(x)))

Unfortunately, I cannot replicate this error without revealing more sensitive sections of the codebase.

Issue Description

I currently have the following pandas dataframe object:

>>> print(my_df)
                        Date   Revenue
0  2023-12-27 00:00:00-05:00   3880359
1  2023-12-26 00:00:00-05:00   3139100
2  2023-12-22 00:00:00-05:00   2849700
3  2023-12-21 00:00:00-05:00   4884800
4  2023-12-20 00:00:00-05:00   4032200
5  2023-12-19 00:00:00-05:00   4979100
6  2023-12-18 00:00:00-05:00   6314700
7  2023-12-15 00:00:00-05:00  11503000
8  2023-12-14 00:00:00-05:00   8033300
9  2023-12-13 00:00:00-05:00   7727900

I get normal expected results when I loop through Revenue column:

>>> my_df['Revenue'].apply(lambda x: print(x, type(x)))
3880359 <class 'int'>
3139100 <class 'int'>
2849700 <class 'int'>
4884800 <class 'int'>
4032200 <class 'int'>
4979100 <class 'int'>
6314700 <class 'int'>
11503000 <class 'int'>
8033300 <class 'int'>
7727900 <class 'int'>

I get abnormal unexpected results when I loop through Date column:

>>> my_df['Date'].apply(lambda x: print(x, type(x)))
DatetimeIndex(['2023-12-27 00:00:00-05:00', '2023-12-26 00:00:00-05:00', '2023-12-22 00:00:00-05:00', '2023-12-21 00:00:00-05:00', '2023-12-20 00:00:00-05:00', '2023-12-19 00:00:00-05:00', '2023-12-18 00:00:00-05:00', '2023-12-15 00:00:00-05:00', '2023-12-14 00:00:00-05:00', '2023-12-13 00:00:00-05:00'], dtype='datetime64[ns, America/New_York]', freq=None) <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
2023-12-27 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-26 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-22 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-21 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-20 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-19 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-18 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-15 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-14 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-13 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>

Why is this happening? Why do I get an index object at first?

Expected Behavior

I should be getting exclusively timestamp objects on iteration:

>>> my_df['Date'].apply(lambda x: print(x, type(x)))
2023-12-27 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-26 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-22 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-21 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-20 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-19 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-18 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-15 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-14 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-13 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 0f437949513225922d851e9581723d82120684a6
python           : 3.8.8.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 23.2.0
Version          : Darwin Kernel Version 23.2.0: Wed Nov 15 21:54:10 PST 2023; root:xnu-10002.61.3~2/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 2.0.3
numpy            : 1.22.4
pytz             : 2022.7
dateutil         : 2.8.2
setuptools       : 52.0.0.post20210125
pip              : 21.0.1
Cython           : 0.29.23
pytest           : 6.2.3
hypothesis       : None
sphinx           : 4.0.1
blosc            : None
feather          : None
xlsxwriter       : 1.3.8
lxml.etree       : 4.9.2
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.3
IPython          : 7.22.0
pandas_datareader: None
bs4              : 4.11.1
bottleneck       : 1.3.2
brotli           : 
fastparquet      : None
fsspec           : 2023.1.0
gcsfs            : None
matplotlib       : 3.3.4
numba            : 0.53.1
numexpr          : 2.7.3
odfpy            : None
openpyxl         : 3.0.7
pandas_gbq       : None
pyarrow          : 11.0.0
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : 1.6.2
snappy           : None
sqlalchemy       : 1.4.7
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : 2.0.1
zstandard        : None
tzdata           : 2023.3
qtpy             : 1.9.0
pyqt5            : None

The text was updated successfully, but these errors were encountered:

AlanCPSC · 2024-01-05T21:13:46Z

There is also a related Stack Overflow post that, unfortunately, remains unresolved here:

Why does pandas.apply return an index on the first iteration instead of the actual element?

Thank you, team, for your work and looking into this strange and potentially widespread bug. 😊

rhshadrach · 2024-01-06T13:55:13Z

Unfortunately, I cannot replicate this error without revealing more sensitive sections of the codebase.

So the output you posted above not the actual output you get?

In any case, can you try using Series.map instead? Right now apply will try row-by-row first, and then fallback to passing the entire Series to the provided function. However I'm still confused because your output has this happening in the opposite order.

AlanCPSC · 2024-01-07T06:45:47Z

So the output you posted above not the actual output you get?

It is the actual output I get while looping through an internally generated dataframe timestamp column. However, I ran into an unexpected error where I sometimes get a DateTimeIndex as the first element. When I save the dataframe into a JSON and reinitialize it, I find that I am unable to reproduce this error. So, my assumption is that there might be some internal temporary variable in pandas itself that isn't being flushed correctly or something.

To create a "reproducible" code snippet, I'd need to reveal sensitive parts of the codebase, which I cannot do. Although this isn't particularly helpful, I wanted to highlight the issue anyways, ensuring that the staff is aware of an anomaly occurring in the apply(...) method.

However I'm still confused because your output has this happening in the opposite order.

Not sure if this is helpful, but the dataframe was flipped beforehand prior to calling apply(...).

actual_df = actual_df.iloc[::-1]
actual_df = actual_df.reset_index(drop=False)
...
actual_df['new_column'] = actual_df['time_column'].apply(lambda x: print(x, type(x)))

rhshadrach · 2024-01-07T13:12:00Z

When I save the dataframe into a JSON and reinitialize it, I find that I am unable to reproduce this error. So, my assumption is that there might be some internal temporary variable in pandas itself that isn't being flushed correctly or something.

I suspect that saving to JSON is lossy. Can you try the following:

df = pd.read_json(...)
pd.testing.assert_series_equal(actual_df['time_column'], df['Date'], check_names=False)

Also, just to be sure, what is the output of actual_df['time_column'].dtype and is lambda x: print(x, type(x)) the exact function that reproduces the error on the actual data?

rhshadrach · 2024-01-07T13:34:50Z

You've checked that you confirmed this exists on the main branch of pandas, is that accurate?

AlanCPSC · 2024-01-08T07:25:22Z

I suspect that saving to JSON is lossy. Can you try the following:

I get the following error

AssertionError: Attributes of Series are different

Attribute "dtype" are different
[left]:  datetime64[ns, America/New_York]
[right]: datetime64[ns]

Also, just to be sure, what is the output of actual_df['time_column'].dtype

I get the following output

datetime64[ns, America/New_York]

Is lambda x: print(x, type(x)) the exact function that reproduces the error on the actual data?

No, the actual function is doing something actually useful. It just blows up with a TypeError instead.

You've checked that you confirmed this exists on the main branch of pandas, is that accurate?

Yes. But don't break your neck trying to track it down. I have a feeling this would be one of the more elusive bugs. The point of this ticket was just to bubble up that there IS an anomaly going on and make a record of it.

AlanCPSC · 2024-01-08T09:52:45Z

Right now apply will try row-by-row first, and then fallback to passing the entire Series to the provided function.

Sorry, could you please elaborate on this? How will it know when the method has failed and when to "fallback"? Does it consider the None return as a failure, or is it checking for exceptions?

rhshadrach · 2024-01-08T21:13:58Z

Is lambda x: print(x, type(x)) the exact function that reproduces the error on the actual data?

No, the actual function is doing something actually useful. It just blows up with a TypeError instead.

I think I was not clear. In the OP you posted the output

DatetimeIndex(['2023-12-27 00:00:00-05:00', '2023-12-26 00:00:00-05:00', '2023-12-22 00:00:00-05:00', '2023-12-21 00:00:00-05:00', '2023-12-20 00:00:00-05:00', '2023-12-19 00:00:00-05:00', '2023-12-18 00:00:00-05:00', '2023-12-15 00:00:00-05:00', '2023-12-14 00:00:00-05:00', '2023-12-13 00:00:00-05:00'], dtype='datetime64[ns, America/New_York]', freq=None) <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
2023-12-27 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-26 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
...

If you use the function lambda x: print(x, type(x)) on your actual data, do you actually get that output above?

Right now apply will try row-by-row first, and then fallback to passing the entire Series to the provided function.

Sorry, could you please elaborate on this? How will it know when the method has failed and when to "fallback"? Does it consider the None return as a failure, or is it checking for exceptions?

We check for ValueError, TypeError, and AssertionErrors. Returning None will not result in using the fallback.

pandas/pandas/core/apply.py

Lines 1477 to 1480 in d8e9529

    
           try: 
        
               result = obj.apply(func, by_row="compat") 
        
           except (ValueError, AttributeError, TypeError): 
        
               result = obj.apply(func, by_row=False)

rhshadrach · 2024-01-08T21:31:54Z

I was able to reproduce this on 2.0.x using the following reproducer:

data = """{"Date":{"0":1703653200000,"1":1703566800000,"2":1703221200000},"Revenue":{"0":3880359,"1":3139100,"2":2849700}}"""
df = pd.read_json(io.StringIO(data))
df["Date"] = pd.to_datetime(df["Date"]).dt.tz_localize(tz='US/Eastern')
df["Date"].apply(lambda x: print(x, type(x)))

The .dt.tz_localize(...) makes the Series use an ExtensionArray, which is not part of the JSON output. The behavior that produces the OP in the output was changed in #52033. It no longer appears on main, as was previously reported.

AlanCPSC · 2024-01-09T01:25:41Z

My apologies for any confusion. Thank you very much for your hard work. 🙌

rhshadrach · 2024-01-09T11:56:47Z

Not a problem, that's for all the information. If you find this is still happening on pandas 2.1 or later, please comment here and we can reopen!

AlanCPSC added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 5, 2024

rhshadrach added the Apply Apply, Aggregate, Transform, Map label Jan 6, 2024

rhshadrach added Closing Candidate May be closeable, needs more eyeballs and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 8, 2024

rhshadrach closed this as completed Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: df.apply(...) method will sometimes return DatetimeIndex on first iteration #56747

BUG: df.apply(...) method will sometimes return DatetimeIndex on first iteration #56747

AlanCPSC commented Jan 5, 2024 •

edited

Loading

AlanCPSC commented Jan 5, 2024

rhshadrach commented Jan 6, 2024

AlanCPSC commented Jan 7, 2024

rhshadrach commented Jan 7, 2024

rhshadrach commented Jan 7, 2024

AlanCPSC commented Jan 8, 2024

AlanCPSC commented Jan 8, 2024

rhshadrach commented Jan 8, 2024

rhshadrach commented Jan 8, 2024

AlanCPSC commented Jan 9, 2024

rhshadrach commented Jan 9, 2024

BUG: df.apply(...) method will sometimes return DatetimeIndex on first iteration #56747

BUG: df.apply(...) method will sometimes return DatetimeIndex on first iteration #56747

Comments

AlanCPSC commented Jan 5, 2024 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

AlanCPSC commented Jan 5, 2024

rhshadrach commented Jan 6, 2024

AlanCPSC commented Jan 7, 2024

rhshadrach commented Jan 7, 2024

rhshadrach commented Jan 7, 2024

AlanCPSC commented Jan 8, 2024

AlanCPSC commented Jan 8, 2024

rhshadrach commented Jan 8, 2024

rhshadrach commented Jan 8, 2024

AlanCPSC commented Jan 9, 2024

rhshadrach commented Jan 9, 2024

AlanCPSC commented Jan 5, 2024 •

edited

Loading