Skip to content

BUG: df.apply(...) method will sometimes return DatetimeIndex on first iteration #56747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
AlanCPSC opened this issue Jan 5, 2024 · 11 comments
Closed
3 tasks done
Labels
Apply Apply, Aggregate, Transform, Map Bug Closing Candidate May be closeable, needs more eyeballs

Comments

@AlanCPSC
Copy link

AlanCPSC commented Jan 5, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandas.
  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

Create a file locally named example.json with the following contents:

{"Date":{"0":1703653200000,"1":1703566800000,"2":1703221200000,"3":1703134800000,"4":1703048400000,"5":1702962000000,"6":1702875600000,"7":1702616400000,"8":1702530000000,"9":1702443600000},"Revenue":{"0":3880359,"1":3139100,"2":2849700,"3":4884800,"4":4032200,"5":4979100,"6":6314700,"7":11503000,"8":8033300,"9":7727900}}

Create a file locally named example.py with the following contents:

import pandas as pd

# Assuming "example.json" is in the same directory as your Python script or notebook
file_path = 'example.json'

# Read DataFrame from JSON file
df = pd.read_json(file_path)

# Display the DataFrame
print(df)

# Recreate the problem
df['Date'].apply(lambda x: print(x, type(x)))
  • Unfortunately, I cannot replicate this error without revealing more sensitive sections of the codebase.

Issue Description

I currently have the following pandas dataframe object:

>>> print(my_df)
                        Date   Revenue
0  2023-12-27 00:00:00-05:00   3880359
1  2023-12-26 00:00:00-05:00   3139100
2  2023-12-22 00:00:00-05:00   2849700
3  2023-12-21 00:00:00-05:00   4884800
4  2023-12-20 00:00:00-05:00   4032200
5  2023-12-19 00:00:00-05:00   4979100
6  2023-12-18 00:00:00-05:00   6314700
7  2023-12-15 00:00:00-05:00  11503000
8  2023-12-14 00:00:00-05:00   8033300
9  2023-12-13 00:00:00-05:00   7727900

I get normal expected results when I loop through Revenue column:

>>> my_df['Revenue'].apply(lambda x: print(x, type(x)))
3880359 <class 'int'>
3139100 <class 'int'>
2849700 <class 'int'>
4884800 <class 'int'>
4032200 <class 'int'>
4979100 <class 'int'>
6314700 <class 'int'>
11503000 <class 'int'>
8033300 <class 'int'>
7727900 <class 'int'>

I get abnormal unexpected results when I loop through Date column:

>>> my_df['Date'].apply(lambda x: print(x, type(x)))
DatetimeIndex(['2023-12-27 00:00:00-05:00', '2023-12-26 00:00:00-05:00', '2023-12-22 00:00:00-05:00', '2023-12-21 00:00:00-05:00', '2023-12-20 00:00:00-05:00', '2023-12-19 00:00:00-05:00', '2023-12-18 00:00:00-05:00', '2023-12-15 00:00:00-05:00', '2023-12-14 00:00:00-05:00', '2023-12-13 00:00:00-05:00'], dtype='datetime64[ns, America/New_York]', freq=None) <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
2023-12-27 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-26 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-22 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-21 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-20 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-19 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-18 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-15 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-14 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-13 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>

Why is this happening? Why do I get an index object at first?

Expected Behavior

I should be getting exclusively timestamp objects on iteration:

>>> my_df['Date'].apply(lambda x: print(x, type(x)))
2023-12-27 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-26 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-22 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-21 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-20 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-19 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-18 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-15 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-14 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-13 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 0f437949513225922d851e9581723d82120684a6
python           : 3.8.8.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 23.2.0
Version          : Darwin Kernel Version 23.2.0: Wed Nov 15 21:54:10 PST 2023; root:xnu-10002.61.3~2/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 2.0.3
numpy            : 1.22.4
pytz             : 2022.7
dateutil         : 2.8.2
setuptools       : 52.0.0.post20210125
pip              : 21.0.1
Cython           : 0.29.23
pytest           : 6.2.3
hypothesis       : None
sphinx           : 4.0.1
blosc            : None
feather          : None
xlsxwriter       : 1.3.8
lxml.etree       : 4.9.2
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.3
IPython          : 7.22.0
pandas_datareader: None
bs4              : 4.11.1
bottleneck       : 1.3.2
brotli           : 
fastparquet      : None
fsspec           : 2023.1.0
gcsfs            : None
matplotlib       : 3.3.4
numba            : 0.53.1
numexpr          : 2.7.3
odfpy            : None
openpyxl         : 3.0.7
pandas_gbq       : None
pyarrow          : 11.0.0
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : 1.6.2
snappy           : None
sqlalchemy       : 1.4.7
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : 2.0.1
zstandard        : None
tzdata           : 2023.3
qtpy             : 1.9.0
pyqt5            : None
@AlanCPSC AlanCPSC added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 5, 2024
@AlanCPSC
Copy link
Author

AlanCPSC commented Jan 5, 2024

There is also a related Stack Overflow post that, unfortunately, remains unresolved here:

Why does pandas.apply return an index on the first iteration instead of the actual element?

Thank you, team, for your work and looking into this strange and potentially widespread bug. 😊

@rhshadrach
Copy link
Member

Unfortunately, I cannot replicate this error without revealing more sensitive sections of the codebase.

So the output you posted above not the actual output you get?

In any case, can you try using Series.map instead? Right now apply will try row-by-row first, and then fallback to passing the entire Series to the provided function. However I'm still confused because your output has this happening in the opposite order.

@rhshadrach rhshadrach added the Apply Apply, Aggregate, Transform, Map label Jan 6, 2024
@AlanCPSC
Copy link
Author

AlanCPSC commented Jan 7, 2024

So the output you posted above not the actual output you get?

It is the actual output I get while looping through an internally generated dataframe timestamp column. However, I ran into an unexpected error where I sometimes get a DateTimeIndex as the first element. When I save the dataframe into a JSON and reinitialize it, I find that I am unable to reproduce this error. So, my assumption is that there might be some internal temporary variable in pandas itself that isn't being flushed correctly or something.

To create a "reproducible" code snippet, I'd need to reveal sensitive parts of the codebase, which I cannot do. Although this isn't particularly helpful, I wanted to highlight the issue anyways, ensuring that the staff is aware of an anomaly occurring in the apply(...) method.

However I'm still confused because your output has this happening in the opposite order.

Not sure if this is helpful, but the dataframe was flipped beforehand prior to calling apply(...).

actual_df = actual_df.iloc[::-1]
actual_df = actual_df.reset_index(drop=False)
...
actual_df['new_column'] = actual_df['time_column'].apply(lambda x: print(x, type(x)))

@rhshadrach
Copy link
Member

When I save the dataframe into a JSON and reinitialize it, I find that I am unable to reproduce this error. So, my assumption is that there might be some internal temporary variable in pandas itself that isn't being flushed correctly or something.

I suspect that saving to JSON is lossy. Can you try the following:

df = pd.read_json(...)
pd.testing.assert_series_equal(actual_df['time_column'], df['Date'], check_names=False)

Also, just to be sure, what is the output of actual_df['time_column'].dtype and is lambda x: print(x, type(x)) the exact function that reproduces the error on the actual data?

@rhshadrach
Copy link
Member

You've checked that you confirmed this exists on the main branch of pandas, is that accurate?

@AlanCPSC
Copy link
Author

AlanCPSC commented Jan 8, 2024

I suspect that saving to JSON is lossy. Can you try the following:

I get the following error

AssertionError: Attributes of Series are different

Attribute "dtype" are different
[left]:  datetime64[ns, America/New_York]
[right]: datetime64[ns]

Also, just to be sure, what is the output of actual_df['time_column'].dtype

I get the following output

datetime64[ns, America/New_York]

Is lambda x: print(x, type(x)) the exact function that reproduces the error on the actual data?

No, the actual function is doing something actually useful. It just blows up with a TypeError instead.

You've checked that you confirmed this exists on the main branch of pandas, is that accurate?

Yes. But don't break your neck trying to track it down. I have a feeling this would be one of the more elusive bugs. The point of this ticket was just to bubble up that there IS an anomaly going on and make a record of it.

@AlanCPSC
Copy link
Author

AlanCPSC commented Jan 8, 2024

Right now apply will try row-by-row first, and then fallback to passing the entire Series to the provided function.

Sorry, could you please elaborate on this? How will it know when the method has failed and when to "fallback"? Does it consider the None return as a failure, or is it checking for exceptions?

@rhshadrach
Copy link
Member

Is lambda x: print(x, type(x)) the exact function that reproduces the error on the actual data?

No, the actual function is doing something actually useful. It just blows up with a TypeError instead.

I think I was not clear. In the OP you posted the output

DatetimeIndex(['2023-12-27 00:00:00-05:00', '2023-12-26 00:00:00-05:00', '2023-12-22 00:00:00-05:00', '2023-12-21 00:00:00-05:00', '2023-12-20 00:00:00-05:00', '2023-12-19 00:00:00-05:00', '2023-12-18 00:00:00-05:00', '2023-12-15 00:00:00-05:00', '2023-12-14 00:00:00-05:00', '2023-12-13 00:00:00-05:00'], dtype='datetime64[ns, America/New_York]', freq=None) <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
2023-12-27 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2023-12-26 00:00:00-05:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
...

If you use the function lambda x: print(x, type(x)) on your actual data, do you actually get that output above?

Right now apply will try row-by-row first, and then fallback to passing the entire Series to the provided function.

Sorry, could you please elaborate on this? How will it know when the method has failed and when to "fallback"? Does it consider the None return as a failure, or is it checking for exceptions?

We check for ValueError, TypeError, and AssertionErrors. Returning None will not result in using the fallback.

pandas/pandas/core/apply.py

Lines 1477 to 1480 in d8e9529

try:
result = obj.apply(func, by_row="compat")
except (ValueError, AttributeError, TypeError):
result = obj.apply(func, by_row=False)

@rhshadrach
Copy link
Member

I was able to reproduce this on 2.0.x using the following reproducer:

data = """{"Date":{"0":1703653200000,"1":1703566800000,"2":1703221200000},"Revenue":{"0":3880359,"1":3139100,"2":2849700}}"""
df = pd.read_json(io.StringIO(data))
df["Date"] = pd.to_datetime(df["Date"]).dt.tz_localize(tz='US/Eastern')
df["Date"].apply(lambda x: print(x, type(x)))

The .dt.tz_localize(...) makes the Series use an ExtensionArray, which is not part of the JSON output. The behavior that produces the OP in the output was changed in #52033. It no longer appears on main, as was previously reported.

@rhshadrach rhshadrach added Closing Candidate May be closeable, needs more eyeballs and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 8, 2024
@AlanCPSC
Copy link
Author

AlanCPSC commented Jan 9, 2024

My apologies for any confusion. Thank you very much for your hard work. 🙌

@rhshadrach
Copy link
Member

Not a problem, that's for all the information. If you find this is still happening on pandas 2.1 or later, please comment here and we can reopen!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Closing Candidate May be closeable, needs more eyeballs
Projects
None yet
Development

No branches or pull requests

2 participants