Skip to content

BUG: fixed Series.dt methods in ArrowDtype class that were returning incorrect values #57355 #58052

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

St0rmie
Copy link
Contributor

@St0rmie St0rmie commented Mar 28, 2024

Fixed a bug where using the Series.dt methods for ArrowDtype would return incorrect time values and added a test to prevent that.

Reproducible example:

start_dates = (
    pd.Series([
        '2016-03-08',
    ], dtype='datetime64[ns]'
).astype('timestamp[ns][pyarrow]'))

end_dates = (
    pd.Series([
        '2017-03-08 23:59:59.999', 
     ], dtype='datetime64[ns]'
).astype('timestamp[ns][pyarrow]'))

duration = (end_dates - start_dates)

print(duration)
print(duration.dt.components)

Running the above example on main branch would give the following output:

0    365 days 23:59:59.999000
dtype: duration[ns][pyarrow]
   days  hours  minutes  seconds  milliseconds  microseconds  nanoseconds
0   365     23       59    86399           999        999000            0

By running the example with this fix, i get the following output:

0    365 days 23:59:59.999000
dtype: duration[ns][pyarrow]
   days  hours  minutes  seconds  milliseconds  microseconds  nanoseconds
0   365     23       59       59           999             0            0

Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Apr 28, 2024
@St0rmie St0rmie force-pushed the fix-arrowdtype-dt-components-incorrect-values branch 2 times, most recently from c7099e4 to ddf075c Compare May 4, 2024 22:00
@MarcoGorelli MarcoGorelli removed the Stale label May 5, 2024
@@ -2644,7 +2644,11 @@ def _dt_minutes(self) -> Self:
def _dt_seconds(self) -> Self:
return type(self)(
pa.array(
self._to_timedeltaarray().seconds, from_pandas=True, type=pa.int32()
[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this is not where we want to fix things - doing a Python loop here would be a significant slow down compared to the non-arrow case.

I think the problem is that the Arrow duration does not have the same storage as the CPython timedelta, yet we use the CPython API to try and extract those components in pandas/_libs/tslibs/timedeltas.pyx

I vaguely recall this being a conversation with the larger team before but can't remember specifics. Maybe @jbrockmendel or @jorisvandenbossche knows

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like it matches what we get for the non-pyarrow .dt.seconds, so i dont think we'd want to change it. only in .dt.components

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I thought that since a loop was already being done for the other properties (hours, minutes and milliseconds) then it would be reasonable to do it for the faulty ones too. However it is indeed a slow down and i will try to find another way to fix this issue. Thank you for your time.

Copy link
Contributor

github-actions bot commented Jun 9, 2024

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jun 9, 2024
@St0rmie St0rmie force-pushed the fix-arrowdtype-dt-components-incorrect-values branch from ddf075c to 8476683 Compare June 15, 2024 05:24
@St0rmie St0rmie force-pushed the fix-arrowdtype-dt-components-incorrect-values branch from 8476683 to 61379f7 Compare June 15, 2024 11:27
@St0rmie
Copy link
Contributor Author

St0rmie commented Jun 15, 2024

I apologize for the time i took to adress this issue.
I tried to look for better ways to fix the issue and this seemed to be a reasonable option.
In timedeltas we have two types of getters for seconds: timedelta.seconds which would return the hours, minutes and seconds in seconds and timedelta.components.seconds which would return the seconds specifically. While the second one returns the atributes of the class directly, the first one applies an operation to convert hours and minutes to seconds before summing the whole count of seconds.

In pyarrow, the components seem to be generated via the methods that i've changed (_dt_hours, _dt_minutes, _dt_seconds, etc. So, in order to change the return of components i had to change these methods. However, based on @WillAyd suggestion, i managed to find a way to do it without the loops by acessing the components in the timedelta generated which fixes the issue without compromising performance.

On the other hand, taking into consideration @jbrockmendel 's suggestion, i think that since the components are being generated via the methods i've changed the option here would be either changing how components are generated or creating another method that would return the hours, minutes and seconds in seconds just like in timedelta. In my opinion, having timedelta.seconds returning the hours, minutes and seconds in seconds but having timedelta.microseconds returning just the microseconds it's a bit confusing so i think it would be better to rename that method differently (such as time_as_seconds for example) and to create the corresponding method for the pyarrow structure.

I might be wrong or have missed something, so i would be very grateful if you could guide me through it if that's the case.
Thank you for your time and patience.

@mroeschke mroeschke added this to the 3.0 milestone Jun 18, 2024
@mroeschke mroeschke added Timedelta Timedelta data type Arrow pyarrow functionality and removed Stale labels Jun 18, 2024
@mroeschke mroeschke merged commit f9f12de into pandas-dev:main Jun 18, 2024
53 checks passed
@mroeschke
Copy link
Member

Thanks @St0rmie and sorry for the delay here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Timedelta Timedelta data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: the seconds attribute of .dt.components for pyarrow timestamp datatype seem faulty
5 participants