Skip to content

BUG: Add fix for hashing timestamps with folds #44282

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jan 4, 2022

Conversation

jonwiggins
Copy link
Contributor

@jonwiggins jonwiggins commented Nov 2, 2021

Edit:

#42305 (comment)
This comment identifies the problem as being upstream in datetime's construction of timestamps which fails during some DST changes.

The datetime hash function then being used tries to unset the fold representing the DST change, (https://github.com/python/cpython/blob/main/Lib/datetime.py#L2117). But it does this using the datetime replace function - this change fixes that problem by removing the fold prior to calling the datetime hash function using timestamp's replace function instead.

@jonwiggins jonwiggins closed this Nov 2, 2021
@jonwiggins jonwiggins reopened this Nov 2, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Dec 3, 2021

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Dec 3, 2021

def test_hash_timestamp_with_fold():
# see gh-33931
america_chicago = dateutil.tz.gettz("America/Chicago")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use pytest.mark.parameterize over the timezone and TImestamp inputs?

@@ -166,3 +166,24 @@ def test_maybe_get_tz_offset_only():

tz = timezones.maybe_get_tz("UTC-02:45")
assert tz == timezone(-timedelta(hours=2, minutes=45))


def test_hash_timestamp_with_fold():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try tests.scalar.timestamp.test_timestamp, there's another test with "hash" in the name that this can go right next to

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a whatsnew note abou this 1.4., datetime bug fix section.

also we'd like to assure that this isn't impacting perf very much. if you can run the datetime asv's would be great

@jreback jreback added Timezones Timezone data dtype and removed Stale labels Dec 30, 2021
@jonwiggins
Copy link
Contributor Author

@jreback I don't see an asv set datetime specifically, so I tried to run the all of the ones for tslibs, let me know if you want me to run a different set.

-> asv continuous -f 1.1 -E virtualenv upstream/master fold_hash_fix -b ^tslibs
...

     [95123939]       [c6400d66]
     <master>         <fold_hash_fix>
-     1.47±0.02μs      1.34±0.03μs     0.91  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 1000, tzlocal())
-     4.40±0.03μs      3.99±0.09μs     0.91  tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('time', 0, <DstTzInfo 'US/Pacific' LMT-1 day, 16:07:00 STD>)
-     1.49±0.03μs      1.35±0.04μs     0.91  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 2011, tzlocal())
-     1.47±0.08μs      1.32±0.02μs     0.90  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 3000, None)
-     1.53±0.07μs      1.37±0.02μs     0.89  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 2000, tzlocal())
-     1.50±0.08μs      1.33±0.02μs     0.88  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 1000, None)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@jbrockmendel
Copy link
Member

There aren't any asvs that specifically hit this. If you wanted to add one, it would go in benchmarks.tslibs.timestamp. Ideally in TimestampOps, but i dont know if it the ts defined in setup there makes a difference.

(an asv for this is likely overkill)

@jreback jreback added this to the 1.4 milestone Dec 31, 2021
@jreback
Copy link
Contributor

jreback commented Dec 31, 2021

can you add the test here: #40817 as well. ping on green for merge

Add unit test for reindexing with fold
@jonwiggins
Copy link
Contributor Author

@jreback Sounds good, I've added the reindex unit test and mentioned it in whatsnew

@jonwiggins
Copy link
Contributor Author

@jreback Any more thoughts?

@jreback jreback merged commit 490b189 into pandas-dev:master Jan 4, 2022
@jreback
Copy link
Contributor

jreback commented Jan 4, 2022

thanks @jonwiggins very nice!

@jonwiggins jonwiggins deleted the fold_hash_fix branch January 4, 2022 00:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: hash of Timestamp on fold=1 create a Segfault
4 participants