Skip to content

BUG: get_indexer_non_unique with np.datetime64("NaT") and np.timedelta64("NaT") #43870

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Oct 18, 2021

Conversation

alexhlim
Copy link
Member

@alexhlim alexhlim commented Oct 4, 2021

@alexhlim
Copy link
Member Author

@jbrockmendel any idea why tests might be failing some machines but not others? right now, they're passing on my local machine. any tips on debugging would be appreciated :)

@jreback jreback added the Indexing Related to indexing on series/frames, not to indexes themselves label Oct 10, 2021
@jreback jreback added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Oct 10, 2021
@jbrockmendel
Copy link
Member

@jbrockmendel any idea why tests might be failing some machines but not others? right now, they're passing on my local machine. any tips on debugging would be appreciated :)

first guess is to check if numpy/numpy#12550 is biting you. i.e. when you construct Index([np_nat_fixture], dtype=object), some numpy versions might be silently casting that np_nat_fixture to an integer.

@alexhlim
Copy link
Member Author

@jbrockmendel any idea why tests might be failing some machines but not others? right now, they're passing on my local machine. any tips on debugging would be appreciated :)

first guess is to check if numpy/numpy#12550 is biting you. i.e. when you construct Index([np_nat_fixture], dtype=object), some numpy versions might be silently casting that np_nat_fixture to an integer.

You were right -- it was because the older version of numpy was down-casting the NaT. I was able to get the CI passing except for Windows (timed-out) -- is this a consequence of performance?

@jbrockmendel
Copy link
Member

I was able to get the CI passing except for Windows (timed-out) -- is this a consequence of performance?

windows timeouts are pretty common on the CI; almost certainly unrelated

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you run asv's which might hit this (conversions) and see if anything is changed

@alexhlim
Copy link
Member Author

@jbrockmendel @jreback I was able to simplify all the na checks into one (float/complex, Decimal, np.datetime64(“NaT”) and np.timedelta64(“NaT”)) - and looks like performance increased as well!

asv continuous -f 1.1 upstream/master HEAD -b ^indexing


       before           after         ratio
     [c4cce9b7]       [da4300cc]
     <master>         <nui-dt64nat-td64nat>
-      97.9±0.4ms       85.7±0.7ms     0.88  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      99.9±0.5ms         87.4±2ms     0.87  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-     1.41±0.03μs      1.22±0.03μs     0.87  indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.Float32Engine'>, <class 'numpy.float32'>), 'monotonic_incr', True, 2000000)
-         102±1ms         83.8±1ms     0.82  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
-        99.3±1ms       80.7±0.7ms     0.81  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
-         101±1ms       82.0±0.7ms     0.81  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-       101±0.9ms       82.2±0.9ms     0.81  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
-         102±2ms         81.4±1ms     0.80  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-      29.8±0.5ms       22.6±0.8ms     0.76  indexing.DataFrameNumericIndexing.time_loc_dups

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@jreback jreback added this to the 1.4 milestone Oct 18, 2021
@jreback
Copy link
Contributor

jreback commented Oct 18, 2021

lgtm. @jbrockmendel good here?

@jbrockmendel
Copy link
Member

LGTM really nice work here

@alexhlim
Copy link
Member Author

fixed up some typos in the comments + adding test cases (failing on master, but this PR should resolve). I'll ping on green @jbrockmendel @jreback

@alexhlim
Copy link
Member Author

ping @jreback @jbrockmendel

@jreback jreback merged commit 2d2db8d into pandas-dev:master Oct 18, 2021
@jreback
Copy link
Contributor

jreback commented Oct 18, 2021

thanks @alexhlim really nice!

@alexhlim
Copy link
Member Author

LGTM really nice work here

your suggestion of using sets + is_matching_na really helped me out!!

@alexhlim
Copy link
Member Author

thanks for taking the time to review @jreback @jbrockmendel !

@alexhlim alexhlim deleted the nui-dt64nat-td64nat branch October 18, 2021 19:54
@MarcoGorelli MarcoGorelli mentioned this pull request Dec 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: get_indexer_non_unique does not handle np.datetime64("NaT") and np.timedelta64("NaT")
3 participants