Skip to content

PERF: improve get_loc on unsorted, non-unique indexes #19539

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 8, 2018

Conversation

toobaz
Copy link
Member

@toobaz toobaz commented Feb 5, 2018

asv run:

     [bd4332f4]       [9ac7be34]
-         475±5μs          422±7μs     0.89  multiindex_object.Duplicates.time_remove_unused_levels
-      15.4±0.3ms       13.5±0.2ms     0.88  multiindex_object.GetLoc.time_small_get_loc_warm
-         197±5ns          171±2ns     0.87  index_object.Datetime.time_is_dates_only
-      8.68±0.3μs       6.88±0.1μs     0.79  index_object.Float64IndexMethod.time_get_loc
-          15.4μs       12.0±0.2μs     0.78  index_object.Indexing.time_get_loc_sorted('Float')
-         283±9μs          215±2μs     0.76  multiindex_object.Values.time_datetime_level_values_sliced
-          15.0μs           11.1μs     0.74  index_object.Indexing.time_get_loc('Float')
-     7.38±0.04μs      5.03±0.05μs     0.68  index_object.Indexing.time_get_loc_sorted('Int')
-     7.15±0.04μs      4.58±0.04μs     0.64  index_object.Indexing.time_get_loc('Int')
-       121±0.9ms           1.62ms     0.01  index_object.Indexing.time_get_loc_non_unique_sorted('Float')
-           136ms           1.51ms     0.01  index_object.Indexing.time_get_loc_non_unique('Float')
-         166±1ms      1.29±0.03ms     0.01  index_object.Indexing.time_get_loc_non_unique_sorted('Int')
-       175±0.9ms      1.29±0.01ms     0.01  index_object.Indexing.time_get_loc_non_unique('Int')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@toobaz toobaz force-pushed the get_loc_dups_unsorted_19478 branch from a383697 to cf36911 Compare February 5, 2018 20:23
@codecov
Copy link

codecov bot commented Feb 5, 2018

Codecov Report

Merging #19539 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #19539   +/-   ##
=======================================
  Coverage   91.81%   91.81%           
=======================================
  Files         153      153           
  Lines       49481    49481           
=======================================
  Hits        45430    45430           
  Misses       4051     4051
Flag Coverage Δ
#multiple 90.21% <ø> (ø) ⬆️
#single 41.85% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bd4332f...8e3d4d0. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Feb 6, 2018

seems to be failing 2.7 on windows and a gazillion warnings

 C:\projects\pandas\pandas\core\indexes\base.py:1726: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
    return getitem(key)
  C:\projects\pandas\pandas\core\indexes\base.py:1726: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
    return getitem(key)
  C:\projects\pandas\pandas\core\indexes\base.py:1726: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
    return getitem(key)
  C:\projects\pandas\pandas\core\indexes\base.py:1726: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
    return getitem(key)
  C:\projects\pandas\pandas\core\indexes\base.py:1726: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
    return getitem(key)

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Feb 6, 2018
@jreback
Copy link
Contributor

jreback commented Feb 6, 2018

also can you see if this closes other issues as well (a couple of refernces in OP)

@toobaz toobaz force-pushed the get_loc_dups_unsorted_19478 branch from cf36911 to da08bb8 Compare May 5, 2018 14:25
@toobaz
Copy link
Member Author

toobaz commented May 7, 2018

@jreback : fixed (wasn't that hard!), ping.

(I guess it shouldn't harm to include in 0.23.0)

@jreback jreback added this to the 0.23.0 milestone May 7, 2018
@jreback
Copy link
Contributor

jreback commented May 7, 2018

@toobaz lgtm. can you just update the asv's as the top with latest code

@toobaz
Copy link
Member Author

toobaz commented May 7, 2018

can you just update the asv's as the top with latest code

Done.

@jreback jreback merged commit da96244 into pandas-dev:master May 8, 2018
@jreback
Copy link
Contributor

jreback commented May 8, 2018

thanks @toobaz

@toobaz toobaz deleted the get_loc_dups_unsorted_19478 branch May 8, 2018 05:43
jreback added a commit to jreback/pandas that referenced this pull request May 9, 2018
jreback added a commit to jreback/pandas that referenced this pull request May 9, 2018
jreback added a commit to jreback/pandas that referenced this pull request May 10, 2018
jreback added a commit that referenced this pull request May 10, 2018
topper-123 pushed a commit to topper-123/pandas that referenced this pull request May 13, 2018
topper-123 pushed a commit to topper-123/pandas that referenced this pull request May 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Int64Index.get_loc() is very slow on unsorted, non-unique index
2 participants