Skip to content

PERF: Unnecessary hash table with RangeIndex #16685

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
chris-b1 opened this issue Jun 12, 2017 · 5 comments · Fixed by #27119
Closed

PERF: Unnecessary hash table with RangeIndex #16685

chris-b1 opened this issue Jun 12, 2017 · 5 comments · Fixed by #27119
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Milestone

Comments

@chris-b1
Copy link
Contributor

Example

def log_memory():
    import os
    import gc
    import psutil
    for i in range(3):
        gc.collect(i)
    process = psutil.Process(os.getpid())
    mem_usage = process.memory_info().rss / float(2 ** 20)
    print("[Memory usage] {:12.1f} MB".format(
        mem_usage
    ))

In [20]: df = pd.DataFrame({'a': np.arange(1000000)})

In [23]: log_memory()
[Memory usage]        132.4 MB

In [24]: df.loc[5, :]
Out[24]: 
a    5
Name: 5, dtype: int32

In [25]: log_memory()
[Memory usage]        172.2 MB

Rather than materializing the hash table, should directly convert labels into positions. Low priority in my opinion, atypical to be using loc with a RangeIndex.

pandas 0.20.2

@chris-b1 chris-b1 added Difficulty Advanced Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Jun 12, 2017
@chris-b1 chris-b1 added this to the Next Major Release milestone Jun 12, 2017
@jreback
Copy link
Contributor

jreback commented Jun 14, 2017

hmm ,this should not be creating a hash table at all (though it would if it morphed to an int64index)

@chris-b1
Copy link
Contributor Author

The _engine for RangeIndex still points at the Int64Index one, so a hash table will be created if something calls self._engine. So either need to make a simplified RangeIndexEngine or override the indexing methods.

_engine_type = libindex.Int64Engine

@jreback
Copy link
Contributor

jreback commented Jun 14, 2017

I think that _engine is purely for compat on some methods. might be easiest/best to create a pretty dummy RI Engine in cython.

@jorisvandenbossche
Copy link
Member

I think that _engine is purely for compat on some methods

_engine is still used in eg get_loc, so often created.

Related to this, I just stumbled on the fact that currenlty RangeIndex.memory_usage() is inaccurate, as it does not take into account the engine.

And given that the actual engine takes a lot more memory as the values (for int64index it is 5x more memory on an example case 500k range-like int index), a RangeIndex is not much more memory-efficient as a Int64Index (its big selling point I thought?) once the engine is created.

@chris-b1
Copy link
Contributor Author

RangeIndex is not much more memory-efficient as a Int64Index (its big selling point I thought?) once the engine is created.

This is true, but I think it would be relatively rare for the engine to populated when a RangeIndex is being used - typically it is when someone doesn't care about the index at all, so would likely be using iloc if any indexing at all.

@jreback jreback modified the milestones: Contributions Welcome, 0.25.0 Jun 30, 2019
jreback pushed a commit that referenced this issue Jun 30, 2019
…dex (#27119)

* TST: actually test #16877 on numeric index (not just RangeIndex)

* PERF: do not instantiate IndexEngine for standard lookup over RangeIndex

closes #16685
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants