Skip to content

BENCH: Index.take benchmark is measuring wrong thing #18000

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Oct 27, 2017 · 5 comments · Fixed by #36656
Closed

BENCH: Index.take benchmark is measuring wrong thing #18000

jorisvandenbossche opened this issue Oct 27, 2017 · 5 comments · Fixed by #36656
Assignees
Labels
Benchmark Performance (ASV) benchmarks good first issue
Milestone

Comments

@jorisvandenbossche
Copy link
Member

The take benchmarks indexing.IndexingMethods.time_take_intindex is benchmarking take on a boolean list, which take interprets as a [0, 1, 0, 1, ...] values, which is a bit silly to benchmark.
We should add some actual benchmark on integers instead.

@sangramga
Copy link

sangramga commented Oct 30, 2017

@jorisvandenbossche I would like to make this issue to be my first contribution. I have looked at the source code and ready to take it up.
Although I have unable to see as to why and how a Boolean list self.indexer is interpreted as integer [0,1,0,1,...] values?
Is it possible to resolve this issue by creating a new indexer list like self.int_indexer for benchmarking? What do you have in mind?

@jorisvandenbossche
Copy link
Member Author

Is it possible to resolve this issue by creating a new indexer list like self.int_indexer for benchmarking? What do you have in mind?

Yes, although I think you can just replace the existing self.indexer with a new list of ints instead of bools (we don't need to keep the old bench of using booleans).

Although I have unable to see as to why and how a Boolean list self.indexer is interpreted as integer [0,1,0,1,...] values?

If you try to run the code (or similar), you will see:

In [84]: s = pd.Series(range(5))

In [85]: s.take([True, False, True, False, True])
Out[85]: 
1    1
0    0
1    1
0    0
1    1
dtype: int64

In [86]: s.loc[[True, False, True, False, True]]
Out[86]: 
0    0
2    2
4    4
dtype: int64

so take interprets the True, False as 0, 1 and not as a mask (like is done in loc)

@sangramga
Copy link

sangramga commented Nov 16, 2017

@jorisvandenbossche I made the following changes in indexing.IndexingMethods.time_take_intindex

Replaced the boolean list indexer with randomized integer list
self.indexer = list(np.random.randint(0, 100000, size=100000))
Does this resolve the issue?

Also where should I add Tests? Should I add/modify any tests?

@hardikpnsp
Copy link
Contributor

It seems like the issue has been open for quite a while. I looked into the take benchmark, It is still using boolean values as indexer. Does not seem like anyone is working on the issue, I will quickly fix that.

@hardikpnsp
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmark Performance (ASV) benchmarks good first issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants