-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: existence docs and benchmarks. #7398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
can you show the rendered verison in the top of this PR |
I've included an image capture of the rendered content at the top. I may need to change some of the inline code from using |
l = ['a', 'b', 'c'] | ||
l_dict = dict(zip(l, l)) | ||
|
||
df[df.index.isin(l)] # isin_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make this several separate blocks (for each of the functions that you are comparing), so make it flow a bit better
|
||
There are a number of different ways to test for existence using pandas. The | ||
following methods can be used to achieve an existence test. The comments correspond | ||
to the legend in the plots further down. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you state a bit more what you exactly understand by 'existence'? In general + in practice the exact operation you want to test in the examples. As it is not fully clear to me (thinking of myself as a not so experienced user) what you actually do in the different examples.
@jorisvandenbossche, you're right! For some reason it only works with a
It seems to be comparing both the value of the index and the value of column for the integer case, so for the index comparison
But for columns, it's checking the index value and the column value, so the following works because 1 index has a 3 value in both cases.
That's rather confusing. I wonder if this behavior is intentional? |
Ah, I overlooked those docs you mentioned. And they indeed state that using When using a series to compare equality ( In any case, I am not really familiar with using |
I'll definitely add some clarification to the |
@chancyk i can take a look at the query stuff, there are some oddities with |
@chancyk how's this coming? |
@jreback, I've removed the references to |
I will try to give this a last review tomorrow. @chancyk you didn't include the figures to the source code, so on another computer this will not build. So you have to also add them to the commit, or another option is to include the code to generate them in the file? (as eg here with |
@jorisvandenbossche, the code to generate those plots is included as bench_existence.py but it takes a little while to run. I'll push those plot images into the repository tonight. |
@jorisvandenbossche, the plot images are now included in the doc/source/_static folder. |
df[[x in dct for x in df.index]] | ||
|
||
# isin_series, query_in Series, pydict, | ||
# join and isin_list are included in the plots below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment can be removed?
@chancyk I added some more comments on the actual examples you provided. In general I think some of the example work only 'by accident' because the example data is a bit strange (in the sense that you have exactly the same data in the series/dataframe and in the column/index to compare) |
@jorisvandenbossche, thanks for the feedback! You're right, some of those aren't equivalent conceptually. I'll add some clarification to when things are acting on the index, as that can only be taken advantage of in special cases where some of the other examples are more general. |
@chancyk can you refresh this @jorisvandenbossche status on this? |
@chancyk I think this would be a really valuable addition to the docs, but there are some comments raised above that should be first addressed. Do you have some time for that? |
I should be able to update this in the next couple of days. I got diverted from writing any code for a while there, but I just finished up another example I would like to include, which is collapsing a DataFrame with groupby, including additional columns to the new DataFrame, and then joining it back to the original. |
I'm getting some interesting results when comparing sorted vs random, the On Thu, Sep 4, 2014 at 2:39 AM, Joris Van den Bossche <
|
how's this coming? |
I'm rewritten the benchmark routine to use random data with various Each timed function now outputs the exact same DataFrame, which is the On Sun, Sep 14, 2014 at 12:20 PM, jreback [email protected] wrote:
|
@chancyk can you post an updated version? |
@chancyk can you update / revisit? |
@chancyk can you revisit this? |
@chancyk can you revist? |
@chancyk want to update? |
@chancyk I think this would be nice to include. pls reopen if you want to update. |
I've included some documentation on existence-type associations as requested by Jeff in this question over at Stack Overflow.
The content includes a couple of plots generated by the script bench/bench_existence.py but I was not able to find any automation that executed those bench/bench_*.py scripts, so I may be missing a hook somewhere. That script also takes a couple of minutes to run.