Skip to content

DOC: existence docs and benchmarks. #7398

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

DOC: existence docs and benchmarks. #7398

wants to merge 4 commits into from

Conversation

chancyk
Copy link

@chancyk chancyk commented Jun 9, 2014

enhancing_performance

I've included some documentation on existence-type associations as requested by Jeff in this question over at Stack Overflow.

The content includes a couple of plots generated by the script bench/bench_existence.py but I was not able to find any automation that executed those bench/bench_*.py scripts, so I may be missing a hook somewhere. That script also takes a couple of minutes to run.

@jreback
Copy link
Contributor

jreback commented Jun 9, 2014

can you show the rendered verison in the top of this PR

@jreback jreback added the Docs label Jun 9, 2014
@jreback jreback added this to the 0.14.1 milestone Jun 9, 2014
@chancyk
Copy link
Author

chancyk commented Jun 10, 2014

I've included an image capture of the rendered content at the top. I may need to change some of the inline code from using .. ipython:: python so that it's not being executed.

l = ['a', 'b', 'c']
l_dict = dict(zip(l, l))

df[df.index.isin(l)] # isin_list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this several separate blocks (for each of the functions that you are comparing), so make it flow a bit better

@chancyk
Copy link
Author

chancyk commented Jun 11, 2014

Thanks, I've made some changes with the recommendations! Render below:

enhancing_performance_2


There are a number of different ways to test for existence using pandas. The
following methods can be used to achieve an existence test. The comments correspond
to the legend in the plots further down.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you state a bit more what you exactly understand by 'existence'? In general + in practice the exact operation you want to test in the examples. As it is not fully clear to me (thinking of myself as a not so experienced user) what you actually do in the different examples.

@chancyk
Copy link
Author

chancyk commented Jun 11, 2014

@jorisvandenbossche, you're right! == displays some strange behavior. Apparently I was testing one strange use case where it actually works. I tacked it on after seeing this documentation but I guess I misinterpreted it.

For some reason it only works with a Series when the index and the values of the series are the same and both the index and the values are integers (???) ... see these results:

In [114]: s = pd.Series([2,12,4], index=[2,3,4])

In [115]: df
Out[115]:
    a  b  c  d
0   a  a  3  1
1   a  a  3  8
2   b  a  0  3
3   b  a  1  0
4   c  b  3  4
5   c  b  2  5
6   d  b  0  3
7   d  b  0  4
8   e  c  3  2
9   e  c  1  6
10  f  c  0  1
11  f  c  1  5

In [116]: s
Out[116]:
2     2
3    12
4     4
dtype: int64

In [117]: df.query('index == @s')
Out[117]:
   a  b  c  d
2  b  a  0  3
4  c  b  3  4

In [118]: s = pd.Series([2,3,4], index=[2,3,4])

In [119]: df.query('index == @s')
Out[119]:
   a  b  c  d
2  b  a  0  3
3  b  a  1  0
4  c  b  3  4

In [120]: s = pd.Series(['a', 'b', 'c'], index=[2,3,4])

# this throws -> ValueError: Series lengths must match to compare
In [121]: df.query('index == @s')

It seems to be comparing both the value of the index and the value of column for the integer case, so for the index comparison df.query('index == @s') is expecting something like:

2    2
3    3
4    4
dtype: int64

But for columns, it's checking the index value and the column value, so the following works because 1 index has a 3 value in both cases.


In [178]: s = pd.Series([2,3,4])

In [179]: s
Out[179]:
0    2
1    3
2    4
dtype: int64

In [180]: df.query('c == @s')
Out[180]:
   a  b  c  d
1  a  a  3  8

In [182]: df
Out[182]:
    a  b  c  d
2   b  a  0  3
6   d  b  0  3
7   d  b  0  4
10  f  c  0  1
3   b  a  1  0
9   e  c  1  6
11  f  c  1  5
5   c  b  2  5
0   a  a  3  1
1   a  a  3  8
4   c  b  3  4
8   e  c  3  2

That's rather confusing. I wonder if this behavior is intentional?

@jorisvandenbossche
Copy link
Member

Ah, I overlooked those docs you mentioned. And they indeed state that using == is equivalent for in with lists (http://pandas.pydata.org/pandas-docs/stable/indexing.html#special-use-of-the-operator-with-list-objects). But in your PR you also use it with series, which would not be the same then (so df.query('index == @lst') is ok, but df.query('index == @series') does something else IIUC).

When using a series to compare equality ('col == @series'), the values are aligned (in the case of comparing 2 series (or column with series) this is obvious). But it is not so clear to me what happens when comparing with an index ('index == @series'). Does that behave like a list or like a series? But how to align the index? (using itself as the index to align with?)

In any case, I am not really familiar with using query, but also for that reason I think you should try to explain a little bit more what you are doing in the different cases (for other people like me), or link to the appropriate doc sections about it.

@chancyk
Copy link
Author

chancyk commented Jun 11, 2014

I'll definitely add some clarification to the query usage. Thanks for double-checking that.

@cpcloud
Copy link
Member

cpcloud commented Jun 11, 2014

@chancyk i can take a look at the query stuff, there are some oddities with ==, in and unequal length objects (we do this to support HDFStore's ability to compare unequal length objects). You may have discovered a bug :-)

@jreback
Copy link
Contributor

jreback commented Jun 17, 2014

@chancyk how's this coming?

@chancyk
Copy link
Author

chancyk commented Jun 19, 2014

@jreback, I've removed the references to query('.. == ..') until that behavior can be verified. I left the query('col == @list') since that's documented elsewhere as having the correct behavior. A new render is included below:

enhancing_performance_3

@jreback
Copy link
Contributor

jreback commented Jun 22, 2014

@jorisvandenbossche ?

@jorisvandenbossche
Copy link
Member

I will try to give this a last review tomorrow.

@chancyk you didn't include the figures to the source code, so on another computer this will not build. So you have to also add them to the commit, or another option is to include the code to generate them in the file? (as eg here with @savefig)

@chancyk
Copy link
Author

chancyk commented Jun 23, 2014

@jorisvandenbossche, the code to generate those plots is included as bench_existence.py but it takes a little while to run. I'll push those plot images into the repository tonight.

@chancyk
Copy link
Author

chancyk commented Jun 24, 2014

@jorisvandenbossche, the plot images are now included in the doc/source/_static folder.

df[[x in dct for x in df.index]]

# isin_series, query_in Series, pydict,
# join and isin_list are included in the plots below.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment can be removed?

@jorisvandenbossche
Copy link
Member

@chancyk I added some more comments on the actual examples you provided. In general I think some of the example work only 'by accident' because the example data is a bit strange (in the sense that you have exactly the same data in the series/dataframe and in the column/index to compare)

@chancyk
Copy link
Author

chancyk commented Jun 26, 2014

@jorisvandenbossche, thanks for the feedback! You're right, some of those aren't equivalent conceptually. I'll add some clarification to when things are acting on the index, as that can only be taken advantage of in special cases where some of the other examples are more general.

@jreback
Copy link
Contributor

jreback commented Sep 4, 2014

@chancyk can you refresh this

@jorisvandenbossche status on this?

@jorisvandenbossche
Copy link
Member

@chancyk I think this would be a really valuable addition to the docs, but there are some comments raised above that should be first addressed. Do you have some time for that?

@chancyk
Copy link
Author

chancyk commented Sep 4, 2014

@jorisvandenbossche @jreback

I should be able to update this in the next couple of days. I got diverted from writing any code for a while there, but I just finished up another example I would like to include, which is collapsing a DataFrame with groupby, including additional columns to the new DataFrame, and then joining it back to the original.

@chancyk
Copy link
Author

chancyk commented Sep 7, 2014

I'm getting some interesting results when comparing sorted vs random, the
number of overlapping IDs between datasets. I'll post a visualization soon
so it's a little easier to digest. @jorisvandenbossche
https://github.com/jorisvandenbossche's concerns definitely brought up
some interesting things.

On Thu, Sep 4, 2014 at 2:39 AM, Joris Van den Bossche <
[email protected]> wrote:

@chancyk https://github.com/chancyk I think this would be a really
valuable addition to the docs, but there are some comments raised above
that should be first addressed. Do you have some time for that?


Reply to this email directly or view it on GitHub
#7398 (comment).

@jreback
Copy link
Contributor

jreback commented Sep 14, 2014

how's this coming?

@chancyk
Copy link
Author

chancyk commented Sep 15, 2014

I'm rewritten the benchmark routine to use random data with various
percentages of overlap between the two datasets. I also generated some
plots where one or both sets of data have been pre-sorted. It outputs a
giant mess of a line plot which I split into a couple of different plots
for legibility. The latest I've generated can be found here:

http://imgur.com/a/w8Zyh

Each timed function now outputs the exact same DataFrame, which is the
matched elements with no duplicates and sorted. When I enforced this the
benefits of the inner join went away. I'm still digesting it myself. It's
best to look at the bench_existence.py script to see what I'm doing. I
haven't updated the Sphinx doc yet.

On Sun, Sep 14, 2014 at 12:20 PM, jreback [email protected] wrote:

how's this coming?


Reply to this email directly or view it on GitHub
#7398 (comment).

@jreback
Copy link
Contributor

jreback commented Sep 26, 2014

@chancyk can you post an updated version?

@jreback jreback modified the milestones: 0.15.1, 0.15.0 Oct 2, 2014
@jreback
Copy link
Contributor

jreback commented Jan 18, 2015

@chancyk can you update / revisit?

@jreback
Copy link
Contributor

jreback commented Mar 5, 2015

@chancyk can you revisit this?

@jreback jreback modified the milestones: 0.16.1, 0.16.0 Mar 5, 2015
@jreback
Copy link
Contributor

jreback commented Apr 8, 2015

@chancyk can you revist?

@jreback jreback modified the milestones: 0.17.0, 0.16.1 Apr 28, 2015
@jreback
Copy link
Contributor

jreback commented Apr 28, 2015

@chancyk want to update?

@jreback
Copy link
Contributor

jreback commented Jul 12, 2015

@chancyk I think this would be nice to include. pls reopen if you want to update.

@jreback jreback closed this Jul 12, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants