DOC: existence docs and benchmarks. #7398

chancyk · 2014-06-09T03:25:04Z

I've included some documentation on existence-type associations as requested by Jeff in this question over at Stack Overflow.

The content includes a couple of plots generated by the script bench/bench_existence.py but I was not able to find any automation that executed those bench/bench_*.py scripts, so I may be missing a hook somewhere. That script also takes a couple of minutes to run.

jreback · 2014-06-09T11:23:07Z

can you show the rendered verison in the top of this PR

chancyk · 2014-06-10T00:18:56Z

I've included an image capture of the rendered content at the top. I may need to change some of the inline code from using .. ipython:: python so that it's not being executed.

jreback · 2014-06-10T12:53:18Z

doc/source/enhancingperf.rst

+    l = ['a', 'b', 'c']
+    l_dict = dict(zip(l, l))
+
+    df[df.index.isin(l)] # isin_list


I would make this several separate blocks (for each of the functions that you are comparing), so make it flow a bit better

chancyk · 2014-06-11T06:43:12Z

Thanks, I've made some changes with the recommendations! Render below:

jorisvandenbossche · 2014-06-11T11:12:31Z

doc/source/enhancingperf.rst

+
+There are a number of different ways to test for existence using pandas. The 
+following methods can be used to achieve an existence test. The comments correspond
+to the legend in the plots further down.


Can you state a bit more what you exactly understand by 'existence'? In general + in practice the exact operation you want to test in the examples. As it is not fully clear to me (thinking of myself as a not so experienced user) what you actually do in the different examples.

chancyk · 2014-06-11T13:46:50Z

@jorisvandenbossche, you're right! == displays some strange behavior. Apparently I was testing one strange use case where it actually works. I tacked it on after seeing this documentation but I guess I misinterpreted it.

For some reason it only works with a Series when the index and the values of the series are the same and both the index and the values are integers (???) ... see these results:

In [114]: s = pd.Series([2,12,4], index=[2,3,4])

In [115]: df
Out[115]:
    a  b  c  d
0   a  a  3  1
1   a  a  3  8
2   b  a  0  3
3   b  a  1  0
4   c  b  3  4
5   c  b  2  5
6   d  b  0  3
7   d  b  0  4
8   e  c  3  2
9   e  c  1  6
10  f  c  0  1
11  f  c  1  5

In [116]: s
Out[116]:
2     2
3    12
4     4
dtype: int64

In [117]: df.query('index == @s')
Out[117]:
   a  b  c  d
2  b  a  0  3
4  c  b  3  4

In [118]: s = pd.Series([2,3,4], index=[2,3,4])

In [119]: df.query('index == @s')
Out[119]:
   a  b  c  d
2  b  a  0  3
3  b  a  1  0
4  c  b  3  4

In [120]: s = pd.Series(['a', 'b', 'c'], index=[2,3,4])

# this throws -> ValueError: Series lengths must match to compare
In [121]: df.query('index == @s')

It seems to be comparing both the value of the index and the value of column for the integer case, so for the index comparison df.query('index == @s') is expecting something like:

2    2
3    3
4    4
dtype: int64

But for columns, it's checking the index value and the column value, so the following works because 1 index has a 3 value in both cases.


In [178]: s = pd.Series([2,3,4])

In [179]: s
Out[179]:
0    2
1    3
2    4
dtype: int64

In [180]: df.query('c == @s')
Out[180]:
   a  b  c  d
1  a  a  3  8

In [182]: df
Out[182]:
    a  b  c  d
2   b  a  0  3
6   d  b  0  3
7   d  b  0  4
10  f  c  0  1
3   b  a  1  0
9   e  c  1  6
11  f  c  1  5
5   c  b  2  5
0   a  a  3  1
1   a  a  3  8
4   c  b  3  4
8   e  c  3  2

That's rather confusing. I wonder if this behavior is intentional?

jorisvandenbossche · 2014-06-11T14:01:07Z

Ah, I overlooked those docs you mentioned. And they indeed state that using == is equivalent for in with lists (http://pandas.pydata.org/pandas-docs/stable/indexing.html#special-use-of-the-operator-with-list-objects). But in your PR you also use it with series, which would not be the same then (so df.query('index == @lst') is ok, but df.query('index == @series') does something else IIUC).

When using a series to compare equality ('col == @series'), the values are aligned (in the case of comparing 2 series (or column with series) this is obvious). But it is not so clear to me what happens when comparing with an index ('index == @series'). Does that behave like a list or like a series? But how to align the index? (using itself as the index to align with?)

In any case, I am not really familiar with using query, but also for that reason I think you should try to explain a little bit more what you are doing in the different cases (for other people like me), or link to the appropriate doc sections about it.

chancyk · 2014-06-11T14:19:09Z

I'll definitely add some clarification to the query usage. Thanks for double-checking that.

cpcloud · 2014-06-11T15:51:37Z

@chancyk i can take a look at the query stuff, there are some oddities with ==, in and unequal length objects (we do this to support HDFStore's ability to compare unequal length objects). You may have discovered a bug :-)

jreback · 2014-06-17T12:40:16Z

@chancyk how's this coming?

chancyk · 2014-06-19T06:55:48Z

@jreback, I've removed the references to query('.. == ..') until that behavior can be verified. I left the query('col == @list') since that's documented elsewhere as having the correct behavior. A new render is included below:

jreback · 2014-06-22T12:38:36Z

@jorisvandenbossche ?

jorisvandenbossche · 2014-06-23T13:51:35Z

I will try to give this a last review tomorrow.

@chancyk you didn't include the figures to the source code, so on another computer this will not build. So you have to also add them to the commit, or another option is to include the code to generate them in the file? (as eg here with @savefig)

chancyk · 2014-06-23T20:19:23Z

@jorisvandenbossche, the code to generate those plots is included as bench_existence.py but it takes a little while to run. I'll push those plot images into the repository tonight.

chancyk · 2014-06-24T02:12:16Z

@jorisvandenbossche, the plot images are now included in the doc/source/_static folder.

jorisvandenbossche · 2014-06-25T19:32:46Z

doc/source/enhancingperf.rst

+    df[[x in dct for x in df.index]]
+
+    # isin_series, query_in Series, pydict,
+    # join and isin_list are included in the plots below.


I think this comment can be removed?

jorisvandenbossche · 2014-06-25T20:21:20Z

@chancyk I added some more comments on the actual examples you provided. In general I think some of the example work only 'by accident' because the example data is a bit strange (in the sense that you have exactly the same data in the series/dataframe and in the column/index to compare)

chancyk · 2014-06-26T00:18:11Z

@jorisvandenbossche, thanks for the feedback! You're right, some of those aren't equivalent conceptually. I'll add some clarification to when things are acting on the index, as that can only be taken advantage of in special cases where some of the other examples are more general.

jreback · 2014-09-04T00:33:40Z

@chancyk can you refresh this

@jorisvandenbossche status on this?

jorisvandenbossche · 2014-09-04T07:39:34Z

@chancyk I think this would be a really valuable addition to the docs, but there are some comments raised above that should be first addressed. Do you have some time for that?

chancyk · 2014-09-04T14:38:55Z

@jorisvandenbossche @jreback

I should be able to update this in the next couple of days. I got diverted from writing any code for a while there, but I just finished up another example I would like to include, which is collapsing a DataFrame with groupby, including additional columns to the new DataFrame, and then joining it back to the original.

chancyk · 2014-09-07T20:55:29Z

I'm getting some interesting results when comparing sorted vs random, the
number of overlapping IDs between datasets. I'll post a visualization soon
so it's a little easier to digest. @jorisvandenbossche
https://github.com/jorisvandenbossche's concerns definitely brought up
some interesting things.

On Thu, Sep 4, 2014 at 2:39 AM, Joris Van den Bossche <
[email protected]> wrote:

@chancyk https://github.com/chancyk I think this would be a really
valuable addition to the docs, but there are some comments raised above
that should be first addressed. Do you have some time for that?

—
Reply to this email directly or view it on GitHub
#7398 (comment).

jreback · 2014-09-14T17:20:43Z

how's this coming?

chancyk · 2014-09-15T01:17:41Z

I'm rewritten the benchmark routine to use random data with various
percentages of overlap between the two datasets. I also generated some
plots where one or both sets of data have been pre-sorted. It outputs a
giant mess of a line plot which I split into a couple of different plots
for legibility. The latest I've generated can be found here:

http://imgur.com/a/w8Zyh

Each timed function now outputs the exact same DataFrame, which is the
matched elements with no duplicates and sorted. When I enforced this the
benefits of the inner join went away. I'm still digesting it myself. It's
best to look at the bench_existence.py script to see what I'm doing. I
haven't updated the Sphinx doc yet.

On Sun, Sep 14, 2014 at 12:20 PM, jreback [email protected] wrote:

how's this coming?

—
Reply to this email directly or view it on GitHub
#7398 (comment).

jreback · 2014-09-26T20:38:51Z

@chancyk can you post an updated version?

jreback · 2015-01-18T21:34:33Z

@chancyk can you update / revisit?

jreback · 2015-03-05T23:46:27Z

@chancyk can you revisit this?

jreback · 2015-04-08T14:40:34Z

@chancyk can you revist?

jreback · 2015-04-28T11:19:02Z

@chancyk want to update?

jreback · 2015-07-12T15:00:43Z

@chancyk I think this would be nice to include. pls reopen if you want to update.

jreback added the Docs label Jun 9, 2014

jreback added this to the 0.14.1 milestone Jun 9, 2014

jreback reviewed Jun 10, 2014
View reviewed changes

DOC: existence docs and benchmarks.

a7abb0e

jorisvandenbossche reviewed Jun 11, 2014
View reviewed changes

DOC: existence docs and benchmarks.

cf284ce

DOC: existence docs and benchmarks.

02e207e

jorisvandenbossche reviewed Jun 25, 2014
View reviewed changes

jreback mentioned this pull request Jun 26, 2014

DOC: closes gh6838. Breakout options.rst from basics.rst #7578

Merged

jorisvandenbossche modified the milestones: 0.15.0, 0.14.1 Jul 3, 2014

jreback modified the milestones: 0.15.0, 0.15.1 Jul 6, 2014

Benchmark using random data.

f9bbee3

jreback modified the milestones: 0.15.1, 0.15.0 Oct 2, 2014

jreback modified the milestones: 0.16.1, 0.16.0 Mar 5, 2015

jreback modified the milestones: 0.17.0, 0.16.1 Apr 28, 2015

jreback closed this Jul 12, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: existence docs and benchmarks. #7398

DOC: existence docs and benchmarks. #7398

chancyk commented Jun 9, 2014

jreback commented Jun 9, 2014

chancyk commented Jun 10, 2014

jreback Jun 10, 2014

chancyk commented Jun 11, 2014

jorisvandenbossche Jun 11, 2014

chancyk commented Jun 11, 2014

jorisvandenbossche commented Jun 11, 2014

chancyk commented Jun 11, 2014

cpcloud commented Jun 11, 2014

jreback commented Jun 17, 2014

chancyk commented Jun 19, 2014

jreback commented Jun 22, 2014

jorisvandenbossche commented Jun 23, 2014

chancyk commented Jun 23, 2014

chancyk commented Jun 24, 2014

jorisvandenbossche Jun 25, 2014

jorisvandenbossche commented Jun 25, 2014

chancyk commented Jun 26, 2014

jreback commented Sep 4, 2014

jorisvandenbossche commented Sep 4, 2014

chancyk commented Sep 4, 2014

chancyk commented Sep 7, 2014

jreback commented Sep 14, 2014

chancyk commented Sep 15, 2014

jreback commented Sep 26, 2014

jreback commented Jan 18, 2015

jreback commented Mar 5, 2015

jreback commented Apr 8, 2015

jreback commented Apr 28, 2015

jreback commented Jul 12, 2015

DOC: existence docs and benchmarks. #7398

DOC: existence docs and benchmarks. #7398

Conversation

chancyk commented Jun 9, 2014

jreback commented Jun 9, 2014

chancyk commented Jun 10, 2014

jreback Jun 10, 2014

Choose a reason for hiding this comment

chancyk commented Jun 11, 2014

jorisvandenbossche Jun 11, 2014

Choose a reason for hiding this comment

chancyk commented Jun 11, 2014

jorisvandenbossche commented Jun 11, 2014

chancyk commented Jun 11, 2014

cpcloud commented Jun 11, 2014

jreback commented Jun 17, 2014

chancyk commented Jun 19, 2014

jreback commented Jun 22, 2014

jorisvandenbossche commented Jun 23, 2014

chancyk commented Jun 23, 2014

chancyk commented Jun 24, 2014

jorisvandenbossche Jun 25, 2014

Choose a reason for hiding this comment

jorisvandenbossche commented Jun 25, 2014

chancyk commented Jun 26, 2014

jreback commented Sep 4, 2014

jorisvandenbossche commented Sep 4, 2014

chancyk commented Sep 4, 2014

chancyk commented Sep 7, 2014

jreback commented Sep 14, 2014

chancyk commented Sep 15, 2014

jreback commented Sep 26, 2014

jreback commented Jan 18, 2015

jreback commented Mar 5, 2015

jreback commented Apr 8, 2015

jreback commented Apr 28, 2015

jreback commented Jul 12, 2015