Skip to content

BUG: union should not try to sort inplace because platform impls differ as to when sorting occurs for objects that cannot be compared #5266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 3, 2014

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Oct 19, 2013

closes #5039

@ghost ghost assigned cpcloud Oct 19, 2013
@cpcloud
Copy link
Member Author

cpcloud commented Oct 19, 2013

@jtratner Can you try this out? Thanks

@cpcloud
Copy link
Member Author

cpcloud commented Oct 19, 2013

running a vbench ... likely a hit since this will now copy

@cpcloud
Copy link
Member Author

cpcloud commented Oct 19, 2013

Actually, nothing huge:

mask_bools                                   |  14.6003 |  13.2553 |   1.1015 |
reindex_fillna_backfill_float32              |   0.4900 |   0.4440 |   1.1035 |
timeseries_sort_index                        |  21.4117 |  19.3573 |   1.1061 |
join_dataframe_integer_2key                  |   5.2234 |   4.7040 |   1.1104 |
stat_ops_frame_mean_int_axis_0               |   0.6540 |   0.5771 |   1.1333 |
stat_ops_frame_sum_int_axis_0                |   0.4297 |   0.3790 |   1.1338 |
reindex_multiindex                           |   1.6460 |   1.4460 |   1.1383 |
read_store_table_panel                       |  22.6684 |  19.8513 |   1.1419 |
stat_ops_frame_sum_float_axis_1              |   0.7653 |   0.6527 |   1.1725 |
groupby_simple_compress_timing               |  26.6473 |  22.1800 |   1.2014 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

@jtratner
Copy link
Contributor

Why not just sort on id if it can't sort on anything else? I.e.:

result = result[np.argsort(map(id, arr))]

That way you don't have to copy for an edge case.

@jtratner
Copy link
Contributor

It has to be dtype object already.

@jtratner
Copy link
Contributor

I guess that's not necessarily stable/repeatable tho

@cpcloud
Copy link
Member Author

cpcloud commented Oct 19, 2013

and it will copy on that kind of slice

@jtratner
Copy link
Contributor

Definitely - I was just thinking you would put that in the except clause with TypeErrors so you're generally not copying where possible.

@jtratner
Copy link
Contributor

But sorting by id is bad anyways, since it's not really deterministic between calls with non-interned values.

@jtratner
Copy link
Contributor

But this does appear to fix the issue. 😄

@cpcloud
Copy link
Member Author

cpcloud commented Oct 19, 2013

the whole idea of catching a failed inplace sort is botched

@jtratner
Copy link
Contributor

wait, no never mind. popped up again on a different thing now.

@cpcloud
Copy link
Member Author

cpcloud commented Oct 19, 2013

arg we need mac testing

@cpcloud
Copy link
Member Author

cpcloud commented Oct 19, 2013

can u put up the error

@jtratner
Copy link
Contributor

Is there something in the test suite that's dependent upon randomness or order of instantiation? Weirdly it passes with nosetests, but fails if I do nosetests --with-progressive Tried this on multiple different runs.

FAIL: pandas.computation.tests.test_eval:TestAlignment.test_basic_series_frame_alignment('numexpr', 'python')
  /usr/local/bin/vim +872 pandas/computation/tests/test_eval.py  # check_basic_series_frame_alignment
    testit(r_idx_type, c_idx_type, index_name)
  /usr/local/bin/vim +867 pandas/computation/tests/test_eval.py  # testit
    assert_frame_equal(res, expected)
  /usr/local/bin/vim +484 pandas/util/testing.py  # assert_frame_equal
    assert_index_equal(left.columns, right.columns)
  /usr/local/bin/vim +364 pandas/util/testing.py  # assert_index_equal
    right.dtype))
AssertionError: [index] left [object Index([u'98ןן2הנ50פ', 2000-01-03 00:00:00, 2000-01-04 00:00:00, 2000-01-05 00:00:00, 2000-01-06 00:00:00, 2000-01-07 00:00:00, u'אוז91כיע5צ', u'י09לירג6אש', u'יא0ק0פלר31', u'ם8327יכסצנ', u'ןעץמףרן43ץ', u'קזהאצ5כצון'], dtype='object')], right [Index([u'98ןן2הנ50פ', 2000-01-04 00:00:00, 2000-01-05 00:00:00, 2000-01-06 00:00:00, u'י09לירג6אש', 2000-01-03 00:00:00, u'אוז91כיע5צ', u'יא0ק0פלר31', 2000-01-07 00:00:00, u'ם8327יכסצנ', u'ןעץמףרן43ץ', u'קזהאצ5כצון'], dtype='object') object]

@cpcloud
Copy link
Member Author

cpcloud commented Oct 19, 2013

let me see ...

@jtratner
Copy link
Contributor

Okay, now I have figured it out - it fails if you run it with "-A 'not network and not slow'". must be something to do with test ordering.

@cpcloud
Copy link
Member Author

cpcloud commented Oct 19, 2013

that repr is screwed up too

@cpcloud
Copy link
Member Author

cpcloud commented Oct 19, 2013

is that just a mistype ? or are those two consecutive commas a bug?

@jtratner
Copy link
Contributor

Nope, that's wrong too, it just intermittently fails now - very confusing...

I wonder if you bumped up the size of the objects it's generating - I bet you'd see it on your computer (double or triple them just for testing)

@cpcloud
Copy link
Member Author

cpcloud commented Oct 19, 2013

sweet i now get the runtimewarning

@cpcloud
Copy link
Member Author

cpcloud commented Oct 19, 2013

which of course makes sense

@cpcloud
Copy link
Member Author

cpcloud commented Oct 19, 2013

I can't reproduce your most recent error.

@jtratner
Copy link
Contributor

It only occurs about 60% of the time for me - I finally figured out that it was occuring semi-randomly by running it endlessly:

from subprocess import call
[call(['nosetests', 'pandas.computation.tests.test_eval:TestAlignment.test_basic_series_frame_alignment']) for i in range(30)]

@jtratner
Copy link
Contributor

@cpcloud what is this test even supposed to test? s + df just ends up with a big NaN frame because it checks columns before Index, a portion like this:

            2000-01-11 00:00:00  טרם7חעחש4ק  איט8צ7ךןהד  ה1קרט2ךץמל  \
R0
איט8צ7ךןהד                  NaN         NaN         NaN         NaN
נצ6מנ4םדץע                  NaN         NaN         NaN         NaN
ה1קרט2ךץמל                  NaN         NaN         NaN         NaN
טרם7חעחש4ק                  NaN         NaN         NaN         NaN
יכלפדה2ערם                  NaN         NaN         NaN         NaN
הועפץג78ןר                  NaN         NaN         NaN         NaN
בעזלו2ףננק                  NaN         NaN         NaN         NaN
ענמ3ך8שז1ג                  NaN         NaN         NaN         NaN
אנ0ףפד38כ5                  NaN         NaN         NaN         NaN
כע13רץןהאר                  NaN         NaN         NaN         NaN

@jtratner
Copy link
Contributor

@cpcloud how about this for union/sorting generally:

  1. Catch the TypeError
  2. Warn that the sort order is undefined with unsortable objects
  3. Return whatever resulted.

And then in the test suite, don't bother comparing the result if the index and column types aren't compatible. There's no point in spending a lot of time trying to test undefined behavior and figuring out why some routes produce different, still-undefined behavior.

So I think this would just be checking that column and index are either equal or are an integer/float combo.

@cpcloud
Copy link
Member Author

cpcloud commented Oct 20, 2013

@jtratner

what is this test even supposed to test?

Just that the operation in pd.eval space returns the same result as the operation in plain Python space, nothing more specific than that.

don't bother comparing the result if the index and column types aren't compatible

There's only a "common sense" notion of "index and column types that aren't compatible" since indices are converted to object dtype if they are not the same dtype. The idea for these tests was just to make sure that operations that don't make sense do the same thing no matter how you're evaluating an expression.

I suspect the reason they are happening differently is because Python space uses combine_* and friends (which may join indices differently with the index) while eval does a join on each axis (if the indices differ on a given axis), reindexes each axis by the new index, passes the expression with the underlying numpy arrays to numexpr, then reconstructs the result. The time complexity of this "index alignment" operation is O(# of terms * max(ndim for each term))

This is an actual issue because the result should be the same (it is on Linux AFAICT) but Mac OS seems to do something a bit different with joining as opposed to combining objects with combine_* and friends.

Does that all make sense?

@cpcloud
Copy link
Member Author

cpcloud commented Oct 20, 2013

@jtratner Can you show me the output of the following:

import numpy as np
import pandas as pd
from pandas.util.testing import makeCustomDataframe as mkdf

f = lambda *args, **kwargs: np.random.randn()
df = mkdf(10, 7, data_gen_f=f, r_idx_type='i', c_idx_type='dt')
s = pd.Series(np.random.randn(5), df.index[:5])
x, y = df.align(s, axis=1)
print(y)

@jtratner
Copy link
Contributor

That totally makes sense - - I was just thinking that there isn't really a
'correct' order for these objects since they are unorderable. Plus I
figured if we could avoid a copy in the general case that would be a plus.

Unfortunately, I just had to take my main mac for service so I have to get
my much older (still Intel) mac up to speed first/get the original back,
which I probably can't do for a few days.

Maybe @TomAugspurger can take a look?

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Mar 28, 2014
@jreback
Copy link
Contributor

jreback commented Apr 9, 2014

@cpcloud status on this?

@jreback
Copy link
Contributor

jreback commented May 30, 2014

want to try to fix this?

@cpcloud
Copy link
Member Author

cpcloud commented May 30, 2014

Yep kind of fell by the wayside a bit. Bring on the version issues!!

@cpcloud cpcloud modified the milestones: 0.14.1, 0.15.0 Jun 1, 2014
@cpcloud
Copy link
Member Author

cpcloud commented Jun 1, 2014

just had a duh moment, no need for copying. since we have access to the separate indexes we can copmare the first elements of each with the default sorting comparator (<) and if that doesn't throw we can sort in place otherwise we leave it as is

@cpcloud
Copy link
Member Author

cpcloud commented Jun 2, 2014

YES this finally passed

@cpcloud
Copy link
Member Author

cpcloud commented Jun 2, 2014

@jreback this passing locally can you review pls when u get a chance? thx

cc @jtratner @TomAugspurger

@cpcloud
Copy link
Member Author

cpcloud commented Jun 2, 2014

by locally i mean passes on travis but hasn't been run on master

@jreback
Copy link
Contributor

jreback commented Jun 2, 2014

looks ok

@cpcloud
Copy link
Member Author

cpcloud commented Jun 2, 2014

well travis is randomly failing

@cpcloud
Copy link
Member Author

cpcloud commented Jun 3, 2014

why is this continually failing?

https://travis-ci.org/pydata/pandas/jobs/26625395

it passes every time on cpcloud/pandas branch but keeps not "finding" pandas

…er as to when sorting occurs for objects that cannot be compared
@jreback
Copy link
Contributor

jreback commented Jun 3, 2014

any direct comparioson with int ndarrays? (e.g. need to explicity set int64 for these comparisons as these fail on 32-bit platforms (just checking, you don't seem to have it)

@jreback
Copy link
Contributor

jreback commented Jun 3, 2014

looks ok otherwise

@cpcloud
Copy link
Member Author

cpcloud commented Jun 3, 2014

There's comparison between single element ints and all other types in the case of union of non monotonic indices. Is that what you mean?

@cpcloud
Copy link
Member Author

cpcloud commented Jun 3, 2014

But by that point they are object arrays. Not sure if that matters re casting the int types

@jreback
Copy link
Contributor

jreback commented Jun 3, 2014

nvm looks ok then

cpcloud added a commit that referenced this pull request Jun 3, 2014
BUG: union should not try to sort inplace because platform impls differ as to when sorting occurs for objects that cannot be compared
@cpcloud cpcloud merged commit 1754bb5 into pandas-dev:master Jun 3, 2014
@cpcloud cpcloud deleted the index-fix-sort-union-5039 branch June 3, 2014 13:37
@jreback
Copy link
Contributor

jreback commented Jun 3, 2014

on python 33-32, and 34-64 on windows (I may not have installed on some of the other py3 versions)

looks like c1 was i and c2 was u

Odd that this doesn't break on travis though

======================================================================
ERROR: pandas.computation.tests.test_eval.TestAlignment.test_complex_series_frame_alignment('numexpr', 'python')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python34-64\lib\site-packages\nose-1.3.0-py3.4.egg\nose\case.py", line 198, in runTest
    self.test(*self.arg)
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win-amd64-3.4\pandas\computation\tests\test_eval.py", line 931, in check_complex_series_frame_alignment
    expected = expected2 + df
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win-amd64-3.4\pandas\core\ops.py", line 761, in f
    return self._combine_frame(other, na_op, fill_value, level)
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win-amd64-3.4\pandas\core\frame.py", line 2774, in _combine_frame
    this, other = self.align(other, join='outer', level=level, copy=False)
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win-amd64-3.4\pandas\core\generic.py", line 2942, in align
    fill_axis=fill_axis)
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win-amd64-3.4\pandas\core\generic.py", line 2969, in _align_frame
    return_indexers=True)
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win-amd64-3.4\pandas\core\index.py", line 1485, in join
    join_index = self.union(other)
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win-amd64-3.4\pandas\core\index.py", line 1014, in union
    result.sort()
TypeError: unorderable types: int() > str()

@cpcloud
Copy link
Member Author

cpcloud commented Jun 3, 2014

ah darn ok. that should be caught by the TypeError. strange that sorting them raises but comparison of individual elements doesnt

@cpcloud
Copy link
Member Author

cpcloud commented Jun 3, 2014

i can repro this on linux yay

@cpcloud
Copy link
Member Author

cpcloud commented Jun 3, 2014

i see ... if your indices are already mixed then there's nothing you can do...i'll just return results unsorted if inferred_type == 'mixed' @jreback i think you mentioned something like this earlier

@jreback
Copy link
Contributor

jreback commented Jun 3, 2014

yeh for sure just bail out (leave unsorted)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Running pandas/computation/test_eval individually generates RuntimeWarnings about tp_compare
4 participants