Skip to content

BUG/PERF: Sort mixed-int in Py3, fix Index.difference #13514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

pijucha
Copy link
Contributor

@pijucha pijucha commented Jun 26, 2016


  1. Added an internal safe_sort to safely sort mixed-integer
    arrays in Python3.
  2. Changed Index.difference and Index.symmetric_difference
    in order to:
  3. Fixed DataFrame.join which raised in Python3 with mixed-int
    non-unique indexes (issue with sorting mixed-ints, BUG: merging with mixed types objects in py3 when unorderable #12814)
  4. Fixed Index.union returning an empty Index when one of
    arguments was a named empty Index (BUG: Index set operations issues #13432)

Benchmarks (for index_object only):

]$ asv compare bf47 a02f -f 1.5 -s 
Benchmarks that have improved:
    before     after       ratio
  [bf47    ] [a02f    ]
-  155.66ms    10.32ms      0.07  index_object.index_datetime_set_difference.time_index_datetime_difference
-  154.66ms     2.98ms      0.02  index_object.index_datetime_set_difference.time_index_datetime_difference_disjoint
-  391.45ms    10.65ms      0.03  index_object.index_datetime_set_difference.time_index_datetime_symmetric_difference
-  195.55ms    88.95ms      0.45  index_object.index_int64_set_difference.time_index_int64_difference
-  440.93ms   103.65ms      0.24  index_object.index_int64_set_difference.time_index_int64_symmetric_difference
-    8.77ms     4.88ms      0.56  index_object.index_str_set_difference.time_str_symmetric_difference

dropna = not isinstance(self, ABCMultiIndex) and \
not isinstance(other, ABCMultiIndex) and \
self.hasnans and other.hasnans
other = other._get_unique_index(dropna=dropna)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would gladly remove this dropna stuff and break backward compatibility (so that NaN's are treated as any other element). Would it be dangerous?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be ok to remove the nan stuff. does it still work? e.g. prob need some more tests that have nans in one/both sides

Copy link
Contributor Author

@pijucha pijucha Jun 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing dropna would make it treat NaN's as any other elements, so the following would be true

pd.Index([1, np.nan]).difference(pd.Index([2, np.nan])) == pd.Index([1])

Similarly for symmetric_difference. It would simplify the code and be consistent with intersection. So, I'm for it.

It would break one test that checks for nan's in symmetric_difference https://github.com/pydata/pandas/blob/master/pandas/tests/indexes/test_base.py#L771
(related to #6444 - sorting of nan's).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that looks fine and more correct. Add an example in whatsnew for this. As an aside there are docs source/indexing.rst which may need updating.

@jreback
Copy link
Contributor

jreback commented Jun 26, 2016

xref #13504

@pijucha
Copy link
Contributor Author

pijucha commented Jun 27, 2016

@jreback Thanks for the comments.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Dtype Conversions Unexpected or buggy dtype conversions Compat pandas objects compatability with Numpy or Python functions Performance Memory or execution speed performance labels Jun 27, 2016
@pijucha pijucha force-pushed the setop13432 branch 3 times, most recently from f9c5d7e to 5d376e8 Compare July 3, 2016 05:05
@codecov-io
Copy link

codecov-io commented Jul 3, 2016

Current coverage is 84.39%

Merging #13514 into master will increase coverage by 0.01%

@@             master     #13514   diff @@
==========================================
  Files           142        142          
  Lines         51223      51278    +55   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43224      43276    +52   
- Misses         7999       8002     +3   
  Partials          0          0          

Powered by Codecov. Last updated by 043879f...3a96089

Parameters
----------
dropna : bool
If True, NaN values are dropped.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I no longer need a dropna parameter, but decided to keep it for now.

@pijucha
Copy link
Contributor Author

pijucha commented Jul 15, 2016

@jreback OK, cleaned some tests. It's green.


Parameters
----------
other : Index or array-like
other: Index or array-like
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these actually should be other : (a space before the :), sorry about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@jreback
Copy link
Contributor

jreback commented Jul 15, 2016

minor doc change. ping on green.

@jreback
Copy link
Contributor

jreback commented Jul 15, 2016

some failures on windows. we ultimately test on appveyor.com (you just need a free account), but its very slow. so easiest to spin up a vm if you can.

these are prob because you need to use _ensure_platform_int when you are using an indexer (on windows things like .take and .putmask take a int32 rather than standard int64. in this particular case you actually need to _ensure_int64 (as you actually have platform ints up front).

(pandas3.5) C:\Users\conda\Documents\pandas3.5>nosetests pandas/tests/indexes
..........................................................................................................................................
..........................................................................................................................................
..........................................................................................................................................
.........................................................................S................................................................
......................................................................................................E...................................
................................................E...............................................
======================================================================
ERROR: test_join_non_unique (pandas.tests.indexes.test_numeric.TestInt64Index)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\conda\Documents\pandas3.5\pandas\tests\indexes\test_numeric.py", line 755, in test_join_non_unique
    joined, lidx, ridx = left.join(left, return_indexers=True)
  File "C:\Users\conda\Documents\pandas3.5\pandas\indexes\base.py", line 2635, in join
    return_indexers=return_indexers)
  File "C:\Users\conda\Documents\pandas3.5\pandas\indexes\base.py", line 2720, in _join_non_unique
    sort=True)
  File "C:\Users\conda\Documents\pandas3.5\pandas\tools\merge.py", line 909, in _get_join_indexers
    return join_func(lkey, rkey, count, **kwargs)
  File "pandas\src\join.pyx", line 51, in pandas.algos.left_outer_join (pandas\algos.c:33321)
    def left_outer_join(ndarray[int64_t] left, ndarray[int64_t] right,
ValueError: Buffer dtype mismatch, expected 'int64_t' but got 'long'

======================================================================
ERROR: test_join_non_unique (pandas.tests.indexes.test_range.TestRangeIndex)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\conda\Documents\pandas3.5\pandas\tests\indexes\test_range.py", line 524, in test_join_non_unique
    res, lidx, ridx = self.index.join(other, return_indexers=True)
  File "C:\Users\conda\Documents\pandas3.5\pandas\indexes\range.py", line 435, in join
    return super(RangeIndex, self).join(other, how, level, return_indexers)
  File "C:\Users\conda\Documents\pandas3.5\pandas\indexes\base.py", line 2642, in join
    return_indexers=return_indexers)
  File "C:\Users\conda\Documents\pandas3.5\pandas\indexes\base.py", line 2720, in _join_non_unique
    sort=True)
  File "C:\Users\conda\Documents\pandas3.5\pandas\tools\merge.py", line 909, in _get_join_indexers
    return join_func(lkey, rkey, count, **kwargs)
  File "pandas\src\join.pyx", line 51, in pandas.algos.left_outer_join (pandas\algos.c:33321)
    def left_outer_join(ndarray[int64_t] left, ndarray[int64_t] right,
ValueError: Buffer dtype mismatch, expected 'int64_t' but got 'long'

----------------------------------------------------------------------
Ran 796 tests in 10.849s

FAILED (SKIP=1, errors=2)

@pijucha
Copy link
Contributor Author

pijucha commented Jul 15, 2016

OK, I think I've found the problem but need some time to test it on windows.

1. Added an internal `safe_sort` to safely sort mixed-integer
arrays in Python3.

2. Changed Index.difference and Index.symmetric_difference
in order to:
- sort mixed-int Indexes (pandas-dev#13432)
- improve performance (pandas-dev#12044)

3. Fixed DataFrame.join which raised in Python3 with mixed-int
non-unique indexes (issue with sorting mixed-ints, pandas-dev#12814)

4. Fixed Index.union returning an empty Index when one of
arguments was a named empty Index (pandas-dev#13432)
new_right = reverse_indexer.take(_ensure_platform_int(right))
np.putmask(new_right, right == -1, -1)
_, new_labels = algos.safe_sort(uniques, labels, na_sentinel=-1)
new_labels = _ensure_int64(new_labels)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix. safe_sort returns platform int labels but the join function needs int64.

@pijucha
Copy link
Contributor Author

pijucha commented Jul 17, 2016

@jreback The fix was straightforward and it should be ok now.

However, I'm getting these two errors on windows (64bit) - also on the master branch, so they're independent of this PR. Either I messed up the windows setup or it's not being tested thoroughly.

======================================================================
FAIL: test_next (pandas.io.tests.test_common.TestMMapWrapper)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\GitHub\pandas-pijucha\pandas\io\tests\test_common.py", line 141, in t
est_next
    self.assertEqual(next_line, line)
AssertionError: 'a,b,c\r\n' != 'a,b,c\n'
- a,b,c
?      -
+ a,b,c

======================================================================
FAIL: test_alignment_non_pandas (pandas.tests.frame.test_operators.TestDataFrame
Operators)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\GitHub\pandas-pijucha\pandas\tests\frame\test_operators.py", line 120
2, in test_alignment_non_pandas
    Series([1, 2, 3], index=df.index))
  File "D:\GitHub\pandas-pijucha\pandas\util\testing.py", line 1157, in assert_s
eries_equal
    assert_attr_equal('dtype', left, right)
  File "D:\GitHub\pandas-pijucha\pandas\util\testing.py", line 882, in assert_at
tr_equal
    left_attr, right_attr)
  File "D:\GitHub\pandas-pijucha\pandas\util\testing.py", line 1021, in raise_as
sert_detail
    raise AssertionError(msg)
AssertionError: Attributes are different

Attribute "dtype" are different
[left]:  int32
[right]: int64None

@jreback
Copy link
Contributor

jreback commented Jul 17, 2016

2nd is already fixed
make sure u r rebased
first was fixed a while back as well

@pijucha
Copy link
Contributor Author

pijucha commented Jul 17, 2016

Ah, then it's probably ok. I tested 0.18.1+202.g0a70b5f on windows.
(This commit is already rebased to the current master. I'll check on windows in a minute.)

@pijucha
Copy link
Contributor Author

pijucha commented Jul 17, 2016

Yes, everything is all right.
The second error is fixed, and the first one was apparently due to my git autocrlf config.

@jreback
Copy link
Contributor

jreback commented Jul 19, 2016

is #13432 completely closed by this? if not pls make checkboxes in the top (and tick off the things that are done and leave open things that are not).

@jreback jreback closed this in b225cac Jul 19, 2016
@jreback
Copy link
Contributor

jreback commented Jul 19, 2016

thanks @pijucha pretty awesome!

@jreback jreback mentioned this pull request Jul 19, 2016
5 tasks
@pijucha
Copy link
Contributor Author

pijucha commented Jul 19, 2016

@jreback Thanks.

As for #13432, the issues with difference and symmetric_difference are solved. But union and intersection are still open. I'll comment there soon.

@jreback
Copy link
Contributor

jreback commented Jul 19, 2016

thanks!

@pijucha pijucha deleted the setop13432 branch July 19, 2016 04:00
@jreback
Copy link
Contributor

jreback commented Jul 19, 2016

@pijucha this is failing on windows / py2.7 (3 is clean), and only windows: https://ci.appveyor.com/project/jreback/pandas-465/build/1.0.762/job/ghokkwmadwog983y

not sure what is going on

@pijucha
Copy link
Contributor Author

pijucha commented Jul 19, 2016

@jreback
This is strange. It's not windows because I'm getting it now on linux (py2.7, numpy 1.11.0). I guess travis tests mostly older numpies and I missed it too.

According to test_categorical.py, numpy (>= 1.10) should sort mixed int-datetime array. But it doesn't:

In [3]: arr = np.array([1, 2, datetime.now(), 0, 3], dtype='O')

In [4]: np.sort(arr)
/home/users/piotr/workspace/pandas-pijucha/pandas_dev_python2/lib/python2.7/site-packages/numpy/core/fromnumeric.py:825: RuntimeWarning: tp_compare didn't return -1 or -2 for exception
  a.sort(axis, kind, order)
Out[4]: array([1, 2, datetime.datetime(2016, 7, 19, 9, 49, 28, 214675), 0, 3], dtype=object)

In [6]: np.__version__
Out[6]: '1.11.0'

Ipython probably interferes here because in pure python2.7 I'm getting

TypeError: can't compare datetime.datetime to int

In the old code in factorize, there was a list comprehension similar to this:

ordered = [np.sort(np.array([e for e in arr if f(e)], dtype=object))
           for f in [lambda x: True, lambda x: False]]

I haven't yet caught it precisely but it looks as if it sometimes swallowed an exception. (New code in safe_sort is simpler - sorts each of the two arrays separately, but still with np.sort.)

It looks to me that Categorical.from_array(arr, ordered=True) should always raise now. And maybe test_constructor_unsortable from test_categorical.py needs to be rewritten. (I'll check later in the evening if it works.)

@jreback
Copy link
Contributor

jreback commented Jul 19, 2016

ok can u make a new issue about this? (pretty much copy your above comment(

@pijucha
Copy link
Contributor Author

pijucha commented Jul 19, 2016

Sure, i will.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: merging with mixed types objects in py3 when unorderable Index.difference performance
3 participants