BUG/PERF: Sort mixed-int in Py3, fix Index.difference #13514

pijucha · 2016-06-26T04:33:36Z

fixes some issues from BUG: Index set operations issues #13432
closes Index.difference performance #12044
closes BUG: merging with mixed types objects in py3 when unorderable #12814
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

Added an internal safe_sort to safely sort mixed-integer
arrays in Python3.
Changed Index.difference and Index.symmetric_difference
in order to:
- sort mixed-int Indexes (BUG: Index set operations issues #13432)
- improve performance (Index.difference performance #12044)
Fixed DataFrame.join which raised in Python3 with mixed-int
non-unique indexes (issue with sorting mixed-ints, BUG: merging with mixed types objects in py3 when unorderable #12814)
Fixed Index.union returning an empty Index when one of
arguments was a named empty Index (BUG: Index set operations issues #13432)

Benchmarks (for index_object only):

]$ asv compare bf47 a02f -f 1.5 -s 
Benchmarks that have improved:
    before     after       ratio
  [bf47    ] [a02f    ]
-  155.66ms    10.32ms      0.07  index_object.index_datetime_set_difference.time_index_datetime_difference
-  154.66ms     2.98ms      0.02  index_object.index_datetime_set_difference.time_index_datetime_difference_disjoint
-  391.45ms    10.65ms      0.03  index_object.index_datetime_set_difference.time_index_datetime_symmetric_difference
-  195.55ms    88.95ms      0.45  index_object.index_int64_set_difference.time_index_int64_difference
-  440.93ms   103.65ms      0.24  index_object.index_int64_set_difference.time_index_int64_symmetric_difference
-    8.77ms     4.88ms      0.56  index_object.index_str_set_difference.time_str_symmetric_difference

pijucha · 2016-06-26T05:07:22Z

pandas/indexes/base.py

+        dropna = not isinstance(self, ABCMultiIndex) and \
+            not isinstance(other, ABCMultiIndex) and \
+            self.hasnans and other.hasnans
+        other = other._get_unique_index(dropna=dropna)


I would gladly remove this dropna stuff and break backward compatibility (so that NaN's are treated as any other element). Would it be dangerous?

I think it would be ok to remove the nan stuff. does it still work? e.g. prob need some more tests that have nans in one/both sides

Removing dropna would make it treat NaN's as any other elements, so the following would be true

pd.Index([1, np.nan]).difference(pd.Index([2, np.nan])) == pd.Index([1])

Similarly for symmetric_difference. It would simplify the code and be consistent with intersection. So, I'm for it.

It would break one test that checks for nan's in symmetric_difference https://github.com/pydata/pandas/blob/master/pandas/tests/indexes/test_base.py#L771
(related to #6444 - sorting of nan's).

that looks fine and more correct. Add an example in whatsnew for this. As an aside there are docs source/indexing.rst which may need updating.

jreback · 2016-06-26T20:48:05Z

xref #13504

pijucha · 2016-06-27T00:12:07Z

@jreback Thanks for the comments.

codecov-io · 2016-07-03T05:32:14Z

Current coverage is 84.39%

Merging #13514 into master will increase coverage by 0.01%

@@             master     #13514   diff @@
==========================================
  Files           142        142          
  Lines         51223      51278    +55   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43224      43276    +52   
- Misses         7999       8002     +3   
  Partials          0          0

Powered by Codecov. Last updated by 043879f...3a96089

pijucha · 2016-07-03T05:43:41Z

pandas/indexes/base.py

+        Parameters
+        ----------
+        dropna : bool
+            If True, NaN values are dropped.


Actually, I no longer need a dropna parameter, but decided to keep it for now.

pijucha · 2016-07-15T01:36:04Z

@jreback OK, cleaned some tests. It's green.

jreback · 2016-07-15T02:03:38Z

pandas/indexes/base.py


        Parameters
        ----------
-        other : Index or array-like
+        other: Index or array-like


these actually should be other : (a space before the :), sorry about that.

jreback · 2016-07-15T02:04:48Z

minor doc change. ping on green.

jreback · 2016-07-15T10:19:05Z

some failures on windows. we ultimately test on appveyor.com (you just need a free account), but its very slow. so easiest to spin up a vm if you can.

these are prob because you need to use _ensure_platform_int when you are using an indexer (on windows things like .take and .putmask take a int32 rather than standard int64. in this particular case you actually need to _ensure_int64 (as you actually have platform ints up front).

(pandas3.5) C:\Users\conda\Documents\pandas3.5>nosetests pandas/tests/indexes
..........................................................................................................................................
..........................................................................................................................................
..........................................................................................................................................
.........................................................................S................................................................
......................................................................................................E...................................
................................................E...............................................
======================================================================
ERROR: test_join_non_unique (pandas.tests.indexes.test_numeric.TestInt64Index)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\conda\Documents\pandas3.5\pandas\tests\indexes\test_numeric.py", line 755, in test_join_non_unique
    joined, lidx, ridx = left.join(left, return_indexers=True)
  File "C:\Users\conda\Documents\pandas3.5\pandas\indexes\base.py", line 2635, in join
    return_indexers=return_indexers)
  File "C:\Users\conda\Documents\pandas3.5\pandas\indexes\base.py", line 2720, in _join_non_unique
    sort=True)
  File "C:\Users\conda\Documents\pandas3.5\pandas\tools\merge.py", line 909, in _get_join_indexers
    return join_func(lkey, rkey, count, **kwargs)
  File "pandas\src\join.pyx", line 51, in pandas.algos.left_outer_join (pandas\algos.c:33321)
    def left_outer_join(ndarray[int64_t] left, ndarray[int64_t] right,
ValueError: Buffer dtype mismatch, expected 'int64_t' but got 'long'

======================================================================
ERROR: test_join_non_unique (pandas.tests.indexes.test_range.TestRangeIndex)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\conda\Documents\pandas3.5\pandas\tests\indexes\test_range.py", line 524, in test_join_non_unique
    res, lidx, ridx = self.index.join(other, return_indexers=True)
  File "C:\Users\conda\Documents\pandas3.5\pandas\indexes\range.py", line 435, in join
    return super(RangeIndex, self).join(other, how, level, return_indexers)
  File "C:\Users\conda\Documents\pandas3.5\pandas\indexes\base.py", line 2642, in join
    return_indexers=return_indexers)
  File "C:\Users\conda\Documents\pandas3.5\pandas\indexes\base.py", line 2720, in _join_non_unique
    sort=True)
  File "C:\Users\conda\Documents\pandas3.5\pandas\tools\merge.py", line 909, in _get_join_indexers
    return join_func(lkey, rkey, count, **kwargs)
  File "pandas\src\join.pyx", line 51, in pandas.algos.left_outer_join (pandas\algos.c:33321)
    def left_outer_join(ndarray[int64_t] left, ndarray[int64_t] right,
ValueError: Buffer dtype mismatch, expected 'int64_t' but got 'long'

----------------------------------------------------------------------
Ran 796 tests in 10.849s

FAILED (SKIP=1, errors=2)

pijucha · 2016-07-15T16:33:43Z

OK, I think I've found the problem but need some time to test it on windows.

1. Added an internal `safe_sort` to safely sort mixed-integer arrays in Python3. 2. Changed Index.difference and Index.symmetric_difference in order to: - sort mixed-int Indexes (pandas-dev#13432) - improve performance (pandas-dev#12044) 3. Fixed DataFrame.join which raised in Python3 with mixed-int non-unique indexes (issue with sorting mixed-ints, pandas-dev#12814) 4. Fixed Index.union returning an empty Index when one of arguments was a named empty Index (pandas-dev#13432)

pijucha · 2016-07-17T17:57:14Z

pandas/tools/merge.py

-    new_right = reverse_indexer.take(_ensure_platform_int(right))
-    np.putmask(new_right, right == -1, -1)
+    _, new_labels = algos.safe_sort(uniques, labels, na_sentinel=-1)
+    new_labels = _ensure_int64(new_labels)


The fix. safe_sort returns platform int labels but the join function needs int64.

pijucha · 2016-07-17T18:09:11Z

@jreback The fix was straightforward and it should be ok now.

However, I'm getting these two errors on windows (64bit) - also on the master branch, so they're independent of this PR. Either I messed up the windows setup or it's not being tested thoroughly.

======================================================================
FAIL: test_next (pandas.io.tests.test_common.TestMMapWrapper)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\GitHub\pandas-pijucha\pandas\io\tests\test_common.py", line 141, in t
est_next
    self.assertEqual(next_line, line)
AssertionError: 'a,b,c\r\n' != 'a,b,c\n'
- a,b,c
?      -
+ a,b,c

======================================================================
FAIL: test_alignment_non_pandas (pandas.tests.frame.test_operators.TestDataFrame
Operators)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\GitHub\pandas-pijucha\pandas\tests\frame\test_operators.py", line 120
2, in test_alignment_non_pandas
    Series([1, 2, 3], index=df.index))
  File "D:\GitHub\pandas-pijucha\pandas\util\testing.py", line 1157, in assert_s
eries_equal
    assert_attr_equal('dtype', left, right)
  File "D:\GitHub\pandas-pijucha\pandas\util\testing.py", line 882, in assert_at
tr_equal
    left_attr, right_attr)
  File "D:\GitHub\pandas-pijucha\pandas\util\testing.py", line 1021, in raise_as
sert_detail
    raise AssertionError(msg)
AssertionError: Attributes are different

Attribute "dtype" are different
[left]:  int32
[right]: int64None

jreback · 2016-07-17T18:11:23Z

2nd is already fixed
make sure u r rebased
first was fixed a while back as well

pijucha · 2016-07-17T18:23:00Z

Ah, then it's probably ok. I tested 0.18.1+202.g0a70b5f on windows.
(This commit is already rebased to the current master. I'll check on windows in a minute.)

pijucha · 2016-07-17T19:29:52Z

Yes, everything is all right.
The second error is fixed, and the first one was apparently due to my git autocrlf config.

jreback · 2016-07-19T01:47:25Z

is #13432 completely closed by this? if not pls make checkboxes in the top (and tick off the things that are done and leave open things that are not).

jreback · 2016-07-19T01:52:41Z

thanks @pijucha pretty awesome!

pijucha · 2016-07-19T02:10:48Z

@jreback Thanks.

As for #13432, the issues with difference and symmetric_difference are solved. But union and intersection are still open. I'll comment there soon.

jreback · 2016-07-19T02:13:25Z

thanks!

jreback · 2016-07-19T10:40:14Z

@pijucha this is failing on windows / py2.7 (3 is clean), and only windows: https://ci.appveyor.com/project/jreback/pandas-465/build/1.0.762/job/ghokkwmadwog983y

not sure what is going on

pijucha · 2016-07-19T14:14:45Z

@jreback
This is strange. It's not windows because I'm getting it now on linux (py2.7, numpy 1.11.0). I guess travis tests mostly older numpies and I missed it too.

According to test_categorical.py, numpy (>= 1.10) should sort mixed int-datetime array. But it doesn't:

In [3]: arr = np.array([1, 2, datetime.now(), 0, 3], dtype='O')

In [4]: np.sort(arr)
/home/users/piotr/workspace/pandas-pijucha/pandas_dev_python2/lib/python2.7/site-packages/numpy/core/fromnumeric.py:825: RuntimeWarning: tp_compare didn't return -1 or -2 for exception
  a.sort(axis, kind, order)
Out[4]: array([1, 2, datetime.datetime(2016, 7, 19, 9, 49, 28, 214675), 0, 3], dtype=object)

In [6]: np.__version__
Out[6]: '1.11.0'

Ipython probably interferes here because in pure python2.7 I'm getting

TypeError: can't compare datetime.datetime to int

In the old code in factorize, there was a list comprehension similar to this:

ordered = [np.sort(np.array([e for e in arr if f(e)], dtype=object))
           for f in [lambda x: True, lambda x: False]]

I haven't yet caught it precisely but it looks as if it sometimes swallowed an exception. (New code in safe_sort is simpler - sorts each of the two arrays separately, but still with np.sort.)

It looks to me that Categorical.from_array(arr, ordered=True) should always raise now. And maybe test_constructor_unsortable from test_categorical.py needs to be rewritten. (I'll check later in the evening if it works.)

jreback · 2016-07-19T14:27:18Z

ok can u make a new issue about this? (pretty much copy your above comment(

pijucha · 2016-07-19T14:50:49Z

Sure, i will.

pijucha reviewed Jun 26, 2016
View reviewed changes

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Dtype Conversions Unexpected or buggy dtype conversions Compat pandas objects compatability with Numpy or Python functions Performance Memory or execution speed performance labels Jun 27, 2016

pijucha force-pushed the setop13432 branch 3 times, most recently from f9c5d7e to 5d376e8 Compare July 3, 2016 05:05

pijucha reviewed Jul 3, 2016
View reviewed changes

pijucha force-pushed the setop13432 branch from 4ac1732 to 4749764 Compare July 15, 2016 00:03

jreback reviewed Jul 15, 2016
View reviewed changes

pijucha force-pushed the setop13432 branch from 4749764 to 3506b6a Compare July 15, 2016 02:40

pijucha force-pushed the setop13432 branch from 3506b6a to 3a96089 Compare July 17, 2016 16:49

pijucha reviewed Jul 17, 2016
View reviewed changes

jreback closed this in b225cac Jul 19, 2016

jreback mentioned this pull request Jul 19, 2016

BUG: Index set operations issues #13432

Closed

5 tasks

pijucha deleted the setop13432 branch July 19, 2016 04:00

This was referenced Jul 20, 2016

TST/BUG in test_categorical.py: test_constructor_unsortable breaks after recent commit #13714

Closed

TST/BUG: test unsortable in safe_sort and Categorical (GH13714) #13733

Closed

jreback mentioned this pull request Sep 9, 2016

TST: Difference in drop_duplicates for numeric Series and DataFrames #14192

Closed

Liam3851 mentioned this pull request Oct 3, 2016

DatetimeIndex union fails in 0.19rc1 when constructed from differences in DatetimeIndexes #14323

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG/PERF: Sort mixed-int in Py3, fix Index.difference #13514

BUG/PERF: Sort mixed-int in Py3, fix Index.difference #13514

pijucha commented Jun 26, 2016

pijucha Jun 26, 2016

jreback Jun 26, 2016

pijucha Jun 26, 2016 •

edited

Loading

jreback Jun 27, 2016

jreback commented Jun 26, 2016

pijucha commented Jun 27, 2016

codecov-io commented Jul 3, 2016 •

edited

Loading

pijucha Jul 3, 2016

pijucha commented Jul 15, 2016

jreback Jul 15, 2016

pijucha Jul 15, 2016

jreback commented Jul 15, 2016

jreback commented Jul 15, 2016

pijucha commented Jul 15, 2016

pijucha Jul 17, 2016

pijucha commented Jul 17, 2016

jreback commented Jul 17, 2016

pijucha commented Jul 17, 2016

pijucha commented Jul 17, 2016

jreback commented Jul 19, 2016

jreback commented Jul 19, 2016

pijucha commented Jul 19, 2016

jreback commented Jul 19, 2016

jreback commented Jul 19, 2016 •

edited

Loading

pijucha commented Jul 19, 2016

jreback commented Jul 19, 2016

pijucha commented Jul 19, 2016

BUG/PERF: Sort mixed-int in Py3, fix Index.difference #13514

BUG/PERF: Sort mixed-int in Py3, fix Index.difference #13514

Conversation

pijucha commented Jun 26, 2016

pijucha Jun 26, 2016

Choose a reason for hiding this comment

jreback Jun 26, 2016

Choose a reason for hiding this comment

pijucha Jun 26, 2016 • edited Loading

Choose a reason for hiding this comment

jreback Jun 27, 2016

Choose a reason for hiding this comment

jreback commented Jun 26, 2016

pijucha commented Jun 27, 2016

codecov-io commented Jul 3, 2016 • edited Loading

Current coverage is 84.39%

pijucha Jul 3, 2016

Choose a reason for hiding this comment

pijucha commented Jul 15, 2016

jreback Jul 15, 2016

Choose a reason for hiding this comment

pijucha Jul 15, 2016

Choose a reason for hiding this comment

jreback commented Jul 15, 2016

jreback commented Jul 15, 2016

pijucha commented Jul 15, 2016

pijucha Jul 17, 2016

Choose a reason for hiding this comment

pijucha commented Jul 17, 2016

jreback commented Jul 17, 2016

pijucha commented Jul 17, 2016

pijucha commented Jul 17, 2016

jreback commented Jul 19, 2016

jreback commented Jul 19, 2016

pijucha commented Jul 19, 2016

jreback commented Jul 19, 2016

jreback commented Jul 19, 2016 • edited Loading

pijucha commented Jul 19, 2016

jreback commented Jul 19, 2016

pijucha commented Jul 19, 2016

pijucha Jun 26, 2016 •

edited

Loading

codecov-io commented Jul 3, 2016 •

edited

Loading

jreback commented Jul 19, 2016 •

edited

Loading