-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG/PERF: Sort mixed-int in Py3, fix Index.difference #13514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dropna = not isinstance(self, ABCMultiIndex) and \ | ||
not isinstance(other, ABCMultiIndex) and \ | ||
self.hasnans and other.hasnans | ||
other = other._get_unique_index(dropna=dropna) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would gladly remove this dropna
stuff and break backward compatibility (so that NaN's are treated as any other element). Would it be dangerous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be ok to remove the nan stuff. does it still work? e.g. prob need some more tests that have nans in one/both sides
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing dropna
would make it treat NaN's as any other elements, so the following would be true
pd.Index([1, np.nan]).difference(pd.Index([2, np.nan])) == pd.Index([1])
Similarly for symmetric_difference
. It would simplify the code and be consistent with intersection
. So, I'm for it.
It would break one test that checks for nan's in symmetric_difference
https://github.com/pydata/pandas/blob/master/pandas/tests/indexes/test_base.py#L771
(related to #6444 - sorting of nan's).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that looks fine and more correct. Add an example in whatsnew for this. As an aside there are docs source/indexing.rst which may need updating.
xref #13504 |
@jreback Thanks for the comments. |
f9c5d7e
to
5d376e8
Compare
Current coverage is 84.39%@@ master #13514 diff @@
==========================================
Files 142 142
Lines 51223 51278 +55
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43224 43276 +52
- Misses 7999 8002 +3
Partials 0 0
|
Parameters | ||
---------- | ||
dropna : bool | ||
If True, NaN values are dropped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I no longer need a dropna
parameter, but decided to keep it for now.
@jreback OK, cleaned some tests. It's green. |
|
||
Parameters | ||
---------- | ||
other : Index or array-like | ||
other: Index or array-like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these actually should be other :
(a space before the :), sorry about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
minor doc change. ping on green. |
some failures on windows. we ultimately test on appveyor.com (you just need a free account), but its very slow. so easiest to spin up a vm if you can. these are prob because you need to use
|
OK, I think I've found the problem but need some time to test it on windows. |
1. Added an internal `safe_sort` to safely sort mixed-integer arrays in Python3. 2. Changed Index.difference and Index.symmetric_difference in order to: - sort mixed-int Indexes (pandas-dev#13432) - improve performance (pandas-dev#12044) 3. Fixed DataFrame.join which raised in Python3 with mixed-int non-unique indexes (issue with sorting mixed-ints, pandas-dev#12814) 4. Fixed Index.union returning an empty Index when one of arguments was a named empty Index (pandas-dev#13432)
new_right = reverse_indexer.take(_ensure_platform_int(right)) | ||
np.putmask(new_right, right == -1, -1) | ||
_, new_labels = algos.safe_sort(uniques, labels, na_sentinel=-1) | ||
new_labels = _ensure_int64(new_labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix. safe_sort
returns platform int labels but the join function needs int64.
@jreback The fix was straightforward and it should be ok now. However, I'm getting these two errors on windows (64bit) - also on the master branch, so they're independent of this PR. Either I messed up the windows setup or it's not being tested thoroughly.
|
2nd is already fixed |
Ah, then it's probably ok. I tested 0.18.1+202.g0a70b5f on windows. |
Yes, everything is all right. |
is #13432 completely closed by this? if not pls make checkboxes in the top (and tick off the things that are done and leave open things that are not). |
thanks @pijucha pretty awesome! |
thanks! |
@pijucha this is failing on windows / py2.7 (3 is clean), and only windows: https://ci.appveyor.com/project/jreback/pandas-465/build/1.0.762/job/ghokkwmadwog983y not sure what is going on |
@jreback According to test_categorical.py, numpy (>= 1.10) should sort mixed int-datetime array. But it doesn't: In [3]: arr = np.array([1, 2, datetime.now(), 0, 3], dtype='O')
In [4]: np.sort(arr)
/home/users/piotr/workspace/pandas-pijucha/pandas_dev_python2/lib/python2.7/site-packages/numpy/core/fromnumeric.py:825: RuntimeWarning: tp_compare didn't return -1 or -2 for exception
a.sort(axis, kind, order)
Out[4]: array([1, 2, datetime.datetime(2016, 7, 19, 9, 49, 28, 214675), 0, 3], dtype=object)
In [6]: np.__version__
Out[6]: '1.11.0' Ipython probably interferes here because in pure python2.7 I'm getting
In the old code in ordered = [np.sort(np.array([e for e in arr if f(e)], dtype=object))
for f in [lambda x: True, lambda x: False]] I haven't yet caught it precisely but it looks as if it sometimes swallowed an exception. (New code in It looks to me that |
ok can u make a new issue about this? (pretty much copy your above comment( |
Sure, i will. |
git diff upstream/master | flake8 --diff
safe_sort
to safely sort mixed-integerarrays in Python3.
in order to:
non-unique indexes (issue with sorting mixed-ints, BUG: merging with mixed types objects in py3 when unorderable #12814)
arguments was a named empty Index (BUG: Index set operations issues #13432)
Benchmarks (for
index_object
only):