-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Improve drop_duplicates for bool columns (#12963) #15738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #15738 +/- ##
==========================================
- Coverage 91.02% 91% -0.02%
==========================================
Files 143 143
Lines 49392 49395 +3
==========================================
- Hits 44957 44951 -6
- Misses 4435 4444 +9
Continue to review full report at Codecov.
|
pandas/core/algorithms.py
Outdated
@@ -343,6 +344,8 @@ def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None): | |||
vals = values.view(np.int64) | |||
else: | |||
vals = np.asarray(values) | |||
if is_bool_dtype(vals): | |||
vals = vals.astype(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you will have to add there dtype = bool
? (or dtype = values.dtype)
Otherwise the uniques will also be 0/1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to have this as part of the if/else
instead (the outer one) as its more explicit.
dont use .astype
(which copies), instead
In [3]: np.asarray(Series([True, False, True])).view('uint8')
Out[3]: array([1, 0, 1], dtype=uint8)
and as @jorisvandenbossche suggest, needs dtype=bool
for the reconstruction part of the uniques.
@mroeschke ping when you have made those changes. I suspect we are incorrectly handling bool in other algos as well (do as separate issues), e.g. |
This looks OK to me (w/ @jreback change) Can be a separate issue, but I think bool handling throughout these algos could made a lot faster by skipping the hashing altogether? We could define a That said, may not be worth the added complexity. %%cython
from numpy cimport uint8_t
cimport cython
import numpy as np
@cython.boundscheck(False)
@cython.wraparound(False)
def unique(uint8_t[:] data):
cdef:
bint seen_true = False
bint seen_false = False
Py_ssize_t i, N = len(data)
uint8_t val
with nogil:
for i in range(N):
val = data[i]
if val == 0:
seen_false = True
elif val == 1:
seen_true = True
if seen_true and seen_false:
break
if seen_true and seen_false:
return np.array([True, False])
elif seen_true:
return np.array([True])
elif seen_false:
return np.array([False])
else:
return np.array([], dtype=bool)
In [35]: bools = np.array([False, True] * 100000)
In [36]: %timeit pd.unique(bools.view('uint8')).astype(bool)
1000 loops, best of 3: 1.74 ms per loop
In [37]: %timeit unique(bools.view('uint8'))
100000 loops, best of 3: 3.47 µs per loop |
@chris-b1 It would be pretty easy to do this and maybe not too much extra code. can you hack up the |
hacked up impl def bool_factorize(values):
values = values.view('uint8')
return values, unique(values)
def drop_duplicates_bool(df):
# 0.19.2 imports
from pandas.core.groupby import get_group_index
from pandas.hashtable import duplicated_int64
keep = 'first'
def f(vals):
labels, shape = bool_factorize(vals)
return labels.astype('i8', copy=False), len(shape)
vals = (df[col].values for col in df)
labels, shape = map(list, zip(*map(f, vals)))
ids = get_group_index(labels, shape, sort=False, xnull=False)
duplicated = pd.Series(duplicated_int64(ids, keep), index=df.index)
return df[-duplicated] I get another 2x speedup over casting to int.
|
It doesn't - apparently In [84]: %timeit df.values.astype('uint8', copy=False)
10 loops, best of 3: 34.8 ms per loop
In [85]: %timeit df.values.astype('uint8', copy=True)
10 loops, best of 3: 34.9 ms per loop
In [86]: %timeit df.values.view('uint8')
100000 loops, best of 3: 8.37 µs per loop |
@chris-b1 yeah I erased that as soon as I wrote it :< the uint8 -> int64 for duplicated -> bool so it probably does copy. you might want to do a hacked duplicated_bool and see how much that makes a difference (if too much work don't, though that impl should be much like unique) |
Add whatsnew Add dtype label and reorg logic
this looks good. @chris-b1 going to merge this. But certainly can look at a |
@jreback before you merge, I think this can use some tests |
@jorisvandenbossche hmm, we are already testing drop_duplicates quite thoroughly |
Well, but the comment I made was not catched by any test? |
hmm, that's a good point. The tests I pointed, check the index. and tests in |
@jorisvandenbossche Looks like tests passed even without Nonetheless, good to add more test to |
closes pandas-dev#12963 Author: Matt Roeschke <[email protected]> Closes pandas-dev#15738 from mroeschke/fix_12963 and squashes the following commits: a020c10 [Matt Roeschke] PERF: Improve drop_duplicates for bool columns (pandas-dev#12963)
closes pandas-dev#12963 Author: Matt Roeschke <[email protected]> Closes pandas-dev#15738 from mroeschke/fix_12963 and squashes the following commits: a020c10 [Matt Roeschke] PERF: Improve drop_duplicates for bool columns (pandas-dev#12963)
git diff upstream/master | flake8 --diff
Converts
bool
columns toint
when callingdrop_duplicates()
so the factorization does not go down theobject
path.