PERF: Improve drop_duplicates for bool columns (#12963) #15738

mroeschke · 2017-03-20T02:05:03Z

closes drop_duplicates slow for data frame of boolean columns #12963
tests passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

Converts bool columns to int when calling drop_duplicates() so the factorization does not go down the object path.

      before           after         ratio
     [9ab57dc5]       [4b05fccb]
-     10.2±0.01ms      4.68±0.01ms     0.46  reindex.Duplicates.time_frame_drop_dups_bool

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

codecov · 2017-03-20T02:37:02Z

Codecov Report

Merging #15738 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15738      +/-   ##
==========================================
- Coverage   91.02%      91%   -0.02%     
==========================================
  Files         143      143              
  Lines       49392    49395       +3     
==========================================
- Hits        44957    44951       -6     
- Misses       4435     4444       +9

Impacted Files	Coverage Δ
pandas/core/algorithms.py	`94.5% <100%> (+0.02%)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.86% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bff47f2...a020c10. Read the comment docs.

jorisvandenbossche · 2017-03-20T08:30:30Z

pandas/core/algorithms.py

@@ -343,6 +344,8 @@ def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
            vals = values.view(np.int64)
    else:
        vals = np.asarray(values)
+        if is_bool_dtype(vals):
+            vals = vals.astype(int)


I think you will have to add there dtype = bool? (or dtype = values.dtype)
Otherwise the uniques will also be 0/1

better to have this as part of the if/else instead (the outer one) as its more explicit.

dont use .astype (which copies), instead

In [3]: np.asarray(Series([True, False, True])).view('uint8') Out[3]: array([1, 0, 1], dtype=uint8)

and as @jorisvandenbossche suggest, needs dtype=bool for the reconstruction part of the uniques.

jreback · 2017-03-20T12:07:32Z

@mroeschke ping when you have made those changes.

I suspect we are incorrectly handling bool in other algos as well (do as separate issues), e.g. value_counts, duplicated, rank

chris-b1 · 2017-03-20T17:52:37Z

This looks OK to me (w/ @jreback change)

Can be a separate issue, but I think bool handling throughout these algos could made a lot faster by skipping the hashing altogether? We could define a BoolHashTable to dispatch to, where the factorization is simply the uint8 view on the data, and the other methods could take various shortcuts. E.g. unique could be something like below.

That said, may not be worth the added complexity.

%%cython
from numpy cimport uint8_t
cimport cython

import numpy as np

@cython.boundscheck(False)
@cython.wraparound(False)
def unique(uint8_t[:] data):
    cdef:
        bint seen_true = False
        bint seen_false = False
        Py_ssize_t i, N = len(data)
        uint8_t val
    with nogil:
        for i in range(N):
            val = data[i]
            if val == 0:
                seen_false = True
            elif val == 1:
                seen_true = True
            if seen_true and seen_false:
                break
    if seen_true and seen_false:
        return np.array([True, False])
    elif seen_true:
        return np.array([True])
    elif seen_false:
        return np.array([False])
    else:
        return np.array([], dtype=bool)

In [35]: bools = np.array([False, True] * 100000)

In [36]: %timeit pd.unique(bools.view('uint8')).astype(bool)
1000 loops, best of 3: 1.74 ms per loop

In [37]: %timeit unique(bools.view('uint8'))
100000 loops, best of 3: 3.47 µs per loop

jreback · 2017-03-20T17:59:54Z

@chris-b1 It would be pretty easy to do this and maybe not too much extra code. can you hack up the .drop_duplicates from the issue and see how substantive that perf difference is? (my naive tests indicates that it should be way faster FYI).

chris-b1 · 2017-03-20T18:33:52Z

hacked up impl

def bool_factorize(values):
    values = values.view('uint8')
    return values, unique(values)

def drop_duplicates_bool(df):
    # 0.19.2 imports
    from pandas.core.groupby import get_group_index
    from pandas.hashtable import duplicated_int64
    keep = 'first'
    def f(vals):
        labels, shape = bool_factorize(vals)
        return labels.astype('i8', copy=False), len(shape)

    vals = (df[col].values for col in df)
    labels, shape = map(list, zip(*map(f, vals)))

    ids = get_group_index(labels, shape, sort=False, xnull=False)
    duplicated = pd.Series(duplicated_int64(ids, keep), index=df.index)
    return df[-duplicated]

I get another 2x speedup over casting to int.

In [67]: %timeit drop_duplicates_bool(df)
1 loop, best of 3: 362 ms per loop

In [68]: %timeit df.astype('uint8').drop_duplicates().astype(bool)
1 loop, best of 3: 868 ms per loop

In [69]: from pandas.util.testing import assert_frame_equal
In [70]: assert_frame_equal(drop_duplicates_bool(df), df.drop_duplicates())

chris-b1 · 2017-03-20T18:45:30Z

It doesn't - apparently numpy doesn't respect copy=False between bool / uint8?

In [84]: %timeit df.values.astype('uint8', copy=False)
10 loops, best of 3: 34.8 ms per loop

In [85]: %timeit df.values.astype('uint8', copy=True)
10 loops, best of 3: 34.9 ms per loop

In [86]: %timeit df.values.view('uint8')
100000 loops, best of 3: 8.37 µs per loop

jreback · 2017-03-20T18:48:26Z

@chris-b1 yeah I erased that as soon as I wrote it :<

the uint8 -> int64 for duplicated -> bool

so it probably does copy.

you might want to do a hacked duplicated_bool and see how much that makes a difference (if too much work don't, though that impl should be much like unique)

Add whatsnew Add dtype label and reorg logic

jreback · 2017-03-20T22:44:29Z

this looks good.

@chris-b1 going to merge this. But certainly can look at a BoolHashTable in a future issue. (pls create one).

jorisvandenbossche · 2017-03-20T22:46:20Z

@jreback before you merge, I think this can use some tests

jreback · 2017-03-20T22:50:01Z

@jorisvandenbossche hmm, we are already testing drop_duplicates quite thoroughly

jreback · 2017-03-20T22:53:58Z

https://github.com/pandas-dev/pandas/blob/master/pandas/tests/test_base.py#L775 tests all dtypes

jorisvandenbossche · 2017-03-20T22:54:38Z

Well, but the comment I made was not catched by any test?

jreback · 2017-03-20T23:02:02Z

Well, but the comment I made was not catched by any test?

hmm, that's a good point. The tests I pointed, check the index.

and tests in pandas/tests/series/test_analytics are pretty minimal. hmm don't even have datetime tests...ok will create an issue.

mroeschke · 2017-03-20T23:16:36Z

@jorisvandenbossche Looks like tests passed even without dtype=bool because duplicated (which calls factorize) only uses the len of the uniques i.e. len([True, False]) == len([1, 0]) https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L3229

Nonetheless, good to add more test to drop_duplicates() and maybe a factorize test for the bool data in pandas/tests/test_algos.py

closes pandas-dev#12963 Author: Matt Roeschke <[email protected]> Closes pandas-dev#15738 from mroeschke/fix_12963 and squashes the following commits: a020c10 [Matt Roeschke] PERF: Improve drop_duplicates for bool columns (pandas-dev#12963)

jorisvandenbossche added the Performance Memory or execution speed performance label Mar 20, 2017

jorisvandenbossche reviewed Mar 20, 2017

View reviewed changes

jreback added this to the 0.20.0 milestone Mar 20, 2017

PERF: Improve drop_duplicates for bool columns (pandas-dev#12963)

a020c10

Add whatsnew Add dtype label and reorg logic

mroeschke force-pushed the fix_12963 branch from 9334282 to a020c10 Compare March 20, 2017 22:02

jreback closed this in f2e942e Mar 20, 2017

chris-b1 mentioned this pull request Mar 20, 2017

PERF: BoolHashTable #15751

Closed

3 tasks

jreback mentioned this pull request Mar 20, 2017

TST: full dtype tests for .drop_duplicates and .duplicated #15752

Closed

mroeschke deleted the fix_12963 branch December 20, 2017 02:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Improve drop_duplicates for bool columns (#12963) #15738

PERF: Improve drop_duplicates for bool columns (#12963) #15738

mroeschke commented Mar 20, 2017

codecov bot commented Mar 20, 2017 •

edited

Loading

jorisvandenbossche Mar 20, 2017

jreback Mar 20, 2017 •

edited

Loading

jreback commented Mar 20, 2017

chris-b1 commented Mar 20, 2017

jreback commented Mar 20, 2017 •

edited

Loading

chris-b1 commented Mar 20, 2017 •

edited

Loading

chris-b1 commented Mar 20, 2017

jreback commented Mar 20, 2017

jreback commented Mar 20, 2017

jorisvandenbossche commented Mar 20, 2017

jreback commented Mar 20, 2017

jreback commented Mar 20, 2017

jorisvandenbossche commented Mar 20, 2017

jreback commented Mar 20, 2017

mroeschke commented Mar 20, 2017

PERF: Improve drop_duplicates for bool columns (#12963) #15738

PERF: Improve drop_duplicates for bool columns (#12963) #15738

Conversation

mroeschke commented Mar 20, 2017

codecov bot commented Mar 20, 2017 • edited Loading

Codecov Report

jorisvandenbossche Mar 20, 2017

Choose a reason for hiding this comment

jreback Mar 20, 2017 • edited Loading

Choose a reason for hiding this comment

jreback commented Mar 20, 2017

chris-b1 commented Mar 20, 2017

jreback commented Mar 20, 2017 • edited Loading

chris-b1 commented Mar 20, 2017 • edited Loading

chris-b1 commented Mar 20, 2017

jreback commented Mar 20, 2017

jreback commented Mar 20, 2017

jorisvandenbossche commented Mar 20, 2017

jreback commented Mar 20, 2017

jreback commented Mar 20, 2017

jorisvandenbossche commented Mar 20, 2017

jreback commented Mar 20, 2017

mroeschke commented Mar 20, 2017

codecov bot commented Mar 20, 2017 •

edited

Loading

jreback Mar 20, 2017 •

edited

Loading

jreback commented Mar 20, 2017 •

edited

Loading

chris-b1 commented Mar 20, 2017 •

edited

Loading