Skip to content

PERF: Improve drop_duplicates for bool columns (#12963) #15738

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

mroeschke
Copy link
Member

Converts bool columns to int when calling drop_duplicates() so the factorization does not go down the object path.

      before           after         ratio
     [9ab57dc5]       [4b05fccb]
-     10.2±0.01ms      4.68±0.01ms     0.46  reindex.Duplicates.time_frame_drop_dups_bool

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@codecov
Copy link

codecov bot commented Mar 20, 2017

Codecov Report

Merging #15738 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15738      +/-   ##
==========================================
- Coverage   91.02%      91%   -0.02%     
==========================================
  Files         143      143              
  Lines       49392    49395       +3     
==========================================
- Hits        44957    44951       -6     
- Misses       4435     4444       +9
Impacted Files Coverage Δ
pandas/core/algorithms.py 94.5% <100%> (+0.02%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.86% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bff47f2...a020c10. Read the comment docs.

@jorisvandenbossche jorisvandenbossche added the Performance Memory or execution speed performance label Mar 20, 2017
@@ -343,6 +344,8 @@ def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
vals = values.view(np.int64)
else:
vals = np.asarray(values)
if is_bool_dtype(vals):
vals = vals.astype(int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you will have to add there dtype = bool? (or dtype = values.dtype)
Otherwise the uniques will also be 0/1

Copy link
Contributor

@jreback jreback Mar 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to have this as part of the if/else instead (the outer one) as its more explicit.

dont use .astype (which copies), instead

In [3]: np.asarray(Series([True, False, True])).view('uint8')
Out[3]: array([1, 0, 1], dtype=uint8)

and as @jorisvandenbossche suggest, needs dtype=bool for the reconstruction part of the uniques.

@jreback jreback added this to the 0.20.0 milestone Mar 20, 2017
@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

@mroeschke ping when you have made those changes.

I suspect we are incorrectly handling bool in other algos as well (do as separate issues), e.g. value_counts, duplicated, rank

@chris-b1
Copy link
Contributor

This looks OK to me (w/ @jreback change)

Can be a separate issue, but I think bool handling throughout these algos could made a lot faster by skipping the hashing altogether? We could define a BoolHashTable to dispatch to, where the factorization is simply the uint8 view on the data, and the other methods could take various shortcuts. E.g. unique could be something like below.

That said, may not be worth the added complexity.

%%cython
from numpy cimport uint8_t
cimport cython

import numpy as np

@cython.boundscheck(False)
@cython.wraparound(False)
def unique(uint8_t[:] data):
    cdef:
        bint seen_true = False
        bint seen_false = False
        Py_ssize_t i, N = len(data)
        uint8_t val
    with nogil:
        for i in range(N):
            val = data[i]
            if val == 0:
                seen_false = True
            elif val == 1:
                seen_true = True
            if seen_true and seen_false:
                break
    if seen_true and seen_false:
        return np.array([True, False])
    elif seen_true:
        return np.array([True])
    elif seen_false:
        return np.array([False])
    else:
        return np.array([], dtype=bool)

In [35]: bools = np.array([False, True] * 100000)

In [36]: %timeit pd.unique(bools.view('uint8')).astype(bool)
1000 loops, best of 3: 1.74 ms per loop

In [37]: %timeit unique(bools.view('uint8'))
100000 loops, best of 3: 3.47 µs per loop

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

@chris-b1 It would be pretty easy to do this and maybe not too much extra code. can you hack up the .drop_duplicates from the issue and see how substantive that perf difference is? (my naive tests indicates that it should be way faster FYI).

@chris-b1
Copy link
Contributor

chris-b1 commented Mar 20, 2017

hacked up impl

def bool_factorize(values):
    values = values.view('uint8')
    return values, unique(values)

def drop_duplicates_bool(df):
    # 0.19.2 imports
    from pandas.core.groupby import get_group_index
    from pandas.hashtable import duplicated_int64
    keep = 'first'
    def f(vals):
        labels, shape = bool_factorize(vals)
        return labels.astype('i8', copy=False), len(shape)

    vals = (df[col].values for col in df)
    labels, shape = map(list, zip(*map(f, vals)))

    ids = get_group_index(labels, shape, sort=False, xnull=False)
    duplicated = pd.Series(duplicated_int64(ids, keep), index=df.index)
    return df[-duplicated]

I get another 2x speedup over casting to int.

In [67]: %timeit drop_duplicates_bool(df)
1 loop, best of 3: 362 ms per loop

In [68]: %timeit df.astype('uint8').drop_duplicates().astype(bool)
1 loop, best of 3: 868 ms per loop

In [69]: from pandas.util.testing import assert_frame_equal
In [70]: assert_frame_equal(drop_duplicates_bool(df), df.drop_duplicates())

@chris-b1
Copy link
Contributor

It doesn't - apparently numpy doesn't respect copy=False between bool / uint8?

In [84]: %timeit df.values.astype('uint8', copy=False)
10 loops, best of 3: 34.8 ms per loop

In [85]: %timeit df.values.astype('uint8', copy=True)
10 loops, best of 3: 34.9 ms per loop

In [86]: %timeit df.values.view('uint8')
100000 loops, best of 3: 8.37 µs per loop

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

@chris-b1 yeah I erased that as soon as I wrote it :<

the uint8 -> int64 for duplicated -> bool

so it probably does copy.

you might want to do a hacked duplicated_bool and see how much that makes a difference (if too much work don't, though that impl should be much like unique)

Add whatsnew

Add dtype label and reorg logic
@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

this looks good.

@chris-b1 going to merge this. But certainly can look at a BoolHashTable in a future issue. (pls create one).

@jorisvandenbossche
Copy link
Member

@jreback before you merge, I think this can use some tests

@jreback jreback closed this in f2e942e Mar 20, 2017
@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

@jorisvandenbossche hmm, we are already testing drop_duplicates quite thoroughly

@chris-b1 chris-b1 mentioned this pull request Mar 20, 2017
3 tasks
@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

@jorisvandenbossche
Copy link
Member

Well, but the comment I made was not catched by any test?

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

Well, but the comment I made was not catched by any test?

hmm, that's a good point. The tests I pointed, check the index.

and tests in pandas/tests/series/test_analytics are pretty minimal. hmm don't even have datetime tests...ok will create an issue.

@mroeschke
Copy link
Member Author

@jorisvandenbossche Looks like tests passed even without dtype=bool because duplicated (which calls factorize) only uses the len of the uniques i.e. len([True, False]) == len([1, 0]) https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L3229

Nonetheless, good to add more test to drop_duplicates() and maybe a factorize test for the bool data in pandas/tests/test_algos.py

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
closes pandas-dev#12963

Author: Matt Roeschke <[email protected]>

Closes pandas-dev#15738 from mroeschke/fix_12963 and squashes the following commits:

a020c10 [Matt Roeschke] PERF: Improve drop_duplicates for bool columns (pandas-dev#12963)
mattip pushed a commit to mattip/pandas that referenced this pull request Apr 3, 2017
closes pandas-dev#12963

Author: Matt Roeschke <[email protected]>

Closes pandas-dev#15738 from mroeschke/fix_12963 and squashes the following commits:

a020c10 [Matt Roeschke] PERF: Improve drop_duplicates for bool columns (pandas-dev#12963)
@mroeschke mroeschke deleted the fix_12963 branch December 20, 2017 02:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

drop_duplicates slow for data frame of boolean columns
4 participants