Skip to content

PERF: DataFrame.duplicated with subset= for 1 column is slower than Series.duplicated #45236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
victorlin opened this issue Jan 6, 2022 · 2 comments · Fixed by #45534
Closed
3 tasks done
Labels
duplicated duplicated, drop_duplicates Performance Memory or execution speed performance
Milestone

Comments

@victorlin
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the master branch of pandas.

Reproducible Example

Set up:

import pandas as pd

df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5]
})

The following are multiple ways to check for duplicates of the 'brand' column.

  1. Using subset=['brand'] averages 176 µs:

    %timeit df.duplicated(subset=['brand'])
    # 176 µs ± 2.61 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  2. Using [] indexing averages 31.2 µs:

    %timeit df['brand'].duplicated()
    # 31.2 µs ± 213 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  3. Using .loc[:,'brand'] indexing averages 40.6 µs:

    %timeit df.loc[:,'brand'].duplicated()
    # 40.6 µs ± 141 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

(1) is the exact example used in the doc for pandas.DataFrame.duplicated, but it is the slowest.

Installed Versions

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-2-3d232a07e144> in <module>
----> 1 pd.show_versions()

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/site-packages/pandas/util/_print_versions.py in show_versions(as_json)
    107     """
    108     sys_info = _get_sys_info()
--> 109     deps = _get_dependency_info()
    110 
    111     if as_json:

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/site-packages/pandas/util/_print_versions.py in _get_dependency_info()
     86     result: dict[str, JSONSerializable] = {}
     87     for modname in deps:
---> 88         mod = import_optional_dependency(modname, errors="ignore")
     89         result[modname] = get_version(mod) if mod else None
     90     return result

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/site-packages/pandas/compat/_optional.py in import_optional_dependency(name, extra, errors, min_version)
    113     )
    114     try:
--> 115         module = importlib.import_module(name)
    116     except ImportError:
    117         if errors == "raise":

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/importlib/__init__.py in import_module(name, package)
    124                 break
    125             level += 1
--> 126     return _bootstrap._gcd_import(name[level:], package, level)
    127 
    128 

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/importlib/_bootstrap.py in _gcd_import(name, package, level)

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/importlib/_bootstrap.py in _find_and_load(name, import_)

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/importlib/_bootstrap.py in _find_and_load_unlocked(name, import_)

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/importlib/_bootstrap.py in _load_unlocked(spec)

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/importlib/_bootstrap_external.py in exec_module(self, module)

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/importlib/_bootstrap.py in _call_with_frames_removed(f, *args, **kwds)

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/site-packages/setuptools/__init__.py in <module>
      6 import re
      7 
----> 8 import _distutils_hack.override  # noqa: F401
      9 
     10 import distutils.core

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/site-packages/_distutils_hack/override.py in <module>
----> 1 __import__('_distutils_hack').do_override()

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/site-packages/_distutils_hack/__init__.py in do_override()
     71     if enabled():
     72         warn_distutils_present()
---> 73         ensure_local_distutils()
     74 
     75 

/opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/site-packages/_distutils_hack/__init__.py in ensure_local_distutils()
     59     # check that submodules load as expected
     60     core = importlib.import_module('distutils.core')
---> 61     assert '_distutils' in core.__file__, core.__file__
     62 
     63 

AssertionError: /opt/homebrew/Caskroom/miniconda/base/envs/pandas/lib/python3.10/distutils/core.py

I don't think this is expected 😳 this is how the environment was set up:

mamba create -n pandas pandas ipython -y

If it's any better, output of conda list:

# packages in environment at /opt/homebrew/Caskroom/miniconda/base/envs/pandas:
#
# Name                    Version                   Build  Channel
appnope                   0.1.2           py310h2ec42d9_2    conda-forge
backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
backports                 1.0                        py_2    conda-forge
backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
bzip2                     1.0.8                h0d85af4_4    conda-forge
ca-certificates           2021.10.8            h033912b_0    conda-forge
decorator                 5.1.0              pyhd8ed1ab_0    conda-forge
ipython                   7.31.0          py310h2ec42d9_0    conda-forge
jedi                      0.18.1          py310h2ec42d9_0    conda-forge
libblas                   3.9.0           12_osx64_openblas    conda-forge
libcblas                  3.9.0           12_osx64_openblas    conda-forge
libcxx                    12.0.1               habf9029_1    conda-forge
libffi                    3.4.2                h0d85af4_5    conda-forge
libgfortran               5.0.0           9_3_0_h6c81a4c_23    conda-forge
libgfortran5              9.3.0               h6c81a4c_23    conda-forge
liblapack                 3.9.0           12_osx64_openblas    conda-forge
libopenblas               0.3.18          openmp_h3351f45_0    conda-forge
libzlib                   1.2.11            h9173be1_1013    conda-forge
llvm-openmp               12.0.1               hda6cdc1_1    conda-forge
matplotlib-inline         0.1.3              pyhd8ed1ab_0    conda-forge
ncurses                   6.2                  h2e338ed_4    conda-forge
numpy                     1.22.0          py310hfbbbacf_0    conda-forge
openssl                   3.0.0                h0d85af4_2    conda-forge
pandas                    1.3.5           py310hdd25497_0    conda-forge
parso                     0.8.3              pyhd8ed1ab_0    conda-forge
pexpect                   4.8.0              pyh9f0ad1d_2    conda-forge
pickleshare               0.7.5                   py_1003    conda-forge
pip                       21.3.1             pyhd8ed1ab_0    conda-forge
prompt-toolkit            3.0.24             pyha770c72_0    conda-forge
ptyprocess                0.7.0              pyhd3deb0d_0    conda-forge
pygments                  2.11.1             pyhd8ed1ab_0    conda-forge
python                    3.10.1          h38b4d05_2_cpython    conda-forge
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python_abi                3.10                    2_cp310    conda-forge
pytz                      2021.3             pyhd8ed1ab_0    conda-forge
readline                  8.1                  h05e3726_0    conda-forge
setuptools                60.2.0          py310h2ec42d9_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
sqlite                    3.37.0               h23a322b_0    conda-forge
tk                        8.6.11               h5dbffcc_1    conda-forge
traitlets                 5.1.1              pyhd8ed1ab_0    conda-forge
tzdata                    2021e                he74cb21_0    conda-forge
wcwidth                   0.2.5              pyh9f0ad1d_2    conda-forge
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
xz                        5.2.5                haf1e3a3_1    conda-forge
zlib                      1.2.11            h9173be1_1013    conda-forge

Prior Performance

No response

@victorlin victorlin added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Jan 6, 2022
@jreback
Copy link
Contributor

jreback commented Jan 7, 2022

pls use a bigger frame for the comparison - this could easily be 1 function call difference

@victorlin
Copy link
Author

@jreback good point. Here is with 1,000,000 rows and 3 columns. subset= is still noticeably slower:

df = pd.DataFrame(np.random.choice(['foo','bar','baz'], size=(1000000,3)), columns=['a','b','c'])

%timeit df.duplicated(subset=['a'])
# 81.4 ms ± 815 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df['a'].duplicated()
# 25.2 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.loc[:,'a'].duplicated()
# 25.4 ms ± 188 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@jbrockmendel jbrockmendel added the duplicated duplicated, drop_duplicates label Jan 10, 2022
@mroeschke mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jan 13, 2022
@jreback jreback added this to the 1.5 milestone Jan 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicated duplicated, drop_duplicates Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants