Skip to content

PERF: asv for select_dtypes #14588

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
simonm3 opened this issue Nov 4, 2016 · 6 comments · Fixed by #36839
Closed

PERF: asv for select_dtypes #14588

simonm3 opened this issue Nov 4, 2016 · 6 comments · Fixed by #36839
Assignees
Labels
Dtype Conversions Unexpected or buggy dtype conversions good first issue Performance Memory or execution speed performance
Milestone

Comments

@simonm3
Copy link

simonm3 commented Nov 4, 2016

Why is select_dtypes so slow?

%timeit [col for col in df.columns if np.issubdtype(df[col].dtype, np.number)]
453 microsecs per loop

%timeit df.select_dtypes(include=[np.number])
4.58 secs per loop

@jreback
Copy link
Contributor

jreback commented Nov 4, 2016

you would have to show a full example and pd.show_versions as the instructions indicate

@simonm3
Copy link
Author

simonm3 commented Nov 4, 2016

df = pd.DataFrame(np.random.randn(100000, 4), columns=list('ABCD'))
%timeit [col for col in df.columns if np.issubdtype(df[col].dtype, np.number)]

The slowest run took 13.72 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 35.5 µs per loop

%timeit df.select_dtypes(include=[np.number])

100 loops, best of 3: 3.41 ms per loop

#######################################################

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.19.0
nose: 1.3.7
pip: 8.1.2
setuptools: 28.7.1
Cython: 0.24
numpy: 1.11.2
scipy: 0.17.1
statsmodels: 0.8.0rc1
xarray: None
IPython: 4.2.0
sphinx: 1.3.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

@simonm3 You might want to take a look at the definition in https://github.com/pandas-dev/pandas/blob/2e77536bdf90ef20fefd4eab751447918e07668f/pandas/core/frame.py maybe do some profiling to see where the time is spent.

Before you do any profiling / more benchmarking, make sure the results are equivalent. For example, select_dtypes also has an exclude argument that your version doesn't accept.

@jreback
Copy link
Contributor

jreback commented Nov 12, 2016

In [7]: %timeit df.select_dtypes(include=[np.number])
1000 loops, best of 3: 1.66 ms per loop

In [8]: %timeit df.select_dtypes(include=[np.float])
1000 loops, best of 3: 1.66 ms per loop

In [9]: %timeit df.select_dtypes(include=['float'])
1000 loops, best of 3: 1.7 ms per loop

In [10]: %timeit df.select_dtypes(include=['object'])
1000 loops, best of 3: 908 µs per loop

In [11]: %timeit df.copy()
1000 loops, best of 3: 585 µs per loop

In [12]: %timeit df.loc[:,df.columns]
1000 loops, best of 3: 893 µs per loop

on master (post 0.19.1)
so this at the end basically doing [12](in this case). so its 2x slower, maybe something going on, but it seems very minor at best.

@simonm3 your comparison is not apt as you are not just selecting the column names, but the actual data itself.

all of that said, I will reprupose this issue to have some asv's for this

@jreback jreback added Difficulty Novice Dtype Conversions Unexpected or buggy dtype conversions Performance Memory or execution speed performance labels Nov 12, 2016
@jreback jreback added this to the Next Major Release milestone Nov 12, 2016
@jreback jreback changed the title Why is select_dtypes so slow? PERF: asv for select_dtypes Nov 12, 2016
@hermidalc
Copy link

Still extremely slow as of pandas 0.25.3

@avinashpancham
Copy link
Contributor

take

@jreback jreback modified the milestones: Contributions Welcome, 1.2 Oct 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions good first issue Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants