[PERF] Vectorize select_dtypes #28447

datajanko · 2019-09-14T14:23:46Z

closes Improve DataFrame.select_dtypes scaling to wide data frames #28317
tests added / passed (no new tests added, test_dtypes.py, passes locally)
passes black pandas (passed locally)
passes git diff upstream/master -u -- "*.py" | flake8 --diff (passed locally)
whatsnew entry
ASV Benchmark
final refactoring

I was able to run this code for 10million columns in less than a second, which should solve the mentioned problem sufficiently well.

ASV Benchmark:

``` [ 75.00%] ··· frame_methods.SelectDtypes.time_select_dtypes ok [ 75.00%] ··· ======== ============= n -------- ------------- 10 1.53±0.2ms 100 1.48±0.03ms 1000 1.50±0.06ms 10000 1.89±0.04ms 100000 6.55±0.2ms ======== =============

[ 75.00%] · For pandas commit ad9fe5d <master^2> (round 2/2):
[ 75.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 75.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· frame_methods.SelectDtypes.time_select_dtypes ok
[100.00%] ··· ======== ============
n
-------- ------------
10 2.04±0.2ms
100 10.0±0.3ms
1000 91.2±2ms
10000 898±50ms
100000 8.93±0.08s
======== ============

   before           after         ratio
 [ad9fe5d7]       [ef1b9a49]
 <master^2>       <28317-select-dtypes-performance>

 10.0±0.3ms      1.48±0.03ms     0.15  frame_methods.SelectDtypes.time_select_dtypes(100)

   91.2±2ms      1.50±0.06ms     0.02  frame_methods.SelectDtypes.time_select_dtypes(1000)

   898±50ms      1.89±0.04ms     0.00  frame_methods.SelectDtypes.time_select_dtypes(10000)

 8.93±0.08s       6.55±0.2ms     0.00  frame_methods.SelectDtypes.time_select_dtypes(100000)

I hope this is sufficient
</details>

updates pandas

datajanko · 2019-09-14T20:54:36Z

A review would be nice, as well as maybe a better name for filter_unique_dtypes_on

datajanko · 2019-09-14T21:07:14Z

Just the ASV benchmark with the most recent code versions

ASV Benchmark

``` · Creating environments · Discovering benchmarks ·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt. ·· Building 16554c6 <28317-select-dtypes-performance> for conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt............................................ ·· Installing 16554c6 <28317-select-dtypes-performance> into conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.. · Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks) [ 0.00%] · For pandas commit ad9fe5d (round 1/2): [ 0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt... [ 0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt [ 25.00%] ··· Running (frame_methods.SelectDtypes.time_select_dtypes--). [ 25.00%] · For pandas commit 16554c6 <28317-select-dtypes-performance> (round 1/2): [ 25.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt... [ 25.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt [ 50.00%] ··· Running (frame_methods.SelectDtypes.time_select_dtypes--). [ 50.00%] · For pandas commit 16554c6 <28317-select-dtypes-performance> (round 2/2): [ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt [ 75.00%] ··· frame_methods.SelectDtypes.time_select_dtypes ok [ 75.00%] ··· ======== ============= n -------- ------------- 10 1.37±0.1ms 100 1.41±0.07ms 1000 1.56±0.3ms 10000 1.86±0.3ms 100000 6.43±0.3ms ======== =============

[ 75.00%] · For pandas commit ad9fe5d <master^2> (round 2/2):
[ 75.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 75.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· frame_methods.SelectDtypes.time_select_dtypes ok
[100.00%] ··· ======== ============
n
-------- ------------
10 2.17±0.3ms
100 10.3±0.2ms
1000 95.5±4ms
10000 919±30ms
100000 9.46±0.03s
======== ============

   before           after         ratio
 [ad9fe5d7]       [16554c61]
 <master^2>       <28317-select-dtypes-performance>

 2.17±0.3ms       1.37±0.1ms     0.63  frame_methods.SelectDtypes.time_select_dtypes(10)

 10.3±0.2ms      1.41±0.07ms     0.14  frame_methods.SelectDtypes.time_select_dtypes(100)

   95.5±4ms       1.56±0.3ms     0.02  frame_methods.SelectDtypes.time_select_dtypes(1000)

   919±30ms       1.86±0.3ms     0.00  frame_methods.SelectDtypes.time_select_dtypes(10000)

 9.46±0.03s       6.43±0.3ms     0.00  frame_methods.SelectDtypes.time_select_dtypes(100000)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

</details>

pandas/core/frame.py

doc/source/whatsnew/v0.25.2.rst

asv_bench/benchmarks/frame_methods.py

adds Type Hints removes now unnecessary code for selecting slices uses iloc access replaces list comprehension by loop

datajanko · 2019-09-16T20:25:48Z

I'm not totally sure about the type annotations, specifically which type I should use for type. I think I tackled the other issues as well.

pandas/core/frame.py

WillAyd · 2019-09-16T21:26:43Z

pandas/core/frame.py

+        # Hence, we can just shrink the columns we want to keep
+        keep_these = np.full(self.shape[1], True)
+
+        def extract_unique_dtypes_from_dtypes_list(


Hmm is this function even required? Would np.isin not offer similar functionality?

The DTypes are hierarchical, i.e. number contains e.g. timeldetas. In the tests we often see, include=number and exclude=timedelta. Somehow we need to take care of that, and the subclass checks are doing that.

TomAugspurger · 2019-09-17T17:10:55Z

FYI, some CI checks are failing.

I think this is a generally promising approach. Will take a closer look later.

datajanko · 2019-09-17T18:05:09Z

Sorry, I was at work. CI is only failing due to the type annotation. Will try to figure it put. The last commit has an unfortunate name. I just was fixing a wrong export from pandas._typing.

datajanko · 2019-09-17T18:56:58Z

Hmm, sorry, I don't get the typing annotations here:

pandas/core/frame.py:3492: error: Argument 1 to "tuple" has incompatible type "FrozenSet[Union[str, Any, ExtensionDtype]]"; expected "Iterable[Union[type, Tuple[Any, ...]]]"

and tuple is called with

tuple(dtypes_set)

and from the type hints at the beginning:

dtypes_set: FrozenSet[Dtype]

Btw, I recharged to a list comprehension since I'm only having one loop now. But of course I would change back. Any hints on the typing related stuff?

WillAyd · 2019-09-17T20:23:01Z

Seems weird - I would expect frozen set to be a subset of Iterable in the typing world. Any chance you see a bug report about it?

jreback

small comment, otherwise lgtm. ping on green.

doc/source/whatsnew/v1.0.0.rst

datajanko · 2019-09-18T05:05:27Z

To file a bug report for the typing issue, not sure that this is a bug though, I'll try to create an easier example that shows the issue.

Right now I'm having:

from typing import FrozenSet, Iterable, Any, Union

from pandas._typing import ExtensionDtype, Dtype
import numpy as np

dtypes : FrozenSet[Dtype] = frozenset([np.numbers, np.float64])

tuple(dtypes)

which does not throw the Mypy error.

Or shall we just remove, the annotations and keep going?

WillAyd · 2019-09-18T15:41:46Z

I think just add a # type: ignore to the line that is failing. Unfortunate this is failing but prefer to keep the annotations as they are at the very least good documentation

datajanko · 2019-09-18T19:13:26Z

Okay, so I think I created and MVE:

from typing import List

l: List[int] = []
issubclass(object, tuple(l))
issubclass(object, l)

gives using Mypy

line 4: error: Argument 1 to "tuple" has incompatible type "List[int]"; expected "Iterable[Union[type, Tuple[Any, ...]]]"
line 5: error: Argument 2 to "issubclass" has incompatible type "List[int]"; expected "Union[type, Tuple[Union[type, Tuple[Any, ...]], ...]]"

The second line is only to show what is subclass expects. We have the analogous error message

pandas/core/frame.py:3492: error: Argument 1 to "tuple" has incompatible type "FrozenSet[Union[str, Any, ExtensionDtype]]"; expected "Iterable[Union[type, Tuple[Any, ...]]]"

with Dtype. Can anybody explain this? Is this a bug? Adding above a line

tuple(l)

is unnoticed by Mypy.

If not, I'll follow the suggestion from above, and will just use #type: ignore

WillAyd · 2019-09-18T20:10:01Z

@datajanko pretty confusing error message but aren't those invocations wrong? Should the second argument to isinstance be a type and not an instance thereof?

WillAyd · 2019-09-18T20:16:17Z

Thinking through this I think the definition of DType in pandas._typing might be wrong and maybe that's actually throwing the error message. Right now it's Union[str, np.dtype, "ExtensionDtype"], effectively allowing an instance of any of those types. But it might need to be Type[Union[str, np.dtype, "ExtensionDtype"]] to require the class instead of the instance (or something to that effect).

Tried that locally on your branch but that caused some more failures. For now I'd suggest just `type: ignoring" the failing line and we can open up a follow up issue to address the above

WillAyd · 2019-09-18T20:16:26Z

cc @simonjayhawkins on previous comment

WillAyd

Looks good. May be enough as is just curious if there's not a way to make more succinct

pandas/core/frame.py

WillAyd · 2019-09-19T19:00:39Z

pandas/core/frame.py

+        unique_dtypes = self.dtypes.unique()
+
+        if include:
+            included_dtypes = extract_unique_dtypes_from_dtypes_set(


If above is true would be nice to not duplicate this call in both branches

see comment below

datajanko · 2019-09-19T21:38:35Z

@WillAyd

What works is the following:

        unique_dtypes = self.dtypes.unique()
        extracted_dtypes = lambda y:list(filter(lambda x: issubclass(x.type, tuple(y)), unique_dtypes))

        if include:
            keep_these &= self.dtypes.isin(extracted_dtypes(include))

        if exclude:
            keep_these &= ~self.dtypes.isin(extracted_dtypes(exclude))

        return self.iloc[:, keep_these.values]

note that extracted_dtypes is function. Do we prefer to def it? (We should call it extract_dtypes probably)

WillAyd · 2019-09-20T03:07:31Z

Nevermind then - I don't necessarily find that more readable

WillAyd

lgtm

datajanko · 2019-09-20T04:51:44Z

@jreback ping since green

jreback · 2019-09-20T12:42:58Z

thanks @datajanko nice patch!

datajanko and others added 3 commits September 13, 2019 20:19

Merge pull request #1 from pandas-dev/master

2897723

updates pandas

Merge branch 'master' of https://github.com/pandas-dev/pandas

a54e1b4

more vectorization in select_dtypes

04880bb

datajanko changed the title ~~[WIP] 28317 performance dtypes~~ [WIP] [PERF]28317 performance dtypes Sep 14, 2019

datajanko added 4 commits September 14, 2019 21:11

adds select_dtypes benchmark

114eb0d

parametrized benchmark

ef1b9a4

adds what's new entry

45f4bc7

refactoring

16554c6

datajanko changed the title ~~[WIP] [PERF]28317 performance dtypes~~ [PERF]28317 performance dtypes Sep 14, 2019

datajanko changed the title ~~[PERF]28317 performance dtypes~~ [PERF]28317 performance select_dtypes Sep 14, 2019

datajanko changed the title ~~[PERF]28317 performance select_dtypes~~ [PERF] Vectorize select_dtypes Sep 14, 2019

jreback requested changes Sep 14, 2019

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

gfyoung added Dtype Conversions Unexpected or buggy dtype conversions Performance Memory or execution speed performance labels Sep 15, 2019

WillAyd requested changes Sep 16, 2019

View reviewed changes

WillAyd added this to the 1.0 milestone Sep 16, 2019

datajanko added 4 commits September 16, 2019 22:16

refactoring

bd5e897

adds Type Hints removes now unnecessary code for selecting slices uses iloc access replaces list comprehension by loop

changes what's new to v 1.0.0 instead of 0.25.2

91ccbd4

removes the smallest and the two biggest(slowest) benchmarks

72e4a9a

replaces boolean series by numpy array

5d1d2ca

datajanko requested a review from WillAyd September 16, 2019 20:24

WillAyd requested changes Sep 16, 2019

View reviewed changes

fixes type annotations

38b1667

datajanko force-pushed the 28317-select-dtypes-performance branch from 36973a2 to 38b1667 Compare September 17, 2019 06:39

remove iteration as issubclass accepts tuples of classes

6f3c5e5

renaming

7a43629

jreback requested changes Sep 17, 2019

View reviewed changes

doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved

make what'snew leaner

7b535ab

ignore type check

9ddb53c

datajanko requested review from jreback and WillAyd September 18, 2019 21:02

WillAyd reviewed Sep 19, 2019

View reviewed changes

WillAyd approved these changes Sep 20, 2019

View reviewed changes

jreback approved these changes Sep 20, 2019

View reviewed changes

jreback merged commit 8c55b6f into pandas-dev:master Sep 20, 2019

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

[PERF] Vectorize select_dtypes (pandas-dev#28447)

86ec487

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

[PERF] Vectorize select_dtypes (pandas-dev#28447)

d19102c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERF] Vectorize select_dtypes #28447

[PERF] Vectorize select_dtypes #28447

datajanko commented Sep 14, 2019 •

edited

Loading

datajanko commented Sep 14, 2019

datajanko commented Sep 14, 2019 •

edited

Loading

datajanko commented Sep 16, 2019

WillAyd Sep 16, 2019

datajanko Sep 17, 2019 •

edited

Loading

TomAugspurger commented Sep 17, 2019

datajanko commented Sep 17, 2019

datajanko commented Sep 17, 2019

WillAyd commented Sep 17, 2019

jreback left a comment

datajanko commented Sep 18, 2019

WillAyd commented Sep 18, 2019

datajanko commented Sep 18, 2019 •

edited

Loading

WillAyd commented Sep 18, 2019

WillAyd commented Sep 18, 2019

WillAyd commented Sep 18, 2019

WillAyd left a comment

WillAyd Sep 19, 2019

datajanko Sep 19, 2019

datajanko commented Sep 19, 2019 •

edited

Loading

WillAyd commented Sep 20, 2019

WillAyd left a comment

datajanko commented Sep 20, 2019

jreback commented Sep 20, 2019

[PERF] Vectorize select_dtypes #28447

[PERF] Vectorize select_dtypes #28447

Conversation

datajanko commented Sep 14, 2019 • edited Loading

datajanko commented Sep 14, 2019

datajanko commented Sep 14, 2019 • edited Loading

datajanko commented Sep 16, 2019

WillAyd Sep 16, 2019

Choose a reason for hiding this comment

datajanko Sep 17, 2019 • edited Loading

Choose a reason for hiding this comment

TomAugspurger commented Sep 17, 2019

datajanko commented Sep 17, 2019

datajanko commented Sep 17, 2019

WillAyd commented Sep 17, 2019

jreback left a comment

Choose a reason for hiding this comment

datajanko commented Sep 18, 2019

WillAyd commented Sep 18, 2019

datajanko commented Sep 18, 2019 • edited Loading

WillAyd commented Sep 18, 2019

WillAyd commented Sep 18, 2019

WillAyd commented Sep 18, 2019

WillAyd left a comment

Choose a reason for hiding this comment

WillAyd Sep 19, 2019

Choose a reason for hiding this comment

datajanko Sep 19, 2019

Choose a reason for hiding this comment

datajanko commented Sep 19, 2019 • edited Loading

WillAyd commented Sep 20, 2019

WillAyd left a comment

Choose a reason for hiding this comment

datajanko commented Sep 20, 2019

jreback commented Sep 20, 2019

datajanko commented Sep 14, 2019 •

edited

Loading

datajanko commented Sep 14, 2019 •

edited

Loading

datajanko Sep 17, 2019 •

edited

Loading

datajanko commented Sep 18, 2019 •

edited

Loading

datajanko commented Sep 19, 2019 •

edited

Loading