Skip to content

[PERF] Vectorize select_dtypes #28447

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Sep 20, 2019

Conversation

datajanko
Copy link
Contributor

@datajanko datajanko commented Sep 14, 2019

I was able to run this code for 10million columns in less than a second, which should solve the mentioned problem sufficiently well.

ASV Benchmark: ``` [ 75.00%] ··· frame_methods.SelectDtypes.time_select_dtypes ok [ 75.00%] ··· ======== ============= n -------- ------------- 10 1.53±0.2ms 100 1.48±0.03ms 1000 1.50±0.06ms 10000 1.89±0.04ms 100000 6.55±0.2ms ======== =============

[ 75.00%] · For pandas commit ad9fe5d <master^2> (round 2/2):
[ 75.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 75.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· frame_methods.SelectDtypes.time_select_dtypes ok
[100.00%] ··· ======== ============
n
-------- ------------
10 2.04±0.2ms
100 10.0±0.3ms
1000 91.2±2ms
10000 898±50ms
100000 8.93±0.08s
======== ============

   before           after         ratio
 [ad9fe5d7]       [ef1b9a49]
 <master^2>       <28317-select-dtypes-performance>
  •  10.0±0.3ms      1.48±0.03ms     0.15  frame_methods.SelectDtypes.time_select_dtypes(100)
    
  •    91.2±2ms      1.50±0.06ms     0.02  frame_methods.SelectDtypes.time_select_dtypes(1000)
    
  •    898±50ms      1.89±0.04ms     0.00  frame_methods.SelectDtypes.time_select_dtypes(10000)
    
  •  8.93±0.08s       6.55±0.2ms     0.00  frame_methods.SelectDtypes.time_select_dtypes(100000)
    
I hope this is sufficient
</details>

@datajanko datajanko changed the title [WIP] 28317 performance dtypes [WIP] [PERF]28317 performance dtypes Sep 14, 2019
@datajanko datajanko changed the title [WIP] [PERF]28317 performance dtypes [PERF]28317 performance dtypes Sep 14, 2019
@datajanko
Copy link
Contributor Author

A review would be nice, as well as maybe a better name for filter_unique_dtypes_on

@datajanko
Copy link
Contributor Author

datajanko commented Sep 14, 2019

Just the ASV benchmark with the most recent code versions

ASV Benchmark ``` · Creating environments · Discovering benchmarks ·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt. ·· Building 16554c6 <28317-select-dtypes-performance> for conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt............................................ ·· Installing 16554c6 <28317-select-dtypes-performance> into conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.. · Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks) [ 0.00%] · For pandas commit ad9fe5d (round 1/2): [ 0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt... [ 0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt [ 25.00%] ··· Running (frame_methods.SelectDtypes.time_select_dtypes--). [ 25.00%] · For pandas commit 16554c6 <28317-select-dtypes-performance> (round 1/2): [ 25.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt... [ 25.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt [ 50.00%] ··· Running (frame_methods.SelectDtypes.time_select_dtypes--). [ 50.00%] · For pandas commit 16554c6 <28317-select-dtypes-performance> (round 2/2): [ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt [ 75.00%] ··· frame_methods.SelectDtypes.time_select_dtypes ok [ 75.00%] ··· ======== ============= n -------- ------------- 10 1.37±0.1ms 100 1.41±0.07ms 1000 1.56±0.3ms 10000 1.86±0.3ms 100000 6.43±0.3ms ======== =============

[ 75.00%] · For pandas commit ad9fe5d <master^2> (round 2/2):
[ 75.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 75.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· frame_methods.SelectDtypes.time_select_dtypes ok
[100.00%] ··· ======== ============
n
-------- ------------
10 2.17±0.3ms
100 10.3±0.2ms
1000 95.5±4ms
10000 919±30ms
100000 9.46±0.03s
======== ============

   before           after         ratio
 [ad9fe5d7]       [16554c61]
 <master^2>       <28317-select-dtypes-performance>
  •  2.17±0.3ms       1.37±0.1ms     0.63  frame_methods.SelectDtypes.time_select_dtypes(10)
    
  •  10.3±0.2ms      1.41±0.07ms     0.14  frame_methods.SelectDtypes.time_select_dtypes(100)
    
  •    95.5±4ms       1.56±0.3ms     0.02  frame_methods.SelectDtypes.time_select_dtypes(1000)
    
  •    919±30ms       1.86±0.3ms     0.00  frame_methods.SelectDtypes.time_select_dtypes(10000)
    
  •  9.46±0.03s       6.43±0.3ms     0.00  frame_methods.SelectDtypes.time_select_dtypes(100000)
    

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

</details>

@datajanko datajanko changed the title [PERF]28317 performance dtypes [PERF]28317 performance select_dtypes Sep 14, 2019
@datajanko datajanko changed the title [PERF]28317 performance select_dtypes [PERF] Vectorize select_dtypes Sep 14, 2019
@gfyoung gfyoung added Dtype Conversions Unexpected or buggy dtype conversions Performance Memory or execution speed performance labels Sep 15, 2019
@WillAyd WillAyd added this to the 1.0 milestone Sep 16, 2019
@datajanko datajanko requested a review from WillAyd September 16, 2019 20:24
@datajanko
Copy link
Contributor Author

I'm not totally sure about the type annotations, specifically which type I should use for type. I think I tackled the other issues as well.

# Hence, we can just shrink the columns we want to keep
keep_these = np.full(self.shape[1], True)

def extract_unique_dtypes_from_dtypes_list(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm is this function even required? Would np.isin not offer similar functionality?

Copy link
Contributor Author

@datajanko datajanko Sep 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DTypes are hierarchical, i.e. number contains e.g. timeldetas. In the tests we often see, include=number and exclude=timedelta. Somehow we need to take care of that, and the subclass checks are doing that.

@datajanko datajanko force-pushed the 28317-select-dtypes-performance branch from 36973a2 to 38b1667 Compare September 17, 2019 06:39
@TomAugspurger
Copy link
Contributor

FYI, some CI checks are failing.

I think this is a generally promising approach. Will take a closer look later.

@datajanko
Copy link
Contributor Author

Sorry, I was at work. CI is only failing due to the type annotation. Will try to figure it put. The last commit has an unfortunate name. I just was fixing a wrong export from pandas._typing.

@datajanko
Copy link
Contributor Author

Hmm, sorry, I don't get the typing annotations here:

pandas/core/frame.py:3492: error: Argument 1 to "tuple" has incompatible type "FrozenSet[Union[str, Any, ExtensionDtype]]"; expected "Iterable[Union[type, Tuple[Any, ...]]]"

and tuple is called with

tuple(dtypes_set)

and from the type hints at the beginning:

dtypes_set: FrozenSet[Dtype]

Btw, I recharged to a list comprehension since I'm only having one loop now. But of course I would change back. Any hints on the typing related stuff?

@WillAyd
Copy link
Member

WillAyd commented Sep 17, 2019

Seems weird - I would expect frozen set to be a subset of Iterable in the typing world. Any chance you see a bug report about it?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment, otherwise lgtm. ping on green.

@datajanko
Copy link
Contributor Author

To file a bug report for the typing issue, not sure that this is a bug though, I'll try to create an easier example that shows the issue.

Right now I'm having:

from typing import FrozenSet, Iterable, Any, Union

from pandas._typing import ExtensionDtype, Dtype
import numpy as np

dtypes : FrozenSet[Dtype] = frozenset([np.numbers, np.float64])

tuple(dtypes)

which does not throw the Mypy error.

Or shall we just remove, the annotations and keep going?

@WillAyd
Copy link
Member

WillAyd commented Sep 18, 2019

I think just add a # type: ignore to the line that is failing. Unfortunate this is failing but prefer to keep the annotations as they are at the very least good documentation

@datajanko
Copy link
Contributor Author

datajanko commented Sep 18, 2019

Okay, so I think I created and MVE:

from typing import List

l: List[int] = []
issubclass(object, tuple(l))
issubclass(object, l)

gives using Mypy

line 4: error: Argument 1 to "tuple" has incompatible type "List[int]"; expected "Iterable[Union[type, Tuple[Any, ...]]]"
line 5: error: Argument 2 to "issubclass" has incompatible type "List[int]"; expected "Union[type, Tuple[Union[type, Tuple[Any, ...]], ...]]"

The second line is only to show what is subclass expects. We have the analogous error message

pandas/core/frame.py:3492: error: Argument 1 to "tuple" has incompatible type "FrozenSet[Union[str, Any, ExtensionDtype]]"; expected "Iterable[Union[type, Tuple[Any, ...]]]"

with Dtype. Can anybody explain this? Is this a bug? Adding above a line

tuple(l)

is unnoticed by Mypy.

If not, I'll follow the suggestion from above, and will just use #type: ignore

@WillAyd
Copy link
Member

WillAyd commented Sep 18, 2019

@datajanko pretty confusing error message but aren't those invocations wrong? Should the second argument to isinstance be a type and not an instance thereof?

@WillAyd
Copy link
Member

WillAyd commented Sep 18, 2019

Thinking through this I think the definition of DType in pandas._typing might be wrong and maybe that's actually throwing the error message. Right now it's Union[str, np.dtype, "ExtensionDtype"], effectively allowing an instance of any of those types. But it might need to be Type[Union[str, np.dtype, "ExtensionDtype"]] to require the class instead of the instance (or something to that effect).

Tried that locally on your branch but that caused some more failures. For now I'd suggest just `type: ignoring" the failing line and we can open up a follow up issue to address the above

@WillAyd
Copy link
Member

WillAyd commented Sep 18, 2019

cc @simonjayhawkins on previous comment

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. May be enough as is just curious if there's not a way to make more succinct

unique_dtypes = self.dtypes.unique()

if include:
included_dtypes = extract_unique_dtypes_from_dtypes_set(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If above is true would be nice to not duplicate this call in both branches

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment below

@datajanko
Copy link
Contributor Author

datajanko commented Sep 19, 2019

@WillAyd

What works is the following:

        unique_dtypes = self.dtypes.unique()
        extracted_dtypes = lambda y:list(filter(lambda x: issubclass(x.type, tuple(y)), unique_dtypes))

        if include:
            keep_these &= self.dtypes.isin(extracted_dtypes(include))

        if exclude:
            keep_these &= ~self.dtypes.isin(extracted_dtypes(exclude))

        return self.iloc[:, keep_these.values]

note that extracted_dtypes is function. Do we prefer to def it? (We should call it extract_dtypes probably)

@WillAyd
Copy link
Member

WillAyd commented Sep 20, 2019

Nevermind then - I don't necessarily find that more readable

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@datajanko
Copy link
Contributor Author

@jreback ping since green

@jreback jreback merged commit 8c55b6f into pandas-dev:master Sep 20, 2019
@jreback
Copy link
Contributor

jreback commented Sep 20, 2019

thanks @datajanko nice patch!

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve DataFrame.select_dtypes scaling to wide data frames
5 participants