-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
[PERF] Vectorize select_dtypes #28447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PERF] Vectorize select_dtypes #28447
Conversation
A review would be nice, as well as maybe a better name for |
Just the ASV benchmark with the most recent code versions ASV Benchmark``` · Creating environments · Discovering benchmarks ·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt. ·· Building 16554c6 <28317-select-dtypes-performance> for conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt............................................ ·· Installing 16554c6 <28317-select-dtypes-performance> into conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.. · Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks) [ 0.00%] · For pandas commit ad9fe5d (round 1/2): [ 0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt... [ 0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt [ 25.00%] ··· Running (frame_methods.SelectDtypes.time_select_dtypes--). [ 25.00%] · For pandas commit 16554c6 <28317-select-dtypes-performance> (round 1/2): [ 25.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt... [ 25.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt [ 50.00%] ··· Running (frame_methods.SelectDtypes.time_select_dtypes--). [ 50.00%] · For pandas commit 16554c6 <28317-select-dtypes-performance> (round 2/2): [ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt [ 75.00%] ··· frame_methods.SelectDtypes.time_select_dtypes ok [ 75.00%] ··· ======== ============= n -------- ------------- 10 1.37±0.1ms 100 1.41±0.07ms 1000 1.56±0.3ms 10000 1.86±0.3ms 100000 6.43±0.3ms ======== =============[ 75.00%] · For pandas commit ad9fe5d <master^2> (round 2/2):
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
|
adds Type Hints removes now unnecessary code for selecting slices uses iloc access replaces list comprehension by loop
I'm not totally sure about the type annotations, specifically which type I should use for type. I think I tackled the other issues as well. |
pandas/core/frame.py
Outdated
# Hence, we can just shrink the columns we want to keep | ||
keep_these = np.full(self.shape[1], True) | ||
|
||
def extract_unique_dtypes_from_dtypes_list( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm is this function even required? Would np.isin
not offer similar functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DTypes are hierarchical, i.e. number contains e.g. timeldetas. In the tests we often see, include=number
and exclude=timedelta
. Somehow we need to take care of that, and the subclass checks are doing that.
36973a2
to
38b1667
Compare
FYI, some CI checks are failing. I think this is a generally promising approach. Will take a closer look later. |
Sorry, I was at work. CI is only failing due to the type annotation. Will try to figure it put. The last commit has an unfortunate name. I just was fixing a wrong export from |
Hmm, sorry, I don't get the typing annotations here:
and tuple is called with
and from the type hints at the beginning:
Btw, I recharged to a list comprehension since I'm only having one loop now. But of course I would change back. Any hints on the typing related stuff? |
Seems weird - I would expect frozen set to be a subset of Iterable in the typing world. Any chance you see a bug report about it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comment, otherwise lgtm. ping on green.
To file a bug report for the typing issue, not sure that this is a bug though, I'll try to create an easier example that shows the issue. Right now I'm having:
which does not throw the Mypy error. Or shall we just remove, the annotations and keep going? |
I think just add a |
Okay, so I think I created and MVE:
gives using Mypy
The second line is only to show what is subclass expects. We have the analogous error message
with Dtype. Can anybody explain this? Is this a bug? Adding above a line
is unnoticed by Mypy. If not, I'll follow the suggestion from above, and will just use #type: ignore |
@datajanko pretty confusing error message but aren't those invocations wrong? Should the second argument to |
Thinking through this I think the definition of Tried that locally on your branch but that caused some more failures. For now I'd suggest just `type: ignoring" the failing line and we can open up a follow up issue to address the above |
cc @simonjayhawkins on previous comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. May be enough as is just curious if there's not a way to make more succinct
unique_dtypes = self.dtypes.unique() | ||
|
||
if include: | ||
included_dtypes = extract_unique_dtypes_from_dtypes_set( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If above is true would be nice to not duplicate this call in both branches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment below
What works is the following:
note that extracted_dtypes is function. Do we prefer to |
Nevermind then - I don't necessarily find that more readable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@jreback ping since green |
thanks @datajanko nice patch! |
black pandas
(passed locally)git diff upstream/master -u -- "*.py" | flake8 --diff
(passed locally)I was able to run this code for 10million columns in less than a second, which should solve the mentioned problem sufficiently well.
ASV Benchmark:
``` [ 75.00%] ··· frame_methods.SelectDtypes.time_select_dtypes ok [ 75.00%] ··· ======== ============= n -------- ------------- 10 1.53±0.2ms 100 1.48±0.03ms 1000 1.50±0.06ms 10000 1.89±0.04ms 100000 6.55±0.2ms ======== =============[ 75.00%] · For pandas commit ad9fe5d <master^2> (round 2/2):
[ 75.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 75.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· frame_methods.SelectDtypes.time_select_dtypes ok
[100.00%] ··· ======== ============
n
-------- ------------
10 2.04±0.2ms
100 10.0±0.3ms
1000 91.2±2ms
10000 898±50ms
100000 8.93±0.08s
======== ============