-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Exclude nuisance columns from result of window functions #27044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Exclude nuisance columns from result of window functions #27044
Conversation
Codecov Report
@@ Coverage Diff @@
## master #27044 +/- ##
==========================================
- Coverage 92% 91.99% -0.02%
==========================================
Files 180 180
Lines 50754 50774 +20
==========================================
+ Hits 46698 46708 +10
- Misses 4056 4066 +10
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #27044 +/- ##
=========================================
Coverage ? 92.03%
=========================================
Files ? 180
Lines ? 50735
Branches ? 0
=========================================
Hits ? 46692
Misses ? 4043
Partials ? 0
Continue to review full report at Codecov.
|
How does this interact with #23002, which would like to apply (certain) operations to non-numeric columns? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think returning an empty series is correct here as it is not consistent with GroupBy ops with a similar limitation. For instance:
>>> pd.Series(['foo']).groupby([0]).mean()
DataError: No numeric types to aggregate
Granted I don't think this is even consistent internally:
>>> pd.Series(['foo']).groupby([0]).rank()
TypeError: 'NoneType' object is not callable
But the empty Series definitely seems wrong to me.
Any objection to the DataError as shown in the above?
@WillAyd DataError seems reasonable. I just checked how >>> pd.DataFrame({"A": ['foo']}).mean()
Series([], dtype: float64) @TomAugspurger this doesn't resolves it. Instead of an error an empty |
Not sure why some test on py35_macos are failing. Will try to reproduce it |
this looks good. can you merge master and ping on green. |
@WillAyd if any other comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice just need to update whatsnew
@ihsansecer see the failure https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=13461 I think this might be an ordering issue (as the dict is not ordered on 3.5). |
pandas/core/window.py
Outdated
results = [] | ||
for b in blocks: | ||
exclude = [] | ||
for dtype in list(dtypes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
best just to
for b in bocks_dict.values():
.....
then don't need anything else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since my solution to unordered dict issue requires deleting nuisance blocks (block of columns with same type) I needed a shallow copy of keys to remove them iteratively
pandas/core/window.py
Outdated
except (TypeError, NotImplementedError): | ||
if isinstance(obj, ABCDataFrame): | ||
exclude.extend(b.columns) | ||
del blocks_dict[dtype] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you del here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you stated the order for dictionary differs in each run. So iteration order for blocks
differs.
So in case of a DataFrame with columns ["A", "B"]
with types [int, str]
when iteration starts with "B" then:
results = [values_of_A]
but blocks = [block_for_B, block_for_A]
. There seems to be a mismatch here (values_of_A and block_for_B are iterated together). So I went with removing block_for_B
which was the first thing came to my mind.
lgtm. @TomAugspurger @jorisvandenbossche if any comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks @ihsansecer ! |
git diff upstream/master -u -- "*.py" | flake8 --diff