Skip to content

ENH: Exclude nuisance columns from result of window functions #27044

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jul 1, 2019

Conversation

ihsansecer
Copy link
Contributor

@gfyoung gfyoung added Dtype Conversions Unexpected or buggy dtype conversions Enhancement labels Jun 26, 2019
@codecov
Copy link

codecov bot commented Jun 26, 2019

Codecov Report

Merging #27044 into master will decrease coverage by 0.01%.
The diff coverage is 61.29%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #27044      +/-   ##
==========================================
- Coverage      92%   91.99%   -0.02%     
==========================================
  Files         180      180              
  Lines       50754    50774      +20     
==========================================
+ Hits        46698    46708      +10     
- Misses       4056     4066      +10
Flag Coverage Δ
#multiple 90.63% <61.29%> (-0.01%) ⬇️
#single 41.81% <3.22%> (-0.11%) ⬇️
Impacted Files Coverage Δ
pandas/core/window.py 96.01% <61.29%> (-0.63%) ⬇️
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
pandas/core/frame.py 96.89% <0%> (-0.12%) ⬇️
pandas/compat/_optional.py 100% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b4aa1d6...035a697. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 26, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@8b48f5c). Click here to learn what that means.
The diff coverage is 65.51%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #27044   +/-   ##
=========================================
  Coverage          ?   92.03%           
=========================================
  Files             ?      180           
  Lines             ?    50735           
  Branches          ?        0           
=========================================
  Hits              ?    46692           
  Misses            ?     4043           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.67% <65.51%> (?)
#single 41.86% <6.89%> (?)
Impacted Files Coverage Δ
pandas/core/window.py 96.23% <65.51%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8b48f5c...fd678a5. Read the comment docs.

@TomAugspurger
Copy link
Contributor

How does this interact with #23002, which would like to apply (certain) operations to non-numeric columns?

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think returning an empty series is correct here as it is not consistent with GroupBy ops with a similar limitation. For instance:

>>> pd.Series(['foo']).groupby([0]).mean()
DataError: No numeric types to aggregate

Granted I don't think this is even consistent internally:

>>> pd.Series(['foo']).groupby([0]).rank()
TypeError: 'NoneType' object is not callable

But the empty Series definitely seems wrong to me.

Any objection to the DataError as shown in the above?

@WillAyd WillAyd added the Window rolling, ewma, expanding label Jun 26, 2019
@ihsansecer
Copy link
Contributor Author

ihsansecer commented Jun 26, 2019

@WillAyd DataError seems reasonable. I just checked how DataFrame functions behave and it led me to returning an empty Series like:

>>> pd.DataFrame({"A": ['foo']}).mean()
Series([], dtype: float64)

@TomAugspurger this doesn't resolves it. Instead of an error an empty Series will be returned (as a result of using Rolling.apply with a string column) but it seems to be wrong as mentioned

@ihsansecer
Copy link
Contributor Author

Not sure why some test on py35_macos are failing. Will try to reproduce it

@jreback jreback added this to the 0.25.0 milestone Jun 27, 2019
@jreback
Copy link
Contributor

jreback commented Jun 27, 2019

this looks good. can you merge master and ping on green.

@jreback
Copy link
Contributor

jreback commented Jun 27, 2019

@WillAyd if any other comments.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice just need to update whatsnew

@jreback
Copy link
Contributor

jreback commented Jun 27, 2019

@ihsansecer see the failure https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=13461

I think this might be an ordering issue (as the dict is not ordered on 3.5).

@jreback jreback removed this from the 0.25.0 milestone Jun 27, 2019
results = []
for b in blocks:
exclude = []
for dtype in list(dtypes):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

best just to

for b in bocks_dict.values():
    .....

then don't need anything else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since my solution to unordered dict issue requires deleting nuisance blocks (block of columns with same type) I needed a shallow copy of keys to remove them iteratively

except (TypeError, NotImplementedError):
if isinstance(obj, ABCDataFrame):
exclude.extend(b.columns)
del blocks_dict[dtype]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you del here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you stated the order for dictionary differs in each run. So iteration order for blocks differs.

So in case of a DataFrame with columns ["A", "B"] with types [int, str] when iteration starts with "B" then:

results = [values_of_A] but blocks = [block_for_B, block_for_A]. There seems to be a mismatch here (values_of_A and block_for_B are iterated together). So I went with removing block_for_B which was the first thing came to my mind.

@jreback jreback added this to the 0.25.0 milestone Jun 28, 2019
@jreback
Copy link
Contributor

jreback commented Jun 28, 2019

lgtm. @TomAugspurger @jorisvandenbossche if any comments.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@WillAyd WillAyd merged commit 355e322 into pandas-dev:master Jul 1, 2019
@WillAyd
Copy link
Member

WillAyd commented Jul 1, 2019

Thanks @ihsansecer !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: window functions need to exclude nuiscance columns
5 participants