Skip to content

Pivot nans fix #28540

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed

Conversation

endremborza
Copy link
Contributor

@endremborza endremborza commented Sep 19, 2019

this gets the ball rolling on #18030 as it is very easy to fix, but requires some decisions, mostly for structure and naming

the basic options as I see them:

0 - this PR as is (maybe some name change to this 26 character monster)

1 - Outright changing the dropna=False default behavior I thought would be a bad idea as it breaks backwards compatibility, but this solution is quite clunky.

2 - Change dropna=True behavior to satisfy the test added here, but again that is not backwards compatible, but seems reasonable.

Also, I've noticed that taking out the line

table = table.dropna(how="all", axis=1)

doesn't break any of the tests, so there is that. (this helps 2)

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might not fully understand the referenced issue yet (couldn't copy / paste the example to match what was written) but I don't think adding a new keyword is appropriate here. Option 2 seems more appropriate - the existing behavior is a bug right?

],
names=["measurer", "product"],
)
tm.assert_index_equal(pv_col.columns, m)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you construct the expected output and use tm.assert_frame_equal(result, expected instead? You should see that in a lot of other tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did one way

@WillAyd WillAyd added MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Sep 20, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a non-obvious api and would some discussion as this is just api bloat AFIACT.

@jreback
Copy link
Contributor

jreback commented Sep 20, 2019

why can you not simply used the value in the observed keyword?

@endremborza
Copy link
Contributor Author

Made some changes to move towards solution 2, but this requires a change in default value to observed, and it breaks 2 tests referencing #21133

those two tests basically require what this new one would like to eliminate (keeping records where values are all nan) which was possible with a new kwarg, but without it the combinations from observed/dropna cant work it out without modifying the old test

@endremborza
Copy link
Contributor Author

I'm sorry, is there any way forward with this?

Either past tests need to be modified and backwards compatibility broken, or the API needs to be bloated, or the whole thing abandoned.

Should I modify the tests that drop the nancols now?

@WillAyd
Copy link
Member

WillAyd commented Oct 22, 2019

@endremborza can you help summarize what behavior you are expecting from the various combinations of dropna / observed and how they differ from today? We typically don't just change the default values (note CI going red) so would have to think through options and best approach

@endremborza
Copy link
Contributor Author

simply put, I think observed=True with dropna=False should not drop combinations that are present, but filled with nans.

simplest example I can produce:

df = pd.DataFrame(
    {"number": ["one", "two", "one", "three"], 
     "letter": ["a", "a" ,"b", "b"], 
     "y": [1, 2, np.nan, np.nan],
     "x": [1, 2, 3, 4]}
)

result = df.pivot_table(values="y",
                        columns=["number", "letter"],
                        index="x",
                        observed=True, dropna=False)

here result is

number one three two
letter a b a b a b
x
1 1.0 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN 2.0 NaN
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN

I suppose it should be

number one three two
letter a b b a
x
1 1.0 NaN NaN NaN
2 NaN NaN NaN 2.0
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN

However, this breaks current tests as I (and the red CI) pointed out

also, I just noticed that if I leave out the x column and the index="x" I get a weird attribute error, that I think should not happen, but that is another issue

@WillAyd
Copy link
Member

WillAyd commented Dec 17, 2019

Thanks for the PR but I think this is stale and has some disagreement over implementation. If interested in pursuing further let's discuss more in issue before coming back to code

@WillAyd WillAyd closed this Dec 17, 2019
@endremborza
Copy link
Contributor Author

@WillAyd I'm sorry but this has been an absolute disappointment. I like pandas and I believe in the idea of actively trying to fix your own issues with open source software, but the indecision and disinterest here has been painful.

I think I've been quite clear with how all this could be handled, but it seems nobody took the time and the effort to evaluate these options.

Calling it stale just claims that there was nothing wrong with the process, or the way this was handled, which is quite off-putting.

@jreback
Copy link
Contributor

jreback commented Dec 18, 2019

@endremborza you never responded to any of my queries above

you need to engage in discussion otherwise we cannot move forwards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
3 participants