-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: MaskedArray._quantile match non-nullable behavior #46282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
at first glance it seems that this changes the basic operation. In the styler tests the dataframe is: A B
0 0 1
1 nan None
2 10 2 In the test you have added a highlighted cell in (0,1), and pop it based on other parameters. Under the base "float64" case when On the other hand when axis is So, at first glance this seems to give different, non-intuitive values. I will take a closer look later. |
@@ -181,9 +181,20 @@ def test_highlight_between_inclusive(styler, inclusive, expected): | |||
) | |||
def test_highlight_quantile(styler, kwargs): | |||
expected = { | |||
(0, 1): [("background-color", "yellow")], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(0, 1): [("background-color", "yellow")], |
if styler.data.dtypes["B"] != "Int64": | ||
expected.pop((0, 1)) | ||
else: | ||
if kwargs.get("axis", -1) is None: | ||
expected.pop((0, 1)) | ||
elif kwargs.get("q_left", -1) == 0: | ||
expected.pop((0, 1)) | ||
elif "subset" in kwargs: | ||
expected.pop((0, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if styler.data.dtypes["B"] != "Int64": | |
expected.pop((0, 1)) | |
else: | |
if kwargs.get("axis", -1) is None: | |
expected.pop((0, 1)) | |
elif kwargs.get("q_left", -1) == 0: | |
expected.pop((0, 1)) | |
elif "subset" in kwargs: | |
expected.pop((0, 1)) | |
if kwargs.get("axis", -1) == 0 and styler.data.dtypes["B"] == "Int64": | |
# pd.NA is interpreted as zero value in the quantile calculation meaning | |
# of values {1, <NA>, 2}, {1, 2} are returned within the [0.5, 1] quantile. | |
expected[(0, 1)] = [("background-color", "yellow")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this fixes the test, although I would not intuitively expect the results noted in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[[ the alternative fix is to change the pytest fixture column values from {1, pd.NA, 2} to {1, pd.Na, 3}, since then the value 1 is not caught within the [0.5, 1] quantile where pd.NA = 0, and conveniently all other tests in the module still pass ]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think the trouble is in the quantile behavior or in the Styler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, further investigation shows that this is a rounding issue when preserving the dtype in the quantile function. With
>>> df
A B
0 0 1
1 <NA> <NA>
2 10 2
Then calling df.quantile(axis=0, numeric_only=False, q=[0.5, 1], interpolation="linear")
gives under dtype="Float64"
:
A B
0.5 5.0 1.5
1.0 10.0 2.0
dtypes=Float64
Whereas under this PR using dtype="Int64"
the values are:
A B
0.5 5 1
1.0 10 2
dtypes=Int64
The value of 1 is included in the second, probably due to calling int(1.5)
, and so the styler highlighting is changed for cell index (0,1).
Prior to this PR the dtype of column B was upcast from "Int64" to "float64" and the leftmost quantile was calculated as 1.5. Since the interpolation style is linear and that is specifically documented to be fraction, I think it would be worth expanding the docs how dtypes are preserved and can affect calculations, or considering upcasting to "Float64".
It is possible to fix all tests in Styler by changing the fixture's column B to have 4
instead of 2
so that the rounding of the 50% quantile at 2.5
to 2
still excludes cell (0,1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @attack68 I think I've got a handle on whats going wrong, and it tentatively looks like its in the quantile code. ill try the fix i have in mind and let you know if that doesnt work out
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
@jbrockmendel can you merge master |
rebased + green |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Along with some refactors to 1) de-duplicate squeeze/unsqueeze logic and 2) avoid densifying Sparse
cc @attack68 the edits to the Styler tests are pure guess-and-check to keep the tests passing. Can you confirm that they make sense? Maybe suggest a more idiomatic way of expressing the conditions in which to pop?