Skip to content

BUG: MaskedArray._quantile match non-nullable behavior #46282

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 9, 2022

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx (Replace xxxx with the Github issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Along with some refactors to 1) de-duplicate squeeze/unsqueeze logic and 2) avoid densifying Sparse

cc @attack68 the edits to the Styler tests are pure guess-and-check to keep the tests passing. Can you confirm that they make sense? Maybe suggest a more idiomatic way of expressing the conditions in which to pop?

@attack68
Copy link
Contributor

attack68 commented Mar 9, 2022

at first glance it seems that this changes the basic operation.

In the styler tests the dataframe is:

     A    B
0    0    1
1  nan None
2   10    2

In the test you have added a highlighted cell in (0,1), and pop it based on other parameters.

Under the base "float64" case when q_left=0.5 and q_right=1, and axis=0 the None value is column B seems to be interpreted as an element (possibly 0) and then {1,2} fall within the quantile, whereas for column A nan is either not interpreted as a value (or 0) and then only {10} falls in the quantile.

On the other hand when axis is None only {2, 10} fall within the quantile, suggesting that nan and None are not evaluated as 0 since otherwise {1} would also be in the quantile.

So, at first glance this seems to give different, non-intuitive values. I will take a closer look later.

@@ -181,9 +181,20 @@ def test_highlight_between_inclusive(styler, inclusive, expected):
)
def test_highlight_quantile(styler, kwargs):
expected = {
(0, 1): [("background-color", "yellow")],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(0, 1): [("background-color", "yellow")],

Comment on lines 188 to 196
if styler.data.dtypes["B"] != "Int64":
expected.pop((0, 1))
else:
if kwargs.get("axis", -1) is None:
expected.pop((0, 1))
elif kwargs.get("q_left", -1) == 0:
expected.pop((0, 1))
elif "subset" in kwargs:
expected.pop((0, 1))
Copy link
Contributor

@attack68 attack68 Mar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if styler.data.dtypes["B"] != "Int64":
expected.pop((0, 1))
else:
if kwargs.get("axis", -1) is None:
expected.pop((0, 1))
elif kwargs.get("q_left", -1) == 0:
expected.pop((0, 1))
elif "subset" in kwargs:
expected.pop((0, 1))
if kwargs.get("axis", -1) == 0 and styler.data.dtypes["B"] == "Int64":
# pd.NA is interpreted as zero value in the quantile calculation meaning
# of values {1, <NA>, 2}, {1, 2} are returned within the [0.5, 1] quantile.
expected[(0, 1)] = [("background-color", "yellow")]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fixes the test, although I would not intuitively expect the results noted in the comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[[ the alternative fix is to change the pytest fixture column values from {1, pd.NA, 2} to {1, pd.Na, 3}, since then the value 1 is not caught within the [0.5, 1] quantile where pd.NA = 0, and conveniently all other tests in the module still pass ]]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think the trouble is in the quantile behavior or in the Styler?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, further investigation shows that this is a rounding issue when preserving the dtype in the quantile function. With

>>> df
      A     B
0     0     1
1  <NA>  <NA>
2    10     2

Then calling df.quantile(axis=0, numeric_only=False, q=[0.5, 1], interpolation="linear") gives under dtype="Float64":

        A    B
0.5   5.0  1.5
1.0  10.0  2.0
dtypes=Float64

Whereas under this PR using dtype="Int64" the values are:

      A  B
0.5   5  1
1.0  10  2
dtypes=Int64

The value of 1 is included in the second, probably due to calling int(1.5), and so the styler highlighting is changed for cell index (0,1).

Prior to this PR the dtype of column B was upcast from "Int64" to "float64" and the leftmost quantile was calculated as 1.5. Since the interpolation style is linear and that is specifically documented to be fraction, I think it would be worth expanding the docs how dtypes are preserved and can affect calculations, or considering upcasting to "Float64".

It is possible to fix all tests in Styler by changing the fixture's column B to have 4 instead of 2 so that the rounding of the 50% quantile at 2.5 to 2 still excludes cell (0,1).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @attack68 I think I've got a handle on whats going wrong, and it tentatively looks like its in the quantile code. ill try the fix i have in mind and let you know if that doesnt work out

@jreback jreback added NA - MaskedArrays Related to pd.NA and nullable extension arrays Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Mar 16, 2022
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Apr 16, 2022
@jreback
Copy link
Contributor

jreback commented May 8, 2022

@jbrockmendel can you merge master

@jbrockmendel
Copy link
Member Author

rebased + green

@jreback jreback added this to the 1.5 milestone May 9, 2022
@jreback jreback merged commit 6edb64c into pandas-dev:main May 9, 2022
@jbrockmendel jbrockmendel deleted the ref-quantile-wrap branch May 9, 2022 18:44
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff NA - MaskedArrays Related to pd.NA and nullable extension arrays Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants