BUG: MaskedArray._quantile match non-nullable behavior #46282

jbrockmendel · 2022-03-08T22:53:05Z

closes #xxxx (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Along with some refactors to 1) de-duplicate squeeze/unsqueeze logic and 2) avoid densifying Sparse

cc @attack68 the edits to the Styler tests are pure guess-and-check to keep the tests passing. Can you confirm that they make sense? Maybe suggest a more idiomatic way of expressing the conditions in which to pop?

attack68 · 2022-03-09T06:32:04Z

at first glance it seems that this changes the basic operation.

In the styler tests the dataframe is:

     A    B
0    0    1
1  nan None
2   10    2

In the test you have added a highlighted cell in (0,1), and pop it based on other parameters.

Under the base "float64" case when q_left=0.5 and q_right=1, and axis=0 the None value is column B seems to be interpreted as an element (possibly 0) and then {1,2} fall within the quantile, whereas for column A nan is either not interpreted as a value (or 0) and then only {10} falls in the quantile.

On the other hand when axis is None only {2, 10} fall within the quantile, suggesting that nan and None are not evaluated as 0 since otherwise {1} would also be in the quantile.

So, at first glance this seems to give different, non-intuitive values. I will take a closer look later.

attack68 · 2022-03-09T17:13:52Z

pandas/tests/io/formats/style/test_highlight.py

@@ -181,9 +181,20 @@ def test_highlight_between_inclusive(styler, inclusive, expected):
 )
 def test_highlight_quantile(styler, kwargs):
    expected = {
+        (0, 1): [("background-color", "yellow")],


Suggested change

(0, 1): [("background-color", "yellow")],

attack68 · 2022-03-09T17:14:46Z

pandas/tests/io/formats/style/test_highlight.py

+    if styler.data.dtypes["B"] != "Int64":
+        expected.pop((0, 1))
+    else:
+        if kwargs.get("axis", -1) is None:
+            expected.pop((0, 1))
+        elif kwargs.get("q_left", -1) == 0:
+            expected.pop((0, 1))
+        elif "subset" in kwargs:
+            expected.pop((0, 1))


Suggested change

if styler.data.dtypes["B"] != "Int64":

expected.pop((0, 1))

else:

if kwargs.get("axis", -1) is None:

expected.pop((0, 1))

elif kwargs.get("q_left", -1) == 0:

expected.pop((0, 1))

elif "subset" in kwargs:

expected.pop((0, 1))

if kwargs.get("axis", -1) == 0 and styler.data.dtypes["B"] == "Int64":

# pd.NA is interpreted as zero value in the quantile calculation meaning

# of values {1, <NA>, 2}, {1, 2} are returned within the [0.5, 1] quantile.

expected[(0, 1)] = [("background-color", "yellow")]

this fixes the test, although I would not intuitively expect the results noted in the comment.

[[ the alternative fix is to change the pytest fixture column values from {1, pd.NA, 2} to {1, pd.Na, 3}, since then the value 1 is not caught within the [0.5, 1] quantile where pd.NA = 0, and conveniently all other tests in the module still pass ]]

Do you think the trouble is in the quantile behavior or in the Styler?

OK, further investigation shows that this is a rounding issue when preserving the dtype in the quantile function. With

>>> df A B 0 0 1 1 <NA> <NA> 2 10 2

Then calling df.quantile(axis=0, numeric_only=False, q=[0.5, 1], interpolation="linear") gives under dtype="Float64":

A B 0.5 5.0 1.5 1.0 10.0 2.0 dtypes=Float64

Whereas under this PR using dtype="Int64" the values are:

A B 0.5 5 1 1.0 10 2 dtypes=Int64

The value of 1 is included in the second, probably due to calling int(1.5), and so the styler highlighting is changed for cell index (0,1).

Prior to this PR the dtype of column B was upcast from "Int64" to "float64" and the leftmost quantile was calculated as 1.5. Since the interpolation style is linear and that is specifically documented to be fraction, I think it would be worth expanding the docs how dtypes are preserved and can affect calculations, or considering upcasting to "Float64".

It is possible to fix all tests in Styler by changing the fixture's column B to have 4 instead of 2 so that the rounding of the 50% quantile at 2.5 to 2 still excludes cell (0,1).

thanks @attack68 I think I've got a handle on whats going wrong, and it tentatively looks like its in the quantile code. ill try the fix i have in mind and let you know if that doesnt work out

github-actions · 2022-04-16T00:05:55Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

jreback · 2022-05-08T22:37:24Z

@jbrockmendel can you merge master

jbrockmendel · 2022-05-09T03:15:07Z

rebased + green

)

BUG: MaskedArray._quantile match non-nullable behavior

de59af4

Merge branch 'main' into ref-quantile-wrap

c4cd9c1

attack68 reviewed Mar 9, 2022

View reviewed changes

jreback added NA - MaskedArrays Related to pd.NA and nullable extension arrays Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Mar 16, 2022

Merge branch 'main' into ref-quantile-wrap

dcc33f2

github-actions bot added the Stale label Apr 16, 2022

jbrockmendel added 3 commits May 2, 2022 13:52

Merge branch 'main' into ref-quantile-wrap

8a6b82a

Merge branch 'main' into ref-quantile-wrap

f80ddc8

fix incorrect dtype

fe2d38e

Merge branch 'main' into ref-quantile-wrap

348a08a

jreback added this to the 1.5 milestone May 9, 2022

jreback merged commit 6edb64c into pandas-dev:main May 9, 2022

jbrockmendel deleted the ref-quantile-wrap branch May 9, 2022 18:44

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022

BUG: MaskedArray._quantile match non-nullable behavior (pandas-dev#46282

58bba9b

)

attack68 mentioned this pull request Aug 5, 2022

BUG: Resampler.quantile produce inconsistent results when interpolation='nearest' and q=0.5 #47942

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: MaskedArray._quantile match non-nullable behavior #46282

BUG: MaskedArray._quantile match non-nullable behavior #46282

jbrockmendel commented Mar 8, 2022

attack68 commented Mar 9, 2022

attack68 Mar 9, 2022

attack68 Mar 9, 2022 •

edited

Loading

attack68 Mar 9, 2022

attack68 Mar 9, 2022

jbrockmendel Mar 12, 2022

attack68 Mar 13, 2022

jbrockmendel Mar 16, 2022

github-actions bot commented Apr 16, 2022

jreback commented May 8, 2022

jbrockmendel commented May 9, 2022

BUG: MaskedArray._quantile match non-nullable behavior #46282

BUG: MaskedArray._quantile match non-nullable behavior #46282

Conversation

jbrockmendel commented Mar 8, 2022

attack68 commented Mar 9, 2022

attack68 Mar 9, 2022

Choose a reason for hiding this comment

attack68 Mar 9, 2022 • edited Loading

Choose a reason for hiding this comment

attack68 Mar 9, 2022

Choose a reason for hiding this comment

attack68 Mar 9, 2022

Choose a reason for hiding this comment

jbrockmendel Mar 12, 2022

Choose a reason for hiding this comment

attack68 Mar 13, 2022

Choose a reason for hiding this comment

jbrockmendel Mar 16, 2022

Choose a reason for hiding this comment

github-actions bot commented Apr 16, 2022

jreback commented May 8, 2022

jbrockmendel commented May 9, 2022

attack68 Mar 9, 2022 •

edited

Loading