Skip to content

ENH: recognize Decimal("NaN") in pd.isna #39409

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Feb 25, 2021

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Jan 26, 2021

Discussion in #23530 seems ambivalent on whether this is desirable, and I don't have a strong opinion on it in general. BUT tm.assert_foo_equal is incorrect with Decimal("NaN") ATM and id like to see that fixed.

xref #32206

# GH 31615
if isinstance(nulls_fixture, Decimal):
mark = pytest.mark.xfail(reason="not implemented")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref #28609 cc @WillAyd

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wanted to fix this here it would just require adding the same condition in the ujson code that we have for checking floats.

if (npy_isnan(val) || npy_isinf(val)) {

The decimal check is only a few branches below that

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this, though worried about perf cost, can you run some benchmarks

if dtype.kind in ["i", "u", "f", "c"]:
# Numeric
return obj is not NaT and not isinstance(obj, (np.datetime64, np.timedelta64))

if dtype == np.dtype(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these be if/elif

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@@ -606,15 +607,19 @@ def is_valid_nat_for_dtype(obj, dtype: DtypeObj) -> bool:
if not lib.is_scalar(obj) or not isna(obj):
return False
if dtype.kind == "M":
return not isinstance(obj, np.timedelta64)
return not isinstance(obj, (np.timedelta64, Decimal))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really strange that you need to do this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you just test dtype.kind == 'O' first?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no bc dtype.kind == "O" includes Period and Interval

@@ -89,6 +91,10 @@ def test_constructor_infer_periodindex(self):
def test_constructor_infer_nat_dt_like(
self, pos, klass, dtype, ctor, nulls_fixture, request
):
if isinstance(nulls_fixture, Decimal):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe should have a nulls_fixture_compatible_datetimelike ?

@jbrockmendel
Copy link
Member Author

In [4]: %timeit pd.isna(2)
323 ns ± 15.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- master
319 ns ± 5.37 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- PR

In [5]: %timeit pd.isna(np.nan)
329 ns ± 3.86 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- master
330 ns ± 9.32 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- PR

In [6]: arr = np.arange(1000).astype(object)
In [7]: arr[500] = Decimal("NAN")

In [8]: %timeit pd.isna(arr)
42.1 µs ± 254 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)  # master
45.3 µs ± 213 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)  # <-- PR

@jreback jreback added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Jan 27, 2021
if dtype.kind == "m":
return not isinstance(obj, np.datetime64)
return not isinstance(obj, (np.datetime64, Decimal))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@jreback
Copy link
Contributor

jreback commented Jan 28, 2021

ok looks fine, can you add a whatsnew note (also prob need to update the docs where we mention missing values e.g. isna & user docs). can do as a followon is ok.

cc @jorisvandenbossche

@jorisvandenbossche
Copy link
Member

What's the rationale for supporting decimals? (we don't have special support for it elsewhere, I think)

So that something like pd.Series([decimal.Decimal("NaN"), decimal.Decimal("2.2")]).isna() works correctly? (now it returns [False, False])

We don't really have such custom support for anything else in object dtype, so while I certainly understand the use case, I am not sure why would we should do it for decimal and not for something else? (although maybe decimal is the only relevant case)
Also, assume we would have proper decimal support in the future (eg arrow-backed), then the question is also if we want to keep supporting it like this in object dtype (arrow eg also doesn't support "NaN" for decimal).

@jbrockmendel
Copy link
Member Author

we don't have special support for it elsewhere, I think

we recognize it in lib.is_scalar and infer_dtype

So that something like pd.Series([decimal.Decimal("NaN"), decimal.Decimal("2.2")]).isna() works correctly? (now it returns [False, False])

correct

@jorisvandenbossche
Copy link
Member

we recognize it in infer_dtype

Wondering: is the fact that infer_dtype recognizes it actually used somewhere? (I only see it used in the tests)

The fact that is_scalar recognizes it is a good point, but in principle any Python class that follows the number protocol (eg implements __int__ or __float__) will be recognized as scalar.

I am personally still hesitant about the "is this the future behaviour we want?"

@jreback
Copy link
Contributor

jreback commented Feb 11, 2021

@jorisvandenbossche are you -1 here? I think this is ok behavior. its rn a special case, so this seems like a nice cleanup.

@jorisvandenbossche
Copy link
Member

I don't have a strong opinion about it, but we don't support decimals as a first class citizen (only in object dtype as any other Python class), so I don't really see the value in adding a special case for it in our C code.

(but so, if others want to see this behaviour, I won't block it)

@jreback
Copy link
Contributor

jreback commented Feb 12, 2021

thanks @jorisvandenbossche yeah to me this improves the UX a bit and doesn't hurt perf so ok with it.

@jbrockmendel
Copy link
Member Author

whatsnew added + green

@jbrockmendel
Copy link
Member Author

the only reason i can think of to treat Decimal special is bc it is from the stdlib

@jreback jreback added this to the 1.3 milestone Feb 25, 2021
@jreback jreback merged commit 8ec9e0a into pandas-dev:master Feb 25, 2021
@jreback
Copy link
Contributor

jreback commented Feb 25, 2021

merging, this makes the logic a bit simpler and agree its a built in type, so why not

Copy link

@Matausi29 Matausi29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jbrockmendel jbrockmendel deleted the enh-isna-decimal branch February 25, 2021 02:06
return (
val is C_NA
or is_null_datetimelike(val, inat_is_null=False)
or is_decimal_na(val)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the list of what is considered null in docstring maybe could be updated.

also for consistency when using pandas.options.mode.use_inf_as_na, what about checknull_old?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the list of what is considered null in docstring maybe could be updated.

just added this to my next "collected misc" branch

also for consistency when using pandas.options.mode.use_inf_as_na, what about checknull_old?

i guess you're referring to Decimal("inf")? my inclination is to let that sleeping dog lie

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this pull request Feb 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Decimal("NaN") is pandas.isna
6 participants