Skip to content

BUG/TST: Fix infer_dtype for Period array-likes and general ExtensionArrays #37367

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Feb 12, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 11 additions & 5 deletions pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -1050,6 +1050,7 @@ _TYPE_MAP = {
"M": "datetime64",
"timedelta64[ns]": "timedelta64",
"m": "timedelta64",
"period": "period",
"interval": "interval",
}

Expand Down Expand Up @@ -1203,9 +1204,14 @@ cdef object _try_infer_map(object dtype):
object val
str attr
for attr in ["name", "kind", "base"]:
val = getattr(dtype, attr)
val = getattr(dtype, attr, None)
if val in _TYPE_MAP:
return _TYPE_MAP[val]
# also check base name for parametrized dtypes (eg period[D])
if isinstance(val, str):
val = val.split("[")[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we explicity add the Period[D] types to the type_map (there aren't that many)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going from the PeriodDtypeCode enum, there are actually quite some?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we were to get that specific, at some point it makes more sense to just return a dtype object

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right here why can't we try to convert to an EA type using registry.find()?

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Nov 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we already have a dtype here, there is no need to look anything up in the registry. We don't want to infer a type, we have an EA dtype that we want to find in the TYPE_MAP to infer the category returned by infer_dtype, by checking the name, kind or base attributes of the dtype.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to import those in lib.pyx ? (might be fine, eg there are already imports from tslibs.period, but at the moment only cimport s)

If that's OK, I am fine with changing it. I don't necessarily find it less "hacky" than the current solution, but I just want some solution that is acceptable for all

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not already import all of the scalar EA types?
why is this any different

+1 on using the existing machinery

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't necessarily find it less "hacky" than the current solution, but I just want some solution that is acceptable for all

100% agree on both points

Do we want to import those in lib.pyx

not ideal, but i dont think it will harm anything ATM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and please don't say that I should convert the string to a dtype, as you can see in the code a few lines above, we actually start from a dtype object)

exactly you are missing the entire point of the dtype abstraction

you avoid parsing strings in the first place

i will be blocking this util / unless a good soln is done

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not already import all of the scalar EA types?

lib.pyx doesn't know anything about EAs. It only imports helper functions like is_period_object

why is this any different

different than what?

if val in _TYPE_MAP:
return _TYPE_MAP[val]
return None


Expand Down Expand Up @@ -1324,12 +1330,12 @@ def infer_dtype(value: object, skipna: bool = True) -> str:
# e.g. categoricals
dtype = value.dtype
if not isinstance(dtype, np.dtype):
value = _try_infer_map(value.dtype)
if value is not None:
return value
inferred = _try_infer_map(value.dtype)
if inferred is not None:
return inferred

# its ndarray-like but we can't handle
raise ValueError(f"cannot infer type for {type(value)}")
values = np.asarray(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have tests that get here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base extension test and the decimal test that I added in this PR get here (those were raising errors before)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note I am not fully sure about this change. Before it raised an error, now it will typically return "mixed" for random python objects. Both don't seem very useful (or in other words, both are an indication that nothing could be inferred). But I found it a bit inconsistent to raise in this case, while if you would pass a list of the same objects, we would actually infer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are already doing this below. This does a full conversion of the object, I think this will simply kill performance in some cases making this routine very unpredictable. would rather not make this change here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random python objects will be marked as 'mixed' in any event w/o the performance panelty below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are going to remove the exception then you can remove this L1337,1338 entirely (as np.asarray is called on L1341)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are going to remove the exception then you can remove this L1337,1338 entirely (as np.asarray is called on L1341)

I removed the duplicate asarray, but so we should still decide if we want to keep that exception or not


# Unwrap Series/Index
values = np.asarray(value)
Expand Down
13 changes: 13 additions & 0 deletions pandas/tests/dtypes/test_inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -890,6 +890,19 @@ def test_infer_dtype_period(self):
arr = np.array([pd.Period("2011-01", freq="D"), pd.Period("2011-02", freq="M")])
assert lib.infer_dtype(arr, skipna=True) == "period"

@pytest.mark.parametrize("klass", [pd.array, pd.Series, pd.Index])
@pytest.mark.parametrize("skipna", [True, False])
def test_infer_dtype_period_array(self, klass, skipna):
# https://github.com/pandas-dev/pandas/issues/23553
values = klass(
[
pd.Period("2011-01-01", freq="D"),
pd.Period("2011-01-02", freq="D"),
pd.NaT,
]
)
assert lib.infer_dtype(values, skipna=skipna) == "period"

def test_infer_dtype_period_mixed(self):
arr = np.array(
[pd.Period("2011-01", freq="M"), np.datetime64("nat")], dtype=object
Expand Down
8 changes: 8 additions & 0 deletions pandas/tests/extension/base/dtype.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,3 +123,11 @@ def test_get_common_dtype(self, dtype):
# still testing as good practice to have this working (and it is the
# only case we can test in general)
assert dtype._get_common_dtype([dtype]) == dtype

@pytest.mark.parametrize("skipna", [True, False])
def test_infer_dtype(self, data, data_missing, skipna):
# only testing that this works without raising an error
res = pd.api.types.infer_dtype(data, skipna=skipna)
assert isinstance(res, str)
res = pd.api.types.infer_dtype(data_missing, skipna=skipna)
assert isinstance(res, str)
7 changes: 7 additions & 0 deletions pandas/tests/extension/decimal/test_decimal.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,13 @@ class TestDtype(BaseDecimal, base.BaseDtypeTests):
def test_hashable(self, dtype):
pass

@pytest.mark.parametrize("skipna", [True, False])
def test_infer_dtype(self, data, data_missing, skipna):
# here overriding base test to ensure we fall back to inferring the
# array elements for external EAs (which here are recognized as decimals)
assert pd.api.types.infer_dtype(data, skipna=skipna) == "decimal"
assert pd.api.types.infer_dtype(data_missing, skipna=skipna) == "decimal"


class TestInterface(BaseDecimal, base.BaseInterfaceTests):
pass
Expand Down