Skip to content

API: Infer extension types in array #29799

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Dec 2, 2019
Merged

Conversation

TomAugspurger
Copy link
Contributor

  • string
  • integer

Closes #29791 (though we'll want to add BooleanArray).

@TomAugspurger TomAugspurger added Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. labels Nov 22, 2019
@TomAugspurger TomAugspurger added this to the 1.0 milestone Nov 22, 2019

s = pd.Series([1, 2, np.nan], dtype="Int64")
s
Currently :meth:`pandas.array` and :meth:`pandas.Series` use different
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my attempt to explain the inconsistency between pd.array and pd.Series we're introducing. It's not ideal, but I think it's the right behavior for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually we want to share code between these (and ideally also the Index constructor), right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the issue (#29791), I mentioned for that reason the idea of a keyword to control this preference between old or new dtypes.
(but we can also introduce that at the moment we want to share code if that seems useful then)

@TomAugspurger
Copy link
Contributor Author

CI is passing now.


:meth:`pandas.array` now infers pandas' new extension types in several cases (:issue:`29791`):

1. Sting data (including missing values) now returns a :class:`arrays.StringArray`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sting -> String


elif inferred_dtype == "integer":
return IntegerArray._from_sequence(data, copy=copy)

# TODO(BooleanArray): handle this type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessarily for this PR, but is it viable to handle BooleanArray here now?

@jbrockmendel
Copy link
Member

small comment about a typo, otherwise LGTM

@jorisvandenbossche
Copy link
Member

Can you add BooleanArray as well?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. I think @jbrockmendel had some comments.

1. Sting data (including missing values) now returns a :class:`arrays.StringArray`.
2. Integer data (including missing values) now returns a :class:`arrays.IntegerArray`.

*pandas 0.25.x*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side issue, we are pretty inconsistent on showing the previous version in the whatsnew

@jreback
Copy link
Contributor

jreback commented Nov 26, 2019

note that in #29791 should refer to teaching infer_dtype about extension types. This PR doesn't actually do that, rather you teach pd.array to correctly infer given the results of infer_dtype. I actually think we should push this down and remove the logic from pd.array itself, BUT, I suspect that actually would break things (e.g. we would need to process 'nullable-integer') for example. So even though this is a positive change, I think we need to open an issue about this (or revise this PR).

@jbrockmendel
Copy link
Member

LGTM

a mixture of valid integers and NA will return a floating-point
NumPy array.
If pandas does not infer a dedicated extension type a
:class:`arrays.PandasArray` is returned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention that this can still change in the future? (eg that more types start to get inferred, so basically that you should not rely on the fact of pd.array returning a PandasArray when no dtype is specified)

@jorisvandenbossche
Copy link
Member

#29791 should refer to teaching infer_dtype about extension types. This PR doesn't actually do that, rather you teach pd.array to correctly infer given the results of infer_dtype. I actually think we should push this down and remove the logic from pd.array

@jreback what would be the advantage of doing that? I personally don't really see the benefit of infer_dtype to be able to return both "integer" or "integer-nullable" for the same integer data. In the end, if you want to use the new dtype or a numpy dtype does not necessarily depend on the data, but from the context where it is being used (eg the context of pd.array wanting to return the new dtypes, so pd.array can decide how to interpret "integer", as numpy int or nullable int).

@jreback
Copy link
Contributor

jreback commented Nov 28, 2019

#29791 should refer to teaching infer_dtype about extension types. This PR doesn't actually do that, rather you teach pd.array to correctly infer given the results of infer_dtype. I actually think we should push this down and remove the logic from pd.array

@jreback what would be the advantage of doing that? I personally don't really see the benefit of infer_dtype to be able to return both "integer" or "integer-nullable" for the same integer data. In the end, if you want to use the new dtype or a numpy dtype does not necessarily depend on the data, but from the context where it is being used (eg the context of pd.array wanting to return the new dtypes, so pd.array can decide how to interpret "integer", as numpy int or nullable int).

you are missing the point
this PR is just a temporary work around - infer_dtype is used practically everywhere
this is a very specific case that you are trying to solve

@jorisvandenbossche
Copy link
Member

Jeff, if you think I am missing your point, then try to explain it better (but I rather think we are just disagreeing).
In any case, it's not a discussion for this PR I think, so maybe you can open a new issue for this?

@@ -270,7 +275,7 @@ def array(
return cls._from_sequence(data, dtype=dtype, copy=copy)

if dtype is None:
inferred_dtype = lib.infer_dtype(data, skipna=False)
inferred_dtype = lib.infer_dtype(data, skipna=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so my issue with this PR is that is duplicating a lot of logic that is already held here: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx#L1948, so now we have 2 places with slightly different ways of doing things.

This routine is slightly more 'high-level', but myabe_convert_objects is way more used internally. So how to reconcile these things?

@jorisvandenbossche @TomAugspurger

Copy link
Contributor Author

@TomAugspurger TomAugspurger Dec 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with maybe_convert_objects, but what's the duplicate logic between the two? At a glance, it seems like maybe_convert_objects is mixing two things

  1. type inference (the sole purpose of lib.infer_dtype)
  2. array construction (which may be pd.array? Or move the core to some _libs method?).

Should we update maybe_infer_objects to use array internally? I don't have a feel for whether that's even possible, do you?

I do see the similarity in purpose though. They're both for taking potentially untyped things and converting them to a typed array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update maybe_infer_objects to use array internally? I don't have a feel for whether that's even possible, do you?

Probably not in the short term. Best guess for de-duplication for the discussed functions will look something like:

  • lib.maybe_convert_objects is made to back lib.infer_dtype
    • Note: lib.maybe_convert_objects involves two runtime non-cython imports that I'd really like to find a way to avoid (one will be easy, the other very much not)
  • pd.array calls maybe_convert_objects instead of infer_dtype
    • More generally, many places where we call infer_dtype followed by casting can be replaced to be one-pass instead of two-pass.
  • Series constructor is backed by pd.array
  • Index.__new__ is backed by pd.array (we'd need something like RangeArray first)

@TomAugspurger
Copy link
Contributor Author

Fixed the docstring & merging master. I can open an issue about deduplicating array, maybe_convert_object and infer_dtype. I think that can be a followup though, as this PR really is only changing ~10 LOC, despite the large diff :)

@jreback
Copy link
Contributor

jreback commented Dec 2, 2019

Fixed the docstring & merging master. I can open an issue about deduplicating array, maybe_convert_object and infer_dtype. I think that can be a followup though, as this PR really is only changing ~10 LOC, despite the large diff :)

right not trying to cause an issue :->

just wanting to clarify the apis & implementation we have for internal (maybe_convert_*) and external (pd.array) that need to be cleanly separated and useful / documented etc.

opening an issue would be great.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. ping on green.,

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 2, 2019

All green. #29973 for the duplicate functionality between convert_objects and array.

@jreback
Copy link
Contributor

jreback commented Dec 2, 2019

thanks @TomAugspurger

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 2, 2019 via email

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
@TomAugspurger TomAugspurger deleted the infer-ea branch November 17, 2020 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Have pd.array infer new extension types
4 participants