API: Infer extension types in array #29799

TomAugspurger · 2019-11-22T19:40:57Z

string
integer

Closes #29791 (though we'll want to add BooleanArray).

* string * integer

pandas/core/construction.py

TomAugspurger · 2019-11-22T19:41:41Z

doc/source/user_guide/integer_na.rst


-   s = pd.Series([1, 2, np.nan], dtype="Int64")
-   s
+   Currently :meth:`pandas.array` and :meth:`pandas.Series` use different


This is my attempt to explain the inconsistency between pd.array and pd.Series we're introducing. It's not ideal, but I think it's the right behavior for now.

Eventually we want to share code between these (and ideally also the Index constructor), right?

In the issue (#29791), I mentioned for that reason the idea of a keyword to control this preference between old or new dtypes.
(but we can also introduce that at the moment we want to share code if that seems useful then)

pandas/_libs/lib.pyx

pandas/core/construction.py

TomAugspurger · 2019-11-25T17:57:31Z

CI is passing now.

jbrockmendel · 2019-11-25T22:16:55Z

doc/source/whatsnew/v1.0.0.rst

+
+:meth:`pandas.array` now infers pandas' new extension types in several cases (:issue:`29791`):
+
+1. Sting data (including missing values) now returns a :class:`arrays.StringArray`.


Sting -> String

jbrockmendel · 2019-11-25T22:19:14Z

pandas/core/construction.py

+
+        elif inferred_dtype == "integer":
+            return IntegerArray._from_sequence(data, copy=copy)
+
        # TODO(BooleanArray): handle this type


not necessarily for this PR, but is it viable to handle BooleanArray here now?

jbrockmendel · 2019-11-25T22:20:30Z

small comment about a typo, otherwise LGTM

jorisvandenbossche · 2019-11-25T23:36:25Z

Can you add BooleanArray as well?

jreback

lgtm. I think @jbrockmendel had some comments.

jreback · 2019-11-26T00:05:41Z

doc/source/whatsnew/v1.0.0.rst

+1. Sting data (including missing values) now returns a :class:`arrays.StringArray`.
+2. Integer data (including missing values) now returns a :class:`arrays.IntegerArray`.
+
+*pandas 0.25.x*


side issue, we are pretty inconsistent on showing the previous version in the whatsnew

jreback · 2019-11-26T00:10:39Z

note that in #29791 should refer to teaching infer_dtype about extension types. This PR doesn't actually do that, rather you teach pd.array to correctly infer given the results of infer_dtype. I actually think we should push this down and remove the logic from pd.array itself, BUT, I suspect that actually would break things (e.g. we would need to process 'nullable-integer') for example. So even though this is a positive change, I think we need to open an issue about this (or revise this PR).

jbrockmendel · 2019-11-28T00:08:08Z

LGTM

jorisvandenbossche · 2019-11-28T07:32:14Z

pandas/core/construction.py

-    a mixture of valid integers and NA will return a floating-point
-    NumPy array.
+    If pandas does not infer a dedicated extension type a
+    :class:`arrays.PandasArray` is returned.


Should we mention that this can still change in the future? (eg that more types start to get inferred, so basically that you should not rely on the fact of pd.array returning a PandasArray when no dtype is specified)

jorisvandenbossche · 2019-11-28T07:35:13Z

#29791 should refer to teaching infer_dtype about extension types. This PR doesn't actually do that, rather you teach pd.array to correctly infer given the results of infer_dtype. I actually think we should push this down and remove the logic from pd.array

@jreback what would be the advantage of doing that? I personally don't really see the benefit of infer_dtype to be able to return both "integer" or "integer-nullable" for the same integer data. In the end, if you want to use the new dtype or a numpy dtype does not necessarily depend on the data, but from the context where it is being used (eg the context of pd.array wanting to return the new dtypes, so pd.array can decide how to interpret "integer", as numpy int or nullable int).

jreback · 2019-11-28T11:23:05Z

#29791 should refer to teaching infer_dtype about extension types. This PR doesn't actually do that, rather you teach pd.array to correctly infer given the results of infer_dtype. I actually think we should push this down and remove the logic from pd.array

@jreback what would be the advantage of doing that? I personally don't really see the benefit of infer_dtype to be able to return both "integer" or "integer-nullable" for the same integer data. In the end, if you want to use the new dtype or a numpy dtype does not necessarily depend on the data, but from the context where it is being used (eg the context of pd.array wanting to return the new dtypes, so pd.array can decide how to interpret "integer", as numpy int or nullable int).

you are missing the point
this PR is just a temporary work around - infer_dtype is used practically everywhere
this is a very specific case that you are trying to solve

jorisvandenbossche · 2019-11-28T12:13:54Z

Jeff, if you think I am missing your point, then try to explain it better (but I rather think we are just disagreeing).
In any case, it's not a discussion for this PR I think, so maybe you can open a new issue for this?

jreback · 2019-11-29T23:16:04Z

pandas/core/construction.py

@@ -270,7 +275,7 @@ def array(
        return cls._from_sequence(data, dtype=dtype, copy=copy)

    if dtype is None:
-        inferred_dtype = lib.infer_dtype(data, skipna=False)
+        inferred_dtype = lib.infer_dtype(data, skipna=True)


so my issue with this PR is that is duplicating a lot of logic that is already held here: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx#L1948, so now we have 2 places with slightly different ways of doing things.

This routine is slightly more 'high-level', but myabe_convert_objects is way more used internally. So how to reconcile these things?

@jorisvandenbossche @TomAugspurger

I'm not familiar with maybe_convert_objects, but what's the duplicate logic between the two? At a glance, it seems like maybe_convert_objects is mixing two things

type inference (the sole purpose of lib.infer_dtype)

array construction (which may be pd.array? Or move the core to some _libs method?).

Should we update maybe_infer_objects to use array internally? I don't have a feel for whether that's even possible, do you?

I do see the similarity in purpose though. They're both for taking potentially untyped things and converting them to a typed array.

Should we update maybe_infer_objects to use array internally? I don't have a feel for whether that's even possible, do you?

Probably not in the short term. Best guess for de-duplication for the discussed functions will look something like:

lib.maybe_convert_objects is made to back lib.infer_dtype

Note: lib.maybe_convert_objects involves two runtime non-cython imports that I'd really like to find a way to avoid (one will be easy, the other very much not)

pd.array calls maybe_convert_objects instead of infer_dtype

More generally, many places where we call infer_dtype followed by casting can be replaced to be one-pass instead of two-pass.

Series constructor is backed by pd.array

Index.__new__ is backed by pd.array (we'd need something like RangeArray first)

TomAugspurger · 2019-12-02T12:23:55Z

Fixed the docstring & merging master. I can open an issue about deduplicating array, maybe_convert_object and infer_dtype. I think that can be a followup though, as this PR really is only changing ~10 LOC, despite the large diff :)

jreback · 2019-12-02T12:42:34Z

Fixed the docstring & merging master. I can open an issue about deduplicating array, maybe_convert_object and infer_dtype. I think that can be a followup though, as this PR really is only changing ~10 LOC, despite the large diff :)

right not trying to cause an issue :->

just wanting to clarify the apis & implementation we have for internal (maybe_convert_*) and external (pd.array) that need to be cleanly separated and useful / documented etc.

opening an issue would be great.

jreback

lgtm. ping on green.,

TomAugspurger · 2019-12-02T16:39:07Z

All green. #29973 for the duplicate functionality between convert_objects and array.

jreback · 2019-12-02T17:38:53Z

thanks @TomAugspurger

TomAugspurger · 2019-12-02T17:42:38Z

Thanks!

…

On Mon, Dec 2, 2019 at 11:38 AM Jeff Reback ***@***.***> wrote: thanks @TomAugspurger <https://github.com/TomAugspurger> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#29799?email_source=notifications&email_token=AAKAOIU6GHJ3GX4CKEBVFFTQWVB25A5CNFSM4JQVETG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFUI4IY#issuecomment-560500259>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIUPLDBTRWAHMV4JKXTQWVB25ANCNFSM4JQVETGQ> .

API: Infer extension types in array

3313f23

* string * integer

TomAugspurger added Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. labels Nov 22, 2019

TomAugspurger added this to the 1.0 milestone Nov 22, 2019

jbrockmendel reviewed Nov 22, 2019

View reviewed changes

pandas/core/construction.py Show resolved Hide resolved

TomAugspurger commented Nov 22, 2019

View reviewed changes

jbrockmendel reviewed Nov 22, 2019

View reviewed changes

pandas/core/construction.py Show resolved Hide resolved

TomAugspurger added 2 commits November 22, 2019 13:51

update docstring

dd02d69

remove mixed-string

5a9c306

TomAugspurger force-pushed the infer-ea branch from f8d4e70 to 5a9c306 Compare November 22, 2019 20:52

TomAugspurger added 8 commits November 25, 2019 06:09

skipna=True

e3ba846

Merge remote-tracking branch 'upstream/master' into infer-ea

8d6f79b

update new test

e055ada

reduce

ad43c3a

32 bit, doc

77c5d3f

update

0f89f47

fix docstring

4e08fd2

reorganize

bddce9b

jbrockmendel reviewed Nov 25, 2019

View reviewed changes

jreback approved these changes Nov 26, 2019

View reviewed changes

TomAugspurger mentioned this pull request Nov 26, 2019

Use an enum for infer_dtype return values? #29868

Open

TomAugspurger added 3 commits November 27, 2019 11:53

Merge remote-tracking branch 'upstream/master' into infer-ea

f63e0ef

Handle BooleanArray

372ac06

Merge remote-tracking branch 'upstream/master' into infer-ea

799dcce

jorisvandenbossche approved these changes Nov 28, 2019

View reviewed changes

jreback requested changes Nov 29, 2019

View reviewed changes

TomAugspurger added 2 commits December 2, 2019 06:10

Merge remote-tracking branch 'upstream/master' into infer-ea

b6082d1

update docstring

d0f3082

jreback approved these changes Dec 2, 2019

View reviewed changes

TomAugspurger mentioned this pull request Dec 2, 2019

REF: Deduplicate array, infer_dtype, and maybe_convert_objects #29973

Closed

jreback merged commit 83812e1 into pandas-dev:master Dec 2, 2019

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

API: Infer extension types in array (pandas-dev#29799)

feb1e30

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

API: Infer extension types in array (pandas-dev#29799)

a80de2b

TomAugspurger deleted the infer-ea branch November 17, 2020 18:07


		:meth:`pandas.array` now infers pandas' new extension types in several cases (:issue:`29791`):

		1. Sting data (including missing values) now returns a :class:`arrays.StringArray`.

Uh oh!

API: Infer extension types in array #29799

API: Infer extension types in array #29799

Uh oh!

Conversation

TomAugspurger commented Nov 22, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TomAugspurger commented Nov 25, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Nov 25, 2019

Uh oh!

jorisvandenbossche commented Nov 25, 2019

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Nov 26, 2019

Uh oh!

jbrockmendel commented Nov 28, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Nov 28, 2019

Uh oh!

jreback commented Nov 28, 2019

Uh oh!

jorisvandenbossche commented Nov 28, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Dec 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Dec 2, 2019

Uh oh!

jreback commented Dec 2, 2019

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Dec 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Dec 2, 2019

Uh oh!

TomAugspurger commented Dec 2, 2019 via email

Uh oh!

Uh oh!

TomAugspurger Dec 2, 2019 •

edited

Loading

TomAugspurger commented Dec 2, 2019 •

edited

Loading