-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: Infer extension types in array #29799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* string * integer
|
||
s = pd.Series([1, 2, np.nan], dtype="Int64") | ||
s | ||
Currently :meth:`pandas.array` and :meth:`pandas.Series` use different |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my attempt to explain the inconsistency between pd.array
and pd.Series
we're introducing. It's not ideal, but I think it's the right behavior for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually we want to share code between these (and ideally also the Index constructor), right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the issue (#29791), I mentioned for that reason the idea of a keyword to control this preference between old or new dtypes.
(but we can also introduce that at the moment we want to share code if that seems useful then)
f8d4e70
to
5a9c306
Compare
CI is passing now. |
doc/source/whatsnew/v1.0.0.rst
Outdated
|
||
:meth:`pandas.array` now infers pandas' new extension types in several cases (:issue:`29791`): | ||
|
||
1. Sting data (including missing values) now returns a :class:`arrays.StringArray`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sting -> String
pandas/core/construction.py
Outdated
|
||
elif inferred_dtype == "integer": | ||
return IntegerArray._from_sequence(data, copy=copy) | ||
|
||
# TODO(BooleanArray): handle this type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not necessarily for this PR, but is it viable to handle BooleanArray here now?
small comment about a typo, otherwise LGTM |
Can you add BooleanArray as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. I think @jbrockmendel had some comments.
1. Sting data (including missing values) now returns a :class:`arrays.StringArray`. | ||
2. Integer data (including missing values) now returns a :class:`arrays.IntegerArray`. | ||
|
||
*pandas 0.25.x* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side issue, we are pretty inconsistent on showing the previous version in the whatsnew
note that in #29791 should refer to teaching |
LGTM |
a mixture of valid integers and NA will return a floating-point | ||
NumPy array. | ||
If pandas does not infer a dedicated extension type a | ||
:class:`arrays.PandasArray` is returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention that this can still change in the future? (eg that more types start to get inferred, so basically that you should not rely on the fact of pd.array
returning a PandasArray when no dtype is specified)
@jreback what would be the advantage of doing that? I personally don't really see the benefit of |
you are missing the point |
Jeff, if you think I am missing your point, then try to explain it better (but I rather think we are just disagreeing). |
@@ -270,7 +275,7 @@ def array( | |||
return cls._from_sequence(data, dtype=dtype, copy=copy) | |||
|
|||
if dtype is None: | |||
inferred_dtype = lib.infer_dtype(data, skipna=False) | |||
inferred_dtype = lib.infer_dtype(data, skipna=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so my issue with this PR is that is duplicating a lot of logic that is already held here: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx#L1948, so now we have 2 places with slightly different ways of doing things.
This routine is slightly more 'high-level', but myabe_convert_objects is way more used internally. So how to reconcile these things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with maybe_convert_objects
, but what's the duplicate logic between the two? At a glance, it seems like maybe_convert_objects
is mixing two things
- type inference (the sole purpose of
lib.infer_dtype
) - array construction (which may be
pd.array
? Or move the core to some_libs
method?).
Should we update maybe_infer_objects
to use array
internally? I don't have a feel for whether that's even possible, do you?
I do see the similarity in purpose though. They're both for taking potentially untyped things and converting them to a typed array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we update maybe_infer_objects to use array internally? I don't have a feel for whether that's even possible, do you?
Probably not in the short term. Best guess for de-duplication for the discussed functions will look something like:
lib.maybe_convert_objects
is made to backlib.infer_dtype
- Note:
lib.maybe_convert_objects
involves two runtime non-cython imports that I'd really like to find a way to avoid (one will be easy, the other very much not)
- Note:
pd.array
callsmaybe_convert_objects
instead ofinfer_dtype
- More generally, many places where we call infer_dtype followed by casting can be replaced to be one-pass instead of two-pass.
Series
constructor is backed bypd.array
Index.__new__
is backed bypd.array
(we'd need something likeRangeArray
first)
Fixed the docstring & merging master. I can open an issue about deduplicating |
right not trying to cause an issue :-> just wanting to clarify the apis & implementation we have for internal (maybe_convert_*) and external (pd.array) that need to be cleanly separated and useful / documented etc. opening an issue would be great. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. ping on green.,
All green. #29973 for the duplicate functionality between |
thanks @TomAugspurger |
Thanks!
…On Mon, Dec 2, 2019 at 11:38 AM Jeff Reback ***@***.***> wrote:
thanks @TomAugspurger <https://github.com/TomAugspurger>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#29799?email_source=notifications&email_token=AAKAOIU6GHJ3GX4CKEBVFFTQWVB25A5CNFSM4JQVETG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFUI4IY#issuecomment-560500259>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIUPLDBTRWAHMV4JKXTQWVB25ANCNFSM4JQVETGQ>
.
|
Closes #29791 (though we'll want to add BooleanArray).