Skip to content

Have pd.array infer new extension types #29791

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Nov 22, 2019 · 2 comments · Fixed by #29799
Closed

Have pd.array infer new extension types #29791

TomAugspurger opened this issue Nov 22, 2019 · 2 comments · Fixed by #29799
Labels
API Design Constructors Series/DataFrame/Index/pd.array Constructors ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@TomAugspurger
Copy link
Contributor

Currently pd.array sometimes requires an explicit dtype=... to get one of our extension arrays (we'll infer for Period, Datetime, and Interval).

This proposal is to have it infer the extension type for

  • strings -> StringArray
  • boolean -> BooleanArray
  • integer -> IntegerArray

All of these currently return PandasArray.

Concretely, we'll need to teach infer_dtype how not to infer mixed for a mix of strings / booleans and NA values, similar to how it handles integer-na

In [27]: lib.infer_dtype([True, None], skipna=False)
Out[27]: 'mixed'

In [28]: lib.infer_dtype(['a', None], skipna=False)
Out[28]: 'mixed'

In [29]: lib.infer_dtype([0, np.nan], skipna=False)
Out[29]: 'integer-na'

and then handle those in array.

@TomAugspurger TomAugspurger added this to the 1.0 milestone Nov 22, 2019
@TomAugspurger TomAugspurger added API Design ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Nov 22, 2019
@jorisvandenbossche
Copy link
Member

Concretely, we'll need to teach infer_dtype how not to infer mixed for a mix of strings / booleans and NA values, similar to how it handles integer-na

Shouldn't we infer with skipna=True ? (since all those new dtypes can hold missing values, we should ignore missing values when inferring?)

Another thought: should we add a keyword for this? (something like use_new_dtypes=True but with a potentially better name?) So if we would like to start using pd.array(..) more internally for array creation, we can still turn it off if needed.

@jbrockmendel jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Nov 24, 2019
@batterseapower
Copy link
Contributor

If I might ask, what was the reasoning behind using the extension dtypes even if the input does not contain nulls? e.g. in 1.1.3 now str(pd.array([True, False]).dtype) == 'boolean', which means .values can't necessarily be used with numpy/numba since it's an extension array which those packages will not understand. Sorry if this has been discussed elsewhere but I couldn't find it: the changelog describes the change but doesn't really explain it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Constructors Series/DataFrame/Index/pd.array Constructors ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants