DOC: Add info on dtype strings #30590

Dr-Irv · 2019-12-31T22:44:06Z

Problem description

I've been studying the new string, boolean and Intxx dtypes and think it would be worthwhile to add something about the strings that you are allowed to use with extension arrays in specifying the dtypes. It could be an additional column in the dtypes table here:
https://dev.pandas.io/docs/getting_started/basics.html#dtypes

I think the following table is correct:

Data Type	Array	Possible Strings
`DatetimeTZDtype`	`DatetimeArray`	`'datetime64[ns, <tz>]'`
`CategoricalDtype`	`Categorical`	`'category'`
`PeriodDtype`	`PeriodArray`	`'period[<freq>]' or 'Period[<freq>]'`
`SparseDtype`	`SparseArray`	`'Sparse'`, `'Sparse[int]'`, `'Sparse[int32, 0]'`, `'Sparse[int64, 0]'`, `'Sparse[float64, nan]'`, `'Sparse[float32, nan]'`
`IntervalDtype`	`IntervalArray`	`'interval'`, `'Interval'`, `'Interval[<np.numeric>]'`, `'Interval[datetime64[ns, <tz>]]'`, `'Interval[timedelta64[<freq>]]'`
`Int64Dtype` (and others)	`IntegerArray`	`'Int8'`, `'Int16'`, `'Int32'`, `'Int64'`, `'UInt8'`, `'UInt16'`, `'UInt32'`, `'UInt64'`
`StringDtype`	`StringArray`	`'string'`
`BooleanDtype`	`BooleanArray`	`'boolean'`

I also think we may want to make it clear that if you specify a string not in that table, it needs to be a string acceptable as a numpy dtype.

If people like @TomAugspurger and @jorisvandenbossche think this is useful, I'll add a column to that table in the docs (or maybe have to use a separate table because of the length of the last column above).

Also, should we consider allowing 'Boolean' and 'String' and 'Category', i.e. type names with a leading capital letter? We're inconsistent in terms of what case is allowed in different places for the strings representing dtypes (see period/Period and interval/Interval)

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-12-31T23:13:32Z

3rd party libraries can register dtype alises as well, so we can’t provide a comprehensive list. Though we can document the ones we provide.

Wrt adding Boolean, I’d maybe prefer going the other way. Adding integer, integer32, etc. that makes us consistently the lowercase spelled out form.

TomAugspurger · 2019-12-31T23:15:05Z

Does Sparse[int, 1] actually work? I thought we didn’t parse the fill value.

Dr-Irv · 2020-01-01T17:17:32Z

3rd party libraries can register dtype alises as well, so we can’t provide a comprehensive list. Though we can document the ones we provide.

Yes, that is my proposal.

Wrt adding Boolean, I’d maybe prefer going the other way. Adding integer, integer32, etc. that makes us consistently the lowercase spelled out form.

Only issue there is that int, int32, etc., are the numpy dtypes, so that could create confusion. That's why I suggested using the upper case ones.

TomAugspurger · 2020-01-01T17:42:32Z

There is also bool -> boolean, str -> string. We can consistently use the spelled out version, or the capitalized version. My preference is for spelled out.

…

On Wed, Jan 1, 2020 at 11:17 AM Irv Lustig ***@***.***> wrote: 3rd party libraries can register dtype alises as well, so we can’t provide a comprehensive list. Though we can document the ones we provide. Yes, that is my proposal. Wrt adding Boolean, I’d maybe prefer going the other way. Adding integer, integer32, etc. that makes us consistently the lowercase spelled out form. Only issue there is that int, int32, etc., are the numpy dtypes, so that could create confusion. That's why I suggested using the upper case ones. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#30590?email_source=notifications&email_token=AAKAOITMNQUVZCTIDYDDVQ3Q3TF23A5CNFSM4KBYMJCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH5I6GI#issuecomment-570068761>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOISWEQYNSNC53MVBKJTQ3TF23ANCNFSM4KBYMJCA> .

Dr-Irv · 2020-01-02T15:16:03Z

Does Sparse[int, 1] actually work? I thought we didn’t parse the fill value.

Sparse[int, 1] doesn't work, but I didn't put it in the list. However, that list is incomplete, as you can specify Sparse[<dtype>, <fillvalue>] for valid combinations of <dtype> and <fillvalue>. E.g., Sparse[bool, False] works. So the docstring inside the code at pd.core.arrays.sparse.dtype.SparseDtype.construct_from_string is incorrect.

Seems like @jorisvandenbossche wrote that docstring. Will wait for him to see if the internal docs there should be updated.

Dr-Irv · 2020-01-02T15:18:10Z

There is also bool -> boolean, str -> string. We can consistently use the spelled out version, or the capitalized version. My preference is for spelled out.

So does that mean we change from Int64, Int32, etc. to integer64 and integer32 in the code? Not sure that is a good idea.

Or allow both but only document the spelled out lowercase versions?

TomAugspurger · 2020-01-02T16:13:48Z

Both I would think. The ones I'm not sure about is UInt. Is uinteger64 weird, since it's a mix of abbreviations and spelled out?

TomAugspurger · 2020-01-02T16:17:41Z

Back to the original issue, adding an alias column to that table is a great idea. We also have https://dev.pandas.io/docs/reference/arrays.html.

We don't allow specifying the fill value for SparseDtype, just the dytpe

In [53]: pd.api.types.pandas_dtype("Sparse[int]")
Out[53]: Sparse[int64, 0]

In [54]: pd.api.types.pandas_dtype("Sparse[int, 1]")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-54-8824dda6d2f2> in <module>
----> 1 pd.api.types.pandas_dtype("Sparse[int, 1]")

~/sandbox/pandas/pandas/core/dtypes/common.py in pandas_dtype(dtype)
   1870     # raise a consistent TypeError if failed
   1871     try:
-> 1872         npdtype = np.dtype(dtype)
   1873     except SyntaxError:
   1874         # np.dtype uses `eval` which can raise SyntaxError

TypeError: data type "Sparse[int, 1]" not understood

Dr-Irv · 2020-01-02T16:25:49Z

Back to the original issue, adding an alias column to that table is a great idea. We also have https://dev.pandas.io/docs/reference/arrays.html.

So I will add the docs using the current implementation and postpone making changes with respect to the names.

We don't allow specifying the fill value for SparseDtype, just the dytpe

Yes, but Sparse[int32, 0] and Sparse[int64, 0] do work:

In [14]: pd.api.types.pandas_dtype("Sparse[int32, 0]")
Out[14]: Sparse[int32, 0]

In [15]: pd.api.types.pandas_dtype("Sparse[int64, 0]")
Out[15]: Sparse[int64, 0]

TomAugspurger · 2020-01-02T18:10:57Z

Hmm I wouldn't document that. I don't know if it's delibarate.

Dr-Irv mentioned this issue Jan 2, 2020

DOC: Add strings for dtypes in basic.rst #30628

Merged

5 tasks

jreback added Docs Dtype Conversions Unexpected or buggy dtype conversions labels Jan 3, 2020

jreback added this to the 1.0 milestone Jan 3, 2020

datapythonista closed this as completed in #30628 Jan 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: Add info on dtype strings #30590

DOC: Add info on dtype strings #30590

Dr-Irv commented Dec 31, 2019 •

edited

Loading

TomAugspurger commented Dec 31, 2019

TomAugspurger commented Dec 31, 2019

Dr-Irv commented Jan 1, 2020

TomAugspurger commented Jan 1, 2020 via email

Dr-Irv commented Jan 2, 2020

Dr-Irv commented Jan 2, 2020

TomAugspurger commented Jan 2, 2020

TomAugspurger commented Jan 2, 2020

Dr-Irv commented Jan 2, 2020

TomAugspurger commented Jan 2, 2020

DOC: Add info on dtype strings #30590

DOC: Add info on dtype strings #30590

Comments

Dr-Irv commented Dec 31, 2019 • edited Loading

Problem description

TomAugspurger commented Dec 31, 2019

TomAugspurger commented Dec 31, 2019

Dr-Irv commented Jan 1, 2020

TomAugspurger commented Jan 1, 2020 via email

Dr-Irv commented Jan 2, 2020

Dr-Irv commented Jan 2, 2020

TomAugspurger commented Jan 2, 2020

TomAugspurger commented Jan 2, 2020

Dr-Irv commented Jan 2, 2020

TomAugspurger commented Jan 2, 2020

Dr-Irv commented Dec 31, 2019 •

edited

Loading