Skip to content

DOC: Add info on dtype strings #30590

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Dr-Irv opened this issue Dec 31, 2019 · 10 comments · Fixed by #30628
Closed

DOC: Add info on dtype strings #30590

Dr-Irv opened this issue Dec 31, 2019 · 10 comments · Fixed by #30628
Labels
Docs Dtype Conversions Unexpected or buggy dtype conversions
Milestone

Comments

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 31, 2019

Problem description

I've been studying the new string, boolean and Intxx dtypes and think it would be worthwhile to add something about the strings that you are allowed to use with extension arrays in specifying the dtypes. It could be an additional column in the dtypes table here:
https://dev.pandas.io/docs/getting_started/basics.html#dtypes

I think the following table is correct:

Data Type Array Possible Strings
DatetimeTZDtype DatetimeArray 'datetime64[ns, <tz>]' 
CategoricalDtype Categorical 'category'
PeriodDtype PeriodArray 'period[<freq>]' or 'Period[<freq>]'
SparseDtype SparseArray 'Sparse', 'Sparse[int]', 'Sparse[int32, 0]', 'Sparse[int64, 0]', 'Sparse[float64, nan]', 'Sparse[float32, nan]'
IntervalDtype IntervalArray 'interval', 'Interval', 'Interval[<np.numeric>]', 'Interval[datetime64[ns, <tz>]]', 'Interval[timedelta64[<freq>]]'
Int64Dtype (and others) IntegerArray 'Int8', 'Int16', 'Int32', 'Int64', 'UInt8', 'UInt16', 'UInt32', 'UInt64'
StringDtype StringArray 'string'
BooleanDtype BooleanArray 'boolean'

I also think we may want to make it clear that if you specify a string not in that table, it needs to be a string acceptable as a numpy dtype.

If people like @TomAugspurger and @jorisvandenbossche think this is useful, I'll add a column to that table in the docs (or maybe have to use a separate table because of the length of the last column above).

Also, should we consider allowing 'Boolean' and 'String' and 'Category', i.e. type names with a leading capital letter? We're inconsistent in terms of what case is allowed in different places for the strings representing dtypes (see period/Period and interval/Interval)

@TomAugspurger
Copy link
Contributor

3rd party libraries can register dtype alises as well, so we can’t provide a comprehensive list. Though we can document the ones we provide.

Wrt adding Boolean, I’d maybe prefer going the other way. Adding integer, integer32, etc. that makes us consistently the lowercase spelled out form.

@TomAugspurger
Copy link
Contributor

Does Sparse[int, 1] actually work? I thought we didn’t parse the fill value.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jan 1, 2020

3rd party libraries can register dtype alises as well, so we can’t provide a comprehensive list. Though we can document the ones we provide.

Yes, that is my proposal.

Wrt adding Boolean, I’d maybe prefer going the other way. Adding integer, integer32, etc. that makes us consistently the lowercase spelled out form.

Only issue there is that int, int32, etc., are the numpy dtypes, so that could create confusion. That's why I suggested using the upper case ones.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 1, 2020 via email

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jan 2, 2020

Does Sparse[int, 1] actually work? I thought we didn’t parse the fill value.

Sparse[int, 1] doesn't work, but I didn't put it in the list. However, that list is incomplete, as you can specify Sparse[<dtype>, <fillvalue>] for valid combinations of <dtype> and <fillvalue>. E.g., Sparse[bool, False] works. So the docstring inside the code at pd.core.arrays.sparse.dtype.SparseDtype.construct_from_string is incorrect.

Seems like @jorisvandenbossche wrote that docstring. Will wait for him to see if the internal docs there should be updated.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jan 2, 2020

There is also bool -> boolean, str -> string. We can consistently use the spelled out version, or the capitalized version. My preference is for spelled out.

So does that mean we change from Int64, Int32, etc. to integer64 and integer32 in the code? Not sure that is a good idea.

Or allow both but only document the spelled out lowercase versions?

@TomAugspurger
Copy link
Contributor

Both I would think. The ones I'm not sure about is UInt. Is uinteger64 weird, since it's a mix of abbreviations and spelled out?

@TomAugspurger
Copy link
Contributor

Back to the original issue, adding an alias column to that table is a great idea. We also have https://dev.pandas.io/docs/reference/arrays.html.

We don't allow specifying the fill value for SparseDtype, just the dytpe

In [53]: pd.api.types.pandas_dtype("Sparse[int]")
Out[53]: Sparse[int64, 0]

In [54]: pd.api.types.pandas_dtype("Sparse[int, 1]")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-54-8824dda6d2f2> in <module>
----> 1 pd.api.types.pandas_dtype("Sparse[int, 1]")

~/sandbox/pandas/pandas/core/dtypes/common.py in pandas_dtype(dtype)
   1870     # raise a consistent TypeError if failed
   1871     try:
-> 1872         npdtype = np.dtype(dtype)
   1873     except SyntaxError:
   1874         # np.dtype uses `eval` which can raise SyntaxError

TypeError: data type "Sparse[int, 1]" not understood

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jan 2, 2020

Back to the original issue, adding an alias column to that table is a great idea. We also have https://dev.pandas.io/docs/reference/arrays.html.

So I will add the docs using the current implementation and postpone making changes with respect to the names.

We don't allow specifying the fill value for SparseDtype, just the dytpe

Yes, but Sparse[int32, 0] and Sparse[int64, 0] do work:

In [14]: pd.api.types.pandas_dtype("Sparse[int32, 0]")
Out[14]: Sparse[int32, 0]

In [15]: pd.api.types.pandas_dtype("Sparse[int64, 0]")
Out[15]: Sparse[int64, 0]

@TomAugspurger
Copy link
Contributor

Hmm I wouldn't document that. I don't know if it's delibarate.

@jreback jreback added Docs Dtype Conversions Unexpected or buggy dtype conversions labels Jan 3, 2020
@jreback jreback added this to the 1.0 milestone Jan 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants