Skip to content

Commit 3a677ad

Browse files
committed
Skip test for old fastparquet
1 parent 1663941 commit 3a677ad

File tree

12 files changed

+245
-66
lines changed

12 files changed

+245
-66
lines changed

doc/source/advanced.rst

+3-1
Original file line numberDiff line numberDiff line change
@@ -638,9 +638,11 @@ and allows efficient indexing and storage of an index with a large number of dup
638638

639639
.. ipython:: python
640640
641+
from pandas.api.types import CategoricalDtype
642+
641643
df = pd.DataFrame({'A': np.arange(6),
642644
'B': list('aabbca')})
643-
df['B'] = df['B'].astype(pd.api.types.CategoricalDtype(list('cab')))
645+
df['B'] = df['B'].astype(CategoricalDtype(list('cab')))
644646
df
645647
df.dtypes
646648
df.B.cat.categories

doc/source/api.rst

+4-1
Original file line numberDiff line numberDiff line change
@@ -637,7 +637,10 @@ strings and apply several methods to it. These can be accessed like
637637
Categorical
638638
~~~~~~~~~~~
639639

640-
If the Series is of dtype ``category``, ``Series.cat`` can be used to change the the categorical
640+
.. autoclass:: api.types.CategoricalDtype
641+
:members: categories, ordered
642+
643+
If the Series is of dtype ``CategoricalDtype``, ``Series.cat`` can be used to change the categorical
641644
data. This accessor is similar to the ``Series.dt`` or ``Series.str`` and has the
642645
following usable methods and properties:
643646

doc/source/categorical.rst

+30-22
Original file line numberDiff line numberDiff line change
@@ -99,9 +99,11 @@ of :class:`~pd.api.types.CategoricalDtype`.
9999

100100
.. ipython:: python
101101
102+
from pandas.api.types import CategoricalDtype
103+
102104
s = pd.Series(["a", "b", "c", "a"])
103-
cat_type = pd.api.types.CategoricalDtype(categories=["b", "c", "d"],
104-
ordered=False)
105+
cat_type = CategoricalDtype(categories=["b", "c", "d"],
106+
ordered=False)
105107
s_cat = s.astype(cat_type)
106108
s_cat
107109
@@ -141,33 +143,40 @@ constructor to save the factorize step during normal constructor mode:
141143
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
142144
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
143145
146+
.. _categorical.categoricaldtype:
147+
144148
CategoricalDtype
145149
----------------
146150

147151
.. versionchanged:: 0.21.0
148152

149-
A categorical's type is fully described by 1.) its categories (an iterable with
150-
unique values and no missing values), and 2.) its orderedness (a boolean).
153+
A categorical's type is fully described by
154+
155+
1. its categories: a sequence of unique values and no missing values
156+
2. its orderedness: a boolean
157+
151158
This information can be stored in a :class:`~pandas.api.types.CategoricalDtype`.
152159
The ``categories`` argument is optional, which implies that the actual categories
153160
should be inferred from whatever is present in the data when the
154161
:class:`pandas.Categorical` is created.
155162

156163
.. ipython:: python
157164
158-
pd.api.types.CategoricalDtype(['a', 'b', 'c'])
159-
pd.api.types.CategoricalDtype(['a', 'b', 'c'], ordered=True)
160-
pd.api.types.CategoricalDtype()
165+
from pandas.api.types import CategoricalDtype
166+
167+
CategoricalDtype(['a', 'b', 'c'])
168+
CategoricalDtype(['a', 'b', 'c'], ordered=True)
169+
CategoricalDtype()
161170
162171
A :class:`~pandas.api.types.CategoricalDtype` can be used in any place pandas
163172
expects a `dtype`. For example :func:`pandas.read_csv`,
164-
:func:`pandas.DataFrame.astype`, or the Series constructor.
173+
:func:`pandas.DataFrame.astype`, or in the Series constructor.
165174

166-
As a convenience, you can use the string `'category'` in place of a
175+
As a convenience, you can use the string ``'category'`` in place of a
167176
:class:`~pandas.api.types.CategoricalDtype` when you want the default behavior of
168177
the categories being unordered, and equal to the set values present in the
169-
array. On other words, ``dtype='category'`` is equivalent to
170-
``dtype=pd.api.types.CategoricalDtype()``.
178+
array. In other words, ``dtype='category'`` is equivalent to
179+
``dtype=CategoricalDtype()``.
171180

172181
Equality Semantics
173182
~~~~~~~~~~~~~~~~~~
@@ -178,19 +187,20 @@ order of the ``categories`` is not considered
178187

179188
.. ipython:: python
180189
181-
c1 = pd.api.types.CategoricalDtype(['a', 'b', 'c'], ordered=False)
190+
c1 = CategoricalDtype(['a', 'b', 'c'], ordered=False)
191+
182192
# Equal, since order is not considered when ordered=False
183-
c1 == pd.api.types.CategoricalDtype(['b', 'c', 'a'], ordered=False)
193+
c1 == CategoricalDtype(['b', 'c', 'a'], ordered=False)
194+
184195
# Unequal, since the second CategoricalDtype is ordered
185-
c1 == pd.api.types.CategoricalDtype(['a', 'b', 'c'], ordered=True)
196+
c1 == CategoricalDtype(['a', 'b', 'c'], ordered=True)
186197
187198
All instances of ``CategoricalDtype`` compare equal to the string ``'category'``
188199

189200
.. ipython:: python
190201
191202
c1 == 'category'
192203
193-
194204
.. warning::
195205

196206
Since ``dtype='category'`` is essentially ``CategoricalDtype(None, False)``,
@@ -246,9 +256,7 @@ It's also possible to pass in the categories in a specific order:
246256

247257
.. ipython:: python
248258
249-
s = pd.Series(list('babc')).astype(
250-
pd.api.types.CategoricalDtype(list('abcd'))
251-
)
259+
s = pd.Series(list('babc')).astype(CategoricalDtype(list('abcd')))
252260
s
253261
254262
# categories
@@ -362,7 +370,7 @@ meaning and certain operations are possible. If the categorical is unordered, ``
362370
s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False))
363371
s.sort_values(inplace=True)
364372
s = pd.Series(["a","b","c","a"]).astype(
365-
pd.api.types.CategoricalDtype(ordered=True)
373+
CategoricalDtype(ordered=True)
366374
)
367375
s.sort_values(inplace=True)
368376
s
@@ -464,13 +472,13 @@ categories or a categorical with any list-like object, will raise a TypeError.
464472
.. ipython:: python
465473
466474
cat = pd.Series([1,2,3]).astype(
467-
pd.api.types.CategoricalDtype([3, 2, 1], ordered=True)
475+
CategoricalDtype([3, 2, 1], ordered=True)
468476
)
469477
cat_base = pd.Series([2,2,2]).astype(
470-
pd.api.types.CategoricalDtype([3, 2, 1], ordered=True)
478+
CategoricalDtype([3, 2, 1], ordered=True)
471479
)
472480
cat_base2 = pd.Series([2,2,2]).astype(
473-
pd.api.types.CategoricalDtype(ordered=True)
481+
CategoricalDtype(ordered=True)
474482
)
475483
476484
cat

doc/source/whatsnew/v0.21.0.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ users upgrade to this version.
1010
Highlights include:
1111

1212
- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`.
13+
- New user-facing :class:`pandas.api.types.CategoricalDtype` for specifying
14+
categoricals independent of the data (:issue:`14711`, :issue:`15078`)
1315

1416
Check the :ref:`API Changes <whatsnew_0210.api_breaking>` and :ref:`deprecations <whatsnew_0210.deprecations>` before updating.
1517

@@ -22,8 +24,6 @@ Check the :ref:`API Changes <whatsnew_0210.api_breaking>` and :ref:`deprecations
2224
New features
2325
~~~~~~~~~~~~
2426

25-
- New user-facing :class:`pandas.api.types.CategoricalDtype` for specifying
26-
categoricals independent of the data (:issue:`14711`, :issue:`15078`)
2727
- Support for `PEP 519 -- Adding a file system path protocol
2828
<https://www.python.org/dev/peps/pep-0519/>`_ on most readers and writers (:issue:`13823`)
2929
- Added ``__fspath__`` method to :class:`~pandas.HDFStore`, :class:`~pandas.ExcelFile`,

pandas/core/categorical.py

+52-14
Original file line numberDiff line numberDiff line change
@@ -202,6 +202,7 @@ class Categorical(PandasObject):
202202
categorical, read only.
203203
ordered : boolean
204204
Whether or not this Categorical is ordered.
205+
dtype : CategoricalDtype
205206
206207
Raises
207208
------
@@ -248,17 +249,30 @@ class Categorical(PandasObject):
248249
__array_priority__ = 1000
249250
_typ = 'categorical'
250251

251-
def __init__(self, values, categories=None, ordered=False, fastpath=False):
252+
def __init__(self, values, categories=None, ordered=None, dtype=None,
253+
fastpath=False):
254+
255+
if dtype is not None:
256+
if categories is not None or ordered is not None:
257+
raise ValueError("Cannot specify both `dtype` and `categories`"
258+
" or `ordered`.")
259+
categories = dtype.categories
260+
ordered = dtype.ordered
261+
262+
if ordered is None:
263+
ordered = False
252264

253265
if fastpath:
254-
self._dtype = CategoricalDtype(categories, ordered)
266+
if dtype is None:
267+
dtype = CategoricalDtype(categories, ordered)
255268
self._codes = coerce_indexer_dtype(values, categories)
269+
self._dtype = dtype
256270
return
257271

258272
# sanitize input
259273
if is_categorical_dtype(values):
260274

261-
# we are either a Series, CategoricalIndex or CategoricalDtype
275+
# we are either a Series, CategoricalIndex
262276
if isinstance(values, (ABCSeries, ABCCategoricalIndex)):
263277
values = values._values
264278

@@ -308,7 +322,8 @@ def __init__(self, values, categories=None, ordered=False, fastpath=False):
308322
raise NotImplementedError("> 1 ndim Categorical are not "
309323
"supported at this time")
310324

311-
dtype = CategoricalDtype(categories, ordered)
325+
if dtype is None or isinstance(dtype, str):
326+
dtype = CategoricalDtype(categories, ordered)
312327

313328
else:
314329
# there were two ways if categories are present
@@ -320,7 +335,9 @@ def __init__(self, values, categories=None, ordered=False, fastpath=False):
320335

321336
# make sure that we always have the same type here, no matter what
322337
# we get passed in
323-
dtype = CategoricalDtype(categories, ordered)
338+
if dtype is None or isinstance(dtype, str):
339+
dtype = CategoricalDtype(categories, ordered)
340+
324341
codes = _get_codes_for_values(values, dtype.categories)
325342

326343
# TODO: check for old style usage. These warnings should be removes
@@ -496,16 +513,14 @@ def from_codes(cls, codes, categories, ordered=False):
496513
categorical. If not given, the resulting categorical will be
497514
unordered.
498515
"""
499-
from pandas import Index
500-
501516
try:
502517
codes = np.asarray(codes, np.int64)
503518
except:
504519
raise ValueError(
505520
"codes need to be convertible to an arrays of integers")
506521

507522
# have to use the instance, not property
508-
categories = cls._dtype._validate_categories(Index(categories))
523+
categories = CategoricalDtype._validate_categories(categories)
509524

510525
if len(codes) and (codes.max() >= len(categories) or codes.min() < -1):
511526
raise ValueError("codes need to be between -1 and "
@@ -558,13 +573,13 @@ def _set_categories(self, categories, fastpath=False):
558573
559574
"""
560575

561-
new = CategoricalDtype(categories, self.ordered, fastpath)
576+
new_dtype = CategoricalDtype(categories, self.ordered, fastpath)
562577
if (not fastpath and self.dtype.categories is not None and
563-
len(new.categories) != len(self.dtype.categories)):
578+
len(new_dtype.categories) != len(self.dtype.categories)):
564579
raise ValueError("new categories need to have the same number of "
565580
"items than the old categories!")
566581

567-
self._dtype = new
582+
self._dtype = new_dtype
568583

569584
def _codes_for_groupby(self, sort):
570585
"""
@@ -606,6 +621,29 @@ def _codes_for_groupby(self, sort):
606621

607622
return self.reorder_categories(cat.categories)
608623

624+
def _set_dtype(self, dtype):
625+
"""Internal method for directly updating the CategoricalDtype
626+
627+
Parameters
628+
----------
629+
dtype : CategoricalDtype
630+
631+
Notes
632+
-----
633+
We don't do any validation here. It's assumed that the dtype is
634+
a (valid) instance of `CategoricalDtype`.
635+
"""
636+
# We want to convert old codes -> new codes *without* going to values
637+
# [b, a, c, a, b, f] | original dtype: [a, b, c, d]
638+
# [0, 1, 2, 0, 1, .] | original codes
639+
# --------------- | ----------
640+
# [b, a, ., a, b, .] | new dtype: [b, a, e]
641+
# [0, 1, ., 1, 0, .] |
642+
mapping = dtype.categories.get_indexer_for(self.categories)
643+
codes = mapping[self.codes]
644+
codes[self.codes == -1] = -1
645+
return type(self)(codes, dtype=dtype, fastpath=True)
646+
609647
def set_ordered(self, value, inplace=False):
610648
"""
611649
Sets the ordered attribute to the boolean value
@@ -619,9 +657,9 @@ def set_ordered(self, value, inplace=False):
619657
of this categorical with ordered set to the value
620658
"""
621659
inplace = validate_bool_kwarg(inplace, 'inplace')
622-
new = CategoricalDtype(self.categories, ordered=value)
660+
new_dtype = CategoricalDtype(self.categories, ordered=value)
623661
cat = self if inplace else self.copy()
624-
cat._dtype = new
662+
cat._dtype = new_dtype
625663
if not inplace:
626664
return cat
627665

@@ -1222,7 +1260,7 @@ def value_counts(self, dropna=True):
12221260
count = bincount(np.where(mask, code, ncat))
12231261
ix = np.append(ix, -1)
12241262

1225-
ix = self._constructor(ix, categories=cat, ordered=obj.ordered,
1263+
ix = self._constructor(ix, dtype=self.dtype,
12261264
fastpath=True)
12271265

12281266
return Series(count, index=CategoricalIndex(ix), dtype='int64')

pandas/core/dtypes/common.py

+23-4
Original file line numberDiff line numberDiff line change
@@ -692,19 +692,38 @@ def is_dtype_equal(source, target):
692692
return False
693693

694694

695-
def _is_dtype_union_equal(source, target):
695+
def is_dtype_union_equal(source, target):
696696
"""
697-
Check whether two arrays have compatible dtypes to do a unoin.
697+
Check whether two arrays have compatible dtypes to do a union.
698698
numpy types are checked with ``is_dtype_equal``. Extension types are
699699
checked separately.
700+
701+
Parameters
702+
----------
703+
source : The first dtype to compare
704+
target : The second dtype to compare
705+
706+
Returns
707+
----------
708+
boolean : Whether or not the two dtypes are equal.
709+
710+
>>> is_dtype_equal("int", int)
711+
True
712+
713+
>>> is_dtype_equal(CategoricalDtype(['a', 'b'],
714+
... CategoricalDtype(['b', 'c']))
715+
True
716+
717+
>>> is_dtype_equal(CategoricalDtype(['a', 'b'],
718+
... CategoricalDtype(['b', 'c'], ordered=True))
719+
False
700720
"""
701721
source = _get_dtype(source)
702722
target = _get_dtype(target)
703723
if is_categorical_dtype(source) and is_categorical_dtype(target):
704724
# ordered False for both
705725
return source.ordered is target.ordered
706-
else:
707-
return is_dtype_equal(source, target)
726+
return is_dtype_equal(source, target)
708727

709728

710729
def is_any_int_dtype(arr_or_dtype):

0 commit comments

Comments
 (0)