Skip to content

Commit edb0927

Browse files
committed
Merge pull request #9622 from jreback/cat_fix
API: deprecate setting of .ordered directly (GH9347, GH9190)
2 parents 8c0658d + 87fec5b commit edb0927

16 files changed

+456
-124
lines changed

doc/source/api.rst

+2
Original file line numberDiff line numberDiff line change
@@ -585,6 +585,8 @@ following usable methods and properties (all available as ``Series.cat.<method_o
585585
Categorical.remove_categories
586586
Categorical.remove_unused_categories
587587
Categorical.set_categories
588+
Categorical.as_ordered
589+
Categorical.as_unordered
588590
Categorical.codes
589591

590592
To create a Series of dtype ``category``, use ``cat = s.astype("category")``.

doc/source/categorical.rst

+29-19
Original file line numberDiff line numberDiff line change
@@ -90,8 +90,6 @@ By using some special functions:
9090
See :ref:`documentation <reshaping.tile.cut>` for :func:`~pandas.cut`.
9191

9292
By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to a `DataFrame`.
93-
This is the only possibility to specify differently ordered categories (or no order at all) at
94-
creation time and the only reason to use :class:`pandas.Categorical` directly:
9593

9694
.. ipython:: python
9795
@@ -103,6 +101,14 @@ creation time and the only reason to use :class:`pandas.Categorical` directly:
103101
df["B"] = raw_cat
104102
df
105103
104+
You can also specify differently ordered categories or make the resulting data ordered, by passing these arguments to ``astype()``:
105+
106+
.. ipython:: python
107+
108+
s = Series(["a","b","c","a"])
109+
s_cat = s.astype("category", categories=["b","c","d"], ordered=False)
110+
s_cat
111+
106112
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
107113

108114
.. ipython:: python
@@ -176,10 +182,9 @@ It's also possible to pass in the categories in a specific order:
176182
s.cat.ordered
177183
178184
.. note::
179-
New categorical data is automatically ordered if the passed in values are sortable or a
180-
`categories` argument is supplied. This is a difference to R's `factors`, which are unordered
181-
unless explicitly told to be ordered (``ordered=TRUE``). You can of course overwrite that by
182-
passing in an explicit ``ordered=False``.
185+
186+
New categorical data are NOT automatically ordered. You must explicity pass ``ordered=True`` to
187+
indicate an ordered ``Categorical``.
183188

184189

185190
Renaming categories
@@ -270,29 +275,37 @@ Sorting and Order
270275

271276
.. _categorical.sort:
272277

278+
.. warning::
279+
280+
The default for construction has change in v0.16.0 to ``ordered=False``, from the prior implicit ``ordered=True``
281+
273282
If categorical data is ordered (``s.cat.ordered == True``), then the order of the categories has a
274-
meaning and certain operations are possible. If the categorical is unordered, a `TypeError` is
275-
raised.
283+
meaning and certain operations are possible. If the categorical is unordered, ``.min()/.max()`` will raise a `TypeError`.
276284

277285
.. ipython:: python
278286
279287
s = Series(Categorical(["a","b","c","a"], ordered=False))
280-
try:
281-
s.sort()
282-
except TypeError as e:
283-
print("TypeError: " + str(e))
284-
s = Series(["a","b","c","a"], dtype="category") # ordered per default!
288+
s.sort()
289+
s = Series(["a","b","c","a"]).astype('category', ordered=True)
285290
s.sort()
286291
s
287292
s.min(), s.max()
288293
294+
You can set categorical data to be ordered by using ``as_ordered()`` or unordered by using ``as_unordered()``. These will by
295+
default return a *new* object.
296+
297+
.. ipython:: python
298+
299+
s.cat.as_ordered()
300+
s.cat.as_unordered()
301+
289302
Sorting will use the order defined by categories, not any lexical order present on the data type.
290303
This is even true for strings and numeric data:
291304

292305
.. ipython:: python
293306
294307
s = Series([1,2,3,1], dtype="category")
295-
s.cat.categories = [2,3,1]
308+
s = s.cat.set_categories([2,3,1], ordered=True)
296309
s
297310
s.sort()
298311
s
@@ -310,7 +323,7 @@ necessarily make the sort order the same as the categories order.
310323
.. ipython:: python
311324
312325
s = Series([1,2,3,1], dtype="category")
313-
s = s.cat.reorder_categories([2,3,1])
326+
s = s.cat.reorder_categories([2,3,1], ordered=True)
314327
s
315328
s.sort()
316329
s
@@ -339,7 +352,7 @@ The ordering of the categorical is determined by the ``categories`` of that colu
339352

340353
.. ipython:: python
341354
342-
dfs = DataFrame({'A' : Categorical(list('bbeebbaa'),categories=['e','a','b']),
355+
dfs = DataFrame({'A' : Categorical(list('bbeebbaa'),categories=['e','a','b'],ordered=True),
343356
'B' : [1,2,1,2,2,1,2,1] })
344357
dfs.sort(['A','B'])
345358
@@ -664,9 +677,6 @@ The following differences to R's factor functions can be observed:
664677

665678
* R's `levels` are named `categories`
666679
* R's `levels` are always of type string, while `categories` in pandas can be of any dtype.
667-
* New categorical data is automatically ordered if the passed in values are sortable or a
668-
`categories` argument is supplied. This is a difference to R's `factors`, which are unordered
669-
unless explicitly told to be ordered (``ordered=TRUE``).
670680
* It's not possible to specify labels at creation time. Use ``s.cat.rename_categories(new_labels)``
671681
afterwards.
672682
* In contrast to R's `factor` function, using categorical data as the sole input to create a

doc/source/release.rst

+1
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ Highlights include:
5959
- ``Series.to_coo/from_coo`` methods to interact with ``scipy.sparse``, see :ref:`here <whatsnew_0160.enhancements.sparse>`
6060
- Backwards incompatible change to ``Timedelta`` to conform the ``.seconds`` attribute with ``datetime.timedelta``, see :ref:`here <whatsnew_0160.api_breaking.timedelta>`
6161
- Changes to the ``.loc`` slicing API to conform with the behavior of ``.ix`` see :ref:`here <whatsnew_0160.api_breaking.indexing>`
62+
- Changes to the default for ordering in the ``Categorical`` constructor, see :ref:`here <whatsnew_0160.api_breaking.categorical>`
6263

6364
See the :ref:`v0.16.0 Whatsnew <whatsnew_0160>` overview or the issue tracker on GitHub for an extensive list
6465
of all API changes, enhancements and bugs that have been fixed in 0.16.0.

doc/source/whatsnew/v0.16.0.txt

+129
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ users upgrade to this version.
1313
* ``Series.to_coo/from_coo`` methods to interact with ``scipy.sparse``, see :ref:`here <whatsnew_0160.enhancements.sparse>`
1414
* Backwards incompatible change to ``Timedelta`` to conform the ``.seconds`` attribute with ``datetime.timedelta``, see :ref:`here <whatsnew_0160.api_breaking.timedelta>`
1515
* Changes to the ``.loc`` slicing API to conform with the behavior of ``.ix`` see :ref:`here <whatsnew_0160.api_breaking.indexing>`
16+
* Changes to the default for ordering in the ``Categorical`` constructor, see :ref:`here <whatsnew_0160.api_breaking.categorical>`
1617

1718
- Check the :ref:`API Changes <whatsnew_0160.api>` and :ref:`deprecations <whatsnew_0160.deprecations>` before updating
1819

@@ -367,6 +368,134 @@ API Changes
367368
- ``Series.describe`` for categorical data will now give counts and frequencies of 0, not ``NaN``, for unused categories (:issue:`9443`)
368369

369370

371+
Categorical Changes
372+
~~~~~~~~~~~~~~~~~~~
373+
374+
.. _whatsnew_0160.api_breaking.categorical:
375+
376+
In prior versions, ``Categoricals`` that had an unspecified ordering (meaning no ``ordered`` keyword was passed) were defaulted as ``ordered`` Categoricals. Going forward, the ``ordered`` keyword in the ``Categorical`` constructor will default to ``False``. Ordering must now be explicit.
377+
378+
Furthermore, previously you *could* change the ``ordered`` attribute of a Categorical by just setting the attribute, e.g. ``cat.ordered=True``; This is now deprecated and you should use ``cat.as_ordered()`` or ``cat.as_unordered()``. These will by default return a **new** object and not modify the existing object. (:issue:`9347`, :issue:`9190`)
379+
380+
Previous Behavior
381+
382+
.. code-block:: python
383+
384+
In [3]: s = Series([0,1,2], dtype='category')
385+
386+
In [4]: s
387+
Out[4]:
388+
0 0
389+
1 1
390+
2 2
391+
dtype: category
392+
Categories (3, int64): [0 < 1 < 2]
393+
394+
In [5]: s.cat.ordered
395+
Out[5]: True
396+
397+
In [6]: s.cat.ordered = False
398+
399+
In [7]: s
400+
Out[7]:
401+
0 0
402+
1 1
403+
2 2
404+
dtype: category
405+
Categories (3, int64): [0, 1, 2]
406+
407+
New Behavior
408+
409+
.. ipython:: python
410+
411+
s = Series([0,1,2], dtype='category')
412+
s
413+
s.cat.ordered
414+
s = s.cat.as_ordered()
415+
s
416+
s.cat.ordered
417+
418+
# you can set in the constructor of the Categorical
419+
s = Series(Categorical([0,1,2],ordered=True))
420+
s
421+
s.cat.ordered
422+
423+
For ease of creation of series of categorical data, we have added the ability to pass keywords when calling ``.astype()``. These are passed directly to the constructor.
424+
425+
.. ipython:: python
426+
427+
s = Series(["a","b","c","a"]).astype('category',ordered=True)
428+
s
429+
s = Series(["a","b","c","a"]).astype('category',categories=list('abcdef'),ordered=False)
430+
s
431+
432+
Indexing Changes
433+
~~~~~~~~~~~~~~~~
434+
435+
.. _whatsnew_0160.api_breaking.indexing:
436+
437+
The behavior of a small sub-set of edge cases for using ``.loc`` have changed (:issue:`8613`). Furthermore we have improved the content of the error messages that are raised:
438+
439+
- slicing with ``.loc`` where the start and/or stop bound is not found in the index is now allowed; this previously would raise a ``KeyError``. This makes the behavior the same as ``.ix`` in this case. This change is only for slicing, not when indexing with a single label.
440+
441+
.. ipython:: python
442+
443+
df = DataFrame(np.random.randn(5,4),
444+
columns=list('ABCD'),
445+
index=date_range('20130101',periods=5))
446+
df
447+
s = Series(range(5),[-2,-1,1,2,3])
448+
s
449+
450+
Previous Behavior
451+
452+
.. code-block:: python
453+
454+
In [4]: df.loc['2013-01-02':'2013-01-10']
455+
KeyError: 'stop bound [2013-01-10] is not in the [index]'
456+
457+
In [6]: s.loc[-10:3]
458+
KeyError: 'start bound [-10] is not the [index]'
459+
460+
New Behavior
461+
462+
.. ipython:: python
463+
464+
df.loc['2013-01-02':'2013-01-10']
465+
s.loc[-10:3]
466+
467+
- allow slicing with float-like values on an integer index for ``.ix``. Previously this was only enabled for ``.loc``:
468+
469+
Previous Behavior
470+
471+
.. code-block:: python
472+
473+
In [8]: s.ix[-1.0:2]
474+
TypeError: the slice start value [-1.0] is not a proper indexer for this index type (Int64Index)
475+
476+
New Behavior
477+
478+
.. ipython:: python
479+
480+
s.ix[-1.0:2]
481+
482+
- provide a useful exception for indexing with an invalid type for that index when using ``.loc``. For example trying to use ``.loc`` on an index of type ``DatetimeIndex`` or ``PeriodIndex`` or ``TimedeltaIndex``, with an integer (or a float).
483+
484+
Previous Behavior
485+
486+
.. code-block:: python
487+
488+
In [4]: df.loc[2:3]
489+
KeyError: 'start bound [2] is not the [index]'
490+
491+
New Behavior
492+
493+
.. code-block:: python
494+
495+
In [4]: df.loc[2:3]
496+
TypeError: Cannot do slice indexing on <class 'pandas.tseries.index.DatetimeIndex'> with <type 'int'> keys
497+
498+
370499
.. _whatsnew_0160.deprecations:
371500

372501
Deprecations

0 commit comments

Comments
 (0)