Skip to content

Latest commit

 

History

History
1535 lines (1145 loc) · 88 KB

v0.23.0.rst

File metadata and controls

1535 lines (1145 loc) · 88 KB

What's new in 0.23.0 (May 15, 2018)

{{ header }}

.. ipython:: python
   :suppress:

   from pandas import * # noqa F401, F403


This is a major release from 0.22.0 and includes a number of API changes, deprecations, new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.

Highlights include:

Check the :ref:`API Changes <whatsnew_0230.api_breaking>` and :ref:`deprecations <whatsnew_0230.deprecations>` before updating.

Warning

Starting January 1, 2019, pandas feature releases will support Python 3 only. See Dropping Python 2.7 for more.

What's new in v0.23.0

New features

JSON read/write round-trippable with orient='table'

A DataFrame can now be written to and subsequently read back via JSON while preserving metadata through usage of the orient='table' argument (see :issue:`18912` and :issue:`9146`). Previously, none of the available orient values guaranteed the preservation of dtypes and index names, amongst other metadata.

In [1]: df = pd.DataFrame({'foo': [1, 2, 3, 4],
   ...:                    'bar': ['a', 'b', 'c', 'd'],
   ...:                    'baz': pd.date_range('2018-01-01', freq='d', periods=4),
   ...:                    'qux': pd.Categorical(['a', 'b', 'c', 'c'])},
   ...:                   index=pd.Index(range(4), name='idx'))

In [2]: df
Out[2]:
     foo bar        baz qux
idx
0      1   a 2018-01-01   a
1      2   b 2018-01-02   b
2      3   c 2018-01-03   c
3      4   d 2018-01-04   c

[4 rows x 4 columns]

In [3]: df.dtypes
Out[3]:
foo             int64
bar            object
baz    datetime64[ns]
qux          category
Length: 4, dtype: object

In [4]: df.to_json('test.json', orient='table')

In [5]: new_df = pd.read_json('test.json', orient='table')

In [6]: new_df
Out[6]:
     foo bar        baz qux
idx
0      1   a 2018-01-01   a
1      2   b 2018-01-02   b
2      3   c 2018-01-03   c
3      4   d 2018-01-04   c

[4 rows x 4 columns]

In [7]: new_df.dtypes
Out[7]:
foo             int64
bar            object
baz    datetime64[ns]
qux          category
Length: 4, dtype: object

Please note that the string index is not supported with the round trip format, as it is used by default in write_json to indicate a missing index name.

.. ipython:: python
   :okwarning:

   df.index.name = 'index'

   df.to_json('test.json', orient='table')
   new_df = pd.read_json('test.json', orient='table')
   new_df
   new_df.dtypes

.. ipython:: python
   :suppress:

   import os
   os.remove('test.json')


Method .assign() accepts dependent arguments

The :func:`DataFrame.assign` now accepts dependent keyword arguments for python version later than 3.6 (see also PEP 468). Later keyword arguments may now refer to earlier ones if the argument is a callable. See the :ref:`documentation here <dsintro.chained_assignment>` (:issue:`14207`)

.. ipython:: python

    df = pd.DataFrame({'A': [1, 2, 3]})
    df
    df.assign(B=df.A, C=lambda x: x['A'] + x['B'])

Warning

This may subtly change the behavior of your code when you're using .assign() to update an existing column. Previously, callables referring to other variables being updated would get the "old" values

Previous behavior:

In [2]: df = pd.DataFrame({"A": [1, 2, 3]})

In [3]: df.assign(A=lambda df: df.A + 1, C=lambda df: df.A * -1)
Out[3]:
   A  C
0  2 -1
1  3 -2
2  4 -3

New behavior:

.. ipython:: python

    df.assign(A=df.A + 1, C=lambda df: df.A * -1)

Merging on a combination of columns and index levels

Strings passed to :meth:`DataFrame.merge` as the on, left_on, and right_on parameters may now refer to either column names or index level names. This enables merging DataFrame instances on a combination of index levels and columns without resetting indexes. See the :ref:`Merge on columns and levels <merging.merge_on_columns_and_levels>` documentation section. (:issue:`14355`)

.. ipython:: python

   left_index = pd.Index(['K0', 'K0', 'K1', 'K2'], name='key1')

   left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'key2': ['K0', 'K1', 'K0', 'K1']},
                       index=left_index)

   right_index = pd.Index(['K0', 'K1', 'K2', 'K2'], name='key1')

   right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
                         'D': ['D0', 'D1', 'D2', 'D3'],
                         'key2': ['K0', 'K0', 'K0', 'K1']},
                        index=right_index)

   left.merge(right, on=['key1', 'key2'])

Sorting by a combination of columns and index levels

Strings passed to :meth:`DataFrame.sort_values` as the by parameter may now refer to either column names or index level names. This enables sorting DataFrame instances by a combination of index levels and columns without resetting indexes. See the :ref:`Sorting by Indexes and Values <basics.sort_indexes_and_values>` documentation section. (:issue:`14353`)

.. ipython:: python

   # Build MultiIndex
   idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2),
                                    ('b', 2), ('b', 1), ('b', 1)])
   idx.names = ['first', 'second']

   # Build DataFrame
   df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)},
                           index=idx)
   df_multi

   # Sort by 'second' (index) and 'A' (column)
   df_multi.sort_values(by=['second', 'A'])


Extending pandas with custom types (experimental)

pandas now supports storing array-like objects that aren't necessarily 1-D NumPy arrays as columns in a DataFrame or values in a Series. This allows third-party libraries to implement extensions to NumPy's types, similar to how pandas implemented categoricals, datetimes with timezones, periods, and intervals.

As a demonstration, we'll use cyberpandas, which provides an IPArray type for storing ip addresses.

In [1]: from cyberpandas import IPArray

In [2]: values = IPArray([
   ...:     0,
   ...:     3232235777,
   ...:     42540766452641154071740215577757643572
   ...: ])
   ...:
   ...:

IPArray isn't a normal 1-D NumPy array, but because it's a pandas :class:`~pandas.api.extensions.ExtensionArray`, it can be stored properly inside pandas' containers.

In [3]: ser = pd.Series(values)

In [4]: ser
Out[4]:
0                         0.0.0.0
1                     192.168.1.1
2    2001:db8:85a3::8a2e:370:7334
dtype: ip

Notice that the dtype is ip. The missing value semantics of the underlying array are respected:

In [5]: ser.isna()
Out[5]:
0     True
1    False
2    False
dtype: bool

For more, see the :ref:`extension types <extending.extension-types>` documentation. If you build an extension array, publicize it on the ecosystem page.

New observed keyword for excluding unobserved categories in GroupBy

Grouping by a categorical includes the unobserved categories in the output. When grouping by multiple categorical columns, this means you get the cartesian product of all the categories, including combinations where there are no observations, which can result in a large number of groups. We have added a keyword observed to control this behavior, it defaults to observed=False for backward-compatibility. (:issue:`14942`, :issue:`8138`, :issue:`15217`, :issue:`17594`, :issue:`8669`, :issue:`20583`, :issue:`20902`)

.. ipython:: python

   cat1 = pd.Categorical(["a", "a", "b", "b"],
                         categories=["a", "b", "z"], ordered=True)
   cat2 = pd.Categorical(["c", "d", "c", "d"],
                         categories=["c", "d", "y"], ordered=True)
   df = pd.DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})
   df['C'] = ['foo', 'bar'] * 2
   df

To show all values, the previous behavior:

.. ipython:: python

   df.groupby(['A', 'B', 'C'], observed=False).count()


To show only observed values:

.. ipython:: python

   df.groupby(['A', 'B', 'C'], observed=True).count()

For pivoting operations, this behavior is already controlled by the dropna keyword:

.. ipython:: python

   cat1 = pd.Categorical(["a", "a", "b", "b"],
                         categories=["a", "b", "z"], ordered=True)
   cat2 = pd.Categorical(["c", "d", "c", "d"],
                         categories=["c", "d", "y"], ordered=True)
   df = pd.DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})
   df


In [1]: pd.pivot_table(df, values='values', index=['A', 'B'], dropna=True)

Out[1]:
     values
A B
a c     1.0
  d     2.0
b c     3.0
  d     4.0

In [2]: pd.pivot_table(df, values='values', index=['A', 'B'], dropna=False)

Out[2]:
     values
A B
a c     1.0
  d     2.0
  y     NaN
b c     3.0
  d     4.0
  y     NaN
z c     NaN
  d     NaN
  y     NaN

Rolling/Expanding.apply() accepts raw=False to pass a Series to the function

:func:`Series.rolling().apply() <.Rolling.apply>`, :func:`DataFrame.rolling().apply() <.Rolling.apply>`, :func:`Series.expanding().apply() <.Expanding.apply>`, and :func:`DataFrame.expanding().apply() <.Expanding.apply>` have gained a raw=None parameter. This is similar to :func:`DataFame.apply`. This parameter, if True allows one to send a np.ndarray to the applied function. If False a Series will be passed. The default is None, which preserves backward compatibility, so this will default to True, sending an np.ndarray. In a future version the default will be changed to False, sending a Series. (:issue:`5071`, :issue:`20584`)

.. ipython:: python

   s = pd.Series(np.arange(5), np.arange(5) + 1)
   s

Pass a Series:

.. ipython:: python

   s.rolling(2, min_periods=1).apply(lambda x: x.iloc[-1], raw=False)

Mimic the original behavior of passing a ndarray:

.. ipython:: python

   s.rolling(2, min_periods=1).apply(lambda x: x[-1], raw=True)


DataFrame.interpolate has gained the limit_area kwarg

:meth:`DataFrame.interpolate` has gained a limit_area parameter to allow further control of which NaN s are replaced. Use limit_area='inside' to fill only NaNs surrounded by valid values or use limit_area='outside' to fill only NaN s outside the existing valid values while preserving those inside. (:issue:`16284`) See the :ref:`full documentation here <missing_data.interp_limits>`.

.. ipython:: python

   ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan,
                    np.nan, 13, np.nan, np.nan])
   ser

Fill one consecutive inside value in both directions

.. ipython:: python

   ser.interpolate(limit_direction='both', limit_area='inside', limit=1)

Fill all consecutive outside values backward

.. ipython:: python

   ser.interpolate(limit_direction='backward', limit_area='outside')

Fill all consecutive outside values in both directions

.. ipython:: python

   ser.interpolate(limit_direction='both', limit_area='outside')

Function get_dummies now supports dtype argument

The :func:`get_dummies` now accepts a dtype argument, which specifies a dtype for the new columns. The default remains uint8. (:issue:`18330`)

.. ipython:: python

   df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})
   pd.get_dummies(df, columns=['c']).dtypes
   pd.get_dummies(df, columns=['c'], dtype=bool).dtypes


Timedelta mod method

mod (%) and divmod operations are now defined on Timedelta objects when operating with either timedelta-like or with numeric arguments. See the :ref:`documentation here <timedeltas.mod_divmod>`. (:issue:`19365`)

.. ipython:: python

    td = pd.Timedelta(hours=37)
    td % pd.Timedelta(minutes=45)

Method .rank() handles inf values when NaN are present

In previous versions, .rank() would assign inf elements NaN as their ranks. Now ranks are calculated properly. (:issue:`6945`)

.. ipython:: python

    s = pd.Series([-np.inf, 0, 1, np.nan, np.inf])
    s

Previous behavior:

In [11]: s.rank()
Out[11]:
0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
dtype: float64

Current behavior:

.. ipython:: python

    s.rank()

Furthermore, previously if you rank inf or -inf values together with NaN values, the calculation won't distinguish NaN from infinity when using 'top' or 'bottom' argument.

.. ipython:: python

    s = pd.Series([np.nan, np.nan, -np.inf, -np.inf])
    s

Previous behavior:

In [15]: s.rank(na_option='top')
Out[15]:
0    2.5
1    2.5
2    2.5
3    2.5
dtype: float64

Current behavior:

.. ipython:: python

    s.rank(na_option='top')

These bugs were squashed:

Series.str.cat has gained the join kwarg

Previously, :meth:`Series.str.cat` did not -- in contrast to most of pandas -- align :class:`Series` on their index before concatenation (see :issue:`18657`). The method has now gained a keyword join to control the manner of alignment, see examples below and :ref:`here <text.concatenate>`.

In v.0.23 join will default to None (meaning no alignment), but this default will change to 'left' in a future version of pandas.

.. ipython:: python
   :okwarning:

    s = pd.Series(['a', 'b', 'c', 'd'])
    t = pd.Series(['b', 'd', 'e', 'c'], index=[1, 3, 4, 2])
    s.str.cat(t)
    s.str.cat(t, join='left', na_rep='-')

Furthermore, :meth:`Series.str.cat` now works for CategoricalIndex as well (previously raised a ValueError; see :issue:`20842`).

DataFrame.astype performs column-wise conversion to Categorical

:meth:`DataFrame.astype` can now perform column-wise conversion to Categorical by supplying the string 'category' or a :class:`~pandas.api.types.CategoricalDtype`. Previously, attempting this would raise a NotImplementedError. See the :ref:`categorical.objectcreation` section of the documentation for more details and examples. (:issue:`12860`, :issue:`18099`)

Supplying the string 'category' performs column-wise conversion, with only labels appearing in a given column set as categories:

.. ipython:: python

    df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
    df = df.astype('category')
    df['A'].dtype
    df['B'].dtype


Supplying a CategoricalDtype will make the categories in each column consistent with the supplied dtype:

.. ipython:: python

    from pandas.api.types import CategoricalDtype
    df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
    cdt = CategoricalDtype(categories=list('abcd'), ordered=True)
    df = df.astype(cdt)
    df['A'].dtype
    df['B'].dtype


Other enhancements

Backwards incompatible API changes

Dependencies have increased minimum versions

We have updated our minimum supported versions of dependencies (:issue:`15184`). If installed, we now require:

Package Minimum Version Required Issue
python-dateutil 2.5.0 X :issue:`15184`
openpyxl 2.4.0   :issue:`15184`
beautifulsoup4 4.2.1   :issue:`20082`
setuptools 24.2.0   :issue:`20698`

Instantiation from dicts preserves dict insertion order for Python 3.6+

Until Python 3.6, dicts in Python had no formally defined ordering. For Python version 3.6 and later, dicts are ordered by insertion order, see PEP 468. pandas will use the dict's insertion order, when creating a Series or DataFrame from a dict and you're using Python version 3.6 or higher. (:issue:`19884`)

Previous behavior (and current behavior if on Python < 3.6):

In [16]: pd.Series({'Income': 2000,
   ....:            'Expenses': -1500,
   ....:            'Taxes': -200,
   ....:            'Net result': 300})
Out[16]:
Expenses     -1500
Income        2000
Net result     300
Taxes         -200
dtype: int64

Note the Series above is ordered alphabetically by the index values.

New behavior (for Python >= 3.6):

.. ipython:: python

    pd.Series({'Income': 2000,
               'Expenses': -1500,
               'Taxes': -200,
               'Net result': 300})

Notice that the Series is now ordered by insertion order. This new behavior is used for all relevant pandas types (Series, DataFrame, SparseSeries and SparseDataFrame).

If you wish to retain the old behavior while using Python >= 3.6, you can use .sort_index():

.. ipython:: python

    pd.Series({'Income': 2000,
               'Expenses': -1500,
               'Taxes': -200,
               'Net result': 300}).sort_index()

Deprecate Panel

Panel was deprecated in the 0.20.x release, showing as a DeprecationWarning. Using Panel will now show a FutureWarning. The recommended way to represent 3-D data are with a MultiIndex on a DataFrame via the :meth:`~Panel.to_frame` or with the xarray package. pandas provides a :meth:`~Panel.to_xarray` method to automate this conversion (:issue:`13563`, :issue:`18324`).

In [75]: import pandas._testing as tm

In [76]: p = tm.makePanel()

In [77]: p
Out[77]:
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 3 (major_axis) x 4 (minor_axis)
Items axis: ItemA to ItemC
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to D

Convert to a MultiIndex DataFrame

In [78]: p.to_frame()
Out[78]:
                     ItemA     ItemB     ItemC
major      minor
2000-01-03 A      0.469112  0.721555  0.404705
           B     -1.135632  0.271860 -1.039268
           C      0.119209  0.276232 -1.344312
           D     -2.104569  0.113648 -0.109050
2000-01-04 A     -0.282863 -0.706771  0.577046
           B      1.212112 -0.424972 -0.370647
           C     -1.044236 -1.087401  0.844885
           D     -0.494929 -1.478427  1.643563
2000-01-05 A     -1.509059 -1.039575 -1.715002
           B     -0.173215  0.567020 -1.157892
           C     -0.861849 -0.673690  1.075770
           D      1.071804  0.524988 -1.469388

[12 rows x 3 columns]

Convert to an xarray DataArray

In [79]: p.to_xarray()
Out[79]:
<xarray.DataArray (items: 3, major_axis: 3, minor_axis: 4)>
array([[[ 0.469112, -1.135632,  0.119209, -2.104569],
        [-0.282863,  1.212112, -1.044236, -0.494929],
        [-1.509059, -0.173215, -0.861849,  1.071804]],

       [[ 0.721555,  0.27186 ,  0.276232,  0.113648],
        [-0.706771, -0.424972, -1.087401, -1.478427],
        [-1.039575,  0.56702 , -0.67369 ,  0.524988]],

       [[ 0.404705, -1.039268, -1.344312, -0.10905 ],
        [ 0.577046, -0.370647,  0.844885,  1.643563],
        [-1.715002, -1.157892,  1.07577 , -1.469388]]])
Coordinates:
  * items       (items) object 'ItemA' 'ItemB' 'ItemC'
  * major_axis  (major_axis) datetime64[ns] 2000-01-03 2000-01-04 2000-01-05
  * minor_axis  (minor_axis) object 'A' 'B' 'C' 'D'

pandas.core.common removals

The following error & warning messages are removed from pandas.core.common (:issue:`13634`, :issue:`19769`):

  • PerformanceWarning
  • UnsupportedFunctionCall
  • UnsortedIndexError
  • AbstractMethodError

These are available from import from pandas.errors (since 0.19.0).

Changes to make output of DataFrame.apply consistent

:func:`DataFrame.apply` was inconsistent when applying an arbitrary user-defined-function that returned a list-like with axis=1. Several bugs and inconsistencies are resolved. If the applied function returns a Series, then pandas will return a DataFrame; otherwise a Series will be returned, this includes the case where a list-like (e.g. tuple or list is returned) (:issue:`16353`, :issue:`17437`, :issue:`17970`, :issue:`17348`, :issue:`17892`, :issue:`18573`, :issue:`17602`, :issue:`18775`, :issue:`18901`, :issue:`18919`).

.. ipython:: python

    df = pd.DataFrame(np.tile(np.arange(3), 6).reshape(6, -1) + 1,
                      columns=['A', 'B', 'C'])
    df

Previous behavior: if the returned shape happened to match the length of original columns, this would return a DataFrame. If the return shape did not match, a Series with lists was returned.

In [3]: df.apply(lambda x: [1, 2, 3], axis=1)
Out[3]:
   A  B  C
0  1  2  3
1  1  2  3
2  1  2  3
3  1  2  3
4  1  2  3
5  1  2  3

In [4]: df.apply(lambda x: [1, 2], axis=1)
Out[4]:
0    [1, 2]
1    [1, 2]
2    [1, 2]
3    [1, 2]
4    [1, 2]
5    [1, 2]
dtype: object

New behavior: When the applied function returns a list-like, this will now always return a Series.

.. ipython:: python

    df.apply(lambda x: [1, 2, 3], axis=1)
    df.apply(lambda x: [1, 2], axis=1)

To have expanded columns, you can use result_type='expand'

.. ipython:: python

    df.apply(lambda x: [1, 2, 3], axis=1, result_type='expand')

To broadcast the result across the original columns (the old behaviour for list-likes of the correct length), you can use result_type='broadcast'. The shape must match the original columns.

.. ipython:: python

    df.apply(lambda x: [1, 2, 3], axis=1, result_type='broadcast')

Returning a Series allows one to control the exact return structure and column names:

.. ipython:: python

    df.apply(lambda x: pd.Series([1, 2, 3], index=['D', 'E', 'F']), axis=1)

Concatenation will no longer sort

In a future version of pandas :func:`pandas.concat` will no longer sort the non-concatenation axis when it is not already aligned. The current behavior is the same as the previous (sorting), but now a warning is issued when sort is not specified and the non-concatenation axis is not aligned (:issue:`4588`).

.. ipython:: python
   :okwarning:

   df1 = pd.DataFrame({"a": [1, 2], "b": [1, 2]}, columns=['b', 'a'])
   df2 = pd.DataFrame({"a": [4, 5]})

   pd.concat([df1, df2])

To keep the previous behavior (sorting) and silence the warning, pass sort=True

.. ipython:: python

   pd.concat([df1, df2], sort=True)

To accept the future behavior (no sorting), pass sort=False

Note that this change also applies to :meth:`DataFrame.append`, which has also received a sort keyword for controlling this behavior.

Build changes

  • Building pandas for development now requires cython >= 0.24 (:issue:`18613`)
  • Building from source now explicitly requires setuptools in setup.py (:issue:`18113`)
  • Updated conda recipe to be in compliance with conda-build 3.0+ (:issue:`18002`)

Index division by zero fills correctly

Division operations on Index and subclasses will now fill division of positive numbers by zero with np.inf, division of negative numbers by zero with -np.inf and 0 / 0 with np.nan. This matches existing Series behavior. (:issue:`19322`, :issue:`19347`)

Previous behavior:

In [6]: index = pd.Int64Index([-1, 0, 1])

In [7]: index / 0
Out[7]: Int64Index([0, 0, 0], dtype='int64')

# Previous behavior yielded different results depending on the type of zero in the divisor
In [8]: index / 0.0
Out[8]: Float64Index([-inf, nan, inf], dtype='float64')

In [9]: index = pd.UInt64Index([0, 1])

In [10]: index / np.array([0, 0], dtype=np.uint64)
Out[10]: UInt64Index([0, 0], dtype='uint64')

In [11]: pd.RangeIndex(1, 5) / 0
ZeroDivisionError: integer division or modulo by zero

Current behavior:

In [12]: index = pd.Int64Index([-1, 0, 1])
# division by zero gives -infinity where negative,
# +infinity where positive, and NaN for 0 / 0
In [13]: index / 0

# The result of division by zero should not depend on
# whether the zero is int or float
In [14]: index / 0.0

In [15]: index = pd.UInt64Index([0, 1])
In [16]: index / np.array([0, 0], dtype=np.uint64)

In [17]: pd.RangeIndex(1, 5) / 0

Extraction of matching patterns from strings

By default, extracting matching patterns from strings with :func:`str.extract` used to return a Series if a single group was being extracted (a DataFrame if more than one group was extracted). As of pandas 0.23.0 :func:`str.extract` always returns a DataFrame, unless expand is set to False. Finally, None was an accepted value for the expand parameter (which was equivalent to False), but now raises a ValueError. (:issue:`11386`)

Previous behavior:

In [1]: s = pd.Series(['number 10', '12 eggs'])

In [2]: extracted = s.str.extract(r'.*(\d\d).*')

In [3]: extracted
Out [3]:
0    10
1    12
dtype: object

In [4]: type(extracted)
Out [4]:
pandas.core.series.Series

New behavior:

.. ipython:: python

    s = pd.Series(['number 10', '12 eggs'])
    extracted = s.str.extract(r'.*(\d\d).*')
    extracted
    type(extracted)

To restore previous behavior, simply set expand to False:

.. ipython:: python

    s = pd.Series(['number 10', '12 eggs'])
    extracted = s.str.extract(r'.*(\d\d).*', expand=False)
    extracted
    type(extracted)

Default value for the ordered parameter of CategoricalDtype

The default value of the ordered parameter for :class:`~pandas.api.types.CategoricalDtype` has changed from False to None to allow updating of categories without impacting ordered. Behavior should remain consistent for downstream objects, such as :class:`Categorical` (:issue:`18790`)

In previous versions, the default value for the ordered parameter was False. This could potentially lead to the ordered parameter unintentionally being changed from True to False when users attempt to update categories if ordered is not explicitly specified, as it would silently default to False. The new behavior for ordered=None is to retain the existing value of ordered.

New behavior:

In [2]: from pandas.api.types import CategoricalDtype

In [3]: cat = pd.Categorical(list('abcaba'), ordered=True, categories=list('cba'))

In [4]: cat
Out[4]:
[a, b, c, a, b, a]
Categories (3, object): [c < b < a]

In [5]: cdt = CategoricalDtype(categories=list('cbad'))

In [6]: cat.astype(cdt)
Out[6]:
[a, b, c, a, b, a]
Categories (4, object): [c < b < a < d]

Notice in the example above that the converted Categorical has retained ordered=True. Had the default value for ordered remained as False, the converted Categorical would have become unordered, despite ordered=False never being explicitly specified. To change the value of ordered, explicitly pass it to the new dtype, e.g. CategoricalDtype(categories=list('cbad'), ordered=False).

Note that the unintentional conversion of ordered discussed above did not arise in previous versions due to separate bugs that prevented astype from doing any type of category to category conversion (:issue:`10696`, :issue:`18593`). These bugs have been fixed in this release, and motivated changing the default value of ordered.

Better pretty-printing of DataFrames in a terminal

Previously, the default value for the maximum number of columns was pd.options.display.max_columns=20. This meant that relatively wide data frames would not fit within the terminal width, and pandas would introduce line breaks to display these 20 columns. This resulted in an output that was relatively difficult to read:

../_static/print_df_old.png

If Python runs in a terminal, the maximum number of columns is now determined automatically so that the printed data frame fits within the current terminal width (pd.options.display.max_columns=0) (:issue:`17023`). If Python runs as a Jupyter kernel (such as the Jupyter QtConsole or a Jupyter notebook, as well as in many IDEs), this value cannot be inferred automatically and is thus set to 20 as in previous versions. In a terminal, this results in a much nicer output:

../_static/print_df_new.png

Note that if you don't like the new default, you can always set this option yourself. To revert to the old setting, you can run this line:

pd.options.display.max_columns = 20

Datetimelike API changes

Other API changes

Deprecations

Removal of prior version deprecations/changes

Performance improvements

Documentation changes

Thanks to all of the contributors who participated in the pandas Documentation Sprint, which took place on March 10th. We had about 500 participants from over 30 locations across the world. You should notice that many of the :ref:`API docstrings <api>` have greatly improved.

There were too many simultaneous contributions to include a release note for each improvement, but this GitHub search should give you an idea of how many docstrings were improved.

Special thanks to Marc Garcia for organizing the sprint. For more information, read the NumFOCUS blogpost recapping the sprint.

Bug fixes

Categorical

Warning

A class of bugs were introduced in pandas 0.21 with CategoricalDtype that affects the correctness of operations like merge, concat, and indexing when comparing multiple unordered Categorical arrays that have the same categories, but in a different order. We highly recommend upgrading or manually aligning your categories before doing these operations.

Datetimelike

Timedelta

Timezones

Offsets

Numeric

Strings

Indexing

MultiIndex

IO

Plotting

GroupBy/resample/rolling

Sparse

Reshaping

Other

Contributors

.. contributors:: v0.22.0..v0.23.0