Skip to content

Latest commit

 

History

History
921 lines (685 loc) · 50.8 KB

v0.25.0.rst

File metadata and controls

921 lines (685 loc) · 50.8 KB

What's new in 0.25.0 (April XX, 2019)

Warning

Starting with the 0.25.x series of releases, pandas only supports Python 3.5.3 and higher. See :ref:`install.dropping-27` for more details.

Warning

The minimum supported Python version will be bumped to 3.6 in a future release.

Warning

Panel has been fully removed. For N-D labeled data structures, please use xarray

Warning

:func:`read_pickle` and :func:`read_msgpack` are only guaranteed backwards compatible back to pandas version 0.20.3 (:issue:`27082`)

{{ header }}

These are the changes in pandas 0.25.0. See :ref:`release` for a full changelog including other versions of pandas.

Enhancements

Groupby aggregation with relabeling

Pandas has added special groupby behavior, known as "named aggregation", for naming the output columns when applying multiple aggregation functions to specific columns (:issue:`18366`, :issue:`26512`).

.. ipython:: python

   animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
                           'height': [9.1, 6.0, 9.5, 34.0],
                           'weight': [7.9, 7.5, 9.9, 198.0]})
   animals
   animals.groupby("kind").agg(
       min_height=pd.NamedAgg(column='height', aggfunc='min'),
       max_height=pd.NamedAgg(column='height', aggfunc='max'),
       average_weight=pd.NamedAgg(column='weight', aggfunc=np.mean),
   )

Pass the desired columns names as the **kwargs to .agg. The values of **kwargs should be tuples where the first element is the column selection, and the second element is the aggregation function to apply. Pandas provides the pandas.NamedAgg namedtuple to make it clearer what the arguments to the function are, but plain tuples are accepted as well.

.. ipython:: python

   animals.groupby("kind").agg(
       min_height=('height', 'min'),
       max_height=('height', 'max'),
       average_weight=('weight', np.mean),
   )

Named aggregation is the recommended replacement for the deprecated "dict-of-dicts" approach to naming the output of column-specific aggregations (:ref:`whatsnew_0200.api_breaking.deprecate_group_agg_dict`).

A similar approach is now available for Series groupby objects as well. Because there's no need for column selection, the values can just be the functions to apply

.. ipython:: python

   animals.groupby("kind").height.agg(
       min_height="min",
       max_height="max",
   )


This type of aggregation is the recommended alternative to the deprecated behavior when passing a dict to a Series groupby aggregation (:ref:`whatsnew_0200.api_breaking.deprecate_group_agg_dict`).

See :ref:`groupby.aggregate.named` for more.

Groupby Aggregation with multiple lambdas

You can now provide multiple lambda functions to a list-like aggregation in :class:`pandas.core.groupby.GroupBy.agg` (:issue:`26430`).

.. ipython:: python

   animals.groupby('kind').height.agg([
       lambda x: x.iloc[0], lambda x: x.iloc[-1]
   ])

   animals.groupby('kind').agg([
       lambda x: x.iloc[0] - x.iloc[1],
       lambda x: x.iloc[0] + x.iloc[1]
   ])

Previously, these raised a SpecificationError.

Better repr for MultiIndex

Printing of :class:`MultiIndex` instances now shows tuples of each row and ensures that the tuple items are vertically aligned, so it's now easier to understand the structure of the MultiIndex. (:issue:`13480`):

The repr now looks like this:

.. ipython:: python

   pd.MultiIndex.from_product([['a', 'abc'], range(500)])

Previously, outputting a :class:`MultiIndex` printed all the levels and codes of the MultiIndex, which was visually unappealing and made the output more difficult to navigate. For example (limiting the range to 5):

In [1]: pd.MultiIndex.from_product([['a', 'abc'], range(5)])
Out[1]: MultiIndex(levels=[['a', 'abc'], [0, 1, 2, 3]],
   ...:            codes=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2, 3]])

In the new repr, all values will be shown, if the number of rows is smaller than :attr:`options.display.max_seq_items` (default: 100 items). Horizontally, the output will truncate, if it's wider than :attr:`options.display.width` (default: 80 characters).

Other enhancements

Backwards incompatible API changes

Indexing with date strings with UTC offsets

Indexing a :class:`DataFrame` or :class:`Series` with a :class:`DatetimeIndex` with a date string with a UTC offset would previously ignore the UTC offset. Now, the UTC offset is respected in indexing. (:issue:`24076`, :issue:`16785`)

.. ipython:: python

    df = pd.DataFrame([0], index=pd.DatetimeIndex(['2019-01-01'], tz='US/Pacific'))
    df

Previous behavior:

In [3]: df['2019-01-01 00:00:00+04:00':'2019-01-01 01:00:00+04:00']
Out[3]:
                           0
2019-01-01 00:00:00-08:00  0

New behavior:

.. ipython:: python

    df['2019-01-01 12:00:00+04:00':'2019-01-01 13:00:00+04:00']


MultiIndex constructed from levels and codes

Constructing a :class:`MultiIndex` with NaN levels or codes value < -1 was allowed previously. Now, construction with codes value < -1 is not allowed and NaN levels' corresponding codes would be reassigned as -1. (:issue:`19387`)

Previous behavior:

In [1]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
   ...:               codes=[[0, -1, 1, 2, 3, 4]])
   ...:
Out[1]: MultiIndex(levels=[[nan, None, NaT, 128, 2]],
                   codes=[[0, -1, 1, 2, 3, 4]])

In [2]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
Out[2]: MultiIndex(levels=[[1, 2]],
                   codes=[[0, -2]])

New behavior:

.. ipython:: python
    :okexcept:

    pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
                  codes=[[0, -1, 1, 2, 3, 4]])
    pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])


Groupby.apply on DataFrame evaluates first group only once

The implementation of :meth:`DataFrameGroupBy.apply() <pandas.core.groupby.DataFrameGroupBy.apply>` previously evaluated the supplied function consistently twice on the first group to infer if it is safe to use a fast code path. Particularly for functions with side effects, this was an undesired behavior and may have led to surprises. (:issue:`2936`, :issue:`2656`, :issue:`7739`, :issue:`10519`, :issue:`12155`, :issue:`20084`, :issue:`21417`)

Now every group is evaluated only a single time.

.. ipython:: python

    df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})
    df

    def func(group):
        print(group.name)
        return group

Previous behavior:

In [3]: df.groupby('a').apply(func)
x
x
y
Out[3]:
   a  b
0  x  1
1  y  2

New behavior:

.. ipython:: python

    df.groupby("a").apply(func)


Concatenating sparse values

When passed DataFrames whose values are sparse, :func:`concat` will now return a :class:`Series` or :class:`DataFrame` with sparse values, rather than a :class:`SparseDataFrame` (:issue:`25702`).

.. ipython:: python

   df = pd.DataFrame({"A": pd.SparseArray([0, 1])})

Previous behavior:

In [2]: type(pd.concat([df, df]))
pandas.core.sparse.frame.SparseDataFrame

New behavior:

.. ipython:: python

   type(pd.concat([df, df]))


This now matches the existing behavior of :class:`concat` on Series with sparse values. :func:`concat` will continue to return a SparseDataFrame when all the values are instances of SparseDataFrame.

This change also affects routines using :func:`concat` internally, like :func:`get_dummies`, which now returns a :class:`DataFrame` in all cases (previously a SparseDataFrame was returned if all the columns were dummy encoded, and a :class:`DataFrame` otherwise).

Providing any SparseSeries or SparseDataFrame to :func:`concat` will cause a SparseSeries or SparseDataFrame to be returned, as before.

The .str-accessor performs stricter type checks

Due to the lack of more fine-grained dtypes, :attr:`Series.str` so far only checked whether the data was of object dtype. :attr:`Series.str` will now infer the dtype data within the Series; in particular, 'bytes'-only data will raise an exception (except for :meth:`Series.str.decode`, :meth:`Series.str.get`, :meth:`Series.str.len`, :meth:`Series.str.slice`), see :issue:`23163`, :issue:`23011`, :issue:`23551`.

Previous behavior:

In [1]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)

In [2]: s
Out[2]:
0      b'a'
1     b'ba'
2    b'cba'
dtype: object

In [3]: s.str.startswith(b'a')
Out[3]:
0     True
1    False
2    False
dtype: bool

New behavior:

.. ipython:: python
    :okexcept:

    s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)
    s
    s.str.startswith(b'a')

Categorical dtypes are preserved during groupby

Previously, columns that were categorical, but not the groupby key(s) would be converted to object dtype during groupby operations. Pandas now will preserve these dtypes. (:issue:`18502`)

.. ipython:: python

   df = pd.DataFrame(
       {'payload': [-1, -2, -1, -2],
        'col': pd.Categorical(["foo", "bar", "bar", "qux"], ordered=True)})
   df
   df.dtypes

Previous Behavior:

In [5]: df.groupby('payload').first().col.dtype
Out[5]: dtype('O')

New Behavior:

.. ipython:: python

   df.groupby('payload').first().col.dtype


Incompatible Index type unions

When performing :func:`Index.union` operations between objects of incompatible dtypes, the result will be a base :class:`Index` of dtype object. This behavior holds true for unions between :class:`Index` objects that previously would have been prohibited. The dtype of empty :class:`Index` objects will now be evaluated before performing union operations rather than simply returning the other :class:`Index` object. :func:`Index.union` can now be considered commutative, such that A.union(B) == B.union(A) (:issue:`23525`).

Previous behavior:

In [1]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
...
ValueError: can only call with other PeriodIndex-ed objects

In [2]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
Out[2]: Int64Index([1, 2, 3], dtype='int64')

New behavior:

.. ipython:: python

    pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
    pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))

Note that integer- and floating-dtype indexes are considered "compatible". The integer values are coerced to floating point, which may result in loss of precision. See :ref:`indexing.set_ops` for more.

DataFrame groupby ffill/bfill no longer return group labels

The methods ffill, bfill, pad and backfill of :class:`DataFrameGroupBy <pandas.core.groupby.DataFrameGroupBy>` previously included the group labels in the return value, which was inconsistent with other groupby transforms. Now only the filled values are returned. (:issue:`21521`)

.. ipython:: python

    df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})
    df

Previous behavior:

In [3]: df.groupby("a").ffill()
Out[3]:
   a  b
0  x  1
1  y  2

New behavior:

.. ipython:: python

    df.groupby("a").ffill()

DataFrame describe on an empty categorical / object column will return top and freq

When calling :meth:`DataFrame.describe` with an empty categorical / object column, the 'top' and 'freq' columns were previously omitted, which was inconsistent with the output for non-empty columns. Now the 'top' and 'freq' columns will always be included, with :attr:`numpy.nan` in the case of an empty :class:`DataFrame` (:issue:`26397`)

.. ipython:: python

   df = pd.DataFrame({"empty_col": pd.Categorical([])})
   df

Previous behavior:

In [3]: df.describe()
Out[3]:
        empty_col
count           0
unique          0

New behavior:

.. ipython:: python

    df.describe()

__str__ methods now call __repr__ rather than vice versa

Pandas has until now mostly defined string representations in a Pandas objects's __str__/__unicode__/__bytes__ methods, and called __str__ from the __repr__ method, if a specific __repr__ method is not found. This is not needed for Python3. In Pandas 0.25, the string representations of Pandas objects are now generally defined in __repr__, and calls to __str__ in general now pass the call on to the __repr__, if a specific __str__ method doesn't exist, as is standard for Python. This change is backward compatible for direct usage of Pandas, but if you subclass Pandas objects and give your subclasses specific __str__/__repr__ methods, you may have to adjust your __str__/__repr__ methods (:issue:`26495`).

Increased minimum versions for dependencies

Due to dropping support for Python 2.7, a number of optional dependencies have updated minimum versions (:issue:`25725`, :issue:`24942`, :issue:`25752`). Independently, some minimum supported versions of dependencies were updated (:issue:`23519`, :issue:`25554`). If installed, we now require:

Package Minimum Version Required
numpy 1.13.3 X
pytz 2015.4 X
python-dateutil 2.6.1 X
bottleneck 1.2.1  
numexpr 2.6.2  
pytest (dev) 4.0.2  

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package Minimum Version
beautifulsoup4 4.6.0
fastparquet 0.2.1
gcsfs 0.2.2
lxml 3.8.0
matplotlib 2.2.2
openpyxl 2.4.8
pyarrow 0.9.0
pymysql 0.7.1
pytables 3.4.2
scipy 0.19.0
sqlalchemy 1.1.4
xarray 0.8.2
xlrd 1.1.0
xlsxwriter 0.9.8
xlwt 1.2.0

See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for more.

Other API changes

Deprecations

Sparse subclasses

The SparseSeries and SparseDataFrame subclasses are deprecated. Their functionality is better-provided by a Series or DataFrame with sparse values.

Previous way

.. ipython:: python
   :okwarning:

   df = pd.SparseDataFrame({"A": [0, 0, 1, 2]})
   df.dtypes

New way

.. ipython:: python

   df = pd.DataFrame({"A": pd.SparseArray([0, 0, 1, 2])})
   df.dtypes

The memory usage of the two approaches is identical. See :ref:`sparse.migration` for more (:issue:`19239`).

msgpack format

The msgpack format is deprecated as of 0.25 and will be removed in a future version. It is recommended to use pyarrow for on-the-wire transmission of pandas objects. (:issue:`27084`)

Other deprecations

Removal of prior version deprecations/changes

Performance improvements

Bug fixes

Categorical

Datetimelike

Timedelta

Timezones

Numeric

Conversion

Strings

Interval

Indexing

Missing

MultiIndex

I/O

Plotting

Groupby/resample/rolling

Reshaping

Sparse

Build Changes

ExtensionArray

Other

Contributors

.. contributors:: v0.24.x..HEAD