Skip to content

Latest commit

 

History

History
992 lines (791 loc) · 69.4 KB

v1.4.0.rst

File metadata and controls

992 lines (791 loc) · 69.4 KB

What's new in 1.4.0 (??)

These are the changes in pandas 1.4.0. See :ref:`release` for a full changelog including other versions of pandas.

{{ header }}

Enhancements

Improved warning messages

Previously, warning messages may have pointed to lines within the pandas library. Running the script setting_with_copy_warning.py

import pandas as pd

df = pd.DataFrame({'a': [1, 2, 3]})
df[:2].loc[:, 'a'] = 5

with pandas 1.3 resulted in:

.../site-packages/pandas/core/indexing.py:1951: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.

This made it difficult to determine where the warning was being generated from. Now pandas will inspect the call stack, reporting the first line outside of the pandas library that gave rise to the warning. The output of the above script is now:

setting_with_copy_warning.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.

More flexible numeric dtypes for indexes

Until now, it has only been possible to create numeric indexes with int64/float64/uint64 dtypes. It is now possible to create an index of any numpy int/uint/float dtype using the new :class:`NumericIndex` index type (:issue:`41153`):

.. ipython:: python

    pd.NumericIndex([1, 2, 3], dtype="int8")
    pd.NumericIndex([1, 2, 3], dtype="uint32")
    pd.NumericIndex([1, 2, 3], dtype="float32")

In order to maintain backwards compatibility, calls to the base :class:`Index` will currently return :class:`Int64Index`, :class:`UInt64Index` and :class:`Float64Index`, where relevant. For example, the code below returns an Int64Index with dtype int64:

In [1]: pd.Index([1, 2, 3], dtype="int8")
Int64Index([1, 2, 3], dtype='int64')

but will in a future version return a :class:`NumericIndex` with dtype int8.

More generally, currently, all operations that until now have returned :class:`Int64Index`, :class:`UInt64Index` and :class:`Float64Index` will continue to so. This means, that in order to use NumericIndex in the current version, you will have to call NumericIndex explicitly. For example the below series will have an Int64Index:

In [2]: ser = pd.Series([1, 2, 3], index=[1, 2, 3])
In [3]: ser.index
Int64Index([1, 2, 3], dtype='int64')

Instead, if you want to use a NumericIndex, you should do:

.. ipython:: python

    idx = pd.NumericIndex([1, 2, 3], dtype="int8")
    ser = pd.Series([1, 2, 3], index=idx)
    ser.index

In a future version of Pandas, :class:`NumericIndex` will become the default numeric index type and Int64Index, UInt64Index and Float64Index are therefore deprecated and will be removed in the future, see :ref:`here <whatsnew_140.deprecations.int64_uint64_float64index>` for more.

See :ref:`here <advanced.numericindex>` for more about :class:`NumericIndex`.

Styler

:class:`.Styler` has been further developed in 1.4.0. The following general enhancements have been made:

Additionally there are specific enhancements to the HTML specific rendering:

There are also some LaTeX specific enhancements:

  • :meth:`.Styler.to_latex` introduces keyword argument environment, which also allows a specific "longtable" entry through a separate jinja2 template (:issue:`41866`).
  • Naive sparsification is now possible for LaTeX without the necessity of including the multirow package (:issue:`43369`)

Multithreaded CSV reading with a new CSV Engine based on pyarrow

:func:`pandas.read_csv` now accepts engine="pyarrow" (requires at least pyarrow 1.0.1) as an argument, allowing for faster csv parsing on multicore machines with pyarrow installed. See the :doc:`I/O docs </user_guide/io>` for more info. (:issue:`23697`, :issue:`43706`)

Rank function for rolling and expanding windows

Added rank function to :class:`Rolling` and :class:`Expanding`. The new function supports the method, ascending, and pct flags of :meth:`DataFrame.rank`. The method argument supports min, max, and average ranking methods. Example:

.. ipython:: python

    s = pd.Series([1, 4, 2, 3, 5, 3])
    s.rolling(3).rank()

    s.rolling(3).rank(method="max")

Groupby positional indexing

It is now possible to specify positional ranges relative to the ends of each group.

Negative arguments for :meth:`.GroupBy.head` and :meth:`.GroupBy.tail` now work correctly and result in ranges relative to the end and start of each group, respectively. Previously, negative arguments returned empty frames.

.. ipython:: python

    df = pd.DataFrame([["g", "g0"], ["g", "g1"], ["g", "g2"], ["g", "g3"],
                       ["h", "h0"], ["h", "h1"]], columns=["A", "B"])
    df.groupby("A").head(-1)


:meth:`.GroupBy.nth` now accepts a slice or list of integers and slices.

.. ipython:: python

    df.groupby("A").nth(slice(1, -1))
    df.groupby("A").nth([slice(None, 1), slice(-1, None)])

:meth:`.GroupBy.nth` now accepts index notation.

.. ipython:: python

    df.groupby("A").nth[1, -1]
    df.groupby("A").nth[1:-1]
    df.groupby("A").nth[:1, -1:]

DataFrame.from_dict and DataFrame.to_dict have new 'tight' option

A new 'tight' dictionary format that preserves :class:`MultiIndex` entries and names is now available with the :meth:`DataFrame.from_dict` and :meth:`DataFrame.to_dict` methods and can be used with the standard json library to produce a tight representation of :class:`DataFrame` objects (:issue:`4889`).

.. ipython:: python

    df = pd.DataFrame.from_records(
        [[1, 3], [2, 4]],
        index=pd.MultiIndex.from_tuples([("a", "b"), ("a", "c")],
                                        names=["n1", "n2"]),
        columns=pd.MultiIndex.from_tuples([("x", 1), ("y", 2)],
                                          names=["z1", "z2"]),
    )
    df
    df.to_dict(orient='tight')

Other enhancements

Notable bug fixes

These are bug fixes that might have notable behavior changes.

Inconsistent date string parsing

The dayfirst option of :func:`to_datetime` isn't strict, and this can lead to surprising behaviour:

.. ipython:: python
    :okwarning:

    pd.to_datetime(["31-12-2021"], dayfirst=False)

Now, a warning will be raised if a date string cannot be parsed accordance to the given dayfirst value when the value is a delimited date string (e.g. 31-12-2012).

Ignoring dtypes in concat with empty or all-NA columns

When using :func:`concat` to concatenate two or more :class:`DataFrame` objects, if one of the DataFrames was empty or had all-NA values, its dtype was sometimes ignored when finding the concatenated dtype. These are now consistently not ignored (:issue:`43507`).

.. ipython:: python

    df1 = pd.DataFrame({"bar": [pd.Timestamp("2013-01-01")]}, index=range(1))
    df2 = pd.DataFrame({"bar": np.nan}, index=range(1, 2))
    res = pd.concat([df1, df2])

Previously, the float-dtype in df2 would be ignored so the result dtype would be datetime64[ns]. As a result, the np.nan would be cast to NaT.

Previous behavior:

In [4]: res
Out[4]:
         bar
0 2013-01-01
1        NaT

Now the float-dtype is respected. Since the common dtype for these DataFrames is object, the np.nan is retained.

New behavior:

.. ipython:: python

    res

Null-values are no longer coerced to NaN-value in value_counts and mode

:meth:`Series.value_counts` and :meth:`Series.mode` no longer coerce None, NaT and other null-values to a NaN-value for np.object-dtype. This behavior is now consistent with unique, isin and others (:issue:`42688`).

.. ipython:: python

    s = pd.Series([True, None, pd.NaT, None, pd.NaT, None])
    res = s.value_counts(dropna=False)

Previously, all null-values were replaced by a NaN-value.

Previous behavior:

In [3]: res
Out[3]:
NaN     5
True    1
dtype: int64

Now null-values are no longer mangled.

New behavior:

.. ipython:: python

    res

mangle_dupe_cols in read_csv no longer renaming unique columns conflicting with target names

:func:`read_csv` no longer renaming unique cols, which conflict with the target names of duplicated columns. Already existing columns are jumped, e.g. the next available index is used for the target column name (:issue:`14704`).

.. ipython:: python

    import io

    data = "a,a,a.1\n1,2,3"
    res = pd.read_csv(io.StringIO(data))

Previously, the second column was called a.1, while the third col was also renamed to a.1.1.

Previous behavior:

In [3]: res
Out[3]:
    a  a.1  a.1.1
0   1    2      3

Now the renaming checks if a.1 already exists when changing the name of the second column and jumps this index. The second column is instead renamed to a.2.

New behavior:

.. ipython:: python

    res

unstack and pivot_table no longer raises ValueError for result that would exceed int32 limit

Previously :meth:`DataFrame.pivot_table` and :meth:`DataFrame.unstack` would raise a ValueError if the operation could produce a result with more than 2**31 - 1 elements. This operation now raises a :class:`errors.PerformanceWarning` instead (:issue:`26314`).

Previous behavior:

In [3]: df = DataFrame({"ind1": np.arange(2 ** 16), "ind2": np.arange(2 ** 16), "count": 0})
In [4]: df.pivot_table(index="ind1", columns="ind2", values="count", aggfunc="count")
ValueError: Unstacked DataFrame is too big, causing int32 overflow

New behavior:

In [4]: df.pivot_table(index="ind1", columns="ind2", values="count", aggfunc="count")
PerformanceWarning: The following operation may generate 4294967296 cells in the resulting pandas object.

Backwards incompatible API changes

Increased minimum versions for dependencies

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package Minimum Version Required Changed
numpy 1.18.5 X X
pytz 2020.1 X X
python-dateutil 2.8.1 X X
bottleneck 1.3.1   X
numexpr 2.7.1   X
pytest (dev) 6.0    
mypy (dev) 0.930   X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package Minimum Version Changed
beautifulsoup4 4.8.2 X
fastparquet 0.4.0  
fsspec 0.7.4  
gcsfs 0.6.0  
lxml 4.5.0 X
matplotlib 3.3.2 X
numba 0.50.1 X
openpyxl 3.0.2 X
pyarrow 1.0.1 X
pymysql 0.10.1 X
pytables 3.6.1 X
s3fs 0.4.0  
scipy 1.4.1 X
sqlalchemy 1.4.0 X
tabulate 0.8.7  
xarray 0.15.1 X
xlrd 2.0.1 X
xlsxwriter 1.2.2 X
xlwt 1.3.0  
pandas-gbq 0.14.0 X

See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for more.

Other API changes

  • :meth:`Index.get_indexer_for` no longer accepts keyword arguments (other than 'target'); in the past these would be silently ignored if the index was not unique (:issue:`42310`)

  • Change in the position of the min_rows argument in :meth:`DataFrame.to_string` due to change in the docstring (:issue:`44304`)

  • Reduction operations for :class:`DataFrame` or :class:`Series` now raising a ValueError when None is passed for skipna (:issue:`44178`)

  • :func:`read_csv` and :func:`read_html` no longer raising an error when one of the header rows consists only of Unnamed: columns (:issue:`13054`)

  • Changed the name attribute of several holidays in USFederalHolidayCalendar to match official federal holiday names specifically:

    • "New Year's Day" gains the possessive apostrophe
    • "Presidents Day" becomes "Washington's Birthday"
    • "Martin Luther King Jr. Day" is now "Birthday of Martin Luther King, Jr."
    • "July 4th" is now "Independence Day"
    • "Thanksgiving" is now "Thanksgiving Day"
    • "Christmas" is now "Christmas Day"
    • Added "Juneteenth National Independence Day"

Deprecations

Deprecated Int64Index, UInt64Index & Float64Index

:class:`Int64Index`, :class:`UInt64Index` and :class:`Float64Index` have been deprecated in favor of the new :class:`NumericIndex` and will be removed in Pandas 2.0 (:issue:`43028`).

Currently, in order to maintain backward compatibility, calls to :class:`Index` will continue to return :class:`Int64Index`, :class:`UInt64Index` and :class:`Float64Index` when given numeric data, but in the future, a :class:`NumericIndex` will be returned.

Current behavior:

In [1]: pd.Index([1, 2, 3], dtype="int32")
Out [1]: Int64Index([1, 2, 3], dtype='int64')
In [1]: pd.Index([1, 2, 3], dtype="uint64")
Out [1]: UInt64Index([1, 2, 3], dtype='uint64')

Future behavior:

In [3]: pd.Index([1, 2, 3], dtype="int32")
Out [3]: NumericIndex([1, 2, 3], dtype='int32')
In [4]: pd.Index([1, 2, 3], dtype="uint64")
Out [4]: NumericIndex([1, 2, 3], dtype='uint64')

Deprecated Frame.append and Series.append

:meth:`DataFrame.append` and :meth:`Series.append` have been deprecated and will be removed in Pandas 2.0. Use :func:`pandas.concat` instead (:issue:`35407`).

Deprecated syntax

In [1]: pd.Series([1, 2]).append(pd.Series([3, 4])
Out [1]:
<stdin>:1: FutureWarning: The series.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
0    1
1    2
0    3
1    4
dtype: int64

In [2]: df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
In [3]: df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
In [4]: df1.append(df2)
Out [4]:
<stdin>:1: FutureWarning: The series.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
   A  B
0  1  2
1  3  4
0  5  6
1  7  8

Recommended syntax

.. ipython:: python

    pd.concat([pd.Series([1, 2]), pd.Series([3, 4])])

    df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
    df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
    pd.concat([df1, df2])


Other Deprecations

Performance improvements

Bug fixes

Categorical

Datetimelike

Timedelta

Timezones

Numeric

Conversion

Strings

  • Fixed bug in checking for string[pyarrow] dtype incorrectly raising an ImportError when pyarrow is not installed (:issue:`44276`)

Interval

Indexing

Missing

MultiIndex

I/O

Period

Plotting

Groupby/resample/rolling

Reshaping

Sparse

ExtensionArray

Styler

Other

Contributors