Skip to content

Latest commit

 

History

History
505 lines (399 loc) · 30.4 KB

v2.2.0.rst

File metadata and controls

505 lines (399 loc) · 30.4 KB

What's new in 2.2.0 (Month XX, 2024)

These are the changes in pandas 2.2.0. See :ref:`release` for a full changelog including other versions of pandas.

{{ header }}

Enhancements

Calamine engine for :func:`read_excel`

The calamine engine was added to :func:`read_excel`. It uses python-calamine, which provides Python bindings for the Rust library calamine. This engine supports Excel files (.xlsx, .xlsm, .xls, .xlsb) and OpenDocument spreadsheets (.ods) (:issue:`50395`).

There are two advantages of this engine:

  1. Calamine is often faster than other engines, some benchmarks show results up to 5x faster than 'openpyxl', 20x - 'odf', 4x - 'pyxlsb', and 1.5x - 'xlrd'. But, 'openpyxl' and 'pyxlsb' are faster in reading a few rows from large files because of lazy iteration over rows.
  2. Calamine supports the recognition of datetime in .xlsb files, unlike 'pyxlsb' which is the only other engine in pandas that can read .xlsb files.
pd.read_excel("path_to_file.xlsb", engine="calamine")

For more, see :ref:`io.calamine` in the user guide on IO tools.

Series.struct accessor to with PyArrow structured data

The Series.struct accessor provides attributes and methods for processing data with struct[pyarrow] dtype Series. For example, :meth:`Series.struct.explode` converts PyArrow structured data to a pandas DataFrame. (:issue:`54938`)

.. ipython:: python

    import pyarrow as pa
    series = pd.Series(
        [
            {"project": "pandas", "version": "2.2.0"},
            {"project": "numpy", "version": "1.25.2"},
            {"project": "pyarrow", "version": "13.0.0"},
        ],
        dtype=pd.ArrowDtype(
            pa.struct([
                ("project", pa.string()),
                ("version", pa.string()),
            ])
        ),
    )
    series.struct.explode()

Series.list accessor for PyArrow list data

The Series.list accessor provides attributes and methods for processing data with list[pyarrow] dtype Series. For example, :meth:`Series.list.__getitem__` allows indexing pyarrow lists in a Series. (:issue:`55323`)

.. ipython:: python

    import pyarrow as pa
    series = pd.Series(
        [
            [1, 2, 3],
            [4, 5],
            [6],
        ],
        dtype=pd.ArrowDtype(
            pa.list_(pa.int64())
        ),
    )
    series.list[0]

Other enhancements

Notable bug fixes

These are bug fixes that might have notable behavior changes.

:func:`merge` and :meth:`DataFrame.join` now consistently follow documented sort behavior

In previous versions of pandas, :func:`merge` and :meth:`DataFrame.join` did not always return a result that followed the documented sort behavior. pandas now follows the documented sort behavior in merge and join operations (:issue:`54611`).

As documented, sort=True sorts the join keys lexicographically in the resulting :class:`DataFrame`. With sort=False, the order of the join keys depends on the join type (how keyword):

  • how="left": preserve the order of the left keys
  • how="right": preserve the order of the right keys
  • how="inner": preserve the order of the left keys
  • how="outer": sort keys lexicographically

One example with changing behavior is inner joins with non-unique left join keys and sort=False:

.. ipython:: python

    left = pd.DataFrame({"a": [1, 2, 1]})
    right = pd.DataFrame({"a": [1, 2]})
    result = pd.merge(left, right, how="inner", on="a", sort=False)

Old Behavior

In [5]: result
Out[5]:
   a
0  1
1  1
2  2

New Behavior

.. ipython:: python

    result

:func:`merge` and :meth:`DataFrame.join` no longer reorder levels when levels differ

In previous versions of pandas, :func:`merge` and :meth:`DataFrame.join` would reorder index levels when joining on two indexes with different levels (:issue:`34133`).

.. ipython:: python

    left = pd.DataFrame({"left": 1}, index=pd.MultiIndex.from_tuples([("x", 1), ("x", 2)], names=["A", "B"]))
    right = pd.DataFrame({"right": 2}, index=pd.MultiIndex.from_tuples([(1, 1), (2, 2)], names=["B", "C"]))
    result = left.join(right)

Old Behavior

In [5]: result
Out[5]:
       left  right
B A C
1 x 1     1      2
2 x 2     1      2

New Behavior

.. ipython:: python

    result

Backwards incompatible API changes

Increased minimum versions for dependencies

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package Minimum Version Required Changed
    X X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package Minimum Version Changed
    X

See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for more.

Other API changes

Deprecations

Deprecate aliases M, SM, BM, CBM, Q, BQ, Y, and BY in favour of ME, SME, BME, CBME, QE, BQE, YE, and BYE for offsets

Deprecated the following frequency aliases (:issue:`9586`):

  • M (month end) has been renamed ME for offsets
  • SM (semi month end) has been renamed SME for offsets
  • BM (business month end) has been renamed BME for offsets
  • CBM (custom business month end) has been renamed CBME for offsets
  • Q (quarter end) has been renamed QE for offsets
  • BQ (business quarter end) has been renamed BQE for offsets
  • Y (year end) has been renamed YE for offsets
  • BY (business year end) has been renamed BYE for offsets

For example:

Previous behavior:

In [8]: pd.date_range('2020-01-01', periods=3, freq='Q-NOV')
Out[8]:
DatetimeIndex(['2020-02-29', '2020-05-31', '2020-08-31'],
              dtype='datetime64[ns]', freq='Q-NOV')

Future behavior:

.. ipython:: python

    pd.date_range('2020-01-01', periods=3, freq='QE-NOV')

Other Deprecations

Performance improvements

Bug fixes

Categorical

Datetimelike

Timedelta

Timezones

Numeric

Conversion

Strings

Interval

Indexing

Missing

MultiIndex

I/O

Period

Plotting

Groupby/resample/rolling

Reshaping

Sparse

ExtensionArray

Styler

Other

Contributors