Skip to content

Latest commit

 

History

History
1096 lines (910 loc) · 83.9 KB

v2.0.0.rst

File metadata and controls

1096 lines (910 loc) · 83.9 KB

What's new in 2.0.0 (??)

These are the changes in pandas 2.0.0. See :ref:`release` for a full changelog including other versions of pandas.

{{ header }}

Enhancements

Installing optional dependencies with pip extras

When installing pandas using pip, sets of optional dependencies can also be installed by specifying extras.

pip install "pandas[performance, aws]>=2.0.0"

The available extras, found in the :ref:`installation guide<install.dependencies>`, are [all, performance, computation, timezone, fss, aws, gcp, excel, parquet, feather, hdf5, spss, postgresql, mysql, sql-other, html, xml, plot, output_formatting, clipboard, compression, test] (:issue:`39164`).

Configuration option, mode.dtype_backend, to return pyarrow-backed dtypes

The use_nullable_dtypes keyword argument has been expanded to the following functions to enable automatic conversion to nullable dtypes (:issue:`36712`)

Additionally a new global configuration, mode.dtype_backend can now be used in conjunction with the parameter use_nullable_dtypes=True in the following functions to select the nullable dtypes implementation.

And the following methods will also utilize the mode.dtype_backend option.

By default, mode.dtype_backend is set to "pandas" to return existing, numpy-backed nullable dtypes, but it can also be set to "pyarrow" to return pyarrow-backed, nullable :class:`ArrowDtype` (:issue:`48957`, :issue:`49997`).

.. ipython:: python

    import io
    data = io.StringIO("""a,b,c,d,e,f,g,h,i
        1,2.5,True,a,,,,,
        3,4.5,False,b,6,7.5,True,a,
    """)
    with pd.option_context("mode.dtype_backend", "pandas"):
        df = pd.read_csv(data, use_nullable_dtypes=True)
    df.dtypes

    data.seek(0)
    with pd.option_context("mode.dtype_backend", "pyarrow"):
        df_pyarrow = pd.read_csv(data, use_nullable_dtypes=True, engine="pyarrow")
    df_pyarrow.dtypes

Copy-on-Write improvements

Copy-on-Write can be enabled through

pd.set_option("mode.copy_on_write", True)
pd.options.mode.copy_on_write = True

Alternatively, copy on write can be enabled locally through:

with pd.option_context("mode.copy_on_write", True):
    ...

Other enhancements

Notable bug fixes

These are bug fixes that might have notable behavior changes.

:meth:`.GroupBy.cumsum` and :meth:`.GroupBy.cumprod` overflow instead of lossy casting to float

In previous versions we cast to float when applying cumsum and cumprod which lead to incorrect results even if the result could be hold by int64 dtype. Additionally, the aggregation overflows consistent with numpy and the regular :meth:`DataFrame.cumprod` and :meth:`DataFrame.cumsum` methods when the limit of int64 is reached (:issue:`37493`).

Old Behavior

In [1]: df = pd.DataFrame({"key": ["b"] * 7, "value": 625})
In [2]: df.groupby("key")["value"].cumprod()[5]
Out[2]: 5.960464477539062e+16

We return incorrect results with the 6th value.

New Behavior

.. ipython:: python

    df = pd.DataFrame({"key": ["b"] * 7, "value": 625})
    df.groupby("key")["value"].cumprod()

We overflow with the 7th value, but the 6th value is still correct.

In previous versions of pandas, :meth:`.DataFrameGroupBy.nth` and :meth:`.SeriesGroupBy.nth` acted as if they were aggregations. However, for most inputs n, they may return either zero or multiple rows per group. This means that they are filtrations, similar to e.g. :meth:`.DataFrameGroupBy.head`. pandas now treats them as filtrations (:issue:`13666`).

.. ipython:: python

    df = pd.DataFrame({"a": [1, 1, 2, 1, 2], "b": [np.nan, 2.0, 3.0, 4.0, 5.0]})
    gb = df.groupby("a")

Old Behavior

In [5]: gb.nth(n=1)
Out[5]:
   A    B
1  1  2.0
4  2  5.0

New Behavior

.. ipython:: python

    gb.nth(n=1)

In particular, the index of the result is derived from the input by selecting the appropriate rows. Also, when n is larger than the group, no rows instead of NaN is returned.

Old Behavior

In [5]: gb.nth(n=3, dropna="any")
Out[5]:
    B
A
1 NaN
2 NaN

New Behavior

.. ipython:: python

    gb.nth(n=3, dropna="any")

Backwards incompatible API changes

Construction with datetime64 or timedelta64 dtype with unsupported resolution

In past versions, when constructing a :class:`Series` or :class:`DataFrame` and passing a "datetime64" or "timedelta64" dtype with unsupported resolution (i.e. anything other than "ns"), pandas would silently replace the given dtype with its nanosecond analogue:

Previous behavior:

In [5]: pd.Series(["2016-01-01"], dtype="datetime64[s]")
Out[5]:
0   2016-01-01
dtype: datetime64[ns]

In [6] pd.Series(["2016-01-01"], dtype="datetime64[D]")
Out[6]:
0   2016-01-01
dtype: datetime64[ns]

In pandas 2.0 we support resolutions "s", "ms", "us", and "ns". When passing a supported dtype (e.g. "datetime64[s]"), the result now has exactly the requested dtype:

New behavior:

.. ipython:: python

   pd.Series(["2016-01-01"], dtype="datetime64[s]")

With an un-supported dtype, pandas now raises instead of silently swapping in a supported dtype:

New behavior:

.. ipython:: python
   :okexcept:

   pd.Series(["2016-01-01"], dtype="datetime64[D]")

Disallow astype conversion to non-supported datetime64/timedelta64 dtypes

In previous versions, converting a :class:`Series` or :class:`DataFrame` from datetime64[ns] to a different datetime64[X] dtype would return with datetime64[ns] dtype instead of the requested dtype. In pandas 2.0, support is added for "datetime64[s]", "datetime64[ms]", and "datetime64[us]" dtypes, so converting to those dtypes gives exactly the requested dtype:

Previous behavior:

.. ipython:: python

   idx = pd.date_range("2016-01-01", periods=3)
   ser = pd.Series(idx)

Previous behavior:

In [4]: ser.astype("datetime64[s]")
Out[4]:
0   2016-01-01
1   2016-01-02
2   2016-01-03
dtype: datetime64[ns]

With the new behavior, we get exactly the requested dtype:

New behavior:

.. ipython:: python

   ser.astype("datetime64[s]")

For non-supported resolutions e.g. "datetime64[D]", we raise instead of silently ignoring the requested dtype:

New behavior:

.. ipython:: python
   :okexcept:

   ser.astype("datetime64[D]")

For conversion from timedelta64[ns] dtypes, the old behavior converted to a floating point format.

Previous behavior:

.. ipython:: python

   idx = pd.timedelta_range("1 Day", periods=3)
   ser = pd.Series(idx)

Previous behavior:

In [7]: ser.astype("timedelta64[s]")
Out[7]:
0     86400.0
1    172800.0
2    259200.0
dtype: float64

In [8]: ser.astype("timedelta64[D]")
Out[8]:
0    1.0
1    2.0
2    3.0
dtype: float64

The new behavior, as for datetime64, either gives exactly the requested dtype or raises:

New behavior:

.. ipython:: python
   :okexcept:

   ser.astype("timedelta64[s]")
   ser.astype("timedelta64[D]")

UTC and fixed-offset timezones default to standard-library tzinfo objects

In previous versions, the default tzinfo object used to represent UTC was pytz.UTC. In pandas 2.0, we default to datetime.timezone.utc instead. Similarly, for timezones represent fixed UTC offsets, we use datetime.timezone objects instead of pytz.FixedOffset objects. See (:issue:`34916`)

Previous behavior:

In [2]: ts = pd.Timestamp("2016-01-01", tz="UTC")
In [3]: type(ts.tzinfo)
Out[3]: pytz.UTC

In [4]: ts2 = pd.Timestamp("2016-01-01 04:05:06-07:00")
In [3]: type(ts2.tzinfo)
Out[5]: pytz._FixedOffset

New behavior:

.. ipython:: python

   ts = pd.Timestamp("2016-01-01", tz="UTC")
   type(ts.tzinfo)

   ts2 = pd.Timestamp("2016-01-01 04:05:06-07:00")
   type(ts2.tzinfo)

For timezones that are neither UTC nor fixed offsets, e.g. "US/Pacific", we continue to default to pytz objects.

Empty DataFrames/Series will now default to have a RangeIndex

Before, constructing an empty (where data is None or an empty list-like argument) :class:`Series` or :class:`DataFrame` without specifying the axes (index=None, columns=None) would return the axes as empty :class:`Index` with object dtype.

Now, the axes return an empty :class:`RangeIndex`.

Previous behavior:

In [8]: pd.Series().index
Out[8]:
Index([], dtype='object')

In [9] pd.DataFrame().axes
Out[9]:
[Index([], dtype='object'), Index([], dtype='object')]

New behavior:

.. ipython:: python

   pd.Series().index
   pd.DataFrame().axes

Increased minimum versions for dependencies

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package Minimum Version Required Changed
mypy (dev) 0.991   X
pytest (dev) 7.0.0   X
pytest-xdist (dev) 2.2.0   X
hypothesis (dev) 6.34.2   X
python-dateutil 2.8.2 X X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package Minimum Version Changed
pyarrow 6.0.0 X
matplotlib 3.6.1 X
fastparquet 0.6.3 X
xarray 0.21.0 X

See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for more.

Datetimes are now parsed with a consistent format

In the past, :func:`to_datetime` guessed the format for each element independently. This was appropriate for some cases where elements had mixed date formats - however, it would regularly cause problems when users expected a consistent format but the function would switch formats between elements. As of version 2.0.0, parsing will use a consistent format, determined by the first non-NA value (unless the user specifies a format, in which case that is used).

Old behavior:

In [1]: ser = pd.Series(['13-01-2000', '12-01-2000'])
In [2]: pd.to_datetime(ser)
Out[2]:
0   2000-01-13
1   2000-12-01
dtype: datetime64[ns]

New behavior:

.. ipython:: python
  :okwarning:

   ser = pd.Series(['13-01-2000', '12-01-2000'])
   pd.to_datetime(ser)

Note that this affects :func:`read_csv` as well.

If you still need to parse dates with inconsistent formats, you'll need to apply :func:`to_datetime` to each element individually, e.g.

ser = pd.Series(['13-01-2000', '12 January 2000'])
ser.apply(pd.to_datetime)

Other API changes

Deprecations

Removal of prior version deprecations/changes

Performance improvements

Bug fixes

Categorical

Datetimelike

Timedelta

Timezones

Numeric

Conversion

Strings

Interval

Indexing

Missing

MultiIndex

I/O

Period

Plotting

Groupby/resample/rolling

Reshaping

Sparse

ExtensionArray

Styler

Metadata

Other

Contributors