Skip to content

Latest commit

 

History

History
898 lines (677 loc) · 50.9 KB

v1.1.0.rst

File metadata and controls

898 lines (677 loc) · 50.9 KB

What's new in 1.1.0 (??)

These are the changes in pandas 1.1.0. See :ref:`release` for a full changelog including other versions of pandas.

{{ header }}

Enhancements

Nonmonotonic PeriodIndex Partial String Slicing

:class:`PeriodIndex` now supports partial string slicing for non-monotonic indexes, mirroring :class:`DatetimeIndex` behavior (:issue:`31096`)

For example:

.. ipython:: python

   dti = pd.date_range("2014-01-01", periods=30, freq="30D")
   pi = dti.to_period("D")
   ser_monotonic = pd.Series(np.arange(30), index=pi)
   shuffler = list(range(0, 30, 2)) + list(range(1, 31, 2))
   ser = ser_monotonic[shuffler]
   ser

.. ipython:: python

   ser["2014"]
   ser.loc["May 2015"]


Allow NA in groupby key

With :ref:`groupby <groupby.dropna>` , we've added a dropna keyword to :meth:`DataFrame.groupby` and :meth:`Series.groupby` in order to allow NA values in group keys. Users can define dropna to False if they want to include NA values in groupby keys. The default is set to True for dropna to keep backwards compatibility (:issue:`3729`)

.. ipython:: python

    df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
    df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])

    df_dropna

.. ipython:: python

    # Default `dropna` is set to True, which will exclude NaNs in keys
    df_dropna.groupby(by=["b"], dropna=True).sum()

    # In order to allow NaN in keys, set `dropna` to False
    df_dropna.groupby(by=["b"], dropna=False).sum()

The default setting of dropna argument is True which means NA are not included in group keys.

.. versionadded:: 1.1.0


Sorting with keys

We've added a key argument to the DataFrame and Series sorting methods, including :meth:`DataFrame.sort_values`, :meth:`DataFrame.sort_index`, :meth:`Series.sort_values`, and :meth:`Series.sort_index`. The key can be any callable function which is applied column-by-column to each column used for sorting, before sorting is performed (:issue:`27237`). See :ref:`sort_values with keys <basics.sort_value_key>` and :ref:`sort_index with keys <basics.sort_index_key>` for more information.

.. ipython:: python

   s = pd.Series(['C', 'a', 'B'])
   s

.. ipython:: python

   s.sort_values()


Note how this is sorted with capital letters first. If we apply the :meth:`Series.str.lower` method, we get

.. ipython:: python

   s.sort_values(key=lambda x: x.str.lower())


When applied to a DataFrame, they key is applied per-column to all columns or a subset if by is specified, e.g.

.. ipython:: python

   df = pd.DataFrame({'a': ['C', 'C', 'a', 'a', 'B', 'B'],
                      'b': [1, 2, 3, 4, 5, 6]})
   df

.. ipython:: python

   df.sort_values(by=['a'], key=lambda col: col.str.lower())


For more details, see examples and documentation in :meth:`DataFrame.sort_values`, :meth:`Series.sort_values`, and :meth:`~DataFrame.sort_index`.

Fold argument support in Timestamp constructor

:class:`Timestamp:` now supports the keyword-only fold argument according to PEP 495 similar to parent datetime.datetime class. It supports both accepting fold as an initialization argument and inferring fold from other constructor arguments (:issue:`25057`, :issue:`31338`). Support is limited to dateutil timezones as pytz doesn't support fold.

For example:

.. ipython:: python

    ts = pd.Timestamp("2019-10-27 01:30:00+00:00")
    ts.fold

.. ipython:: python

    ts = pd.Timestamp(year=2019, month=10, day=27, hour=1, minute=30,
                      tz="dateutil/Europe/London", fold=1)
    ts

For more on working with fold, see :ref:`Fold subsection <timeseries.fold>` in the user guide.

Parsing timezone-aware format with different timezones in to_datetime

:func:`to_datetime` now supports parsing formats containing timezone names (%Z) and UTC offsets (%z) from different timezones then converting them to UTC by setting utc=True. This would return a :class:`DatetimeIndex` with timezone at UTC as opposed to an :class:`Index` with object dtype if utc=True is not set (:issue:`32792`).

For example:

.. ipython:: python

    tz_strs = ["2010-01-01 12:00:00 +0100", "2010-01-01 12:00:00 -0100",
               "2010-01-01 12:00:00 +0300", "2010-01-01 12:00:00 +0400"]
    pd.to_datetime(tz_strs, format='%Y-%m-%d %H:%M:%S %z', utc=True)
    pd.to_datetime(tz_strs, format='%Y-%m-%d %H:%M:%S %z')

Grouper and resample now supports the arguments origin and offset

:class:`Grouper` and :class:`DataFrame.resample` now supports the arguments origin and offset. It let the user control the timestamp on which to adjust the grouping. (:issue:`31809`)

The bins of the grouping are adjusted based on the beginning of the day of the time series starting point. This works well with frequencies that are multiples of a day (like 30D) or that divides a day (like 90s or 1min). But it can create inconsistencies with some frequencies that do not meet this criteria. To change this behavior you can now specify a fixed timestamp with the argument origin.

Two arguments are now deprecated (more information in the documentation of :class:`DataFrame.resample`):

  • base should be replaced by offset.
  • loffset should be replaced by directly adding an offset to the index DataFrame after being resampled.

Small example of the use of origin:

.. ipython:: python

    start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'
    middle = '2000-10-02 00:00:00'
    rng = pd.date_range(start, end, freq='7min')
    ts = pd.Series(np.arange(len(rng)) * 3, index=rng)
    ts

Resample with the default behavior 'start_day' (origin is 2000-10-01 00:00:00):

.. ipython:: python

    ts.resample('17min').sum()
    ts.resample('17min', origin='start_day').sum()

Resample using a fixed origin:

.. ipython:: python

    ts.resample('17min', origin='epoch').sum()
    ts.resample('17min', origin='2000-01-01').sum()

If needed you can adjust the bins with the argument offset (a Timedelta) that would be added to the default origin.

For a full example, see: :ref:`timeseries.adjust-the-start-of-the-bins`.

Other enhancements

Increased minimum versions for dependencies

Some minimum supported versions of dependencies were updated (:issue:`33718`, :issue:`29766`, :issue:`29723`, pytables >= 3.4.3). If installed, we now require:

Package Minimum Version Required Changed
numpy 1.15.4 X X
pytz 2015.4 X  
python-dateutil 2.7.3 X X
bottleneck 1.2.1    
numexpr 2.6.2    
pytest (dev) 4.0.2    

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package Minimum Version Changed
beautifulsoup4 4.6.0  
fastparquet 0.3.2  
gcsfs 0.2.2  
lxml 3.8.0  
matplotlib 2.2.2  
numba 0.46.0  
openpyxl 2.5.7  
pyarrow 0.13.0  
pymysql 0.7.1  
pytables 3.4.3 X
s3fs 0.3.0  
scipy 1.2.0 X
sqlalchemy 1.1.4  
xarray 0.8.2  
xlrd 1.1.0  
xlsxwriter 0.9.8  
xlwt 1.2.0  

See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for more.

Development Changes

  • The minimum version of Cython is now the most recent bug-fix version (0.29.16) (:issue:`33334`).

Other API changes

Backwards incompatible API changes

MultiIndex.get_indexer interprets method argument differently

This restores the behavior of :meth:`MultiIndex.get_indexer` with method='backfill' or method='pad' to the behavior before pandas 0.23.0. In particular, MultiIndexes are treated as a list of tuples and padding or backfilling is done with respect to the ordering of these lists of tuples (:issue:`29896`).

As an example of this, given:

.. ipython:: python

        df = pd.DataFrame({
            'a': [0, 0, 0, 0],
            'b': [0, 2, 3, 4],
            'c': ['A', 'B', 'C', 'D'],
        }).set_index(['a', 'b'])
        mi_2 = pd.MultiIndex.from_product([[0], [-1, 0, 1, 3, 4, 5]])

The differences in reindexing df with mi_2 and using method='backfill' can be seen here:

pandas >= 0.23, < 1.1.0:

In [1]: df.reindex(mi_2, method='backfill')
Out[1]:
      c
0 -1  A
   0  A
   1  D
   3  A
   4  A
   5  C

pandas <0.23, >= 1.1.0

.. ipython:: python

        df.reindex(mi_2, method='backfill')

And the differences in reindexing df with mi_2 and using method='pad' can be seen here:

pandas >= 0.23, < 1.1.0

In [1]: df.reindex(mi_2, method='pad')
Out[1]:
        c
0 -1  NaN
   0  NaN
   1    D
   3  NaN
   4    A
   5    C

pandas < 0.23, >= 1.1.0

.. ipython:: python

        df.reindex(mi_2, method='pad')

Failed Label-Based Lookups Always Raise KeyError

Label lookups series[key], series.loc[key] and frame.loc[key] used to raises either KeyError or TypeError depending on the type of key and type of :class:`Index`. These now consistently raise KeyError (:issue:`31867`)

.. ipython:: python

    ser1 = pd.Series(range(3), index=[0, 1, 2])
    ser2 = pd.Series(range(3), index=pd.date_range("2020-02-01", periods=3))

Previous behavior:

In [3]: ser1[1.5]
...
TypeError: cannot do label indexing on Int64Index with these indexers [1.5] of type float

In [4] ser1["foo"]
...
KeyError: 'foo'

In [5]: ser1.loc[1.5]
...
TypeError: cannot do label indexing on Int64Index with these indexers [1.5] of type float

In [6]: ser1.loc["foo"]
...
KeyError: 'foo'

In [7]: ser2.loc[1]
...
TypeError: cannot do label indexing on DatetimeIndex with these indexers [1] of type int

In [8]: ser2.loc[pd.Timestamp(0)]
...
KeyError: Timestamp('1970-01-01 00:00:00')

New behavior:

In [3]: ser1[1.5]
...
KeyError: 1.5

In [4] ser1["foo"]
...
KeyError: 'foo'

In [5]: ser1.loc[1.5]
...
KeyError: 1.5

In [6]: ser1.loc["foo"]
...
KeyError: 'foo'

In [7]: ser2.loc[1]
...
KeyError: 1

In [8]: ser2.loc[pd.Timestamp(0)]
...
KeyError: Timestamp('1970-01-01 00:00:00')

Failed Integer Lookups on MultiIndex Raise KeyError

Indexing with integers with a :class:`MultiIndex` that has a integer-dtype first level incorrectly failed to raise KeyError when one or more of those integer keys is not present in the first level of the index (:issue:`33539`)

.. ipython:: python

    idx = pd.Index(range(4))
    dti = pd.date_range("2000-01-03", periods=3)
    mi = pd.MultiIndex.from_product([idx, dti])
    ser = pd.Series(range(len(mi)), index=mi)

Previous behavior:

In [5]: ser[[5]]
Out[5]: Series([], dtype: int64)

New behavior:

In [5]: ser[[5]]
...
KeyError: '[5] not in index'

:meth:`DataFrame.merge` preserves right frame's row order

:meth:`DataFrame.merge` now preserves right frame's row order when executing a right merge (:issue:`27453`)

.. ipython:: python

    left_df = pd.DataFrame({'animal': ['dog', 'pig'], 'max_speed': [40, 11]})
    right_df = pd.DataFrame({'animal': ['quetzal', 'pig'], 'max_speed': [80, 11]})
    left_df
    right_df

Previous behavior:

>>> left_df.merge(right_df, on=['animal', 'max_speed'], how="right")
    animal  max_speed
0      pig         11
1  quetzal         80

New behavior:

.. ipython:: python

    left_df.merge(right_df, on=['animal', 'max_speed'], how="right")

Assignment to multiple columns of a DataFrame when some columns do not exist

Assignment to multiple columns of a :class:`DataFrame` when some of the columns do not exist would previously assign the values to the last column. Now, new columns would be constructed with the right values. (:issue:`13658`)

.. ipython:: python

   df = pd.DataFrame({'a': [0, 1, 2], 'b': [3, 4, 5]})
   df

Previous behavior:

In [3]: df[['a', 'c']] = 1
In [4]: df
Out[4]:
   a  b
0  1  1
1  1  1
2  1  1

New behavior:

.. ipython:: python

   df[['a', 'c']] = 1
   df

Deprecations

Performance improvements

Bug fixes

Categorical

Datetimelike

Timedelta

Timezones

Numeric

Conversion

Strings

Interval

Indexing

Missing

MultiIndex

.. ipython:: python

        df = pd.DataFrame(np.arange(4),
                          index=[["a", "a", "b", "b"], [1, 2, 1, 2]])
        # Rows are now ordered as the requested keys
        df.loc[(['b', 'a'], [2, 1]), :]

.. ipython:: python

        left = pd.MultiIndex.from_arrays([["b", "a"], [2, 1]])
        right = pd.MultiIndex.from_arrays([["a", "b", "c"], [1, 2, 3]])
        # Common elements are now guaranteed to be ordered by the left side
        left.intersection(right, sort=False)

  • Bug when joining 2 Multi-indexes, without specifying level with different columns. Return-indexers parameter is ignored. (:issue:`34074`)

I/O

Plotting

Groupby/resample/rolling

Reshaping

Sparse

ExtensionArray

Other

Contributors