Skip to content

Latest commit

 

History

History
583 lines (459 loc) · 38.1 KB

v1.2.0.rst

File metadata and controls

583 lines (459 loc) · 38.1 KB

What's new in 1.2.0 (??)

These are the changes in pandas 1.2.0. See :ref:`release` for a full changelog including other versions of pandas.

{{ header }}

Enhancements

Optionally disallow duplicate labels

:class:`Series` and :class:`DataFrame` can now be created with allows_duplicate_labels=False flag to control whether the index or columns can contain duplicate labels (:issue:`28394`). This can be used to prevent accidental introduction of duplicate labels, which can affect downstream operations.

By default, duplicates continue to be allowed

.. ipython:: python

   pd.Series([1, 2], index=['a', 'a'])

.. ipython:: python
   :okexcept:

   pd.Series([1, 2], index=['a', 'a']).set_flags(allows_duplicate_labels=False)

pandas will propagate the allows_duplicate_labels property through many operations.

.. ipython:: python
   :okexcept:

   a = (
       pd.Series([1, 2], index=['a', 'b'])
         .set_flags(allows_duplicate_labels=False)
   )
   a
   # An operation introducing duplicates
   a.reindex(['a', 'b', 'a'])

Warning

This is an experimental feature. Currently, many methods fail to propagate the allows_duplicate_labels value. In future versions it is expected that every method taking or returning one or more DataFrame or Series objects will propagate allows_duplicate_labels.

See :ref:`duplicates` for more.

The allows_duplicate_labels flag is stored in the new :attr:`DataFrame.flags` attribute. This stores global attributes that apply to the pandas object. This differs from :attr:`DataFrame.attrs`, which stores information that applies to the dataset.

Passing arguments to fsspec backends

Many read/write functions have acquired the storage_options optional argument, to pass a dictionary of parameters to the storage backend. This allows, for example, for passing credentials to S3 and GCS storage. The details of what parameters can be passed to which backends can be found in the documentation of the individual storage backends (detailed from the fsspec docs for builtin implementations and linked to external ones). See Section :ref:`io.remote`.

:issue:`35655` added fsspec support (including storage_options) for reading excel files.

Support for binary file handles in to_csv

:meth:`to_csv` supports file handles in binary mode (:issue:`19827` and :issue:`35058`) with encoding (:issue:`13068` and :issue:`23854`) and compression (:issue:`22555`). mode has to contain a b for binary handles to be supported.

For example:

.. ipython:: python

   import io

   data = pd.DataFrame([0, 1, 2])
   buffer = io.BytesIO()
   data.to_csv(buffer, mode="w+b", encoding="utf-8", compression="gzip")

Support for short caption and table position in to_latex

:meth:`DataFrame.to_latex` now allows one to specify a floating table position (:issue:`35281`) and a short caption (:issue:`36267`).

New keyword position is implemented to set the position.

.. ipython:: python

   data = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
   table = data.to_latex(position='ht')
   print(table)

Usage of keyword caption is extended. Besides taking a single string as an argument, one can optionally provide a tuple of (full_caption, short_caption) to add a short caption macro.

.. ipython:: python

   data = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
   table = data.to_latex(caption=('the full long caption', 'short caption'))
   print(table)

Change in default floating precision for read_csv and read_table

For the C parsing engine, the methods :meth:`read_csv` and :meth:`read_table` previously defaulted to a parser that could read floating point numbers slightly incorrectly with respect to the last bit in precision. The option floating_precision="high" has always been available to avoid this issue. Beginning with this version, the default is now to use the more accurate parser by making floating_precision=None correspond to the high precision parser, and the new option floating_precision="legacy" to use the legacy parser. The change to using the higher precision parser by default should have no impact on performance. (:issue:`17154`)

Experimental nullable data types for float data

We've added :class:`Float32Dtype` / :class:`Float64Dtype` and :class:`~arrays.FloatingArray`, an extension data type dedicated to floating point data that can hold the pd.NA missing value indicator (:issue:`32265`, :issue:`34307`).

While the default float data type already supports missing values using np.nan, this new data type uses pd.NA (and its corresponding behaviour) as missing value indicator, in line with the already existing nullable :ref:`integer <integer_na>` and :ref:`boolean <boolean>` data types.

One example where the behaviour of np.nan and pd.NA is different is comparison operations:

.. ipython:: python

  # the default numpy float64 dtype
  s1 = pd.Series([1.5, None])
  s1
  s1 > 1

.. ipython:: python

  # the new nullable float64 dtype
  s2 = pd.Series([1.5, None], dtype="Float64")
  s2
  s2 > 1

See the :ref:`missing_data.NA` doc section for more details on the behaviour when using the pd.NA missing value indicator.

As shown above, the dtype can be specified using the "Float64" or "Float32" string (capitalized to distinguish it from the default "float64" data type). Alternatively, you can also use the dtype object:

.. ipython:: python

   pd.Series([1.5, None], dtype=pd.Float32Dtype())

Warning

Experimental: the new floating data types are currently experimental, and its behaviour or API may still change without warning. Especially the behaviour regarding NaN (distinct from NA missing values) is subject to change.

Index/column name preservation when aggregating

When aggregating using :meth:`concat` or the :class:`DataFrame` constructor, Pandas will attempt to preserve index (and column) names whenever possible (:issue:`35847`). In the case where all inputs share a common name, this name will be assigned to the result. When the input names do not all agree, the result will be unnamed. Here is an example where the index name is preserved:

.. ipython:: python

    idx = pd.Index(range(5), name='abc')
    ser = pd.Series(range(5, 10), index=idx)
    pd.concat({'x': ser[1:], 'y': ser[:-1]}, axis=1)

The same is true for :class:`MultiIndex`, but the logic is applied separately on a level-by-level basis.

Other enhancements

Increased minimum version for Python

pandas 1.2.0 supports Python 3.7.1 and higher (:issue:`35214`).

Increased minimum versions for dependencies

Some minimum supported versions of dependencies were updated (:issue:`35214`). If installed, we now require:

Package Minimum Version Required Changed
numpy 1.16.5 X X
pytz 2017.3 X X
python-dateutil 2.7.3 X  
bottleneck 1.2.1    
numexpr 2.6.8   X
pytest (dev) 5.0.1   X
mypy (dev) 0.782   X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package Minimum Version Changed
beautifulsoup4 4.6.0  
fastparquet 0.3.2  
fsspec 0.7.4  
gcsfs 0.6.0  
lxml 4.3.0 X
matplotlib 2.2.3 X
numba 0.46.0  
openpyxl 2.6.0 X
pyarrow 0.15.0 X
pymysql 0.7.11 X
pytables 3.5.1 X
s3fs 0.4.0  
scipy 1.2.0  
sqlalchemy 1.2.8 X
xarray 0.12.0 X
xlrd 1.2.0 X
xlsxwriter 1.0.2 X
xlwt 1.3.0 X
pandas-gbq 0.12.0  

See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for more.

Other API changes

Deprecations

Performance improvements

Bug fixes

Categorical

Datetimelike

Timedelta

Timezones

Numeric

Conversion

Strings

Interval

Indexing

Missing

MultiIndex

I/O

Plotting

Groupby/resample/rolling

Reshaping

Sparse

ExtensionArray

Other

Contributors