doc/source/whatsnew/v0.17.0.txt

.. _whatsnew_0170:

v0.17.0 (???)
-------------

This is a major release from 0.16.2 and includes a small number of API changes, several new features,
enhancements, and performance improvements along with a large number of bug fixes. We recommend that all
users upgrade to this version.

.. warning::

   pandas >= 0.17.0 will no longer support compatibility with Python version 3.2 (:issue:`9118`)

Highlights include:

  - Release the Global Interpreter Lock (GIL) on some cython operations, see :ref:`here <whatsnew_0170.gil>`
  - The default for ``to_datetime`` will now be to ``raise`` when presented with unparseable formats,
    previously this would return the original input, see :ref:`here <whatsnew_0170.api_breaking.to_datetime>`
  - The default for ``dropna`` in ``HDFStore`` has changed to ``False``, to store by default all rows even
    if they are all ``NaN``, see :ref:`here <whatsnew_0170.api_breaking.hdf_dropna>`
  - Support for ``Series.dt.strftime`` to generate formatted strings for datetime-likes, see :ref:`here <whatsnew_0170.strftime>`
  - Development installed versions of pandas will now have ``PEP440`` compliant version strings (:issue:`9518`)

Check the :ref:`API Changes <whatsnew_0170.api>` and :ref:`deprecations <whatsnew_0170.deprecations>` before updating.

.. contents:: What's new in v0.17.0
    :local:
    :backlinks: none

.. _whatsnew_0170.enhancements:

New features
~~~~~~~~~~~~

- SQL io functions now accept a SQLAlchemy connectable. (:issue:`7877`)
- Enable writing complex values to HDF stores when using table format (:issue:`10447`)
- Enable reading gzip compressed files via URL, either by explicitly setting the compression parameter or by inferring from the presence of the HTTP Content-Encoding header in the response (:issue:`8685`)


.. _whatsnew_0170.gil:

Releasing the GIL
^^^^^^^^^^^^^^^^^

We are releasing the global-interpreter-lock (GIL) on some cython operations.
This will allow other threads to run simultaneously during computation, potentially allowing performance improvements
from multi-threading. Notably ``groupby`` and some indexing operations are a benefit from this. (:issue:`8882`)

For example the groupby expression in the following code will have the GIL released during the factorization step, e.g. ``df.groupby('key')``
as well as the ``.sum()`` operation.

.. code-block:: python

   N = 1e6
   df = DataFrame({'key' : np.random.randint(0,ngroups,size=N),
                   'data' : np.random.randn(N) })
   df.groupby('key')['data'].sum()

Releasing of the GIL could benefit an application that uses threads for user interactions (e.g. QT_), or performaning multi-threaded computations. A nice example of a library that can handle these types of computation-in-parallel is the dask_ library.

.. _dask: https://dask.readthedocs.org/en/latest/
.. _QT: https://wiki.python.org/moin/PyQt

.. _whatsnew_0170.strftime:

Support strftime for Datetimelikes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We are now supporting a ``Series.dt.strftime`` method for datetime-likes to generate a formatted string (:issue:`10110`). Examples:

  .. ipython:: python

     # DatetimeIndex
     s = pd.Series(pd.date_range('20130101', periods=4))
     s
     s.dt.strftime('%Y/%m/%d')

  .. ipython:: python

     # PeriodIndex
     s = pd.Series(pd.period_range('20130101', periods=4))
     s
     s.dt.strftime('%Y/%m/%d')

The string format is as the python standard library and details can be found `here <https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior>`_

.. _whatsnew_0170.enhancements.other:

Other enhancements
^^^^^^^^^^^^^^^^^^

- `read_sql` and `to_sql` can accept database URI as con parameter (:issue:`10214`)

- Enable `read_hdf` to be used without specifying a key when the HDF file contains a single dataset (:issue:`10443`)

- ``DatetimeIndex`` can be instantiated using strings contains ``NaT`` (:issue:`7599`)
- The string parsing of ``to_datetime``, ``Timestamp`` and ``DatetimeIndex`` has been made consistent. (:issue:`7599`)

  Prior to v0.17.0, ``Timestamp`` and ``to_datetime`` may parse year-only datetime-string incorrectly using today's date, otherwise ``DatetimeIndex``
  uses the beginning of the year. ``Timestamp`` and ``to_datetime`` may raise ``ValueError`` in some types of datetime-string which ``DatetimeIndex``
  can parse, such as a quarterly string.

  Previous Behavior

  .. code-block:: python

     In [1]: Timestamp('2012Q2')
     Traceback
        ...
     ValueError: Unable to parse 2012Q2

     # Results in today's date.
     In [2]: Timestamp('2014')
     Out [2]: 2014-08-12 00:00:00

  v0.17.0 can parse them as below. It works on ``DatetimeIndex`` also.

  New Behaviour

  .. ipython:: python

     Timestamp('2012Q2')
     Timestamp('2014')
     DatetimeIndex(['2012Q2', '2014'])

  .. note:: If you want to perform calculations based on today's date, use ``Timestamp.now()`` and ``pandas.tseries.offsets``.

  .. ipython:: python

     import pandas.tseries.offsets as offsets
     Timestamp.now()
     Timestamp.now() + offsets.DateOffset(years=1)

- ``to_datetime`` can now accept ``yearfirst`` keyword (:issue:`7599`)

- ``.as_blocks`` will now take a ``copy`` optional argument to return a copy of the data, default is to copy (no change in behavior from prior versions), (:issue:`9607`)

- ``regex`` argument to ``DataFrame.filter`` now handles numeric column names instead of raising ``ValueError`` (:issue:`10384`).
- ``pd.read_stata`` will now read Stata 118 type files. (:issue:`9882`)

- ``pd.merge`` will now allow duplicate column names if they are not merged upon (:issue:`10639`).

- ``pd.pivot`` will now allow passing index as ``None`` (:issue:`3962`).

.. _whatsnew_0170.api:

.. _whatsnew_0170.api_breaking:

Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. _whatsnew_0170.api_breaking.to_datetime:

Changes to to_datetime and to_timedelta
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The default for ``pd.to_datetime`` error handling has changed to ``errors='raise'``. In prior versions it was ``errors='ignore'``.
Furthermore, the ``coerce`` argument has been deprecated in favor of ``errors='coerce'``. This means that invalid parsing will raise rather that return the original
input as in previous versions. (:issue:`10636`)

Previous Behavior:

.. code-block:: python

   In [2]: pd.to_datetime(['2009-07-31', 'asd'])
   Out[2]: array(['2009-07-31', 'asd'], dtype=object)

New Behavior:

.. code-block:: python

   In [3]: pd.to_datetime(['2009-07-31', 'asd'])
   ValueError: Unknown string format

.. ipython:: python

Of course you can coerce this as well.

.. ipython:: python

   to_datetime(['2009-07-31', 'asd'], errors='coerce')

To keep the previous behaviour, you can use ``errors='ignore'``:

.. ipython:: python

   to_datetime(['2009-07-31', 'asd'], errors='ignore')

Furthermore, ``pd.to_timedelta`` has gained a similar API, of ``errors='raise'|'ignore'|'coerce'``, and the ``coerce`` keyword
has been deprecated in favor of ``errors='coerce'``.

.. _whatsnew_0170.api_breaking.convert_objects:

Changes to convert_objects
^^^^^^^^^^^^^^^^^^^^^^^^^^

``DataFrame.convert_objects`` keyword arguments have been shortened. (:issue:`10265`)

  =====================   =============
  Old                     New
  =====================   =============
  ``convert_dates``       ``datetime``
  ``convert_numeric``     ``numeric``
  ``convert_timedelta``   ``timedelta``
  =====================   =============

Coercing types with ``DataFrame.convert_objects`` is now implemented using the
keyword argument ``coerce=True``.  Previously types were coerced by setting a
keyword argument to ``'coerce'`` instead of ``True``, as in ``convert_dates='coerce'``.

.. ipython:: python

   df = pd.DataFrame({'i': ['1','2'],
                      'f': ['apple', '4.2'],
                      's': ['apple','banana']})
   df

The old usage of ``DataFrame.convert_objects`` used `'coerce'` along with the
type.

.. code-block:: python

   In [2]: df.convert_objects(convert_numeric='coerce')

Now the ``coerce`` keyword must be explicitly used.

.. ipython:: python

   df.convert_objects(numeric=True, coerce=True)

In earlier versions of pandas, ``DataFrame.convert_objects`` would not coerce
numeric types when there were no values convertible to a numeric type. This returns
the original DataFrame with no conversion. This change alters
this behavior so that converts all non-number-like strings to ``NaN``.

.. code-block:: python

   In [1]: df = pd.DataFrame({'s': ['a','b']})
   In [2]: df.convert_objects(convert_numeric='coerce')
   Out[2]:
          s
       0  a
       1  b

.. ipython:: python

   pd.DataFrame({'s': ['a','b']})
   df.convert_objects(numeric=True, coerce=True)

In earlier versions of pandas, the default behavior was to try and convert
datetimes and timestamps. The new default is for ``DataFrame.convert_objects``
to do nothing, and so it is necessary to pass at least one conversion target
in the method call.

Changes to Index Comparisons
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Operator equal on Index should behavior similarly to Series (:issue:`9947`, :issue:`10637`)

Starting in v0.17.0, comparing ``Index`` objects of different lengths will raise
a ``ValueError``. This is to be consistent with the behavior of ``Series``.

Previous behavior:

.. code-block:: python

   In [2]: pd.Index([1, 2, 3]) == pd.Index([1, 4, 5])
   Out[2]: array([ True, False, False], dtype=bool)

   In [3]: pd.Index([1, 2, 3]) == pd.Index([2])
   Out[3]: array([False,  True, False], dtype=bool)

   In [4]: pd.Index([1, 2, 3]) == pd.Index([1, 2])
   Out[4]: False

   In [5]: pd.Series([1, 2, 3]) == pd.Series([1, 4, 5])
   Out[5]:
   0     True
   1    False
   2    False
   dtype: bool

   In [6]: pd.Series([1, 2, 3]) == pd.Series([2])
   ValueError: Series lengths must match to compare

   In [7]: pd.Series([1, 2, 3]) == pd.Series([1, 2])
   ValueError: Series lengths must match to compare

New behavior:

.. code-block:: python

   In [8]: pd.Index([1, 2, 3]) == pd.Index([1, 4, 5])
   Out[8]: array([ True, False, False], dtype=bool)

   In [9]: pd.Index([1, 2, 3]) == pd.Index([2])
   ValueError: Lengths must match to compare

   In [10]: pd.Index([1, 2, 3]) == pd.Index([1, 2])
   ValueError: Lengths must match to compare

   In [11]: pd.Series([1, 2, 3]) == pd.Series([1, 4, 5])
   Out[11]:
   0     True
   1    False
   2    False
   dtype: bool

   In [12]: pd.Series([1, 2, 3]) == pd.Series([2])
   ValueError: Series lengths must match to compare

   In [13]: pd.Series([1, 2, 3]) == pd.Series([1, 2])
   ValueError: Series lengths must match to compare

Note that this is different from the ``numpy`` behavior where a comparison can
be broadcast:

.. ipython:: python

   np.array([1, 2, 3]) == np.array([1])

or it can return False if broadcasting can not be done:

.. ipython:: python

   np.array([1, 2, 3]) == np.array([1, 2])

Changes to Boolean Comparisons vs. None
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Boolean comparisons of a ``Series`` vs ``None`` will now be equivalent to comparing with ``np.nan``, rather than raise ``TypeError``. xref (:issue:`1079`).

.. ipython:: python

   s = Series(range(3))
   s.iloc[1] = None
   s

Previous behavior:

.. code-block:: python

   In [5]: s==None
   TypeError: Could not compare <type 'NoneType'> type with Series

New behavior:

.. ipython:: python

   s==None

Usually you simply want to know which values are null.

.. ipython:: python

   s.isnull()

.. warning::

   You generally will want to use ``isnull/notnull`` for these types of comparisons, as ``isnull/notnull`` tells you which elements are null. One has to be
   mindful that ``nan's`` don't compare equal, but ``None's`` do. Note that Pandas/numpy uses the fact that ``np.nan != np.nan``, and treats ``None`` like ``np.nan``.

   .. ipython:: python

      None == None
      np.nan == np.nan

.. _whatsnew_0170.api_breaking.hdf_dropna:

HDFStore dropna behavior
^^^^^^^^^^^^^^^^^^^^^^^^

The default behavior for HDFStore write functions with ``format='table'`` is now to keep rows that are all missing. Previously, the behavior was to drop rows that were all missing save the index. The previous behavior can be replicated using the ``dropna=True`` option. (:issue:`9382`)

Previously:

.. ipython:: python

   df_with_missing = pd.DataFrame({'col1':[0, np.nan, 2],
                                   'col2':[1, np.nan, np.nan]})

   df_with_missing


.. code-block:: python

   In [28]:
   df_with_missing.to_hdf('file.h5', 'df_with_missing', format='table', mode='w')

   pd.read_hdf('file.h5', 'df_with_missing')

   Out [28]:
         col1  col2
     0     0     1
     2     2   NaN


New behavior:

.. ipython:: python
   :suppress:

   import os

.. ipython:: python

   df_with_missing.to_hdf('file.h5', 'df_with_missing', format = 'table', mode='w')

   pd.read_hdf('file.h5', 'df_with_missing')

.. ipython:: python
   :suppress:

   os.remove('file.h5')

See :ref:`documentation <io.hdf5>` for more details.

Changes to ``display.precision`` option
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The ``display.precision`` option has been clarified to refer to decimal places (:issue:`10451`).

Earlier versions of pandas would format floating point numbers to have one less decimal place than the value in
``display.precision``.

.. code-block:: python

  In [1]: pd.set_option('display.precision', 2)

  In [2]: pd.DataFrame({'x': [123.456789]})
  Out[2]:
         x
  0  123.5

If interpreting precision as "significant figures" this did work for scientific notation but that same interpretation
did not work for values with standard formatting. It was also out of step with how numpy handles formatting.

Going forward the value of ``display.precision`` will directly control the number of places after the decimal, for
regular formatting as well as scientific notation, similar to how numpy's ``precision`` print option works.

.. ipython:: python

  pd.set_option('display.precision', 2)
  pd.DataFrame({'x': [123.456789]})

To preserve output behavior with prior versions the default value of ``display.precision`` has been reduced to ``6``
from ``7``.

.. ipython:: python
  :suppress:
  pd.set_option('display.precision', 6)


.. _whatsnew_0170.api_breaking.other:

Other API Changes
^^^^^^^^^^^^^^^^^

- Line and kde plot with ``subplots=True`` now uses default colors, not all black. Specify ``color='k'`` to draw all lines in black (:issue:`9894`)
- Enable writing Excel files in :ref:`memory <_io.excel_writing_buffer>` using StringIO/BytesIO (:issue:`7074`)
- Enable serialization of lists and dicts to strings in ExcelWriter (:issue:`8188`)
- Allow passing `kwargs` to the interpolation methods (:issue:`10378`).
- Serialize metadata properties of subclasses of pandas objects (:issue:`10553`).
- ``Categorical.unique`` now returns new ``Categorical`` which ``categories`` and ``codes`` are unique, rather than returning ``np.array`` (:issue:`10508`)

   - unordered category: values and categories are sorted by appearance order.
   - ordered category: values are sorted by appearance order, categories keeps existing order.

   .. ipython :: python

      cat = pd.Categorical(['C', 'A', 'B', 'C'], categories=['A', 'B', 'C'], ordered=True)
      cat
      cat.unique()

      cat = pd.Categorical(['C', 'A', 'B', 'C'], categories=['A', 'B', 'C'])
      cat
      cat.unique()

- ``groupby`` using ``Categorical`` follows the same rule as ``Categorical.unique`` described above  (:issue:`10508`)
- ``NaT``'s methods now either raise ``ValueError``, or return ``np.nan`` or ``NaT`` (:issue:`9513`)

   ===============================     ==============================================================
   Behavior                            Methods
   ===============================     ==============================================================
   ``return np.nan``                   ``weekday``, ``isoweekday``
   ``return NaT``                      ``date``, ``now``, ``replace``, ``to_datetime``, ``today``
   ``return np.datetime64('NaT')``     ``to_datetime64`` (unchanged)
   ``raise ValueError``                All other public methods (names not beginning with underscores)
   ===============================     ===============================================================

.. _whatsnew_0170.deprecations:

Deprecations
^^^^^^^^^^^^

.. note:: These indexing function have been deprecated in the documentation since 0.11.0.

- For ``Series`` the following indexing functions are deprecated (:issue:`10177`).

  =====================  =================================
  Deprecated Function    Replacement
  =====================  =================================
  ``.irow(i)``           ``.iloc[i]`` or ``.iat[i]``
  ``.iget(i)``           ``.iloc[i]``
  ``.iget_value(i)``     ``.iloc[i]`` or ``.iat[i]``
  =====================  =================================

- For ``DataFrame`` the following indexing functions are deprecated (:issue:`10177`).

  =====================  =================================
  Deprecated Function    Replacement
  =====================  =================================
  ``.irow(i)``           ``.iloc[i]``
  ``.iget_value(i, j)``  ``.iloc[i, j]`` or ``.iat[i, j]``
  ``.icol(j)``           ``.iloc[:, j]``
  =====================  =================================

- ``Categorical.name`` was deprecated to make ``Categorical`` more ``numpy.ndarray`` like. Use ``Series(cat, name="whatever")`` instead (:issue:`10482`).

.. _whatsnew_0170.prior_deprecations:

Removal of prior version deprecations/changes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- Remove use of some deprecated numpy comparison operations, mainly in tests. (:issue:`10569`)


.. _whatsnew_0170.performance:

Performance Improvements
~~~~~~~~~~~~~~~~~~~~~~~~
- Added vbench benchmarks for alternative ExcelWriter engines and reading Excel files (:issue:`7171`)

- 4x improvement in ``timedelta`` string parsing (:issue:`6755`, :issue:`10426`)
- 8x improvement in ``timedelta64`` and ``datetime64`` ops (:issue:`6755`)
- Significantly improved performance of indexing ``MultiIndex`` with slicers (:issue:`10287`)
- Improved performance of ``Series.isin`` for datetimelike/integer Series (:issue:`10287`)
- 20x improvement in ``concat`` of Categoricals when categories are identical (:issue:`10587`)
- Improved performance of ``to_datetime`` when specified format string is ISO8601 (:issue:`10178`)


.. _whatsnew_0170.bug_fixes:

Bug Fixes
~~~~~~~~~

- Bug in ``DataFrame.apply`` when function returns categorical series. (:issue:`9573`)
- Bug in ``to_datetime`` with invalid dates and formats supplied (:issue:`10154`)
- Bug in ``Index.drop_duplicates`` dropping name(s) (:issue:`10115`)
- Bug in ``pd.Series`` when setting a value on an empty ``Series`` whose index has a frequency. (:issue:`10193`)
- Bug in ``DataFrame.plot`` raises ``ValueError`` when color name is specified by multiple characters (:issue:`10387`)
- Bug in ``Index`` construction with a mixed list of tuples (:issue:`10697`)
- Bug in ``DataFrame.reset_index`` when index contains `NaT`. (:issue:`10388`)
- Bug in ``ExcelReader`` when worksheet is empty (:issue:`6403`)
- Bug in ``Table.select_column`` where name is not preserved (:issue:`10392`)
- Bug in ``offsets.generate_range`` where ``start`` and ``end`` have finer precision than ``offset`` (:issue:`9907`)
- Bug in ``pd.rolling_*`` where ``Series.name`` would be lost in the output (:issue:`10565`)


- Bug in ``DataFrame.interpolate`` with ``axis=1`` and ``inplace=True`` (:issue:`10395`)
- Bug in ``io.sql.get_schema`` when specifying multiple columns as primary
  key (:issue:`10385`).

- Bug in ``groupby(sort=False)`` with datetime-like ``Categorical`` raises ``ValueError`` (:issue:`10505`)

- Bug in ``test_categorical`` on big-endian builds (:issue:`10425`)
- Bug in ``Series.shift`` and ``DataFrame.shift`` not supporting categorical data (:issue:`9416`)
- Bug in ``Series.map`` using categorical ``Series`` raises ``AttributeError`` (:issue:`10324`)
- Bug in ``MultiIndex.get_level_values`` including ``Categorical`` raises ``AttributeError`` (:issue:`10460`)
- Bug in ``pd.get_dummies`` with `sparse=True` not returning ``SparseDataFrame`` (:issue:`10531`)
- Bug in ``Index`` subtypes (such as ``PeriodIndex``) not returning their own type for ``.drop`` and ``.insert`` methods (:issue:`10620`)
- Bug in ``algos.outer_join_indexer`` when ``right`` array is empty (:issue:`10618`)

- Bug in ``filter`` (regression from 0.16.0) and ``transform`` when grouping on multiple keys, one of which is datetime-like (:issue:`10114`)


- Bug that caused segfault when resampling an empty Series (:issue:`10228`)
- Bug in ``DatetimeIndex`` and ``PeriodIndex.value_counts`` resets name from its result, but retains in result's ``Index``. (:issue:`10150`)
- Bug in ``pd.eval`` using ``numexpr`` engine coerces 1 element numpy array to scalar (:issue:`10546`)
- Bug in ``pd.concat`` with ``axis=0`` when column is of dtype ``category`` (:issue:`10177`)
- Bug in ``read_msgpack`` where input type is not always checked (:issue:`10369`, :issue:`10630`)
- Bug in ``pd.read_csv`` with kwargs ``index_col=False``, ``index_col=['a', 'b']`` or ``dtype``
  (:issue:`10413`, :issue:`10467`, :issue:`10577`)
- Bug in ``Series.from_csv`` with ``header`` kwarg not setting the ``Series.name`` or the ``Series.index.name`` (:issue:`10483`)
- Bug in ``groupby.var`` which caused variance to be inaccurate for small float values (:issue:`10448`)
- Bug in ``Series.plot(kind='hist')`` Y Label not informative (:issue:`10485`)
- Bug in ``read_csv`` when using a converter which generates a ``uint8`` type (:issue:`9266`)

- Bug causes memory leak in time-series line and area plot (:issue:`9003`)


- Bug in line and kde plot cannot accept multiple colors when ``subplots=True`` (:issue:`9894`)
- Bug in ``DataFrame.plot`` raises ``ValueError`` when color name is specified by multiple characters (:issue:`10387`)


- Reading "famafrench" data via ``DataReader`` results in HTTP 404 error because of the website url is changed (:issue:`10591`).
- Bug in ``read_msgpack`` where DataFrame to decode has duplicate column names (:issue:`9618`)
- Bug in ``io.common.get_filepath_or_buffer`` which caused reading of valid S3 files to fail if the bucket also contained keys for which the user does not have read permission (:issue:`10604`)
- Bug in vectorised setting of timestamp columns with python ``datetime.date`` and numpy ``datetime64`` (:issue:`10408`, :issue:`10412`)

- Bug in ``pd.DataFrame`` when constructing an empty DataFrame with a string dtype (:issue:`9428`)
- Bug in ``pd.unique`` for arrays with the ``datetime64`` or ``timedelta64`` dtype that meant an array with object dtype was returned instead the original dtype (:issue: `9431`)