doc/source/whatsnew/v0.16.0.txt

.. _whatsnew_0160:

v0.16.0 (February ??, 2015)
---------------------------

This is a major release from 0.15.2 and includes a small number of API changes, several new features,
enhancements, and performance improvements along with a large number of bug fixes. We recommend that all
users upgrade to this version.

- Highlights include:

- Check the :ref:`API Changes <whatsnew_0160.api>` and :ref:`deprecations <whatsnew_0160.deprecations>` before updating

- :ref:`Other Enhancements <whatsnew_0160.enhancements>`

- :ref:`Performance Improvements <whatsnew_0160.performance>`

- :ref:`Bug Fixes <whatsnew_0160.bug_fixes>`

New features
~~~~~~~~~~~~

.. _whatsnew_0160.enhancements:

- Reindex now supports ``method='nearest'`` for frames or series with a monotonic increasing or decreasing index (:issue:`9258`):

  .. ipython:: python

     df = pd.DataFrame({'x': range(5)})
     df.reindex([0.2, 1.8, 3.5], method='nearest')

  This method is also exposed by the lower level ``Index.get_indexer`` and ``Index.get_loc`` methods.

- Paths beginning with ~ will now be expanded to begin with the user's home directory (:issue:`9066`)
- Added time interval selection in ``get_data_yahoo`` (:issue:`9071`)
- Added ``Series.str.slice_replace()``, which previously raised ``NotImplementedError`` (:issue:`8888`)
- Added ``Timestamp.to_datetime64()`` to complement ``Timedelta.to_timedelta64()`` (:issue:`9255`)
- ``tseries.frequencies.to_offset()`` now accepts ``Timedelta`` as input (:issue:`9064`)
- Lag parameter was added to the autocorrelation method of ``Series``, defaults to lag-1 autocorrelation (:issue:`9192`)
- ``Timedelta`` will now accept ``nanoseconds`` keyword in constructor (:issue:`9273`)
- SQL code now safely escapes table and column names (:issue:`8986`)

- Added auto-complete for ``Series.str.<tab>``, ``Series.dt.<tab>`` and ``Series.cat.<tab>`` (:issue:`9322`)
- Added ``StringMethods.isalnum()``, ``isalpha()``, ``isdigit()``, ``isspace()``, ``islower()``,
  ``isupper()``, ``istitle()`` which behave as the same as standard ``str`` (:issue:`9282`)

- Added ``StringMethods.find()`` and ``rfind()`` which behave as the same as standard ``str`` (:issue:`9386`)

- ``Index.get_indexer`` now supports ``method='pad'`` and ``method='backfill'`` even for any target array, not just monotonic targets. These methods also work for monotonic decreasing as well as monotonic increasing indexes (:issue:`9258`).
- ``Index.asof`` now works on all index types (:issue:`9258`).

- Added ``StringMethods.isnumeric`` and ``isdecimal`` which behave as the same as standard ``str`` (:issue:`9439`)
- The ``read_excel()`` function's :ref:`sheetname <_io.specifying_sheets>` argument now accepts a list and ``None``, to get multiple or all sheets respectively.  If more than one sheet is specified, a dictionary is returned. (:issue:`9450`)

  .. code-block:: python

     # Returns the 1st and 4th sheet, as a dictionary of DataFrames.
     pd.read_excel('path_to_file.xls',sheetname=['Sheet1',3])

- A ``verbose`` argument has been augmented in ``io.read_excel()``, defaults to False. Set to True to print sheet names as they are parsed. (:issue:`9450`)
- Added ``StringMethods.ljust()`` and ``rjust()`` which behave as the same as standard ``str`` (:issue:`9352`)
- ``StringMethods.pad()`` and ``center()`` now accept ``fillchar`` option to specify filling character (:issue:`9352`)
- Added ``StringMethods.zfill()`` which behave as the same as standard ``str`` (:issue:`9387`)
- Added ``days_in_month`` (compatibility alias ``daysinmonth``) property to ``Timestamp``, ``DatetimeIndex``, ``Period``, ``PeriodIndex``, and ``Series.dt`` (:issue:`9572`)
- Added ``decimal`` option in ``to_csv`` to provide formatting for non-'.' decimal separators (:issue:`781`)

DataFrame Assign
~~~~~~~~~~~~~~~~

.. _whatsnew_0160.enhancements.assign:

Inspired by `dplyr's
<http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html#mutate>`__ ``mutate`` verb, DataFrame has a new
:meth:`~pandas.DataFrame.assign` method.
The function signature for ``assign`` is simply ``**kwargs``. The keys
are the column names for the new fields, and the values are either a value
to be inserted (for example, a ``Series`` or NumPy array), or a function
of one argument to be called on the ``DataFrame``. The new values are inserted,
and the entire DataFrame (with all original and new columns) is returned.

.. ipython :: python

   iris = read_csv('data/iris.data')
   iris.head()

   iris.assign(sepal_ratio=iris['SepalWidth'] / iris['SepalLength']).head()

Above was an example of inserting a precomputed value. We can also pass in
a function to be evalutated.

.. ipython :: python

    iris.assign(sepal_ratio = lambda x: (x['SepalWidth'] /
                                         x['SepalLength'])).head()

The power of ``assign`` comes when used in chains of operations. For example,
we can limit the DataFrame to just those with a Sepal Length greater than 5,
calculate the ratio, and plot

.. ipython:: python

   (iris.query('SepalLength > 5')
        .assign(SepalRatio = lambda x: x.SepalWidth / x.SepalLength,
                PetalRatio = lambda x: x.PetalWidth / x.PetalLength)
        .plot(kind='scatter', x='SepalRatio', y='PetalRatio'))

.. image:: _static/whatsnew_assign.png
  :scale: 50 %

See the :ref:`documentation <dsintro.chained_assignment>` for more. (:issue:`9229`)


Interaction with scipy.sparse
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. _whatsnew_0160.enhancements.sparse:

Added :meth:`SparseSeries.to_coo` and :meth:`SparseSeries.from_coo` methods (:issue:`8048`) for converting to and from ``scipy.sparse.coo_matrix`` instances (see :ref:`here <sparse.scipysparse>`). For example, given a SparseSeries with MultiIndex we can convert to a `scipy.sparse.coo_matrix` by specifying the row and column labels as index levels:

.. ipython:: python

   from numpy import nan
   s = Series([3.0, nan, 1.0, 3.0, nan, nan])
   s.index = MultiIndex.from_tuples([(1, 2, 'a', 0),
                                     (1, 2, 'a', 1),
                                     (1, 1, 'b', 0),
                                     (1, 1, 'b', 1),
                                     (2, 1, 'b', 0),
                                     (2, 1, 'b', 1)],
                                     names=['A', 'B', 'C', 'D'])

   s

   # SparseSeries
   ss = s.to_sparse()
   ss

   A, rows, columns = ss.to_coo(row_levels=['A', 'B'],
                                column_levels=['C', 'D'],
                                sort_labels=False)

   A
   A.todense()
   rows
   columns

The from_coo method is a convenience method for creating a ``SparseSeries``
from a ``scipy.sparse.coo_matrix``:

.. ipython:: python

   from scipy import sparse
   A = sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])),
                               shape=(3, 4))
   A
   A.todense()

   ss = SparseSeries.from_coo(A)
   ss

.. _whatsnew_0160.api:

.. _whatsnew_0160.api_breaking:

Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. _whatsnew_0160.api_breaking.timedelta:

- In v0.15.0 a new scalar type ``Timedelta`` was introduced, that is a
  sub-class of ``datetime.timedelta``.
  Mentioned :ref:`here <whatsnew_0150.timedeltaindex>` was a notice of an API
  change w.r.t. the ``.seconds`` accessor. The intent was to provide a
  user-friendly set of accessors that give the 'natural' value for that unit,
  e.g. if you had a ``Timedelta('1 day, 10:11:12')``, then ``.seconds`` would
  return 12. However, this is at odds with the definition of
  ``datetime.timedelta``, which defines ``.seconds`` as
  ``10 * 3600 + 11 * 60 + 12 == 36672``.

  So in v0.16.0, we are restoring the API to match that of
  ``datetime.timedelta``. Further, the component values are still available
  through the ``.components`` accessor. This affects the ``.seconds`` and
  ``.microseconds`` accessors, and removes the ``.hours``, ``.minutes``,
  ``.milliseconds`` accessors. These changes affect ``TimedeltaIndex``
  and the Series ``.dt`` accessor as well. (:issue:`9185`, :issue:`9139`)

  Previous Behavior

  .. code-block:: python

     In [2]: t = pd.Timedelta('1 day, 10:11:12.100123')

     In [3]: t.days
     Out[3]: 1

     In [4]: t.seconds
     Out[4]: 12

     In [5]: t.microseconds
     Out[5]: 123

  New Behavior

  .. ipython:: python

     t = pd.Timedelta('1 day, 10:11:12.100123')
     t.days
     t.seconds
     t.microseconds

  Using ``.components`` allows the full component access

  .. ipython:: python

     t.components
     t.components.seconds

- ``Index.duplicated`` now returns ``np.array(dtype=bool)`` rather than ``Index(dtype=object)`` containing ``bool`` values. (:issue:`8875`)
- ``DataFrame.to_json`` now returns accurate type serialisation for each column for frames of mixed dtype (:issue:`9037`)

  Previously data was coerced to a common dtype before serialisation, which for
  example resulted in integers being serialised to floats:

  .. code-block:: python

    In [2]: pd.DataFrame({'i': [1,2], 'f': [3.0, 4.2]}).to_json()
    Out[2]: '{"f":{"0":3.0,"1":4.2},"i":{"0":1.0,"1":2.0}}'

  Now each column is serialised using its correct dtype:

  .. code-block:: python

    In [2]:  pd.DataFrame({'i': [1,2], 'f': [3.0, 4.2]}).to_json()
    Out[2]: '{"f":{"0":3.0,"1":4.2},"i":{"0":1,"1":2}}'

- ``DatetimeIndex``, ``PeriodIndex`` and ``TimedeltaIndex.summary`` now output the same format. (:issue:`9116`)
- ``TimedeltaIndex.freqstr`` now output the same string format as ``DatetimeIndex``. (:issue:`9116`)


- Bar and horizontal bar plots no longer add a dashed line along the info axis. The prior style can be achieved with matplotlib's ``axhline`` or ``axvline`` methods (:issue:`9088`).


- ``Series`` now supports bitwise operation for integral types (:issue:`9016`)

  Previously even if the input dtypes where integral, the output dtype was coerced to ``bool``.

  .. code-block:: python
    In [2]: pd.Series([0,1,2,3], list('abcd')) | pd.Series([4,4,4,4], list('abcd'))
    Out[2]:
    a    True
    b    True
    c    True
    d    True
    dtype: bool

  Now if the input dtypes are integral, the output dtype is also integral and the output
  values are the result of the bitwise operation.

  .. code-block:: python

    In [2]: pd.Series([0,1,2,3], list('abcd')) | pd.Series([4,4,4,4], list('abcd'))
    Out[2]:
    a    4
    b    5
    c    6
    d    7
    dtype: int64


- During division involving a ``Series`` or ``DataFrame``, ``0/0`` and ``0//0`` now give ``np.nan`` instead of ``np.inf``. (:issue:`9144`, :issue:`8445`)

  Previous Behavior

  .. code-block:: python

        In [2]: p = pd.Series([0, 1])

        In [3]: p / 0
        Out[3]:
        0    inf
        1    inf
        dtype: float64

        In [4]: p // 0
        Out[4]:
        0    inf
        1    inf
        dtype: float64


  New Behavior

  .. ipython:: python

     p = pd.Series([0, 1])
     p / 0
     p // 0


Indexing Changes
~~~~~~~~~~~~~~~~

.. _whatsnew_0160.api_breaking.indexing:

The behavior of a small sub-set of edge cases for using ``.loc`` have changed (:issue:`8613`). Furthermore we have improved the content of the error messages that are raised:

- slicing with ``.loc`` where the start and/or stop bound is not found in the index is now allowed; this previously would raise a ``KeyError``. This makes the behavior the same as ``.ix`` in this case. This change is only for slicing, not when indexing with a single label.

  .. ipython:: python

     df = DataFrame(np.random.randn(5,4), columns=list('ABCD'), index=date_range('20130101',periods=5))
     df
     s = Series(range(5),[-2,-1,1,2,3])
     s

  Previous Behavior

  .. code-block:: python

     In [4]: df.loc['2013-01-02':'2013-01-10']
     KeyError: 'stop bound [2013-01-10] is not in the [index]'

     In [6]: s.loc[-10:3]
     KeyError: 'start bound [-10] is not the [index]'

  New Behavior

  .. ipython:: python

     df.loc['2013-01-02':'2013-01-10']
     s.loc[-10:3]

- allow slicing with float-like values on an integer index for ``.ix``. Previously this was only enabled for ``.loc``:

  .. code-block:: python

  Previous Behavior

     In [8]: s.ix[-1.0:2]
     TypeError: the slice start value [-1.0] is not a proper indexer for this index type (Int64Index)

  New Behavior

  .. ipython:: python

     In [8]: s.ix[-1.0:2]
     Out[2]:
     -1    1
      1    2
      2    3
     dtype: int64

- provide a useful exception for indexing with an invalid type for that index when using ``.loc``. For example trying to use ``.loc`` on an index of type ``DatetimeIndex`` or ``PeriodIndex`` or ``TimedeltaIndex``, with an integer (or a float).

  Previous Behavior

  .. code-block:: python

     In [4]: df.loc[2:3]
     KeyError: 'start bound [2] is not the [index]'

  New Behavior

  .. code-block:: python

     In [4]: df.loc[2:3]
     TypeError: Cannot do slice indexing on <class 'pandas.tseries.index.DatetimeIndex'> with <type 'int'> keys

DataFrame Construction Changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. _whatsnew_0160.api_breaking.construction:


.. _whatsnew_0160.deprecations:

Deprecations
~~~~~~~~~~~~

- The ``rplot`` trellis plotting interface is deprecated and will be removed
  in a future version. We refer to external packages like
  `seaborn <http://stanford.edu/~mwaskom/software/seaborn/>`_ for similar
  but more refined functionality (:issue:`3445`).

  The documentation includes some examples how to convert your existing code
  using ``rplot`` to seaborn: - :ref:`rplot docs <rplot>`


.. _whatsnew_0160.prior_deprecations:

Removal of prior version deprecations/changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- ``DataFrame.pivot_table`` and ``crosstab``'s ``rows`` and ``cols`` keyword arguments were removed in favor
  of ``index`` and ``columns`` (:issue:`6581`)
- ``DataFrame.to_excel`` and ``DataFrame.to_csv`` ``cols`` keyword argument was removed in favor of ``columns`` (:issue:`6581`)
- Removed ``convert_dummies`` in favor of ``get_dummies`` (:issue:`6581`)
- Removed ``value_range`` in favor of ``describe`` (:issue:`6581`)

Performance
~~~~~~~~~~~

.. _whatsnew_0160.performance:

- Fixed a performance regression for ``.loc`` indexing with an array or list-like (:issue:`9126`:).
- ``DataFrame.to_json`` 30x performance improvement for mixed dtype frames. (:issue:`9037`)
- Performance improvements in ``MultiIndex.duplicated`` by working with labels instead of values (:issue:`9125`)
- Improved the speed of ``nunique`` by calling ``unique`` instead of ``value_counts`` (:issue:`9129`, :issue:`7771`)
- Performance improvement of up to 10x in ``DataFrame.count`` and ``DataFrame.dropna`` by taking advantage of homogeneous/heterogeneous dtypes appropriately (:issue:`9136`)
- Performance improvement of up to 20x in ``DataFrame.count`` when using a ``MultiIndex`` and the ``level`` keyword argument  (:issue:`9163`)
- Performance and memory usage improvements in ``merge`` when key space exceeds ``int64`` bounds (:issue:`9151`)
- Performance improvements in multi-key ``groupby`` (:issue:`9429`)
- Performance improvements in ``MultiIndex.sortlevel`` (:issue:`9445`)
- Performance and memory usage improvements in ``DataFrame.duplicated`` (:issue:`9398`)
- Cythonized ``Period`` (:issue:`9440`)

Bug Fixes
~~~~~~~~~

.. _whatsnew_0160.bug_fixes:

- Changed ``.to_html`` to remove leading/trailing spaces in table body (:issue:`4987`)
- Fixed issue using ``read_csv`` on s3 with Python 3.
- Fixed compatibility issue in ``DatetimeIndex`` affecting architectures where ``numpy.int_`` defaults to ``numpy.int32`` (:issue:`8943`)
- Bug in Panel indexing with an object-like (:issue:`9140`)
- Bug in the returned ``Series.dt.components`` index was reset to the default index (:issue:`9247`)
- Bug in ``Categorical.__getitem__/__setitem__`` with listlike input getting incorrect results from indexer coercion (:issue:`9469`)
- Bug in partial setting with a DatetimeIndex (:issue:`9478`)
- Bug in groupby for integer and datetime64 columns when applying an aggregator that caused the value to be
  changed when the number was sufficiently large (:issue:`9311`, :issue:`6620`)
- Fixed bug in ``to_sql`` when mapping a ``Timestamp`` object column (datetime
  column with timezone info) to the according sqlalchemy type (:issue:`9085`).
- Fixed bug in ``to_sql`` ``dtype`` argument not accepting an instantiated
  SQLAlchemy type  (:issue:`9083`).
- Bug in ``.loc`` partial setting with a ``np.datetime64`` (:issue:`9516`)
- Incorrect dtypes inferred on datetimelike looking series & on xs slices (:issue:`9477`)

- Items in ``Categorical.unique()`` (and ``s.unique()`` if ``s`` is of dtype ``category``) now appear in the order in which they are originally found, not in sorted order (:issue:`9331`). This is now consistent with the behavior for other dtypes in pandas.


- Fixed bug on bug endian platforms which produced incorrect results in ``StataReader`` (:issue:`8688`).

- Bug in ``MultiIndex.has_duplicates`` when having many levels causes an indexer overflow (:issue:`9075`, :issue:`5873`)
- Bug in ``pivot`` and ``unstack`` where ``nan`` values would break index alignment (:issue:`4862`, :issue:`7401`, :issue:`7403`, :issue:`7405`, :issue:`7466`, :issue:`9497`)
- Bug in left ``join`` on multi-index with ``sort=True`` or null values (:issue:`9210`).
- Bug in ``MultiIndex`` where inserting new keys would fail (:issue:`9250`).
- Bug in ``groupby`` when key space exceeds ``int64`` bounds (:issue:`9096`).
- Bug in ``unstack`` with ``TimedeltaIndex`` or ``DatetimeIndex`` and nulls (:issue:`9491`).
- Bug in ``rank`` where comparing floats with tolerance will cause inconsistent behaviour (:issue:`8365`).


- Fixed character encoding bug in ``read_stata`` and ``StataReader`` when loading data from a URL (:issue:`9231`).

- Looking up a partial string label with ``DatetimeIndex.asof`` now includes values that match the string, even if they are after the start of the partial string label (:issue:`9258`). Old behavior:

  .. ipython:: python
    :verbatim:

    In [4]: pd.to_datetime(['2000-01-31', '2000-02-28']).asof('2000-02')
    Out[4]: Timestamp('2000-01-31 00:00:00')

  Fixed behavior:

  .. ipython:: python

    pd.to_datetime(['2000-01-31', '2000-02-28']).asof('2000-02')

  To reproduce the old behavior, simply add more precision to the label (e.g., use ``2000-02-01`` instead of ``2000-02``).


- Bug in adding ``offsets.Nano`` to other offets raises ``TypeError`` (:issue:`9284`)


- Bug in ``DatetimeIndex`` iteration, related to (:issue:`8890`), fixed in (:issue:`9100`)


- Bug in binary operator method (eg ``.mul()``) alignment with integer levels (:issue:`9463`).


- Bug in boxplot, scatter and hexbin plot may show an unnecessary warning (:issue:`8877`)
- Bug in subplot with ``layout`` kw may show unnecessary warning (:issue:`9464`)


- Bug in using grouper functions that need passed thru arguments (e.g. axis), when using wrapped function (e.g. ``fillna``), (:issue:`9221`)

- ``DataFrame`` now properly supports simultaneous ``copy`` and ``dtype`` arguments in constructor (:issue:`9099`)
- Bug in ``read_csv`` when using skiprows on a file with CR line endings with the c engine. (:issue:`9079`)
- ``isnull`` now detects ``NaT`` in ``PeriodIndex`` (:issue:`9129`)
- Bug in groupby ``.nth()`` with a multiple column groupby (:issue:`8979`)
- Bug in ``DataFrame.where`` and ``Series.where`` coerce numerics to string incorrectly (:issue:`9280`)
- Bug in ``DataFrame.where`` and ``Series.where`` raise ``ValueError`` when string list-like is passed. (:issue:`9280`)
- Accessing ``Series.str`` methods on with non-string values now raises ``TypeError`` instead of producing incorrect results (:issue:`9184`)
- Bug in ``DatetimeIndex.__contains__`` when index has duplicates and is not monotonic increasing (:issue:`9512`)
- Fixed division by zero error for ``Series.kurt()`` when all values are equal (:issue:`9197`)


- Fixed issue in the ``xlsxwriter`` engine where it added a default 'General' format to cells if no other format wass applied. This prevented other row or column formatting being applied. (:issue:`9167`)
- Fixes issue with ``index_col=False`` when ``usecols`` is also specified in ``read_csv``. (:issue:`9082`)
- Bug where ``wide_to_long`` would modify the input stubnames list (:issue:`9204`)
- Bug in ``to_sql`` not storing float64 values using double precision. (:issue:`9009`)


- ``SparseSeries`` and ``SparsePanel`` now accept zero argument constructors (same as their non-sparse counterparts) (:issue:`9272`).
- Regression in merging Categoricals and object dtypes (:issue:`9426`)
- Bug in ``read_csv`` with buffer overflows with certain malformed input files (:issue:`9205`)
- Bug in groupby MultiIndex with missing pair (:issue:`9049`, :issue:`9344`)
- Fixed bug in ``Series.groupby`` where grouping on ``MultiIndex`` levels would ignore the sort argument (:issue:`9444`)
- Fix bug in ``DataFrame.Groupby`` where sort=False is ignored in case of Categorical columns. (:issue:`8868`)


- Fixed bug with reading CSV files from Amazon S3 on python 3 raising a TypeError (:issue:`9452`)

- Bug in the Google BigQuery reader where the 'jobComplete' key may be present but False in the query results (:issue:`8728`)


- Fixed bug with DataFrame constructor when passed a Series with a
name and the `columns` keyword argument.