forked from pandas-dev/pandas
-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy pathv0.17.0.txt
1162 lines (791 loc) · 49.7 KB
/
v0.17.0.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
.. _whatsnew_0170:
v0.17.0 (September ??, 2015)
----------------------------
This is a major release from 0.16.2 and includes a small number of API changes, several new features,
enhancements, and performance improvements along with a large number of bug fixes. We recommend that all
users upgrade to this version.
.. warning::
pandas >= 0.17.0 will no longer support compatibility with Python version 3.2 (:issue:`9118`)
.. warning::
The ``pandas.io.data`` package is deprecated and will be replaced by the
`pandas-datareader package <https://github.com/pydata/pandas-datareader>`_.
This will allow the data modules to be independently updated to your pandas
installation. The API for ``pandas-datareader v0.1.1`` is exactly the same
as in ``pandas v0.17.0`` (:issue:`8961`, :issue:`10861`).
After installing pandas-datareader, you can easily change your imports:
.. code-block:: python
from pandas.io import data, wb
becomes
.. code-block:: python
from pandas_datareader import data, wb
Highlights include:
- Release the Global Interpreter Lock (GIL) on some cython operations, see :ref:`here <whatsnew_0170.gil>`
- Plotting methods are now available as attributes of the ``.plot`` accessor, see :ref:`here <whatsnew_0170.plot>`
- The sorting API has been revamped to remove some long-time inconsistencies, see :ref:`here <whatsnew_0170.api_breaking.sorting>`
- Support for a ``datetime64[ns]`` with timezones as a first-class dtype, see :ref:`here <whatsnew_0170.tz>`
- The default for ``to_datetime`` will now be to ``raise`` when presented with unparseable formats,
previously this would return the original input. Also, date parse
functions now return consistent results. See :ref:`here <whatsnew_0170.api_breaking.to_datetime>`
- The default for ``dropna`` in ``HDFStore`` has changed to ``False``, to store by default all rows even
if they are all ``NaN``, see :ref:`here <whatsnew_0170.api_breaking.hdf_dropna>`
- Datetime accessor (``dt``) now supports ``Series.dt.strftime`` to generate formatted strings for datetime-likes, and ``Series.dt.total_seconds`` to generate each duration of the timedelta in seconds. See :ref:`here <whatsnew_0170.strftime>`
- ``Period`` and ``PeriodIndex`` can handle multiplied freq like ``3D``, which corresponding to 3 days span. See :ref:`here <whatsnew_0170.periodfreq>`
- Development installed versions of pandas will now have ``PEP440`` compliant version strings (:issue:`9518`)
- Development support for benchmarking with the `Air Speed Velocity library <https://github.com/spacetelescope/asv/>`_ (:issue:`8316`)
- Support for reading SAS xport files, see :ref:`here <whatsnew_0170.enhancements.sas_xport>`
- Documentation comparing SAS to *pandas*, see :ref:`here <compare_with_sas>`
- Removal of the automatic TimeSeries broadcasting, deprecated since 0.8.0, see :ref:`here <whatsnew_0170.prior_deprecations>`
- Compatibility with Python 3.5 (:issue:`11097`)
Check the :ref:`API Changes <whatsnew_0170.api>` and :ref:`deprecations <whatsnew_0170.deprecations>` before updating.
.. contents:: What's new in v0.17.0
:local:
:backlinks: none
.. _whatsnew_0170.enhancements:
New features
~~~~~~~~~~~~
.. _whatsnew_0170.tz:
Datetime with TZ
^^^^^^^^^^^^^^^^
We are adding an implementation that natively supports datetime with timezones. A ``Series`` or a ``DataFrame`` column previously
*could* be assigned a datetime with timezones, and would work as an ``object`` dtype. This had performance issues with a large
number rows. See the :ref:`docs <timeseries.timezone_series>` for more details. (:issue:`8260`, :issue:`10763`, :issue:`11034`).
The new implementation allows for having a single-timezone across all rows, with operations in a performant manner.
.. ipython:: python
df = DataFrame({'A' : date_range('20130101',periods=3),
'B' : date_range('20130101',periods=3,tz='US/Eastern'),
'C' : date_range('20130101',periods=3,tz='CET')})
df
df.dtypes
.. ipython:: python
df.B
df.B.dt.tz_localize(None)
This uses a new-dtype representation as well, that is very similar in look-and-feel to its numpy cousin ``datetime64[ns]``
.. ipython:: python
df['B'].dtype
type(df['B'].dtype)
.. note::
There is a slightly different string repr for the underlying ``DatetimeIndex`` as a result of the dtype changes, but
functionally these are the same.
Previous Behavior:
.. code-block:: python
In [1]: pd.date_range('20130101',periods=3,tz='US/Eastern')
Out[1]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
'2013-01-03 00:00:00-05:00'],
dtype='datetime64[ns]', freq='D', tz='US/Eastern')
In [2]: pd.date_range('20130101',periods=3,tz='US/Eastern').dtype
Out[2]: dtype('<M8[ns]')
New Behavior:
.. ipython:: python
pd.date_range('20130101',periods=3,tz='US/Eastern')
pd.date_range('20130101',periods=3,tz='US/Eastern').dtype
.. _whatsnew_0170.gil:
Releasing the GIL
^^^^^^^^^^^^^^^^^
We are releasing the global-interpreter-lock (GIL) on some cython operations.
This will allow other threads to run simultaneously during computation, potentially allowing performance improvements
from multi-threading. Notably ``groupby``, ``nsmallest`` and some indexing operations benefit from this. (:issue:`8882`)
For example the groupby expression in the following code will have the GIL released during the factorization step, e.g. ``df.groupby('key')``
as well as the ``.sum()`` operation.
.. code-block:: python
N = 1000000
ngroups = 10
df = DataFrame({'key' : np.random.randint(0,ngroups,size=N),
'data' : np.random.randn(N) })
df.groupby('key')['data'].sum()
Releasing of the GIL could benefit an application that uses threads for user interactions (e.g. QT_), or performaning multi-threaded computations. A nice example of a library that can handle these types of computation-in-parallel is the dask_ library.
.. _dask: https://dask.readthedocs.org/en/latest/
.. _QT: https://wiki.python.org/moin/PyQt
.. _whatsnew_0170.plot:
Plot submethods
^^^^^^^^^^^^^^^
The Series and DataFrame ``.plot()`` method allows for customizing :ref:`plot types<visualization.other>` by supplying the ``kind`` keyword arguments. Unfortunately, many of these kinds of plots use different required and optional keyword arguments, which makes it difficult to discover what any given plot kind uses out of the dozens of possible arguments.
To alleviate this issue, we have added a new, optional plotting interface, which exposes each kind of plot as a method of the ``.plot`` attribute. Instead of writing ``series.plot(kind=<kind>, ...)``, you can now also use ``series.plot.<kind>(...)``:
.. ipython::
:verbatim:
In [13]: df = pd.DataFrame(np.random.rand(10, 2), columns=['a', 'b'])
In [14]: df.plot.bar()
.. image:: _static/whatsnew_plot_submethods.png
As a result of this change, these methods are now all discoverable via tab-completion:
.. ipython::
:verbatim:
In [15]: df.plot.<TAB>
df.plot.area df.plot.barh df.plot.density df.plot.hist df.plot.line df.plot.scatter
df.plot.bar df.plot.box df.plot.hexbin df.plot.kde df.plot.pie
Each method signature only includes relevant arguments. Currently, these are limited to required arguments, but in the future these will include optional arguments, as well. For an overview, see the new :ref:`api.dataframe.plotting` API documentation.
.. _whatsnew_0170.strftime:
Additional methods for ``dt`` accessor
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
strftime
""""""""
We are now supporting a ``Series.dt.strftime`` method for datetime-likes to generate a formatted string (:issue:`10110`). Examples:
.. ipython:: python
# DatetimeIndex
s = pd.Series(pd.date_range('20130101', periods=4))
s
s.dt.strftime('%Y/%m/%d')
.. ipython:: python
# PeriodIndex
s = pd.Series(pd.period_range('20130101', periods=4))
s
s.dt.strftime('%Y/%m/%d')
The string format is as the python standard library and details can be found `here <https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior>`_
total_seconds
"""""""""""""
``pd.Series`` of type ``timedelta64`` has new method ``.dt.total_seconds()`` returning the duration of the timedelta in seconds (:issue:`10817`)
.. ipython:: python
# TimedeltaIndex
s = pd.Series(pd.timedelta_range('1 minutes', periods=4))
s
s.dt.total_seconds()
.. _whatsnew_0170.periodfreq:
Period Frequency Enhancement
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
``Period``, ``PeriodIndex`` and ``period_range`` can now accept multiplied freq. Also, ``Period.freq`` and ``PeriodIndex.freq`` are now stored as ``DateOffset`` instance like ``DatetimeIndex``, not ``str`` (:issue:`7811`)
Multiplied freq represents a span of corresponding length. Below example creates a period of 3 days. Addition and subtraction will shift the period by its span.
.. ipython:: python
p = pd.Period('2015-08-01', freq='3D')
p
p + 1
p - 2
p.to_timestamp()
p.to_timestamp(how='E')
You can use multiplied freq in ``PeriodIndex`` and ``period_range``.
.. ipython:: python
idx = pd.period_range('2015-08-01', periods=4, freq='2D')
idx
idx + 1
.. _whatsnew_0170.enhancements.sas_xport:
Support for SAS XPORT files
^^^^^^^^^^^^^^^^^^^^^^^^^^^
:meth:`~pandas.io.read_sas` provides support for reading *SAS XPORT* format files. (:issue:`4052`).
.. code-block:: python
df = pd.read_sas('sas_xport.xpt')
It is also possible to obtain an iterator and read an XPORT file
incrementally.
.. code-block:: python
for df in pd.read_sas('sas_xport.xpt', chunksize=10000)
do_something(df)
See the :ref:`docs <io.sas>` for more details.
.. _whatsnew_0170.matheval:
Support for Math Functions in .eval()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
:meth:`~pandas.eval` now supports calling math functions (:issue:`4893`)
.. code-block:: python
df = pd.DataFrame({'a': np.random.randn(10)})
df.eval("b = sin(a)")
The support math functions are `sin`, `cos`, `exp`, `log`, `expm1`, `log1p`,
`sqrt`, `sinh`, `cosh`, `tanh`, `arcsin`, `arccos`, `arctan`, `arccosh`,
`arcsinh`, `arctanh`, `abs` and `arctan2`.
These functions map to the intrinsics for the NumExpr engine. For Python
engine, they are mapped to NumPy calls.
Changes to Excel with ``MultiIndex``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In version 0.16.2 a ``DataFrame`` with ``MultiIndex`` columns could not be written to Excel via ``to_excel``.
That functionality has been added (:issue:`10564`), along with updating ``read_excel`` so that the data can
be read back with no loss of information by specifying which columns/rows make up the ``MultiIndex``
in the ``header`` and ``index_col`` parameters (:issue:`4679`)
See the :ref:`documentation <io.excel>` for more details.
.. ipython:: python
df = pd.DataFrame([[1,2,3,4], [5,6,7,8]],
columns = pd.MultiIndex.from_product([['foo','bar'],['a','b']],
names = ['col1', 'col2']),
index = pd.MultiIndex.from_product([['j'], ['l', 'k']],
names = ['i1', 'i2']))
df
df.to_excel('test.xlsx')
df = pd.read_excel('test.xlsx', header=[0,1], index_col=[0,1])
df
.. ipython:: python
:suppress:
import os
os.remove('test.xlsx')
Previously, it was necessary to specify the ``has_index_names`` argument in ``read_excel``
if the serialized data had index names. For version 0.17 the ouptput format of ``to_excel``
has been changed to make this keyword unnecessary - the change is shown below.
**Old**
.. image:: _static/old-excel-index.png
**New**
.. image:: _static/new-excel-index.png
.. warning::
Excel files saved in version 0.16.2 or prior that had index names will still able to be read in,
but the ``has_index_names`` argument must specified to ``True``.
.. _whatsnew_0170.gbq:
Google BigQuery Enhancements
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- Added ability to automatically create a table using the :func:`pandas.io.gbq.to_gbq` function if destination table does not exist. (:issue:`8325`).
- Added ability to replace an existing table and schema when calling the :func:`pandas.io.gbq.to_gbq` function via the ``if_exists`` argument. See the :ref:`docs <io.bigquery>` for more details (:issue:`8325`).
- Added the following functions to the gbq module: :func:`pandas.io.gbq.table_exists`, :func:`pandas.io.gbq.create_table`, and :func:`pandas.io.gbq.delete_table`. See the :ref:`docs <io.bigquery>` for more details (:issue:`8325`).
- ``InvalidColumnOrder`` and ``InvalidPageToken`` in the gbq module will raise ``ValueError`` instead of ``IOError``.
.. _whatsnew_0170.enhancements.other:
Other enhancements
^^^^^^^^^^^^^^^^^^
- ``merge`` now accepts the argument ``indicator`` which adds a Categorical-type column (by default called ``_merge``) to the output object that takes on the values (:issue:`8790`)
=================================== ================
Observation Origin ``_merge`` value
=================================== ================
Merge key only in ``'left'`` frame ``left_only``
Merge key only in ``'right'`` frame ``right_only``
Merge key in both frames ``both``
=================================== ================
.. ipython:: python
df1 = pd.DataFrame({'col1':[0,1], 'col_left':['a','b']})
df2 = pd.DataFrame({'col1':[1,2,2],'col_right':[2,2,2]})
pd.merge(df1, df2, on='col1', how='outer', indicator=True)
For more, see the :ref:`updated docs <merging.indicator>`
- ``pd.merge`` will now allow duplicate column names if they are not merged upon (:issue:`10639`).
- ``pd.pivot`` will now allow passing index as ``None`` (:issue:`3962`).
- ``concat`` will now use existing Series names if provided (:issue:`10698`).
.. ipython:: python
foo = pd.Series([1,2], name='foo')
bar = pd.Series([1,2])
baz = pd.Series([4,5])
Previous Behavior:
.. code-block:: python
In [1] pd.concat([foo, bar, baz], 1)
Out[1]:
0 1 2
0 1 1 4
1 2 2 5
New Behavior:
.. ipython:: python
pd.concat([foo, bar, baz], 1)
- ``DataFrame`` has gained the ``nlargest`` and ``nsmallest`` methods (:issue:`10393`)
- Add a ``limit_direction`` keyword argument that works with ``limit`` to enable ``interpolate`` to fill ``NaN`` values forward, backward, or both (:issue:`9218`, :issue:`10420`, :issue:`11115`)
.. ipython:: python
ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13])
ser.interpolate(limit=1, limit_direction='both')
- Round DataFrame to variable number of decimal places (:issue:`10568`).
.. ipython :: python
df = pd.DataFrame(np.random.random([3, 3]), columns=['A', 'B', 'C'],
index=['first', 'second', 'third'])
df
df.round(2)
df.round({'A': 0, 'C': 2})
- ``drop_duplicates`` and ``duplicated`` now accept ``keep`` keyword to target first, last, and all duplicates. ``take_last`` keyword is deprecated, see :ref:`deprecations <whatsnew_0170.deprecations>` (:issue:`6511`, :issue:`8505`)
.. ipython :: python
s = pd.Series(['A', 'B', 'C', 'A', 'B', 'D'])
s.drop_duplicates()
s.drop_duplicates(keep='last')
s.drop_duplicates(keep=False)
- Reindex now has a ``tolerance`` argument that allows for finer control of :ref:`basics.limits_on_reindex_fill` (:issue:`10411`):
.. ipython:: python
df = pd.DataFrame({'x': range(5),
't': pd.date_range('2000-01-01', periods=5)})
df.reindex([0.1, 1.9, 3.5],
method='nearest',
tolerance=0.2)
When used on a ``DatetimeIndex``, ``TimedeltaIndex`` or ``PeriodIndex``, ``tolerance`` will coerced into a ``Timedelta`` if possible. This allows you to specify tolerance with a string:
.. ipython:: python
df = df.set_index('t')
df.reindex(pd.to_datetime(['1999-12-31']),
method='nearest',
tolerance='1 day')
``tolerance`` is also exposed by the lower level ``Index.get_indexer`` and ``Index.get_loc`` methods.
- Added functionality to use the ``base`` argument when resampling a ``TimeDeltaIndex`` (:issue:`10530`)
- ``DatetimeIndex`` can be instantiated using strings contains ``NaT`` (:issue:`7599`)
- ``to_datetime`` can now accept ``yearfirst`` keyword (:issue:`7599`)
- ``pandas.tseries.offsets`` larger than the ``Day`` offset can now be used with with ``Series`` for addition/subtraction (:issue:`10699`). See the :ref:`Documentation <timeseries.offsetseries>` for more details.
- ``pd.Timedelta.total_seconds()`` now returns Timedelta duration to ns precision (previously microsecond precision) (:issue:`10939`)
- ``PeriodIndex`` now supports arithmetic with ``np.ndarray`` (:issue:`10638`)
- Support pickling of ``Period`` objects (:issue:`10439`)
- ``.as_blocks`` will now take a ``copy`` optional argument to return a copy of the data, default is to copy (no change in behavior from prior versions), (:issue:`9607`)
- ``regex`` argument to ``DataFrame.filter`` now handles numeric column names instead of raising ``ValueError`` (:issue:`10384`).
- Enable reading gzip compressed files via URL, either by explicitly setting the compression parameter or by inferring from the presence of the HTTP Content-Encoding header in the response (:issue:`8685`)
- Enable writing Excel files in :ref:`memory <_io.excel_writing_buffer>` using StringIO/BytesIO (:issue:`7074`)
- Enable serialization of lists and dicts to strings in ``ExcelWriter`` (:issue:`8188`)
- SQL io functions now accept a SQLAlchemy connectable. (:issue:`7877`)
- ``pd.read_sql`` and ``to_sql`` can accept database URI as ``con`` parameter (:issue:`10214`)
- ``read_sql_table`` will now allow reading from views (:issue:`10750`).
- Enable writing complex values to HDF stores when using table format (:issue:`10447`)
- Enable ``pd.read_hdf`` to be used without specifying a key when the HDF file contains a single dataset (:issue:`10443`)
- ``pd.read_stata`` will now read Stata 118 type files. (:issue:`9882`)
- ``msgpack`` submodule has been updated to 0.4.6 with backward compatibility (:issue:`10581`)
- ``DataFrame.to_dict`` now accepts the *index* option in ``orient`` keyword argument (:issue:`10844`).
- ``DataFrame.apply`` will return a Series of dicts if the passed function returns a dict and ``reduce=True`` (:issue:`8735`).
- Allow passing `kwargs` to the interpolation methods (:issue:`10378`).
- Improved error message when concatenating an empty iterable of dataframes (:issue:`9157`)
- ``pd.read_csv`` can now read bz2-compressed files incrementally, and the C parser can read bz2-compressed files from AWS S3 (:issue:`11070`, :issue:`11072`).
- In ``pd.read_csv``, recognize "s3n://" and "s3a://" URLs as designating S3 file storage (:issue:`11070`, :issue:`11071`).
- Read CSV files from AWS S3 incrementally, instead of first downloading the entire file. (Full file download still required for compressed files in Python 2.) (:issue:`11070`, :issue:`11073`)
- ``pd.read_csv`` is now able to infer compression type for files read from AWS S3 storage (:issue:`11070`, :issue:`11074`).
.. _whatsnew_0170.api:
.. _whatsnew_0170.api_breaking:
Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. _whatsnew_0170.api_breaking.sorting:
Changes to sorting API
^^^^^^^^^^^^^^^^^^^^^^
The sorting API has had some longtime inconsistencies. (:issue:`9816`, :issue:`8239`).
Here is a summary of the API **PRIOR** to 0.17.0:
- ``Series.sort`` is **INPLACE** while ``DataFrame.sort`` returns a new object.
- ``Series.order`` returns a new object
- It was possible to use ``Series/DataFrame.sort_index`` to sort by **values** by passing the ``by`` keyword.
- ``Series/DataFrame.sortlevel`` worked only on a ``MultiIndex`` for sorting by index.
To address these issues, we have revamped the API:
- We have introduced a new method, :meth:`DataFrame.sort_values`, which is the merger of ``DataFrame.sort()``, ``Series.sort()``,
and ``Series.order()``, to handle sorting of **values**.
- The existing methods ``Series.sort()``, ``Series.order()``, and ``DataFrame.sort()`` has been deprecated and will be removed in a
future version of pandas.
- The ``by`` argument of ``DataFrame.sort_index()`` has been deprecated and will be removed in a future version of pandas.
- The existing method ``.sort_index()`` will gain the ``level`` keyword to enable level sorting.
We now have two distinct and non-overlapping methods of sorting. A ``*`` marks items that
will show a ``FutureWarning``.
To sort by the **values**:
================================== ====================================
Previous Replacement
================================== ====================================
\* ``Series.order()`` ``Series.sort_values()``
\* ``Series.sort()`` ``Series.sort_values(inplace=True)``
\* ``DataFrame.sort(columns=...)`` ``DataFrame.sort_values(by=...)``
================================== ====================================
To sort by the **index**:
================================== ====================================
Previous Replacement
================================== ====================================
``Series.sort_index()`` ``Series.sort_index()``
``Series.sortlevel(level=...)`` ``Series.sort_index(level=...``)
``DataFrame.sort_index()`` ``DataFrame.sort_index()``
``DataFrame.sortlevel(level=...)`` ``DataFrame.sort_index(level=...)``
\* ``DataFrame.sort()`` ``DataFrame.sort_index()``
================================== ====================================
We have also deprecated and changed similar methods in two Series-like classes, ``Index`` and ``Categorical``.
================================== ====================================
Previous Replacement
================================== ====================================
\* ``Index.order()`` ``Index.sort_values()``
\* ``Categorical.order()`` ``Categorical.sort_values()``
================================== ====================================
.. _whatsnew_0170.api_breaking.to_datetime:
Changes to to_datetime and to_timedelta
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Error handling
""""""""""""""
The default for ``pd.to_datetime`` error handling has changed to ``errors='raise'``.
In prior versions it was ``errors='ignore'``. Furthermore, the ``coerce`` argument
has been deprecated in favor of ``errors='coerce'``. This means that invalid parsing
will raise rather that return the original input as in previous versions. (:issue:`10636`)
Previous Behavior:
.. code-block:: python
In [2]: pd.to_datetime(['2009-07-31', 'asd'])
Out[2]: array(['2009-07-31', 'asd'], dtype=object)
New Behavior:
.. code-block:: python
In [3]: pd.to_datetime(['2009-07-31', 'asd'])
ValueError: Unknown string format
.. ipython:: python
Of course you can coerce this as well.
.. ipython:: python
to_datetime(['2009-07-31', 'asd'], errors='coerce')
To keep the previous behavior, you can use ``errors='ignore'``:
.. ipython:: python
to_datetime(['2009-07-31', 'asd'], errors='ignore')
Furthermore, ``pd.to_timedelta`` has gained a similar API, of ``errors='raise'|'ignore'|'coerce'``, and the ``coerce`` keyword
has been deprecated in favor of ``errors='coerce'``.
Consistent Parsing
""""""""""""""""""
The string parsing of ``to_datetime``, ``Timestamp`` and ``DatetimeIndex`` has
been made consistent. (:issue:`7599`)
Prior to v0.17.0, ``Timestamp`` and ``to_datetime`` may parse year-only datetime-string incorrectly using today's date, otherwise ``DatetimeIndex``
uses the beginning of the year. ``Timestamp`` and ``to_datetime`` may raise ``ValueError`` in some types of datetime-string which ``DatetimeIndex``
can parse, such as a quarterly string.
Previous Behavior:
.. code-block:: python
In [1]: Timestamp('2012Q2')
Traceback
...
ValueError: Unable to parse 2012Q2
# Results in today's date.
In [2]: Timestamp('2014')
Out [2]: 2014-08-12 00:00:00
v0.17.0 can parse them as below. It works on ``DatetimeIndex`` also.
New Behavior:
.. ipython:: python
Timestamp('2012Q2')
Timestamp('2014')
DatetimeIndex(['2012Q2', '2014'])
.. note::
If you want to perform calculations based on today's date, use ``Timestamp.now()`` and ``pandas.tseries.offsets``.
.. ipython:: python
import pandas.tseries.offsets as offsets
Timestamp.now()
Timestamp.now() + offsets.DateOffset(years=1)
.. _whatsnew_0170.api_breaking.convert_objects:
Changes to convert_objects
^^^^^^^^^^^^^^^^^^^^^^^^^^
``DataFrame.convert_objects`` keyword arguments have been shortened. (:issue:`10265`)
===================== =============
Previous Replacement
===================== =============
``convert_dates`` ``datetime``
``convert_numeric`` ``numeric``
``convert_timedelta`` ``timedelta``
===================== =============
Coercing types with ``DataFrame.convert_objects`` is now implemented using the
keyword argument ``coerce=True``. Previously types were coerced by setting a
keyword argument to ``'coerce'`` instead of ``True``, as in ``convert_dates='coerce'``.
.. ipython:: python
df = pd.DataFrame({'i': ['1','2'],
'f': ['apple', '4.2'],
's': ['apple','banana']})
df
The old usage of ``DataFrame.convert_objects`` used ``'coerce'`` along with the
type.
.. code-block:: python
In [2]: df.convert_objects(convert_numeric='coerce')
Now the ``coerce`` keyword must be explicitly used.
.. ipython:: python
df.convert_objects(numeric=True, coerce=True)
In earlier versions of pandas, ``DataFrame.convert_objects`` would not coerce
numeric types when there were no values convertible to a numeric type. This returns
the original DataFrame with no conversion.
.. code-block:: python
In [1]: df = pd.DataFrame({'s': ['a','b']})
In [2]: df.convert_objects(convert_numeric='coerce')
Out[2]:
s
0 a
1 b
The new behavior will convert all non-number-like strings to ``NaN``,
when ``coerce=True`` is passed explicity.
.. ipython:: python
pd.DataFrame({'s': ['a','b']})
df.convert_objects(numeric=True, coerce=True)
In earlier versions of pandas, the default behavior was to try and convert
datetimes and timestamps. The new default is for ``DataFrame.convert_objects``
to do nothing, and so it is necessary to pass at least one conversion target
in the method call.
Changes to Index Comparisons
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Operator equal on ``Index`` should behavior similarly to ``Series`` (:issue:`9947`, :issue:`10637`)
Starting in v0.17.0, comparing ``Index`` objects of different lengths will raise
a ``ValueError``. This is to be consistent with the behavior of ``Series``.
Previous Behavior:
.. code-block:: python
In [2]: pd.Index([1, 2, 3]) == pd.Index([1, 4, 5])
Out[2]: array([ True, False, False], dtype=bool)
In [3]: pd.Index([1, 2, 3]) == pd.Index([2])
Out[3]: array([False, True, False], dtype=bool)
In [4]: pd.Index([1, 2, 3]) == pd.Index([1, 2])
Out[4]: False
New Behavior:
.. code-block:: python
In [8]: pd.Index([1, 2, 3]) == pd.Index([1, 4, 5])
Out[8]: array([ True, False, False], dtype=bool)
In [9]: pd.Index([1, 2, 3]) == pd.Index([2])
ValueError: Lengths must match to compare
In [10]: pd.Index([1, 2, 3]) == pd.Index([1, 2])
ValueError: Lengths must match to compare
Note that this is different from the ``numpy`` behavior where a comparison can
be broadcast:
.. ipython:: python
np.array([1, 2, 3]) == np.array([1])
or it can return False if broadcasting can not be done:
.. ipython:: python
np.array([1, 2, 3]) == np.array([1, 2])
Changes to Boolean Comparisons vs. None
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Boolean comparisons of a ``Series`` vs ``None`` will now be equivalent to comparing with ``np.nan``, rather than raise ``TypeError``. (:issue:`1079`).
.. ipython:: python
s = Series(range(3))
s.iloc[1] = None
s
Previous Behavior:
.. code-block:: python
In [5]: s==None
TypeError: Could not compare <type 'NoneType'> type with Series
New Behavior:
.. ipython:: python
s==None
Usually you simply want to know which values are null.
.. ipython:: python
s.isnull()
.. warning::
You generally will want to use ``isnull/notnull`` for these types of comparisons, as ``isnull/notnull`` tells you which elements are null. One has to be
mindful that ``nan's`` don't compare equal, but ``None's`` do. Note that Pandas/numpy uses the fact that ``np.nan != np.nan``, and treats ``None`` like ``np.nan``.
.. ipython:: python
None == None
np.nan == np.nan
.. _whatsnew_0170.api_breaking.hdf_dropna:
HDFStore dropna behavior
^^^^^^^^^^^^^^^^^^^^^^^^
The default behavior for HDFStore write functions with ``format='table'`` is now to keep rows that are all missing. Previously, the behavior was to drop rows that were all missing save the index. The previous behavior can be replicated using the ``dropna=True`` option. (:issue:`9382`)
Previous Behavior:
.. ipython:: python
df_with_missing = pd.DataFrame({'col1':[0, np.nan, 2],
'col2':[1, np.nan, np.nan]})
df_with_missing
.. code-block:: python
In [28]:
df_with_missing.to_hdf('file.h5',
'df_with_missing',
format='table',
mode='w')
pd.read_hdf('file.h5', 'df_with_missing')
Out [28]:
col1 col2
0 0 1
2 2 NaN
New Behavior:
.. ipython:: python
:suppress:
import os
.. ipython:: python
df_with_missing.to_hdf('file.h5',
'df_with_missing',
format='table',
mode='w')
pd.read_hdf('file.h5', 'df_with_missing')
.. ipython:: python
:suppress:
os.remove('file.h5')
See :ref:`documentation <io.hdf5>` for more details.
.. _whatsnew_0170.api_breaking.display_precision:
Changes to ``display.precision`` option
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ``display.precision`` option has been clarified to refer to decimal places (:issue:`10451`).
Earlier versions of pandas would format floating point numbers to have one less decimal place than the value in
``display.precision``.
.. code-block:: python
In [1]: pd.set_option('display.precision', 2)
In [2]: pd.DataFrame({'x': [123.456789]})
Out[2]:
x
0 123.5
If interpreting precision as "significant figures" this did work for scientific notation but that same interpretation
did not work for values with standard formatting. It was also out of step with how numpy handles formatting.
Going forward the value of ``display.precision`` will directly control the number of places after the decimal, for
regular formatting as well as scientific notation, similar to how numpy's ``precision`` print option works.
.. ipython:: python
pd.set_option('display.precision', 2)
pd.DataFrame({'x': [123.456789]})
To preserve output behavior with prior versions the default value of ``display.precision`` has been reduced to ``6``
from ``7``.
.. ipython:: python
:suppress:
pd.set_option('display.precision', 6)
.. _whatsnew_0170.api_breaking.categorical_unique:
Changes to ``Categorical.unique``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
``Categorical.unique`` now returns new ``Categoricals`` with ``categories`` and ``codes`` that are unique, rather than returning ``np.array`` (:issue:`10508`)
- unordered category: values and categories are sorted by appearance order.
- ordered category: values are sorted by appearance order, categories keep existing order.
.. ipython :: python
cat = pd.Categorical(['C', 'A', 'B', 'C'],
categories=['A', 'B', 'C'],
ordered=True)
cat
cat.unique()
cat = pd.Categorical(['C', 'A', 'B', 'C'],
categories=['A', 'B', 'C'])
cat
cat.unique()
.. _whatsnew_0170.api_breaking.other:
Other API Changes
^^^^^^^^^^^^^^^^^
- Line and kde plot with ``subplots=True`` now uses default colors, not all black. Specify ``color='k'`` to draw all lines in black (:issue:`9894`)
- Calling the ``.value_counts()`` method on a Series with ``categorical`` dtype now returns a Series with a ``CategoricalIndex`` (:issue:`10704`)
- The metadata properties of subclasses of pandas objects will now be serialized (:issue:`10553`).
- ``groupby`` using ``Categorical`` follows the same rule as ``Categorical.unique`` described above (:issue:`10508`)
- When constructing ``DataFrame`` with an array of ``complex64`` dtype previously meant the corresponding column
was automatically promoted to the ``complex128`` dtype. Pandas will now preserve the itemsize of the input for complex data (:issue:`10952`)
- some numeric reduction operators would return ``ValueError``, rather than ``TypeError`` on object types that includes strings and numbers (:issue:`11131`)
- ``NaT``'s methods now either raise ``ValueError``, or return ``np.nan`` or ``NaT`` (:issue:`9513`)
=============================== ===============================================================
Behavior Methods
=============================== ===============================================================
return ``np.nan`` ``weekday``, ``isoweekday``
return ``NaT`` ``date``, ``now``, ``replace``, ``to_datetime``, ``today``
return ``np.datetime64('NaT')`` ``to_datetime64`` (unchanged)
raise ``ValueError`` All other public methods (names not beginning with underscores)
=============================== ===============================================================
.. _whatsnew_0170.deprecations:
Deprecations
^^^^^^^^^^^^
- For ``Series`` the following indexing functions are deprecated (:issue:`10177`).
===================== =================================
Deprecated Function Replacement
===================== =================================
``.irow(i)`` ``.iloc[i]`` or ``.iat[i]``
``.iget(i)`` ``.iloc[i]`` or ``.iat[i]``
``.iget_value(i)`` ``.iloc[i]`` or ``.iat[i]``
===================== =================================
- For ``DataFrame`` the following indexing functions are deprecated (:issue:`10177`).
===================== =================================
Deprecated Function Replacement
===================== =================================
``.irow(i)`` ``.iloc[i]``
``.iget_value(i, j)`` ``.iloc[i, j]`` or ``.iat[i, j]``
``.icol(j)`` ``.iloc[:, j]``
===================== =================================
.. note:: These indexing function have been deprecated in the documentation since 0.11.0.
- ``Categorical.name`` was deprecated to make ``Categorical`` more ``numpy.ndarray`` like. Use ``Series(cat, name="whatever")`` instead (:issue:`10482`).
- Setting missing values (NaN) in a ``Categorical``'s ``categories`` will issue a warning (:issue:`10748`). You can still have missing values in the ``values``.
- ``drop_duplicates`` and ``duplicated``'s ``take_last`` keyword was deprecated in favor of ``keep``. (:issue:`6511`, :issue:`8505`)
- ``Series.nsmallest`` and ``nlargest``'s ``take_last`` keyword was deprecated in favor of ``keep``. (:issue:`10792`)
- ``DataFrame.combineAdd`` and ``DataFrame.combineMult`` are deprecated. They
can easily be replaced by using the ``add`` and ``mul`` methods:
``DataFrame.add(other, fill_value=0)`` and ``DataFrame.mul(other, fill_value=1.)``
(:issue:`10735`).
- ``TimeSeries`` deprecated in favor of ``Series`` (note that this has been alias since 0.13.0), (:issue:`10890`)
- ``SparsePanel`` deprecated and will be removed in a future version (:issue:``)
- ``Series.is_time_series`` deprecated in favor of ``Series.index.is_all_dates`` (:issue:`11135`)
- Legacy offsets (like ``'A@JAN'``) listed in :ref:`here <timeseries.legacyaliases>` are deprecated (note that this has been alias since 0.8.0), (:issue:`10878`)
- ``WidePanel`` deprecated in favor of ``Panel``, ``LongPanel`` in favor of ``DataFrame`` (note these have been aliases since < 0.11.0), (:issue:`10892`)
.. _whatsnew_0170.prior_deprecations:
Removal of prior version deprecations/changes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- Removal of ``na_last`` parameters from ``Series.order()`` and ``Series.sort()``, in favor of ``na_position``, xref (:issue:`5231`)
- Remove of ``percentile_width`` from ``.describe()``, in favor of ``percentiles``. (:issue:`7088`)
- Removal of ``colSpace`` parameter from ``DataFrame.to_string()``, in favor of ``col_space``, circa 0.8.0 version.
- Removal of automatic time-series broadcasting (:issue:`2304`)
.. ipython :: python
np.random.seed(1234)
df = DataFrame(np.random.randn(5,2),columns=list('AB'),index=date_range('20130101',periods=5))
df
Previously
.. code-block:: python
In [3]: df + df.A
FutureWarning: TimeSeries broadcasting along DataFrame index by default is deprecated.
Please use DataFrame.<op> to explicitly broadcast arithmetic operations along the index
Out[3]:
A B
2013-01-01 0.942870 -0.719541
2013-01-02 2.865414 1.120055
2013-01-03 -1.441177 0.166574