Skip to content

Commit 9fd8d13

Browse files
committed
Merge branch 'master' into refactor/csvs
2 parents 9df1d82 + 9ad6363 commit 9fd8d13

File tree

124 files changed

+4831
-688
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

124 files changed

+4831
-688
lines changed

ci/deps/azure-37-locale_slow.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ dependencies:
1818
- lxml
1919
- matplotlib=3.0.0
2020
- numpy=1.16.*
21-
- openpyxl=2.5.7
21+
- openpyxl=2.6.0
2222
- python-dateutil
2323
- python-blosc
2424
- pytz=2017.3

ci/deps/azure-37-minimum_versions.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ dependencies:
1919
- numba=0.46.0
2020
- numexpr=2.6.8
2121
- numpy=1.16.5
22-
- openpyxl=2.5.7
22+
- openpyxl=2.6.0
2323
- pytables=3.4.4
2424
- python-dateutil=2.7.3
2525
- pytz=2017.3

doc/source/development/contributing_docstring.rst

+5-5
Original file line numberDiff line numberDiff line change
@@ -32,18 +32,18 @@ The next example gives an idea of what a docstring looks like:
3232
Parameters
3333
----------
3434
num1 : int
35-
First number to add
35+
First number to add.
3636
num2 : int
37-
Second number to add
37+
Second number to add.
3838
3939
Returns
4040
-------
4141
int
42-
The sum of `num1` and `num2`
42+
The sum of `num1` and `num2`.
4343
4444
See Also
4545
--------
46-
subtract : Subtract one integer from another
46+
subtract : Subtract one integer from another.
4747
4848
Examples
4949
--------
@@ -998,4 +998,4 @@ mapping function names to docstrings. Wherever possible, we prefer using
998998

999999
See ``pandas.core.generic.NDFrame.fillna`` for an example template, and
10001000
``pandas.core.series.Series.fillna`` and ``pandas.core.generic.frame.fillna``
1001-
for the filled versions.
1001+
for the filled versions.

doc/source/getting_started/install.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -274,7 +274,7 @@ html5lib 1.0.1 HTML parser for read_html (see :ref
274274
lxml 4.3.0 HTML parser for read_html (see :ref:`note <optional_html>`)
275275
matplotlib 2.2.3 Visualization
276276
numba 0.46.0 Alternative execution engine for rolling operations
277-
openpyxl 2.5.7 Reading / writing for xlsx files
277+
openpyxl 2.6.0 Reading / writing for xlsx files
278278
pandas-gbq 0.12.0 Google Big Query access
279279
psycopg2 2.7 PostgreSQL engine for sqlalchemy
280280
pyarrow 0.15.0 Parquet, ORC, and feather reading / writing

doc/source/reference/frame.rst

+16
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ Attributes and underlying data
3737
DataFrame.shape
3838
DataFrame.memory_usage
3939
DataFrame.empty
40+
DataFrame.set_flags
4041

4142
Conversion
4243
~~~~~~~~~~
@@ -276,6 +277,21 @@ Time Series-related
276277
DataFrame.tz_convert
277278
DataFrame.tz_localize
278279

280+
.. _api.frame.flags:
281+
282+
Flags
283+
~~~~~
284+
285+
Flags refer to attributes of the pandas object. Properties of the dataset (like
286+
the date is was recorded, the URL it was accessed from, etc.) should be stored
287+
in :attr:`DataFrame.attrs`.
288+
289+
.. autosummary::
290+
:toctree: api/
291+
292+
Flags
293+
294+
279295
.. _api.frame.metadata:
280296

281297
Metadata

doc/source/reference/general_utility_functions.rst

+1
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ Exceptions and warnings
3737

3838
errors.AccessorRegistrationWarning
3939
errors.DtypeWarning
40+
errors.DuplicateLabelError
4041
errors.EmptyDataError
4142
errors.InvalidIndexError
4243
errors.MergeError

doc/source/reference/series.rst

+15
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,8 @@ Attributes
3939
Series.empty
4040
Series.dtypes
4141
Series.name
42+
Series.flags
43+
Series.set_flags
4244

4345
Conversion
4446
----------
@@ -527,6 +529,19 @@ Sparse-dtype specific methods and attributes are provided under the
527529
Series.sparse.from_coo
528530
Series.sparse.to_coo
529531

532+
.. _api.series.flags:
533+
534+
Flags
535+
~~~~~
536+
537+
Flags refer to attributes of the pandas object. Properties of the dataset (like
538+
the date is was recorded, the URL it was accessed from, etc.) should be stored
539+
in :attr:`Series.attrs`.
540+
541+
.. autosummary::
542+
:toctree: api/
543+
544+
Flags
530545

531546
.. _api.series.metadata:
532547

doc/source/user_guide/computation.rst

+3
Original file line numberDiff line numberDiff line change
@@ -361,6 +361,9 @@ compute the mean absolute deviation on a rolling basis:
361361
@savefig rolling_apply_ex.png
362362
s.rolling(window=60).apply(mad, raw=True).plot(style='k')
363363
364+
Using the Numba engine
365+
~~~~~~~~~~~~~~~~~~~~~~
366+
364367
.. versionadded:: 1.0
365368

366369
Additionally, :meth:`~Rolling.apply` can leverage `Numba <https://numba.pydata.org/>`__

doc/source/user_guide/duplicates.rst

+210
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
.. _duplicates:
2+
3+
****************
4+
Duplicate Labels
5+
****************
6+
7+
:class:`Index` objects are not required to be unique; you can have duplicate row
8+
or column labels. This may be a bit confusing at first. If you're familiar with
9+
SQL, you know that row labels are similar to a primary key on a table, and you
10+
would never want duplicates in a SQL table. But one of pandas' roles is to clean
11+
messy, real-world data before it goes to some downstream system. And real-world
12+
data has duplicates, even in fields that are supposed to be unique.
13+
14+
This section describes how duplicate labels change the behavior of certain
15+
operations, and how prevent duplicates from arising during operations, or to
16+
detect them if they do.
17+
18+
.. ipython:: python
19+
20+
import pandas as pd
21+
import numpy as np
22+
23+
Consequences of Duplicate Labels
24+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
25+
26+
Some pandas methods (:meth:`Series.reindex` for example) just don't work with
27+
duplicates present. The output can't be determined, and so pandas raises.
28+
29+
.. ipython:: python
30+
:okexcept:
31+
32+
s1 = pd.Series([0, 1, 2], index=['a', 'b', 'b'])
33+
s1.reindex(['a', 'b', 'c'])
34+
35+
Other methods, like indexing, can give very surprising results. Typically
36+
indexing with a scalar will *reduce dimensionality*. Slicing a ``DataFrame``
37+
with a scalar will return a ``Series``. Slicing a ``Series`` with a scalar will
38+
return a scalar. But with duplicates, this isn't the case.
39+
40+
.. ipython:: python
41+
42+
df1 = pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=['A', 'A', 'B'])
43+
df1
44+
45+
We have duplicates in the columns. If we slice ``'B'``, we get back a ``Series``
46+
47+
.. ipython:: python
48+
49+
df1['B'] # a series
50+
51+
But slicing ``'A'`` returns a ``DataFrame``
52+
53+
54+
.. ipython:: python
55+
56+
df1['A'] # a DataFrame
57+
58+
This applies to row labels as well
59+
60+
.. ipython:: python
61+
62+
df2 = pd.DataFrame({"A": [0, 1, 2]}, index=['a', 'a', 'b'])
63+
df2
64+
df2.loc['b', 'A'] # a scalar
65+
df2.loc['a', 'A'] # a Series
66+
67+
Duplicate Label Detection
68+
~~~~~~~~~~~~~~~~~~~~~~~~~
69+
70+
You can check whether an :class:`Index` (storing the row or column labels) is
71+
unique with :attr:`Index.is_unique`:
72+
73+
.. ipython:: python
74+
75+
df2
76+
df2.index.is_unique
77+
df2.columns.is_unique
78+
79+
.. note::
80+
81+
Checking whether an index is unique is somewhat expensive for large datasets.
82+
Pandas does cache this result, so re-checking on the same index is very fast.
83+
84+
:meth:`Index.duplicated` will return a boolean ndarray indicating whether a
85+
label is repeated.
86+
87+
.. ipython:: python
88+
89+
df2.index.duplicated()
90+
91+
Which can be used as a boolean filter to drop duplicate rows.
92+
93+
.. ipython:: python
94+
95+
df2.loc[~df2.index.duplicated(), :]
96+
97+
If you need additional logic to handle duplicate labels, rather than just
98+
dropping the repeats, using :meth:`~DataFrame.groupby` on the index is a common
99+
trick. For example, we'll resolve duplicates by taking the average of all rows
100+
with the same label.
101+
102+
.. ipython:: python
103+
104+
df2.groupby(level=0).mean()
105+
106+
.. _duplicates.disallow:
107+
108+
Disallowing Duplicate Labels
109+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
110+
111+
.. versionadded:: 1.2.0
112+
113+
As noted above, handling duplicates is an important feature when reading in raw
114+
data. That said, you may want to avoid introducing duplicates as part of a data
115+
processing pipeline (from methods like :meth:`pandas.concat`,
116+
:meth:`~DataFrame.rename`, etc.). Both :class:`Series` and :class:`DataFrame`
117+
*disallow* duplicate labels by calling ``.set_flags(allows_duplicate_labels=False)``.
118+
(the default is to allow them). If there are duplicate labels, an exception
119+
will be raised.
120+
121+
.. ipython:: python
122+
:okexcept:
123+
124+
pd.Series(
125+
[0, 1, 2],
126+
index=['a', 'b', 'b']
127+
).set_flags(allows_duplicate_labels=False)
128+
129+
This applies to both row and column labels for a :class:`DataFrame`
130+
131+
.. ipython:: python
132+
:okexcept:
133+
134+
pd.DataFrame(
135+
[[0, 1, 2], [3, 4, 5]], columns=["A", "B", "C"],
136+
).set_flags(allows_duplicate_labels=False)
137+
138+
This attribute can be checked or set with :attr:`~DataFrame.flags.allows_duplicate_labels`,
139+
which indicates whether that object can have duplicate labels.
140+
141+
.. ipython:: python
142+
143+
df = (
144+
pd.DataFrame({"A": [0, 1, 2, 3]},
145+
index=['x', 'y', 'X', 'Y'])
146+
.set_flags(allows_duplicate_labels=False)
147+
)
148+
df
149+
df.flags.allows_duplicate_labels
150+
151+
:meth:`DataFrame.set_flags` can be used to return a new ``DataFrame`` with attributes
152+
like ``allows_duplicate_labels`` set to some value
153+
154+
.. ipython:: python
155+
156+
df2 = df.set_flags(allows_duplicate_labels=True)
157+
df2.flags.allows_duplicate_labels
158+
159+
The new ``DataFrame`` returned is a view on the same data as the old ``DataFrame``.
160+
Or the property can just be set directly on the same object
161+
162+
163+
.. ipython:: python
164+
165+
df2.flags.allows_duplicate_labels = False
166+
df2.flags.allows_duplicate_labels
167+
168+
When processing raw, messy data you might initially read in the messy data
169+
(which potentially has duplicate labels), deduplicate, and then disallow duplicates
170+
going forward, to ensure that your data pipeline doesn't introduce duplicates.
171+
172+
173+
.. code-block:: python
174+
175+
>>> raw = pd.read_csv("...")
176+
>>> deduplicated = raw.groupby(level=0).first() # remove duplicates
177+
>>> deduplicated.flags.allows_duplicate_labels = False # disallow going forward
178+
179+
Setting ``allows_duplicate_labels=True`` on a ``Series`` or ``DataFrame`` with duplicate
180+
labels or performing an operation that introduces duplicate labels on a ``Series`` or
181+
``DataFrame`` that disallows duplicates will raise an
182+
:class:`errors.DuplicateLabelError`.
183+
184+
.. ipython:: python
185+
:okexcept:
186+
187+
df.rename(str.upper)
188+
189+
This error message contains the labels that are duplicated, and the numeric positions
190+
of all the duplicates (including the "original") in the ``Series`` or ``DataFrame``
191+
192+
Duplicate Label Propagation
193+
^^^^^^^^^^^^^^^^^^^^^^^^^^^
194+
195+
In general, disallowing duplicates is "sticky". It's preserved through
196+
operations.
197+
198+
.. ipython:: python
199+
:okexcept:
200+
201+
s1 = pd.Series(0, index=['a', 'b']).set_flags(allows_duplicate_labels=False)
202+
s1
203+
s1.head().rename({"a": "b"})
204+
205+
.. warning::
206+
207+
This is an experimental feature. Currently, many methods fail to
208+
propagate the ``allows_duplicate_labels`` value. In future versions
209+
it is expected that every method taking or returning one or more
210+
DataFrame or Series objects will propagate ``allows_duplicate_labels``.

doc/source/user_guide/enhancingperf.rst

+7
Original file line numberDiff line numberDiff line change
@@ -373,6 +373,13 @@ nicer interface by passing/returning pandas objects.
373373
374374
In this example, using Numba was faster than Cython.
375375

376+
Numba as an argument
377+
~~~~~~~~~~~~~~~~~~~~
378+
379+
Additionally, we can leverage the power of `Numba <https://numba.pydata.org/>`__
380+
by calling it as an argument in :meth:`~Rolling.apply`. See :ref:`Computation tools
381+
<stats.rolling_apply>` for an extensive example.
382+
376383
Vectorize
377384
~~~~~~~~~
378385

doc/source/user_guide/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ Further information on any specific method can be obtained in the
3333
reshaping
3434
text
3535
missing_data
36+
duplicates
3637
categorical
3738
integer_na
3839
boolean

doc/source/whatsnew/v1.1.2.rst

+12-2
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,10 @@ Fixed regressions
1818
- Fix regression in updating a column inplace (e.g. using ``df['col'].fillna(.., inplace=True)``) (:issue:`35731`)
1919
- Performance regression for :meth:`RangeIndex.format` (:issue:`35712`)
2020
- Regression in :meth:`DataFrame.replace` where a ``TypeError`` would be raised when attempting to replace elements of type :class:`Interval` (:issue:`35931`)
21+
- Fix regression in pickle roundtrip of the ``closed`` attribute of :class:`IntervalIndex` (:issue:`35658`)
22+
- Fixed regression in :meth:`DataFrameGroupBy.agg` where a ``ValueError: buffer source array is read-only`` would be raised when the underlying array is read-only (:issue:`36014`)
2123
-
2224

23-
2425
.. ---------------------------------------------------------------------------
2526
2627
.. _whatsnew_112.bug_fixes:
@@ -30,8 +31,17 @@ Bug fixes
3031
- Bug in :meth:`DataFrame.eval` with ``object`` dtype column binary operations (:issue:`35794`)
3132
- Bug in :class:`Series` constructor raising a ``TypeError`` when constructing sparse datetime64 dtypes (:issue:`35762`)
3233
- Bug in :meth:`DataFrame.apply` with ``result_type="reduce"`` returning with incorrect index (:issue:`35683`)
33-
- Bug in :meth:`DateTimeIndex.format` and :meth:`PeriodIndex.format` with ``name=True`` setting the first item to ``"None"`` where it should bw ``""`` (:issue:`35712`)
34+
- Bug in :meth:`DateTimeIndex.format` and :meth:`PeriodIndex.format` with ``name=True`` setting the first item to ``"None"`` where it should be ``""`` (:issue:`35712`)
3435
- Bug in :meth:`Float64Index.__contains__` incorrectly raising ``TypeError`` instead of returning ``False`` (:issue:`35788`)
36+
- Bug in :class:`DataFrame` indexing returning an incorrect :class:`Series` in some cases when the series has been altered and a cache not invalidated (:issue:`33675`)
37+
38+
.. ---------------------------------------------------------------------------
39+
40+
.. _whatsnew_112.other:
41+
42+
Other
43+
~~~~~
44+
- :meth:`factorize` now supports ``na_sentinel=None`` to include NaN in the uniques of the values and remove ``dropna`` keyword which was unintentionally exposed to public facing API in 1.1 version from :meth:`factorize` (:issue:`35667`)
3545

3646
.. ---------------------------------------------------------------------------
3747

0 commit comments

Comments
 (0)