Skip to content

Commit f9ff20b

Browse files
committed
Merge branch 'master' of https://github.com/pandas-dev/pandas into bool_ops3
2 parents f19a596 + 52559f5 commit f9ff20b

File tree

106 files changed

+3215
-2341
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

106 files changed

+3215
-2341
lines changed

README.md

+4-1
Original file line numberDiff line numberDiff line change
@@ -216,13 +216,16 @@ Further, general questions and discussions can also take place on the [pydata ma
216216
## Discussion and Development
217217
Most development discussion is taking place on github in this repo. Further, the [pandas-dev mailing list](https://mail.python.org/mailman/listinfo/pandas-dev) can also be used for specialized discussions or design issues, and a [Gitter channel](https://gitter.im/pydata/pandas) is available for quick development related questions.
218218

219-
## Contributing to pandas
219+
## Contributing to pandas [![Open Source Helpers](https://www.codetriage.com/pandas-dev/pandas/badges/users.svg)](https://www.codetriage.com/pandas-dev/pandas)
220+
220221
All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome.
221222

222223
A detailed overview on how to contribute can be found in the **[contributing guide.](https://pandas.pydata.org/pandas-docs/stable/contributing.html)**
223224

224225
If you are simply looking to start working with the pandas codebase, navigate to the [GitHub “issues” tab](https://github.com/pandas-dev/pandas/issues) and start looking through interesting issues. There are a number of issues listed under [Docs](https://github.com/pandas-dev/pandas/issues?labels=Docs&sort=updated&state=open) and [Difficulty Novice](https://github.com/pandas-dev/pandas/issues?q=is%3Aopen+is%3Aissue+label%3A%22Difficulty+Novice%22) where you could start out.
225226

227+
You can also triage issues which may include reproducing bug reports, or asking for vital information such as version numbers or reproduction instructions. If you would like to start triaging issues, one easy way to get started is to [subscribe to pandas on CodeTriage](https://www.codetriage.com/pandas-dev/pandas).
228+
226229
Or maybe through using pandas you have an idea of your own or are looking for something in the documentation and thinking ‘this can be improved’...you can do something about it!
227230

228231
Feel free to ask questions on the [mailing list](https://groups.google.com/forum/?fromgroups#!forum/pydata) or on [Gitter](https://gitter.im/pydata/pandas).

asv_bench/benchmarks/groupby.py

+19-7
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,13 @@
1111
from .pandas_vb_common import setup # noqa
1212

1313

14+
method_blacklist = {
15+
'object': {'median', 'prod', 'sem', 'cumsum', 'sum', 'cummin', 'mean',
16+
'max', 'skew', 'cumprod', 'cummax', 'rank', 'pct_change', 'min',
17+
'var', 'mad', 'describe', 'std'}
18+
}
19+
20+
1421
class ApplyDictReturn(object):
1522
goal_time = 0.2
1623

@@ -153,6 +160,7 @@ def time_frame_nth_any(self, df):
153160
def time_frame_nth(self, df):
154161
df.groupby(0).nth(0)
155162

163+
156164
def time_series_nth_any(self, df):
157165
df[1].groupby(df[0]).nth(0, dropna='any')
158166

@@ -369,23 +377,27 @@ class GroupByMethods(object):
369377
goal_time = 0.2
370378

371379
param_names = ['dtype', 'method']
372-
params = [['int', 'float'],
373-
['all', 'any', 'count', 'cumcount', 'cummax', 'cummin',
374-
'cumprod', 'cumsum', 'describe', 'first', 'head', 'last', 'mad',
375-
'max', 'min', 'median', 'mean', 'nunique', 'pct_change', 'prod',
376-
'rank', 'sem', 'shift', 'size', 'skew', 'std', 'sum', 'tail',
377-
'unique', 'value_counts', 'var']]
380+
params = [['int', 'float', 'object'],
381+
['all', 'any', 'bfill', 'count', 'cumcount', 'cummax', 'cummin',
382+
'cumprod', 'cumsum', 'describe', 'ffill', 'first', 'head',
383+
'last', 'mad', 'max', 'min', 'median', 'mean', 'nunique',
384+
'pct_change', 'prod', 'rank', 'sem', 'shift', 'size', 'skew',
385+
'std', 'sum', 'tail', 'unique', 'value_counts', 'var']]
378386

379387
def setup(self, dtype, method):
388+
if method in method_blacklist.get(dtype, {}):
389+
raise NotImplementedError # skip benchmark
380390
ngroups = 1000
381391
size = ngroups * 2
382392
rng = np.arange(ngroups)
383393
values = rng.take(np.random.randint(0, ngroups, size=size))
384394
if dtype == 'int':
385395
key = np.random.randint(0, size, size=size)
386-
else:
396+
elif dtype == 'float':
387397
key = np.concatenate([np.random.random(ngroups) * 0.1,
388398
np.random.random(ngroups) * 10.0])
399+
elif dtype == 'object':
400+
key = ['foo'] * size
389401

390402
df = DataFrame({'values': values, 'key': key})
391403
self.df_groupby_method = getattr(df.groupby('key')['values'], method)

ci/lint.sh

+9
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,15 @@ if [ "$LINT" ]; then
111111
RET=1
112112
fi
113113

114+
# Check for the following code in the extension array base tests
115+
# tm.assert_frame_equal
116+
# tm.assert_series_equal
117+
grep -r -E --include '*.py' --exclude base.py 'tm.assert_(series|frame)_equal' pandas/tests/extension/base
118+
119+
if [ $? = "0" ]; then
120+
RET=1
121+
fi
122+
114123
echo "Check for invalid testing DONE"
115124

116125
# Check for imports from pandas.core.common instead

ci/requirements-2.7.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@ source activate pandas
44

55
echo "install 27"
66

7-
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1 fastparquet
7+
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1 jemalloc=4.5.0.post fastparquet

ci/requirements-3.6_DOC.run

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
ipython
22
ipykernel
33
ipywidgets
4-
sphinx=1.5*
4+
sphinx
55
nbconvert
66
nbformat
77
notebook

ci/requirements_dev.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,4 @@ pytest>=3.1
77
python-dateutil>=2.5.0
88
pytz
99
setuptools>=3.3
10-
sphinx=1.5*
10+
sphinx

doc/source/api.rst

+12-8
Original file line numberDiff line numberDiff line change
@@ -6,19 +6,18 @@ API Reference
66
*************
77

88
This page gives an overview of all public pandas objects, functions and
9-
methods. In general, all classes and functions exposed in the top-level
10-
``pandas.*`` namespace are regarded as public.
9+
methods. All classes and functions exposed in ``pandas.*`` namespace are public.
1110

12-
Further some of the subpackages are public, including ``pandas.errors``,
13-
``pandas.plotting``, and ``pandas.testing``. Certain functions in the
14-
``pandas.io`` and ``pandas.tseries`` submodules are public as well (those
15-
mentioned in the documentation). Further, the ``pandas.api.types`` subpackage
16-
holds some public functions related to data types in pandas.
11+
Some subpackages are public which include ``pandas.errors``,
12+
``pandas.plotting``, and ``pandas.testing``. Public functions in
13+
``pandas.io`` and ``pandas.tseries`` submodules are mentioned in
14+
the documentation. ``pandas.api.types`` subpackage holds some
15+
public functions related to data types in pandas.
1716

1817

1918
.. warning::
2019

21-
The ``pandas.core``, ``pandas.compat``, and ``pandas.util`` top-level modules are considered to be PRIVATE. Stability of functionality in those modules in not guaranteed.
20+
The ``pandas.core``, ``pandas.compat``, and ``pandas.util`` top-level modules are PRIVATE. Stable functionality in such modules is not guaranteed.
2221

2322

2423
.. _api.functions:
@@ -2180,8 +2179,12 @@ Computations / Descriptive Stats
21802179
.. autosummary::
21812180
:toctree: generated/
21822181

2182+
GroupBy.all
2183+
GroupBy.any
2184+
GroupBy.bfill
21832185
GroupBy.count
21842186
GroupBy.cumcount
2187+
GroupBy.ffill
21852188
GroupBy.first
21862189
GroupBy.head
21872190
GroupBy.last
@@ -2193,6 +2196,7 @@ Computations / Descriptive Stats
21932196
GroupBy.nth
21942197
GroupBy.ohlc
21952198
GroupBy.prod
2199+
GroupBy.rank
21962200
GroupBy.size
21972201
GroupBy.sem
21982202
GroupBy.std

doc/source/basics.rst

+3-3
Original file line numberDiff line numberDiff line change
@@ -746,7 +746,7 @@ What if the function you wish to apply takes its data as, say, the second argume
746746
In this case, provide ``pipe`` with a tuple of ``(callable, data_keyword)``.
747747
``.pipe`` will route the ``DataFrame`` to the argument specified in the tuple.
748748

749-
For example, we can fit a regression using statsmodels. Their API expects a formula first and a ``DataFrame`` as the second argument, ``data``. We pass in the function, keyword pair ``(sm.poisson, 'data')`` to ``pipe``:
749+
For example, we can fit a regression using statsmodels. Their API expects a formula first and a ``DataFrame`` as the second argument, ``data``. We pass in the function, keyword pair ``(sm.ols, 'data')`` to ``pipe``:
750750

751751
.. ipython:: python
752752
@@ -756,7 +756,7 @@ For example, we can fit a regression using statsmodels. Their API expects a form
756756
757757
(bb.query('h > 0')
758758
.assign(ln_h = lambda df: np.log(df.h))
759-
.pipe((sm.poisson, 'data'), 'hr ~ ln_h + year + g + C(lg)')
759+
.pipe((sm.ols, 'data'), 'hr ~ ln_h + year + g + C(lg)')
760760
.fit()
761761
.summary()
762762
)
@@ -2312,4 +2312,4 @@ All NumPy dtypes are subclasses of ``numpy.generic``:
23122312
.. note::
23132313

23142314
Pandas also defines the types ``category``, and ``datetime64[ns, tz]``, which are not integrated into the normal
2315-
NumPy hierarchy and wont show up with the above function.
2315+
NumPy hierarchy and won't show up with the above function.

doc/source/categorical.rst

+84-14
Original file line numberDiff line numberDiff line change
@@ -46,9 +46,14 @@ The categorical data type is useful in the following cases:
4646

4747
See also the :ref:`API docs on categoricals<api.categorical>`.
4848

49+
.. _categorical.objectcreation:
50+
4951
Object Creation
5052
---------------
5153

54+
Series Creation
55+
~~~~~~~~~~~~~~~
56+
5257
Categorical ``Series`` or columns in a ``DataFrame`` can be created in several ways:
5358

5459
By specifying ``dtype="category"`` when constructing a ``Series``:
@@ -77,7 +82,7 @@ discrete bins. See the :ref:`example on tiling <reshaping.tile.cut>` in the docs
7782
df['group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
7883
df.head(10)
7984
80-
By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to a `DataFrame`.
85+
By passing a :class:`pandas.Categorical` object to a ``Series`` or assigning it to a ``DataFrame``.
8186

8287
.. ipython:: python
8388
@@ -89,6 +94,55 @@ By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to
8994
df["B"] = raw_cat
9095
df
9196
97+
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
98+
99+
.. ipython:: python
100+
101+
df.dtypes
102+
103+
DataFrame Creation
104+
~~~~~~~~~~~~~~~~~~
105+
106+
Similar to the previous section where a single column was converted to categorical, all columns in a
107+
``DataFrame`` can be batch converted to categorical either during or after construction.
108+
109+
This can be done during construction by specifying ``dtype="category"`` in the ``DataFrame`` constructor:
110+
111+
.. ipython:: python
112+
113+
df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')}, dtype="category")
114+
df.dtypes
115+
116+
Note that the categories present in each column differ; the conversion is done column by column, so
117+
only labels present in a given column are categories:
118+
119+
.. ipython:: python
120+
121+
df['A']
122+
df['B']
123+
124+
125+
.. versionadded:: 0.23.0
126+
127+
Analogously, all columns in an existing ``DataFrame`` can be batch converted using :meth:`DataFrame.astype`:
128+
129+
.. ipython:: python
130+
131+
df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
132+
df_cat = df.astype('category')
133+
df_cat.dtypes
134+
135+
This conversion is likewise done column by column:
136+
137+
.. ipython:: python
138+
139+
df_cat['A']
140+
df_cat['B']
141+
142+
143+
Controlling Behavior
144+
~~~~~~~~~~~~~~~~~~~~
145+
92146
In the examples above where we passed ``dtype='category'``, we used the default
93147
behavior:
94148

@@ -108,21 +162,36 @@ of :class:`~pandas.api.types.CategoricalDtype`.
108162
s_cat = s.astype(cat_type)
109163
s_cat
110164
111-
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
165+
Similarly, a ``CategoricalDtype`` can be used with a ``DataFrame`` to ensure that categories
166+
are consistent among all columns.
112167

113168
.. ipython:: python
114169
115-
df.dtypes
170+
df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
171+
cat_type = CategoricalDtype(categories=list('abcd'),
172+
ordered=True)
173+
df_cat = df.astype(cat_type)
174+
df_cat['A']
175+
df_cat['B']
116176
117177
.. note::
118178

119-
In contrast to R's `factor` function, categorical data is not converting input values to
120-
strings and categories will end up the same data type as the original values.
179+
To perform table-wise conversion, where all labels in the entire ``DataFrame`` are used as
180+
categories for each column, the ``categories`` parameter can be determined programatically by
181+
``categories = pd.unique(df.values.ravel())``.
121182

122-
.. note::
183+
If you already have ``codes`` and ``categories``, you can use the
184+
:func:`~pandas.Categorical.from_codes` constructor to save the factorize step
185+
during normal constructor mode:
123186

124-
In contrast to R's `factor` function, there is currently no way to assign/change labels at
125-
creation time. Use `categories` to change the categories after creation time.
187+
.. ipython:: python
188+
189+
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
190+
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
191+
192+
193+
Regaining Original Data
194+
~~~~~~~~~~~~~~~~~~~~~~~
126195

127196
To get back to the original ``Series`` or NumPy array, use
128197
``Series.astype(original_dtype)`` or ``np.asarray(categorical)``:
@@ -136,14 +205,15 @@ To get back to the original ``Series`` or NumPy array, use
136205
s2.astype(str)
137206
np.asarray(s2)
138207
139-
If you already have `codes` and `categories`, you can use the
140-
:func:`~pandas.Categorical.from_codes` constructor to save the factorize step
141-
during normal constructor mode:
208+
.. note::
142209

143-
.. ipython:: python
210+
In contrast to R's `factor` function, categorical data is not converting input values to
211+
strings; categories will end up the same data type as the original values.
144212

145-
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
146-
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
213+
.. note::
214+
215+
In contrast to R's `factor` function, there is currently no way to assign/change labels at
216+
creation time. Use `categories` to change the categories after creation time.
147217

148218
.. _categorical.categoricaldtype:
149219

doc/source/dsintro.rst

+14-23
Original file line numberDiff line numberDiff line change
@@ -364,6 +364,19 @@ and returns a DataFrame. It operates like the ``DataFrame`` constructor except
364364
for the ``orient`` parameter which is ``'columns'`` by default, but which can be
365365
set to ``'index'`` in order to use the dict keys as row labels.
366366

367+
368+
.. ipython:: python
369+
370+
pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]))
371+
372+
If you pass ``orient='index'``, the keys will be the row labels. In this
373+
case, you can also pass the desired column names:
374+
375+
.. ipython:: python
376+
377+
pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]),
378+
orient='index', columns=['one', 'two', 'three'])
379+
367380
.. _basics.dataframe.from_records:
368381

369382
**DataFrame.from_records**
@@ -378,28 +391,6 @@ dtype. For example:
378391
data
379392
pd.DataFrame.from_records(data, index='C')
380393
381-
.. _basics.dataframe.from_items:
382-
383-
**DataFrame.from_items**
384-
385-
``DataFrame.from_items`` works analogously to the form of the ``dict``
386-
constructor that takes a sequence of ``(key, value)`` pairs, where the keys are
387-
column (or row, in the case of ``orient='index'``) names, and the value are the
388-
column values (or row values). This can be useful for constructing a DataFrame
389-
with the columns in a particular order without having to pass an explicit list
390-
of columns:
391-
392-
.. ipython:: python
393-
394-
pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])])
395-
396-
If you pass ``orient='index'``, the keys will be the row labels. But in this
397-
case you must also pass the desired column names:
398-
399-
.. ipython:: python
400-
401-
pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])],
402-
orient='index', columns=['one', 'two', 'three'])
403394
404395
Column selection, addition, deletion
405396
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -539,7 +530,7 @@ To write code compatible with all versions of Python, split the assignment in tw
539530
you'll need to take care when passing ``assign`` expressions that
540531

541532
* Updating an existing column
542-
* Refering to the newly updated column in the same ``assign``
533+
* Referring to the newly updated column in the same ``assign``
543534

544535
For example, we'll update column "A" and then refer to it when creating "B".
545536

0 commit comments

Comments
 (0)