Skip to content

Commit 4540878

Browse files
authored
Merge branch 'master' into issue-19342
2 parents 2b96919 + 92dbc78 commit 4540878

30 files changed

+888
-500
lines changed

asv_bench/benchmarks/groupby.py

+5-5
Original file line numberDiff line numberDiff line change
@@ -370,11 +370,11 @@ class GroupByMethods(object):
370370

371371
param_names = ['dtype', 'method']
372372
params = [['int', 'float'],
373-
['all', 'any', 'count', 'cumcount', 'cummax', 'cummin',
374-
'cumprod', 'cumsum', 'describe', 'first', 'head', 'last', 'mad',
375-
'max', 'min', 'median', 'mean', 'nunique', 'pct_change', 'prod',
376-
'rank', 'sem', 'shift', 'size', 'skew', 'std', 'sum', 'tail',
377-
'unique', 'value_counts', 'var']]
373+
['all', 'any', 'bfill', 'count', 'cumcount', 'cummax', 'cummin',
374+
'cumprod', 'cumsum', 'describe', 'ffill', 'first', 'head',
375+
'last', 'mad', 'max', 'min', 'median', 'mean', 'nunique',
376+
'pct_change', 'prod', 'rank', 'sem', 'shift', 'size', 'skew',
377+
'std', 'sum', 'tail', 'unique', 'value_counts', 'var']]
378378

379379
def setup(self, dtype, method):
380380
ngroups = 1000

ci/requirements-2.7.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@ source activate pandas
44

55
echo "install 27"
66

7-
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1 fastparquet
7+
conda install -n pandas -c conda-forge feather-format pyarrow=0.4.1 jemalloc=4.5.0.post fastparquet

doc/source/tutorials.rst

+11-9
Original file line numberDiff line numberDiff line change
@@ -26,32 +26,34 @@ repository <http://github.com/jvns/pandas-cookbook>`_. To run the examples in th
2626
clone the GitHub repository and get IPython Notebook running.
2727
See `How to use this cookbook <https://github.com/jvns/pandas-cookbook#how-to-use-this-cookbook>`_.
2828

29-
- `A quick tour of the IPython Notebook: <http://nbviewer.ipython.org/github/jvns/pandas-cookbook/blob/v0.2/cookbook/A%20quick%20tour%20of%20IPython%20Notebook.ipynb>`_
29+
- `A quick tour of the IPython Notebook: <http://nbviewer.jupyter.org/github/jvns/pandas-cookbook/blob/v0.2/cookbook/A%20quick%20tour%20of%20IPython%20Notebook.ipynb>`_
3030
Shows off IPython's awesome tab completion and magic functions.
31-
- `Chapter 1: <http://nbviewer.ipython.org/github/jvns/pandas-cookbook/blob/v0.2/cookbook/Chapter%201%20-%20Reading%20from%20a%20CSV.ipynb>`_
31+
- `Chapter 1: <http://nbviewer.jupyter.org/github/jvns/pandas-cookbook/blob/v0.2/cookbook/Chapter%201%20-%20Reading%20from%20a%20CSV.ipynb>`_
3232
Reading your data into pandas is pretty much the easiest thing. Even
3333
when the encoding is wrong!
34-
- `Chapter 2: <http://nbviewer.ipython.org/github/jvns/pandas-cookbook/blob/v0.2/cookbook/Chapter%202%20-%20Selecting%20data%20&%20finding%20the%20most%20common%20complaint%20type.ipynb>`_
34+
- `Chapter 2: <http://nbviewer.jupyter.org/github/jvns/pandas-cookbook/blob/v0.2/cookbook/Chapter%202%20-%20Selecting%20data%20%26%20finding%20the%20most%20common%20complaint%20type.ipynb>`_
3535
It's not totally obvious how to select data from a pandas dataframe.
3636
Here we explain the basics (how to take slices and get columns)
37-
- `Chapter 3: <http://nbviewer.ipython.org/github/jvns/pandas-cookbook/blob/v0.2/cookbook/Chapter%203%20-%20Which%20borough%20has%20the%20most%20noise%20complaints%3F%20%28or%2C%20more%20selecting%20data%29.ipynb>`_
37+
- `Chapter 3: <http://nbviewer.jupyter.org/github/jvns/pandas-cookbook/blob/v0.2/cookbook/Chapter%203%20-%20Which%20borough%20has%20the%20most%20noise%20complaints%20%28or%2C%20more%20selecting%20data%29.ipynb>`_
3838
Here we get into serious slicing and dicing and learn how to filter
3939
dataframes in complicated ways, really fast.
40-
- `Chapter 4: <http://nbviewer.ipython.org/github/jvns/pandas-cookbook/blob/v0.2/cookbook/Chapter%204%20-%20Find%20out%20on%20which%20weekday%20people%20bike%20the%20most%20with%20groupby%20and%20aggregate.ipynb>`_
40+
- `Chapter 4: <http://nbviewer.jupyter.org/github/jvns/pandas-cookbook/blob/v0.2/cookbook/Chapter%204%20-%20Find%20out%20on%20which%20weekday%20people%20bike%20the%20most%20with%20groupby%20and%20aggregate.ipynb>`_
4141
Groupby/aggregate is seriously my favorite thing about pandas
4242
and I use it all the time. You should probably read this.
43-
- `Chapter 5: <http://nbviewer.ipython.org/github/jvns/pandas-cookbook/blob/v0.2/cookbook/Chapter%205%20-%20Combining%20dataframes%20and%20scraping%20Canadian%20weather%20data.ipynb>`_
43+
- `Chapter 5: <http://nbviewer.jupyter.org/github/jvns/pandas-cookbook/blob/v0.2/cookbook/Chapter%205%20-%20Combining%20dataframes%20and%20scraping%20Canadian%20weather%20data.ipynb>`_
4444
Here you get to find out if it's cold in Montreal in the winter
4545
(spoiler: yes). Web scraping with pandas is fun! Here we combine dataframes.
46-
- `Chapter 6: <http://nbviewer.ipython.org/github/jvns/pandas-cookbook/blob/v0.2/cookbook/Chapter%206%20-%20String%20operations%21%20Which%20month%20was%20the%20snowiest%3F.ipynb>`_
46+
- `Chapter 6: <http://nbviewer.jupyter.org/github/jvns/pandas-cookbook/blob/v0.2/cookbook/Chapter%206%20-%20String%20Operations-%20Which%20month%20was%20the%20snowiest.ipynb>`_
4747
Strings with pandas are great. It has all these vectorized string
4848
operations and they're the best. We will turn a bunch of strings
4949
containing "Snow" into vectors of numbers in a trice.
50-
- `Chapter 7: <http://nbviewer.ipython.org/github/jvns/pandas-cookbook/blob/v0.2/cookbook/Chapter%207%20-%20Cleaning%20up%20messy%20data.ipynb>`_
50+
- `Chapter 7: <http://nbviewer.jupyter.org/github/jvns/pandas-cookbook/blob/v0.2/cookbook/Chapter%207%20-%20Cleaning%20up%20messy%20data.ipynb>`_
5151
Cleaning up messy data is never a joy, but with pandas it's easier.
52-
- `Chapter 8: <http://nbviewer.ipython.org/github/jvns/pandas-cookbook/blob/v0.2/cookbook/Chapter%208%20-%20How%20to%20deal%20with%20timestamps.ipynb>`_
52+
- `Chapter 8: <http://nbviewer.jupyter.org/github/jvns/pandas-cookbook/blob/v0.2/cookbook/Chapter%208%20-%20How%20to%20deal%20with%20timestamps.ipynb>`_
5353
Parsing Unix timestamps is confusing at first but it turns out
5454
to be really easy.
55+
- `Chapter 9: <http://nbviewer.jupyter.org/github/jvns/pandas-cookbook/blob/v0.2/cookbook/Chapter%209%20-%20Loading%20data%20from%20SQL%20databases.ipynb>`_
56+
Reading data from SQL databases.
5557

5658

5759
Lessons for new pandas users

doc/source/whatsnew/v0.23.0.txt

+14-2
Original file line numberDiff line numberDiff line change
@@ -214,13 +214,22 @@ Please note that the string `index` is not supported with the round trip format,
214214
:okwarning:
215215

216216
df.index.name = 'index'
217+
217218
df.to_json('test.json', orient='table')
218219
new_df = pd.read_json('test.json', orient='table')
219220
new_df
220-
print(new_df.index.name)
221+
new_df.dtypes
222+
223+
.. ipython:: python
224+
:suppress:
225+
226+
import os
227+
os.remove('test.json')
228+
221229

222230
.. _whatsnew_0230.enhancements.assign_dependent:
223231

232+
224233
``.assign()`` accepts dependent arguments
225234
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
226235

@@ -689,6 +698,7 @@ Performance Improvements
689698
- Improved performance of pairwise ``.rolling()`` and ``.expanding()`` with ``.cov()`` and ``.corr()`` operations (:issue:`17917`)
690699
- Improved performance of :func:`DataFrameGroupBy.rank` (:issue:`15779`)
691700
- Improved performance of variable ``.rolling()`` on ``.min()`` and ``.max()`` (:issue:`19521`)
701+
- Improved performance of ``GroupBy.ffill`` and ``GroupBy.bfill`` (:issue:`11296`)
692702

693703
.. _whatsnew_0230.docs:
694704

@@ -755,7 +765,7 @@ Datetimelike
755765
- Bug in :func:`Timestamp.floor` :func:`DatetimeIndex.floor` where time stamps far in the future and past were not rounded correctly (:issue:`19206`)
756766
- Bug in :func:`to_datetime` where passing an out-of-bounds datetime with ``errors='coerce'`` and ``utc=True`` would raise ``OutOfBoundsDatetime`` instead of parsing to ``NaT`` (:issue:`19612`)
757767
- Bug in :class:`DatetimeIndex` and :class:`TimedeltaIndex` addition and subtraction where name of the returned object was not always set consistently. (:issue:`19744`)
758-
-
768+
- Bug in :class:`DatetimeIndex` and :class:`TimedeltaIndex` addition and subtraction where operations with numpy arrays raised ``TypeError`` (:issue:`19847`)
759769

760770
Timedelta
761771
^^^^^^^^^
@@ -807,6 +817,7 @@ Numeric
807817
- Bug in :class:`Index` constructor with ``dtype='uint64'`` where int-like floats were not coerced to :class:`UInt64Index` (:issue:`18400`)
808818
- Bug in :class:`DataFrame` flex arithmetic (e.g. ``df.add(other, fill_value=foo)``) with a ``fill_value`` other than ``None`` failed to raise ``NotImplementedError`` in corner cases where either the frame or ``other`` has length zero (:issue:`19522`)
809819
- Multiplication and division of numeric-dtyped :class:`Index` objects with timedelta-like scalars returns ``TimedeltaIndex`` instead of raising ``TypeError`` (:issue:`19333`)
820+
- Bug where ``NaN`` was returned instead of 0 by :func:`Series.pct_change` and :func:`DataFrame.pct_change` when ``fill_method`` is not ``None`` (provided) (:issue:`19873`)
810821

811822

812823
Indexing
@@ -907,6 +918,7 @@ Reshaping
907918
- Comparisons between :class:`Series` and :class:`Index` would return a ``Series`` with an incorrect name, ignoring the ``Index``'s name attribute (:issue:`19582`)
908919
- Bug in :func:`qcut` where datetime and timedelta data with ``NaT`` present raised a ``ValueError`` (:issue:`19768`)
909920
- Bug in :class:`Series` constructor with ``Categorical`` where a ```ValueError`` is not raised when an index of different length is given (:issue:`19342`)
921+
- Bug in :func:`DataFrame.iterrows`, which would infers strings not compliant to `ISO8601 <https://en.wikipedia.org/wiki/ISO_8601>`_ to datetimes (:issue:`19671`)
910922

911923
Other
912924
^^^^^

pandas/_libs/groupby.pyx

+216
Original file line numberDiff line numberDiff line change
@@ -94,5 +94,221 @@ cdef inline float64_t kth_smallest_c(float64_t* a,
9494
return a[k]
9595

9696

97+
@cython.boundscheck(False)
98+
@cython.wraparound(False)
99+
def group_median_float64(ndarray[float64_t, ndim=2] out,
100+
ndarray[int64_t] counts,
101+
ndarray[float64_t, ndim=2] values,
102+
ndarray[int64_t] labels,
103+
Py_ssize_t min_count=-1):
104+
"""
105+
Only aggregates on axis=0
106+
"""
107+
cdef:
108+
Py_ssize_t i, j, N, K, ngroups, size
109+
ndarray[int64_t] _counts
110+
ndarray data
111+
float64_t* ptr
112+
113+
assert min_count == -1, "'min_count' only used in add and prod"
114+
115+
ngroups = len(counts)
116+
N, K = (<object> values).shape
117+
118+
indexer, _counts = groupsort_indexer(labels, ngroups)
119+
counts[:] = _counts[1:]
120+
121+
data = np.empty((K, N), dtype=np.float64)
122+
ptr = <float64_t*> data.data
123+
124+
take_2d_axis1_float64_float64(values.T, indexer, out=data)
125+
126+
with nogil:
127+
128+
for i in range(K):
129+
# exclude NA group
130+
ptr += _counts[0]
131+
for j in range(ngroups):
132+
size = _counts[j + 1]
133+
out[j, i] = median_linear(ptr, size)
134+
ptr += size
135+
136+
137+
@cython.boundscheck(False)
138+
@cython.wraparound(False)
139+
def group_cumprod_float64(float64_t[:, :] out,
140+
float64_t[:, :] values,
141+
int64_t[:] labels,
142+
bint is_datetimelike):
143+
"""
144+
Only transforms on axis=0
145+
"""
146+
cdef:
147+
Py_ssize_t i, j, N, K, size
148+
float64_t val
149+
float64_t[:, :] accum
150+
int64_t lab
151+
152+
N, K = (<object> values).shape
153+
accum = np.ones_like(values)
154+
155+
with nogil:
156+
for i in range(N):
157+
lab = labels[i]
158+
159+
if lab < 0:
160+
continue
161+
for j in range(K):
162+
val = values[i, j]
163+
if val == val:
164+
accum[lab, j] *= val
165+
out[i, j] = accum[lab, j]
166+
167+
168+
@cython.boundscheck(False)
169+
@cython.wraparound(False)
170+
def group_cumsum(numeric[:, :] out,
171+
numeric[:, :] values,
172+
int64_t[:] labels,
173+
is_datetimelike):
174+
"""
175+
Only transforms on axis=0
176+
"""
177+
cdef:
178+
Py_ssize_t i, j, N, K, size
179+
numeric val
180+
numeric[:, :] accum
181+
int64_t lab
182+
183+
N, K = (<object> values).shape
184+
accum = np.zeros_like(values)
185+
186+
with nogil:
187+
for i in range(N):
188+
lab = labels[i]
189+
190+
if lab < 0:
191+
continue
192+
for j in range(K):
193+
val = values[i, j]
194+
195+
if numeric == float32_t or numeric == float64_t:
196+
if val == val:
197+
accum[lab, j] += val
198+
out[i, j] = accum[lab, j]
199+
else:
200+
accum[lab, j] += val
201+
out[i, j] = accum[lab, j]
202+
203+
204+
@cython.boundscheck(False)
205+
@cython.wraparound(False)
206+
def group_shift_indexer(ndarray[int64_t] out, ndarray[int64_t] labels,
207+
int ngroups, int periods):
208+
cdef:
209+
Py_ssize_t N, i, j, ii
210+
int offset, sign
211+
int64_t lab, idxer, idxer_slot
212+
int64_t[:] label_seen = np.zeros(ngroups, dtype=np.int64)
213+
int64_t[:, :] label_indexer
214+
215+
N, = (<object> labels).shape
216+
217+
if periods < 0:
218+
periods = -periods
219+
offset = N - 1
220+
sign = -1
221+
elif periods > 0:
222+
offset = 0
223+
sign = 1
224+
225+
if periods == 0:
226+
with nogil:
227+
for i in range(N):
228+
out[i] = i
229+
else:
230+
# array of each previous indexer seen
231+
label_indexer = np.zeros((ngroups, periods), dtype=np.int64)
232+
with nogil:
233+
for i in range(N):
234+
## reverse iterator if shifting backwards
235+
ii = offset + sign * i
236+
lab = labels[ii]
237+
238+
# Skip null keys
239+
if lab == -1:
240+
out[ii] = -1
241+
continue
242+
243+
label_seen[lab] += 1
244+
245+
idxer_slot = label_seen[lab] % periods
246+
idxer = label_indexer[lab, idxer_slot]
247+
248+
if label_seen[lab] > periods:
249+
out[ii] = idxer
250+
else:
251+
out[ii] = -1
252+
253+
label_indexer[lab, idxer_slot] = ii
254+
255+
256+
@cython.wraparound(False)
257+
@cython.boundscheck(False)
258+
def group_fillna_indexer(ndarray[int64_t] out, ndarray[int64_t] labels,
259+
ndarray[uint8_t] mask, object direction,
260+
int64_t limit):
261+
"""Indexes how to fill values forwards or backwards within a group
262+
263+
Parameters
264+
----------
265+
out : array of int64_t values which this method will write its results to
266+
Missing values will be written to with a value of -1
267+
labels : array containing unique label for each group, with its ordering
268+
matching up to the corresponding record in `values`
269+
mask : array of int64_t values where a 1 indicates a missing value
270+
direction : {'ffill', 'bfill'}
271+
Direction for fill to be applied (forwards or backwards, respectively)
272+
limit : Consecutive values to fill before stopping, or -1 for no limit
273+
274+
Notes
275+
-----
276+
This method modifies the `out` parameter rather than returning an object
277+
"""
278+
cdef:
279+
Py_ssize_t i, N
280+
ndarray[int64_t] sorted_labels
281+
int64_t idx, curr_fill_idx=-1, filled_vals=0
282+
283+
N = len(out)
284+
285+
# Make sure all arrays are the same size
286+
assert N == len(labels) == len(mask)
287+
288+
sorted_labels = np.argsort(labels).astype(np.int64, copy=False)
289+
if direction == 'bfill':
290+
sorted_labels = sorted_labels[::-1]
291+
292+
with nogil:
293+
for i in range(N):
294+
idx = sorted_labels[i]
295+
if mask[idx] == 1: # is missing
296+
# Stop filling once we've hit the limit
297+
if filled_vals >= limit and limit != -1:
298+
curr_fill_idx = -1
299+
filled_vals += 1
300+
else: # reset items when not missing
301+
filled_vals = 0
302+
curr_fill_idx = idx
303+
304+
out[idx] = curr_fill_idx
305+
306+
# If we move to the next group, reset
307+
# the fill_idx and counter
308+
if i == N - 1 or labels[idx] != labels[sorted_labels[i+1]]:
309+
curr_fill_idx = -1
310+
filled_vals = 0
311+
312+
97313
# generated from template
98314
include "groupby_helper.pxi"

0 commit comments

Comments
 (0)