Skip to content

Commit e19bf6f

Browse files
committed
Merge remote-tracking branch 'upstream/master' into to_html-to_string
* upstream/master: BUG: Casting tz-aware DatetimeIndex to object-dtype ndarray/Index (pandas-dev#23524) BUG: Delegate more of Excel parsing to CSV (pandas-dev#23544) API: DataFrame.__getitem__ returns Series for sparse column (pandas-dev#23561) CLN: use float64_t consistently instead of double, double_t (pandas-dev#23583) DOC: Fix Order of parameters in docstrings (pandas-dev#23611) TST: Unskip some Categorical Tests (pandas-dev#23613) TST: Fix integer ops comparison test (pandas-dev#23619) DOC: Fixes to docstring to add validation to CI (pandas-dev#23560) DOC: Remove incorrect periods at the end of parameter types (pandas-dev#23600) MAINT: tm.assert_raises_regex --> pytest.raises (pandas-dev#23592) DOC: Updating Series.resample and DataFrame.resample docstrings (pandas-dev#23197)
2 parents b87dc8c + 58a59bd commit e19bf6f

File tree

273 files changed

+3327
-3124
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

273 files changed

+3327
-3124
lines changed

ci/code_checks.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -151,7 +151,7 @@ if [[ -z "$CHECK" || "$CHECK" == "doctests" ]]; then
151151

152152
MSG='Doctests generic.py' ; echo $MSG
153153
pytest -q --doctest-modules pandas/core/generic.py \
154-
-k"-_set_axis_name -_xs -describe -droplevel -groupby -interpolate -pct_change -pipe -reindex -reindex_axis -resample -to_json -transpose -values -xs"
154+
-k"-_set_axis_name -_xs -describe -droplevel -groupby -interpolate -pct_change -pipe -reindex -reindex_axis -to_json -transpose -values -xs"
155155
RET=$(($RET + $?)) ; echo $MSG "DONE"
156156

157157
MSG='Doctests top-level reshaping functions' ; echo $MSG

doc/source/io.rst

+28-1
Original file line numberDiff line numberDiff line change
@@ -2861,7 +2861,13 @@ to be parsed.
28612861
28622862
read_excel('path_to_file.xls', 'Sheet1', usecols=2)
28632863
2864-
If `usecols` is a list of integers, then it is assumed to be the file column
2864+
You can also specify a comma-delimited set of Excel columns and ranges as a string:
2865+
2866+
.. code-block:: python
2867+
2868+
read_excel('path_to_file.xls', 'Sheet1', usecols='A,C:E')
2869+
2870+
If ``usecols`` is a list of integers, then it is assumed to be the file column
28652871
indices to be parsed.
28662872

28672873
.. code-block:: python
@@ -2870,6 +2876,27 @@ indices to be parsed.
28702876
28712877
Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``.
28722878

2879+
.. versionadded:: 0.24
2880+
2881+
If ``usecols`` is a list of strings, it is assumed that each string corresponds
2882+
to a column name provided either by the user in ``names`` or inferred from the
2883+
document header row(s). Those strings define which columns will be parsed:
2884+
2885+
.. code-block:: python
2886+
2887+
read_excel('path_to_file.xls', 'Sheet1', usecols=['foo', 'bar'])
2888+
2889+
Element order is ignored, so ``usecols=['baz', 'joe']`` is the same as ``['joe', 'baz']``.
2890+
2891+
.. versionadded:: 0.24
2892+
2893+
If ``usecols`` is callable, the callable function will be evaluated against
2894+
the column names, returning names where the callable function evaluates to ``True``.
2895+
2896+
.. code-block:: python
2897+
2898+
read_excel('path_to_file.xls', 'Sheet1', usecols=lambda x: x.isalpha())
2899+
28732900
Parsing Dates
28742901
+++++++++++++
28752902

doc/source/whatsnew/v0.24.0.txt

+8
Original file line numberDiff line numberDiff line change
@@ -239,6 +239,7 @@ Other Enhancements
239239
- Added :meth:`Interval.overlaps`, :meth:`IntervalArray.overlaps`, and :meth:`IntervalIndex.overlaps` for determining overlaps between interval-like objects (:issue:`21998`)
240240
- :func:`~DataFrame.to_parquet` now supports writing a ``DataFrame`` as a directory of parquet files partitioned by a subset of the columns when ``engine = 'pyarrow'`` (:issue:`23283`)
241241
- :meth:`Timestamp.tz_localize`, :meth:`DatetimeIndex.tz_localize`, and :meth:`Series.tz_localize` have gained the ``nonexistent`` argument for alternative handling of nonexistent times. See :ref:`timeseries.timezone_nonexsistent` (:issue:`8917`)
242+
- :meth:`read_excel()` now accepts ``usecols`` as a list of column names or callable (:issue:`18273`)
242243

243244
.. _whatsnew_0240.api_breaking:
244245

@@ -563,6 +564,7 @@ changes were made:
563564
- The result of concatenating a mix of sparse and dense Series is a Series with sparse values, rather than a ``SparseSeries``.
564565
- ``SparseDataFrame.combine`` and ``DataFrame.combine_first`` no longer supports combining a sparse column with a dense column while preserving the sparse subtype. The result will be an object-dtype SparseArray.
565566
- Setting :attr:`SparseArray.fill_value` to a fill value with a different dtype is now allowed.
567+
- ``DataFrame[column]`` is now a :class:`Series` with sparse values, rather than a :class:`SparseSeries`, when slicing a single column with sparse values (:issue:`23559`).
566568

567569
Some new warnings are issued for operations that require or are likely to materialize a large dense array:
568570

@@ -1128,6 +1130,9 @@ Datetimelike
11281130
- Bug in :class:`PeriodIndex` with attribute ``freq.n`` greater than 1 where adding a :class:`DateOffset` object would return incorrect results (:issue:`23215`)
11291131
- Bug in :class:`Series` that interpreted string indices as lists of characters when setting datetimelike values (:issue:`23451`)
11301132
- Bug in :class:`Timestamp` constructor which would drop the frequency of an input :class:`Timestamp` (:issue:`22311`)
1133+
- Bug in :class:`DatetimeIndex` where calling ``np.array(dtindex, dtype=object)`` would incorrectly return an array of ``long`` objects (:issue:`23524`)
1134+
- Bug in :class:`Index` where passing a timezone-aware :class:`DatetimeIndex` and `dtype=object` would incorrectly raise a ``ValueError`` (:issue:`23524`)
1135+
- Bug in :class:`Index` where calling ``np.array(dtindex, dtype=object)`` on a timezone-naive :class:`DatetimeIndex` would return an array of ``datetime`` objects instead of :class:`Timestamp` objects, potentially losing nanosecond portions of the timestamps (:issue:`23524`)
11311136

11321137
Timedelta
11331138
^^^^^^^^^
@@ -1174,6 +1179,7 @@ Offsets
11741179
- Bug in :class:`FY5253` where date offsets could incorrectly raise an ``AssertionError`` in arithmetic operatons (:issue:`14774`)
11751180
- Bug in :class:`DateOffset` where keyword arguments ``week`` and ``milliseconds`` were accepted and ignored. Passing these will now raise ``ValueError`` (:issue:`19398`)
11761181
- Bug in adding :class:`DateOffset` with :class:`DataFrame` or :class:`PeriodIndex` incorrectly raising ``TypeError`` (:issue:`23215`)
1182+
- Bug in comparing :class:`DateOffset` objects with non-DateOffset objects, particularly strings, raising ``ValueError`` instead of returning ``False`` for equality checks and ``True`` for not-equal checks (:issue:`23524`)
11771183

11781184
Numeric
11791185
^^^^^^^
@@ -1301,6 +1307,8 @@ Notice how we now instead output ``np.nan`` itself instead of a stringified form
13011307
- Bug in :meth:`HDFStore.append` when appending a :class:`DataFrame` with an empty string column and ``min_itemsize`` < 8 (:issue:`12242`)
13021308
- Bug in :meth:`read_csv()` in which :class:`MultiIndex` index names were being improperly handled in the cases when they were not provided (:issue:`23484`)
13031309
- Bug in :meth:`read_html()` in which the error message was not displaying the valid flavors when an invalid one was provided (:issue:`23549`)
1310+
- Bug in :meth:`read_excel()` in which ``index_col=None`` was not being respected and parsing index columns anyway (:issue:`20480`)
1311+
- Bug in :meth:`read_excel()` in which ``usecols`` was not being validated for proper column names when passed in as a string (:issue:`20480`)
13041312

13051313
Plotting
13061314
^^^^^^^^

pandas/_libs/algos.pxd

-3
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,6 @@
11
from util cimport numeric
22

33

4-
cpdef numeric kth_smallest(numeric[:] a, Py_ssize_t k) nogil
5-
6-
74
cdef inline Py_ssize_t swap(numeric *a, numeric *b) nogil:
85
cdef:
96
numeric t

pandas/_libs/algos.pyx

+8-10
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,7 @@ from numpy cimport (ndarray,
1515
NPY_FLOAT32, NPY_FLOAT64,
1616
NPY_OBJECT,
1717
int8_t, int16_t, int32_t, int64_t, uint8_t, uint16_t,
18-
uint32_t, uint64_t, float32_t, float64_t,
19-
double_t)
18+
uint32_t, uint64_t, float32_t, float64_t)
2019
cnp.import_array()
2120

2221

@@ -32,10 +31,9 @@ import missing
3231

3332
cdef float64_t FP_ERR = 1e-13
3433

35-
cdef double NaN = <double>np.NaN
36-
cdef double nan = NaN
34+
cdef float64_t NaN = <float64_t>np.NaN
3735

38-
cdef int64_t iNaT = get_nat()
36+
cdef int64_t NPY_NAT = get_nat()
3937

4038
tiebreakers = {
4139
'average': TIEBREAK_AVERAGE,
@@ -199,7 +197,7 @@ def groupsort_indexer(ndarray[int64_t] index, Py_ssize_t ngroups):
199197

200198
@cython.boundscheck(False)
201199
@cython.wraparound(False)
202-
cpdef numeric kth_smallest(numeric[:] a, Py_ssize_t k) nogil:
200+
def kth_smallest(numeric[:] a, Py_ssize_t k) -> numeric:
203201
cdef:
204202
Py_ssize_t i, j, l, m, n = a.shape[0]
205203
numeric x
@@ -812,23 +810,23 @@ def is_monotonic(ndarray[algos_t, ndim=1] arr, bint timelike):
812810
n = len(arr)
813811

814812
if n == 1:
815-
if arr[0] != arr[0] or (timelike and <int64_t>arr[0] == iNaT):
813+
if arr[0] != arr[0] or (timelike and <int64_t>arr[0] == NPY_NAT):
816814
# single value is NaN
817815
return False, False, True
818816
else:
819817
return True, True, True
820818
elif n < 2:
821819
return True, True, True
822820

823-
if timelike and <int64_t>arr[0] == iNaT:
821+
if timelike and <int64_t>arr[0] == NPY_NAT:
824822
return False, False, True
825823

826824
if algos_t is not object:
827825
with nogil:
828826
prev = arr[0]
829827
for i in range(1, n):
830828
cur = arr[i]
831-
if timelike and <int64_t>cur == iNaT:
829+
if timelike and <int64_t>cur == NPY_NAT:
832830
is_monotonic_inc = 0
833831
is_monotonic_dec = 0
834832
break
@@ -853,7 +851,7 @@ def is_monotonic(ndarray[algos_t, ndim=1] arr, bint timelike):
853851
prev = arr[0]
854852
for i in range(1, n):
855853
cur = arr[i]
856-
if timelike and <int64_t>cur == iNaT:
854+
if timelike and <int64_t>cur == NPY_NAT:
857855
is_monotonic_inc = 0
858856
is_monotonic_dec = 0
859857
break

pandas/_libs/algos_common_helper.pxi.in

+2-2
Original file line numberDiff line numberDiff line change
@@ -84,9 +84,9 @@ def put2d_{{name}}_{{dest_name}}(ndarray[{{c_type}}, ndim=2, cast=True] values,
8484

8585
{{endfor}}
8686

87-
#----------------------------------------------------------------------
87+
# ----------------------------------------------------------------------
8888
# ensure_dtype
89-
#----------------------------------------------------------------------
89+
# ----------------------------------------------------------------------
9090

9191
cdef int PLATFORM_INT = (<ndarray>np.arange(0, dtype=np.intp)).descr.type_num
9292

pandas/_libs/algos_rank_helper.pxi.in

+5-5
Original file line numberDiff line numberDiff line change
@@ -74,9 +74,9 @@ def rank_1d_{{dtype}}(object in_arr, ties_method='average',
7474
{{elif dtype == 'float64'}}
7575
mask = np.isnan(values)
7676
{{elif dtype == 'int64'}}
77-
mask = values == iNaT
77+
mask = values == NPY_NAT
7878

79-
# create copy in case of iNaT
79+
# create copy in case of NPY_NAT
8080
# values are mutated inplace
8181
if mask.any():
8282
values = values.copy()
@@ -149,7 +149,7 @@ def rank_1d_{{dtype}}(object in_arr, ties_method='average',
149149
{{if dtype != 'uint64'}}
150150
isnan = sorted_mask[i]
151151
if isnan and keep_na:
152-
ranks[argsorted[i]] = nan
152+
ranks[argsorted[i]] = NaN
153153
continue
154154
{{endif}}
155155

@@ -257,7 +257,7 @@ def rank_2d_{{dtype}}(object in_arr, axis=0, ties_method='average',
257257
{{elif dtype == 'float64'}}
258258
mask = np.isnan(values)
259259
{{elif dtype == 'int64'}}
260-
mask = values == iNaT
260+
mask = values == NPY_NAT
261261
{{endif}}
262262

263263
np.putmask(values, mask, nan_value)
@@ -317,7 +317,7 @@ def rank_2d_{{dtype}}(object in_arr, axis=0, ties_method='average',
317317
{{else}}
318318
if (val == nan_value) and keep_na:
319319
{{endif}}
320-
ranks[i, argsorted[i, j]] = nan
320+
ranks[i, argsorted[i, j]] = NaN
321321

322322
{{if dtype == 'object'}}
323323
infs += 1

pandas/_libs/algos_take_helper.pxi.in

+2-2
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@ Template for each `dtype` helper function for take
44
WARNING: DO NOT edit .pxi FILE directly, .pxi is generated from .pxi.in
55
"""
66

7-
#----------------------------------------------------------------------
7+
# ----------------------------------------------------------------------
88
# take_1d, take_2d
9-
#----------------------------------------------------------------------
9+
# ----------------------------------------------------------------------
1010

1111
{{py:
1212

pandas/_libs/groupby.pyx

+16-18
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,13 @@
11
# -*- coding: utf-8 -*-
22

3-
cimport cython
4-
from cython cimport Py_ssize_t
3+
import cython
4+
from cython import Py_ssize_t
55

66
from libc.stdlib cimport malloc, free
77

88
import numpy as np
99
cimport numpy as cnp
1010
from numpy cimport (ndarray,
11-
double_t,
1211
int8_t, int16_t, int32_t, int64_t, uint8_t, uint16_t,
1312
uint32_t, uint64_t, float32_t, float64_t)
1413
cnp.import_array()
@@ -20,10 +19,9 @@ from algos cimport (swap, TiebreakEnumType, TIEBREAK_AVERAGE, TIEBREAK_MIN,
2019
TIEBREAK_MAX, TIEBREAK_FIRST, TIEBREAK_DENSE)
2120
from algos import take_2d_axis1_float64_float64, groupsort_indexer, tiebreakers
2221

23-
cdef int64_t iNaT = get_nat()
22+
cdef int64_t NPY_NAT = get_nat()
2423

25-
cdef double NaN = <double>np.NaN
26-
cdef double nan = NaN
24+
cdef float64_t NaN = <float64_t>np.NaN
2725

2826

2927
cdef inline float64_t median_linear(float64_t* a, int n) nogil:
@@ -67,13 +65,13 @@ cdef inline float64_t median_linear(float64_t* a, int n) nogil:
6765
return result
6866

6967

70-
# TODO: Is this redundant with algos.kth_smallest?
68+
# TODO: Is this redundant with algos.kth_smallest
7169
cdef inline float64_t kth_smallest_c(float64_t* a,
7270
Py_ssize_t k,
7371
Py_ssize_t n) nogil:
7472
cdef:
7573
Py_ssize_t i, j, l, m
76-
double_t x, t
74+
float64_t x, t
7775

7876
l = 0
7977
m = n - 1
@@ -109,7 +107,7 @@ def group_median_float64(ndarray[float64_t, ndim=2] out,
109107
cdef:
110108
Py_ssize_t i, j, N, K, ngroups, size
111109
ndarray[int64_t] _counts
112-
ndarray data
110+
ndarray[float64_t, ndim=2] data
113111
float64_t* ptr
114112

115113
assert min_count == -1, "'min_count' only used in add and prod"
@@ -139,8 +137,8 @@ def group_median_float64(ndarray[float64_t, ndim=2] out,
139137
@cython.boundscheck(False)
140138
@cython.wraparound(False)
141139
def group_cumprod_float64(float64_t[:, :] out,
142-
float64_t[:, :] values,
143-
int64_t[:] labels,
140+
const float64_t[:, :] values,
141+
const int64_t[:] labels,
144142
bint is_datetimelike,
145143
bint skipna=True):
146144
"""
@@ -177,7 +175,7 @@ def group_cumprod_float64(float64_t[:, :] out,
177175
@cython.wraparound(False)
178176
def group_cumsum(numeric[:, :] out,
179177
numeric[:, :] values,
180-
int64_t[:] labels,
178+
const int64_t[:] labels,
181179
is_datetimelike,
182180
bint skipna=True):
183181
"""
@@ -217,7 +215,7 @@ def group_cumsum(numeric[:, :] out,
217215

218216
@cython.boundscheck(False)
219217
@cython.wraparound(False)
220-
def group_shift_indexer(ndarray[int64_t] out, ndarray[int64_t] labels,
218+
def group_shift_indexer(int64_t[:] out, const int64_t[:] labels,
221219
int ngroups, int periods):
222220
cdef:
223221
Py_ssize_t N, i, j, ii
@@ -291,7 +289,7 @@ def group_fillna_indexer(ndarray[int64_t] out, ndarray[int64_t] labels,
291289
"""
292290
cdef:
293291
Py_ssize_t i, N
294-
ndarray[int64_t] sorted_labels
292+
int64_t[:] sorted_labels
295293
int64_t idx, curr_fill_idx=-1, filled_vals=0
296294

297295
N = len(out)
@@ -327,10 +325,10 @@ def group_fillna_indexer(ndarray[int64_t] out, ndarray[int64_t] labels,
327325

328326
@cython.boundscheck(False)
329327
@cython.wraparound(False)
330-
def group_any_all(ndarray[uint8_t] out,
331-
ndarray[int64_t] labels,
332-
ndarray[uint8_t] values,
333-
ndarray[uint8_t] mask,
328+
def group_any_all(uint8_t[:] out,
329+
const int64_t[:] labels,
330+
const uint8_t[:] values,
331+
const uint8_t[:] mask,
334332
object val_test,
335333
bint skipna):
336334
"""Aggregated boolean values to show truthfulness of group elements

0 commit comments

Comments
 (0)