Skip to content

Commit 960ddb5

Browse files
authored
Merge branch 'pandas-dev:master' into perf-readcsv
2 parents 96ddaef + 6599834 commit 960ddb5

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

67 files changed

+813
-427
lines changed

.github/workflows/pre-commit.yml

-2
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,6 @@ jobs:
1313
concurrency:
1414
group: ${{ github.ref }}-pre-commit
1515
cancel-in-progress: ${{github.event_name == 'pull_request'}}
16-
env:
17-
SKIP: pyright
1816
steps:
1917
- uses: actions/checkout@v2
2018
- uses: actions/setup-python@v2

.pre-commit-config.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,7 @@ repos:
8989
language: node
9090
pass_filenames: false
9191
types: [python]
92+
stages: [manual]
9293
# note: keep version in sync with .github/workflows/ci.yml
9394
additional_dependencies: ['[email protected]']
9495
- repo: local

ci/deps/actions-38-db.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ dependencies:
1616
- botocore>=1.11
1717
- dask
1818
- fastparquet>=0.4.0
19-
- fsspec>=0.7.4, <2021.6.0
19+
- fsspec>=0.7.4
2020
- gcsfs>=0.6.0
2121
- geopandas
2222
- html5lib

ci/deps/actions-38-slow.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ dependencies:
1313

1414
# pandas dependencies
1515
- beautifulsoup4
16-
- fsspec>=0.7.4, <2021.6.0
16+
- fsspec>=0.7.4
1717
- html5lib
1818
- lxml
1919
- matplotlib

ci/deps/actions-39-slow.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ dependencies:
1515
# pandas dependencies
1616
- beautifulsoup4
1717
- bottleneck
18-
- fsspec>=0.8.0, <2021.6.0
18+
- fsspec>=0.8.0
1919
- gcsfs
2020
- html5lib
2121
- jinja2

ci/deps/actions-39.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ dependencies:
1414
# pandas dependencies
1515
- beautifulsoup4
1616
- bottleneck
17-
- fsspec>=0.8.0, <2021.6.0
17+
- fsspec>=0.8.0
1818
- gcsfs
1919
- html5lib
2020
- jinja2

ci/deps/azure-windows-38.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ dependencies:
1717
- bottleneck
1818
- fastparquet>=0.4.0
1919
- flask
20-
- fsspec>=0.8.0, <2021.6.0
20+
- fsspec>=0.8.0
2121
- matplotlib=3.3.2
2222
- moto>=1.3.14
2323
- numba

ci/deps/azure-windows-39.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ dependencies:
1515
# pandas dependencies
1616
- beautifulsoup4
1717
- bottleneck
18-
- fsspec>=0.8.0, <2021.6.0
18+
- fsspec>=0.8.0
1919
- gcsfs
2020
- html5lib
2121
- jinja2

doc/source/development/contributing_codebase.rst

+3-1
Original file line numberDiff line numberDiff line change
@@ -402,10 +402,12 @@ pandas uses `mypy <http://mypy-lang.org>`_ and `pyright <https://github.com/micr
402402
mypy pandas
403403
404404
# let pre-commit setup and run pyright
405-
pre-commit run --all-files pyright
405+
pre-commit run --hook-stage manual --all-files pyright
406406
# or if pyright is installed (requires node.js)
407407
pyright
408408
409+
A recent version of ``numpy`` (>=1.21.0) is required for type validation.
410+
409411
.. _contributing.ci:
410412

411413
Testing with continuous integration

doc/source/whatsnew/v1.3.4.rst

+2
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@ Fixed regressions
2222
- Fixed regression in :meth:`Series.cat.categories` setter failing to update the categories on the ``Series`` (:issue:`43334`)
2323
- Fixed regression in :meth:`pandas.read_csv` raising ``UnicodeDecodeError`` exception when ``memory_map=True`` (:issue:`43540`)
2424
- Fixed regression in :meth:`Series.aggregate` attempting to pass ``args`` and ``kwargs`` multiple times to the user supplied ``func`` in certain cases (:issue:`43357`)
25+
- Fixed regression when iterating over a :class:`DataFrame.groupby.rolling` object causing the resulting DataFrames to have an incorrect index if the input groupings were not sorted (:issue:`43386`)
26+
- Fixed regression in :meth:`DataFrame.groupby.rolling.cov` and :meth:`DataFrame.groupby.rolling.corr` computing incorrect results if the input groupings were not sorted (:issue:`43386`)
2527

2628
.. ---------------------------------------------------------------------------
2729

doc/source/whatsnew/v1.4.0.rst

+7-1
Original file line numberDiff line numberDiff line change
@@ -126,7 +126,8 @@ Other enhancements
126126
- Attempting to write into a file in missing parent directory with :meth:`DataFrame.to_csv`, :meth:`DataFrame.to_html`, :meth:`DataFrame.to_excel`, :meth:`DataFrame.to_feather`, :meth:`DataFrame.to_parquet`, :meth:`DataFrame.to_stata`, :meth:`DataFrame.to_json`, :meth:`DataFrame.to_pickle`, and :meth:`DataFrame.to_xml` now explicitly mentions missing parent directory, the same is true for :class:`Series` counterparts (:issue:`24306`)
127127
- :meth:`IntegerArray.all` , :meth:`IntegerArray.any`, :meth:`FloatingArray.any`, and :meth:`FloatingArray.all` use Kleene logic (:issue:`41967`)
128128
- Added support for nullable boolean and integer types in :meth:`DataFrame.to_stata`, :class:`~pandas.io.stata.StataWriter`, :class:`~pandas.io.stata.StataWriter117`, and :class:`~pandas.io.stata.StataWriterUTF8` (:issue:`40855`)
129-
-
129+
- :meth:`DataFrame.__pos__`, :meth:`DataFrame.__neg__` now retain ``ExtensionDtype`` dtypes (:issue:`43883`)
130+
130131

131132
.. ---------------------------------------------------------------------------
132133
@@ -336,6 +337,7 @@ Other Deprecations
336337
- Deprecated the 'include_start' and 'include_end' arguments in :meth:`DataFrame.between_time`; in a future version passing 'include_start' or 'include_end' will raise (:issue:`40245`)
337338
- Deprecated the ``squeeze`` argument to :meth:`read_csv`, :meth:`read_table`, and :meth:`read_excel`. Users should squeeze the DataFrame afterwards with ``.squeeze("columns")`` instead. (:issue:`43242`)
338339
- Deprecated the ``index`` argument to :class:`SparseArray` construction (:issue:`23089`)
340+
- Deprecated the ``closed`` argument in :meth:`date_range` and :meth:`bdate_range` in favor of ``inclusive`` argument; In a future version passing ``closed`` will raise (:issue:`40245`)
339341
- Deprecated :meth:`.Rolling.validate`, :meth:`.Expanding.validate`, and :meth:`.ExponentialMovingWindow.validate` (:issue:`43665`)
340342
- Deprecated silent dropping of columns that raised a ``TypeError`` in :class:`Series.transform` and :class:`DataFrame.transform` when used with a dictionary (:issue:`43740`)
341343
- Deprecated silent dropping of columns that raised a ``TypeError``, ``DataError``, and some cases of ``ValueError`` in :meth:`Series.aggregate`, :meth:`DataFrame.aggregate`, :meth:`Series.groupby.aggregate`, and :meth:`DataFrame.groupby.aggregate` when used with a list (:issue:`43740`)
@@ -386,6 +388,7 @@ Datetimelike
386388
- Bug in :class:`DataFrame` constructor unnecessarily copying non-datetimelike 2D object arrays (:issue:`39272`)
387389
- Bug in :func:`to_datetime` with ``format`` and ``pandas.NA`` was raising ``ValueError`` (:issue:`42957`)
388390
- :func:`to_datetime` would silently swap ``MM/DD/YYYY`` and ``DD/MM/YYYY`` formats if the given ``dayfirst`` option could not be respected - now, a warning is raised in the case of delimited date strings (e.g. ``31-12-2012``) (:issue:`12585`)
391+
- Bug in :meth:`date_range` and :meth:`bdate_range` do not return right bound when ``start`` = ``end`` and set is closed on one side (:issue:`43394`)
389392
-
390393

391394
Timedelta
@@ -464,6 +467,8 @@ I/O
464467
- Bug in unpickling a :class:`Index` with object dtype incorrectly inferring numeric dtypes (:issue:`43188`)
465468
- Bug in :func:`read_csv` where reading multi-header input with unequal lengths incorrectly raising uncontrolled ``IndexError`` (:issue:`43102`)
466469
- Bug in :func:`read_csv`, changed exception class when expecting a file path name or file-like object from ``OSError`` to ``TypeError`` (:issue:`43366`)
470+
- Bug in :func:`read_json` not handling non-numpy dtypes correctly (especially ``category``) (:issue:`21892`, :issue:`33205`)
471+
- Bug in :func:`json_normalize` where multi-character ``sep`` parameter is incorrectly prefixed to every key (:issue:`43831`)
467472
- Bug in :func:`read_csv` with :code:`float_precision="round_trip"` which did not skip initial/trailing whitespace (:issue:`43713`)
468473
-
469474

@@ -509,6 +514,7 @@ Sparse
509514
- Bug in :meth:`DataFrame.sparse.to_coo` raising ``AttributeError`` when column names are not unique (:issue:`29564`)
510515
- Bug in :meth:`SparseArray.max` and :meth:`SparseArray.min` raising ``ValueError`` for arrays with 0 non-null elements (:issue:`43527`)
511516
- Bug in :meth:`DataFrame.sparse.to_coo` silently converting non-zero fill values to zero (:issue:`24817`)
517+
- Bug in :class:`SparseArray` comparison methods with an array-like operand of mismatched length raising ``AssertionError`` or unclear ``ValueError`` depending on the input (:issue:`43863`)
512518
-
513519

514520
ExtensionArray

environment.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@ dependencies:
106106
- pytables>=3.6.1 # pandas.read_hdf, DataFrame.to_hdf
107107
- s3fs>=0.4.0 # file IO when using 's3://...' path
108108
- aiobotocore
109-
- fsspec>=0.7.4, <2021.6.0 # for generic remote file operations
109+
- fsspec>=0.7.4 # for generic remote file operations
110110
- gcsfs>=0.6.0 # file IO when using 'gcs://...' path
111111
- sqlalchemy # pandas.read_sql, DataFrame.to_sql
112112
- xarray<0.19 # DataFrame.to_xarray

pandas/_libs/algos_common_helper.pxi.in

+2-4
Original file line numberDiff line numberDiff line change
@@ -8,18 +8,16 @@ WARNING: DO NOT edit .pxi FILE directly, .pxi is generated from .pxi.in
88
# ensure_dtype
99
# ----------------------------------------------------------------------
1010

11-
cdef int PLATFORM_INT = (<ndarray>np.arange(0, dtype=np.intp)).descr.type_num
12-
1311

1412
def ensure_platform_int(object arr):
1513
# GH3033, GH1392
1614
# platform int is the size of the int pointer, e.g. np.intp
1715
if util.is_array(arr):
18-
if (<ndarray>arr).descr.type_num == PLATFORM_INT:
16+
if (<ndarray>arr).descr.type_num == cnp.NPY_INTP:
1917
return arr
2018
else:
2119
# equiv: arr.astype(np.intp)
22-
return cnp.PyArray_Cast(<ndarray>arr, PLATFORM_INT)
20+
return cnp.PyArray_Cast(<ndarray>arr, cnp.NPY_INTP)
2321
else:
2422
return np.array(arr, dtype=np.intp)
2523

pandas/_libs/algos_take_helper.pxi.in

+4-4
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,7 @@ def take_2d_axis0_{{name}}_{{dest}}(const {{c_type_in}}[:, :] values,
103103
{{else}}
104104
def take_2d_axis0_{{name}}_{{dest}}(ndarray[{{c_type_in}}, ndim=2] values,
105105
{{endif}}
106-
ndarray[intp_t] indexer,
106+
ndarray[intp_t, ndim=1] indexer,
107107
{{c_type_out}}[:, :] out,
108108
fill_value=np.nan):
109109
cdef:
@@ -158,7 +158,7 @@ def take_2d_axis1_{{name}}_{{dest}}(const {{c_type_in}}[:, :] values,
158158
{{else}}
159159
def take_2d_axis1_{{name}}_{{dest}}(ndarray[{{c_type_in}}, ndim=2] values,
160160
{{endif}}
161-
ndarray[intp_t] indexer,
161+
ndarray[intp_t, ndim=1] indexer,
162162
{{c_type_out}}[:, :] out,
163163
fill_value=np.nan):
164164

@@ -195,8 +195,8 @@ def take_2d_multi_{{name}}_{{dest}}(ndarray[{{c_type_in}}, ndim=2] values,
195195
fill_value=np.nan):
196196
cdef:
197197
Py_ssize_t i, j, k, n, idx
198-
ndarray[intp_t] idx0 = indexer[0]
199-
ndarray[intp_t] idx1 = indexer[1]
198+
ndarray[intp_t, ndim=1] idx0 = indexer[0]
199+
ndarray[intp_t, ndim=1] idx1 = indexer[1]
200200
{{c_type_out}} fv
201201

202202
n = len(idx0)

pandas/_libs/index.pyx

+8-5
Original file line numberDiff line numberDiff line change
@@ -116,12 +116,14 @@ cdef class IndexEngine:
116116
cdef:
117117
bint unique, monotonic_inc, monotonic_dec
118118
bint need_monotonic_check, need_unique_check
119+
object _np_type
119120

120121
def __init__(self, ndarray values):
121122
self.values = values
122123

123124
self.over_size_threshold = len(values) >= _SIZE_CUTOFF
124125
self.clear_mapping()
126+
self._np_type = values.dtype.type
125127

126128
def __contains__(self, val: object) -> bool:
127129
# We assume before we get here:
@@ -168,13 +170,13 @@ cdef class IndexEngine:
168170
See ObjectEngine._searchsorted_left.__doc__.
169171
"""
170172
# Caller is responsible for ensuring _check_type has already been called
171-
loc = self.values.searchsorted(val, side="left")
173+
loc = self.values.searchsorted(self._np_type(val), side="left")
172174
return loc
173175

174176
cdef inline _get_loc_duplicates(self, object val):
175177
# -> Py_ssize_t | slice | ndarray[bool]
176178
cdef:
177-
Py_ssize_t diff
179+
Py_ssize_t diff, left, right
178180

179181
if self.is_monotonic_increasing:
180182
values = self.values
@@ -318,8 +320,8 @@ cdef class IndexEngine:
318320
set stargets, remaining_stargets
319321
dict d = {}
320322
object val
321-
int count = 0, count_missing = 0
322-
Py_ssize_t i, j, n, n_t, n_alloc
323+
Py_ssize_t count = 0, count_missing = 0
324+
Py_ssize_t i, j, n, n_t, n_alloc, start, end
323325
bint d_has_nan = False, stargets_has_nan = False, need_nan_check = True
324326

325327
values = self.values
@@ -481,7 +483,8 @@ cdef class DatetimeEngine(Int64Engine):
481483
# with either a Timestamp or NaT (Timedelta or NaT for TimedeltaEngine)
482484

483485
cdef:
484-
int64_t loc
486+
Py_ssize_t loc
487+
485488
if is_definitely_invalid_key(val):
486489
raise TypeError(f"'{val}' is an invalid key")
487490

pandas/_libs/internals.pyx

+23-9
Original file line numberDiff line numberDiff line change
@@ -227,7 +227,7 @@ cdef class BlockPlacement:
227227
cdef:
228228
slice nv, s = self._ensure_has_slice()
229229
Py_ssize_t other_int, start, stop, step, l
230-
ndarray newarr
230+
ndarray[intp_t, ndim=1] newarr
231231

232232
if s is not None:
233233
# see if we are either all-above or all-below, each of which
@@ -260,7 +260,7 @@ cdef class BlockPlacement:
260260
cdef:
261261
slice slc = self._ensure_has_slice()
262262
slice new_slice
263-
ndarray new_placement
263+
ndarray[intp_t, ndim=1] new_placement
264264

265265
if slc is not None and slc.step == 1:
266266
new_slc = slice(slc.start * factor, slc.stop * factor, 1)
@@ -345,7 +345,9 @@ cpdef Py_ssize_t slice_len(slice slc, Py_ssize_t objlen=PY_SSIZE_T_MAX) except -
345345
return length
346346

347347

348-
cdef slice_get_indices_ex(slice slc, Py_ssize_t objlen=PY_SSIZE_T_MAX):
348+
cdef (Py_ssize_t, Py_ssize_t, Py_ssize_t, Py_ssize_t) slice_get_indices_ex(
349+
slice slc, Py_ssize_t objlen=PY_SSIZE_T_MAX
350+
):
349351
"""
350352
Get (start, stop, step, length) tuple for a slice.
351353
@@ -460,9 +462,11 @@ def get_blkno_indexers(
460462
# blockno handling.
461463
cdef:
462464
int64_t cur_blkno
463-
Py_ssize_t i, start, stop, n, diff, tot_len
465+
Py_ssize_t i, start, stop, n, diff
466+
cnp.npy_intp tot_len
464467
int64_t blkno
465468
object group_dict = defaultdict(list)
469+
ndarray[int64_t, ndim=1] arr
466470

467471
n = blknos.shape[0]
468472
result = list()
@@ -495,7 +499,8 @@ def get_blkno_indexers(
495499
result.append((blkno, slice(slices[0][0], slices[0][1])))
496500
else:
497501
tot_len = sum(stop - start for start, stop in slices)
498-
arr = np.empty(tot_len, dtype=np.int64)
502+
# equiv np.empty(tot_len, dtype=np.int64)
503+
arr = cnp.PyArray_EMPTY(1, &tot_len, cnp.NPY_INT64, 0)
499504

500505
i = 0
501506
for start, stop in slices:
@@ -526,16 +531,21 @@ def get_blkno_placements(blknos, group: bool = True):
526531
yield blkno, BlockPlacement(indexer)
527532

528533

534+
@cython.boundscheck(False)
535+
@cython.wraparound(False)
529536
cpdef update_blklocs_and_blknos(
530-
ndarray[intp_t] blklocs, ndarray[intp_t] blknos, Py_ssize_t loc, intp_t nblocks
537+
ndarray[intp_t, ndim=1] blklocs,
538+
ndarray[intp_t, ndim=1] blknos,
539+
Py_ssize_t loc,
540+
intp_t nblocks,
531541
):
532542
"""
533543
Update blklocs and blknos when a new column is inserted at 'loc'.
534544
"""
535545
cdef:
536546
Py_ssize_t i
537547
cnp.npy_intp length = len(blklocs) + 1
538-
ndarray[intp_t] new_blklocs, new_blknos
548+
ndarray[intp_t, ndim=1] new_blklocs, new_blknos
539549

540550
# equiv: new_blklocs = np.empty(length, dtype=np.intp)
541551
new_blklocs = cnp.PyArray_EMPTY(1, &length, cnp.NPY_INTP, 0)
@@ -693,7 +703,7 @@ cdef class BlockManager:
693703
cnp.npy_intp length = self.shape[0]
694704
SharedBlock blk
695705
BlockPlacement bp
696-
ndarray[intp_t] new_blknos, new_blklocs
706+
ndarray[intp_t, ndim=1] new_blknos, new_blklocs
697707

698708
# equiv: np.empty(length, dtype=np.intp)
699709
new_blknos = cnp.PyArray_EMPTY(1, &length, cnp.NPY_INTP, 0)
@@ -711,7 +721,11 @@ cdef class BlockManager:
711721
new_blknos[j] = blkno
712722
new_blklocs[j] = i
713723

714-
for blkno in new_blknos:
724+
for i in range(length):
725+
# faster than `for blkno in new_blknos`
726+
# https://github.com/cython/cython/issues/4393
727+
blkno = new_blknos[i]
728+
715729
# If there are any -1s remaining, this indicates that our mgr_locs
716730
# are invalid.
717731
if blkno == -1:

0 commit comments

Comments
 (0)