Skip to content

Commit b276196

Browse files
Merge branch 'pandas-dev:main' into raise-on-parse-int-overflow
2 parents 5896e01 + d1d9b7f commit b276196

File tree

22 files changed

+325
-67
lines changed

22 files changed

+325
-67
lines changed

Dockerfile

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
FROM quay.io/condaforge/mambaforge
1+
FROM quay.io/condaforge/mambaforge:4.13.0-1
22

33
# if you forked pandas, you can pass in your own GitHub username to use your fork
44
# i.e. gh_username=myname

asv_bench/benchmarks/groupby.py

+2
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
import numpy as np
66

77
from pandas import (
8+
NA,
89
Categorical,
910
DataFrame,
1011
Index,
@@ -592,6 +593,7 @@ def setup(self, dtype, method):
592593
columns=list("abcdefghij"),
593594
dtype=dtype,
594595
)
596+
df.loc[list(range(1, N, 5)), list("abcdefghij")] = NA
595597
df["key"] = np.random.randint(0, 100, size=N)
596598
self.df = df
597599

doc/source/user_guide/io.rst

+36
Original file line numberDiff line numberDiff line change
@@ -3174,6 +3174,42 @@ But assigning *any* temporary name to correct URI allows parsing by nodes.
31743174
However, if XPath does not reference node names such as default, ``/*``, then
31753175
``namespaces`` is not required.
31763176

3177+
.. note::
3178+
3179+
Since ``xpath`` identifies the parent of content to be parsed, only immediate
3180+
desendants which include child nodes or current attributes are parsed.
3181+
Therefore, ``read_xml`` will not parse the text of grandchildren or other
3182+
descendants and will not parse attributes of any descendant. To retrieve
3183+
lower level content, adjust xpath to lower level. For example,
3184+
3185+
.. ipython:: python
3186+
:okwarning:
3187+
3188+
xml = """
3189+
<data>
3190+
<row>
3191+
<shape sides="4">square</shape>
3192+
<degrees>360</degrees>
3193+
</row>
3194+
<row>
3195+
<shape sides="0">circle</shape>
3196+
<degrees>360</degrees>
3197+
</row>
3198+
<row>
3199+
<shape sides="3">triangle</shape>
3200+
<degrees>180</degrees>
3201+
</row>
3202+
</data>"""
3203+
3204+
df = pd.read_xml(xml, xpath="./row")
3205+
df
3206+
3207+
shows the attribute ``sides`` on ``shape`` element was not parsed as
3208+
expected since this attribute resides on the child of ``row`` element
3209+
and not ``row`` element itself. In other words, ``sides`` attribute is a
3210+
grandchild level descendant of ``row`` element. However, the ``xpath``
3211+
targets ``row`` element which covers only its children and attributes.
3212+
31773213
With `lxml`_ as parser, you can flatten nested XML documents with an XSLT
31783214
script which also can be string/file/URL types. As background, `XSLT`_ is
31793215
a special-purpose language written in a special XML file that can transform

doc/source/whatsnew/v1.5.0.rst

+2
Original file line numberDiff line numberDiff line change
@@ -1011,6 +1011,8 @@ Time Zones
10111011
Numeric
10121012
^^^^^^^
10131013
- Bug in operations with array-likes with ``dtype="boolean"`` and :attr:`NA` incorrectly altering the array in-place (:issue:`45421`)
1014+
- Bug in arithmetic operations with nullable types without :attr:`NA` values not matching the same operation with non-nullable types (:issue:`48223`)
1015+
- Bug in ``floordiv`` when dividing by ``IntegerDtype`` ``0`` would return ``0`` instead of ``inf`` (:issue:`48223`)
10141016
- Bug in division, ``pow`` and ``mod`` operations on array-likes with ``dtype="boolean"`` not being like their ``np.bool_`` counterparts (:issue:`46063`)
10151017
- Bug in multiplying a :class:`Series` with ``IntegerDtype`` or ``FloatingDtype`` by an array-like with ``timedelta64[ns]`` dtype incorrectly raising (:issue:`45622`)
10161018
- Bug in :meth:`mean` where the optional dependency ``bottleneck`` causes precision loss linear in the length of the array. ``bottleneck`` has been disabled for :meth:`mean` improving the loss to log-linear but may result in a performance decrease. (:issue:`42878`)

doc/source/whatsnew/v1.6.0.rst

+2-1
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,7 @@ Deprecations
100100

101101
Performance improvements
102102
~~~~~~~~~~~~~~~~~~~~~~~~
103+
- Performance improvement in :meth:`.GroupBy.median` for nullable dtypes (:issue:`37493`)
103104
- Performance improvement in :meth:`.GroupBy.mean` and :meth:`.GroupBy.var` for extension array dtypes (:issue:`37493`)
104105
- Performance improvement for :meth:`MultiIndex.unique` (:issue:`48335`)
105106
-
@@ -154,7 +155,7 @@ Indexing
154155
^^^^^^^^
155156
- Bug in :meth:`DataFrame.reindex` filling with wrong values when indexing columns and index for ``uint`` dtypes (:issue:`48184`)
156157
- Bug in :meth:`DataFrame.reindex` casting dtype to ``object`` when :class:`DataFrame` has single extension array column when re-indexing ``columns`` and ``index`` (:issue:`48190`)
157-
-
158+
- Bug in :func:`~DataFrame.describe` when formatting percentiles in the resulting index showed more decimals than needed (:issue:`46362`)
158159

159160
Missing
160161
^^^^^^^

pandas/_libs/groupby.pyi

+2
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ def group_median_float64(
1010
values: np.ndarray, # ndarray[float64_t, ndim=2]
1111
labels: npt.NDArray[np.int64],
1212
min_count: int = ..., # Py_ssize_t
13+
mask: np.ndarray | None = ...,
14+
result_mask: np.ndarray | None = ...,
1315
) -> None: ...
1416
def group_cumprod_float64(
1517
out: np.ndarray, # float64_t[:, ::1]

pandas/_libs/groupby.pyx

+95-19
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ from pandas._libs.algos import (
4141
ensure_platform_int,
4242
groupsort_indexer,
4343
rank_1d,
44+
take_2d_axis1_bool_bool,
4445
take_2d_axis1_float64_float64,
4546
)
4647

@@ -64,11 +65,48 @@ cdef enum InterpolationEnumType:
6465
INTERPOLATION_MIDPOINT
6566

6667

67-
cdef inline float64_t median_linear(float64_t* a, int n) nogil:
68+
cdef inline float64_t median_linear_mask(float64_t* a, int n, uint8_t* mask) nogil:
6869
cdef:
6970
int i, j, na_count = 0
71+
float64_t* tmp
7072
float64_t result
73+
74+
if n == 0:
75+
return NaN
76+
77+
# count NAs
78+
for i in range(n):
79+
if mask[i]:
80+
na_count += 1
81+
82+
if na_count:
83+
if na_count == n:
84+
return NaN
85+
86+
tmp = <float64_t*>malloc((n - na_count) * sizeof(float64_t))
87+
88+
j = 0
89+
for i in range(n):
90+
if not mask[i]:
91+
tmp[j] = a[i]
92+
j += 1
93+
94+
a = tmp
95+
n -= na_count
96+
97+
result = calc_median_linear(a, n, na_count)
98+
99+
if na_count:
100+
free(a)
101+
102+
return result
103+
104+
105+
cdef inline float64_t median_linear(float64_t* a, int n) nogil:
106+
cdef:
107+
int i, j, na_count = 0
71108
float64_t* tmp
109+
float64_t result
72110

73111
if n == 0:
74112
return NaN
@@ -93,18 +131,34 @@ cdef inline float64_t median_linear(float64_t* a, int n) nogil:
93131
a = tmp
94132
n -= na_count
95133

134+
result = calc_median_linear(a, n, na_count)
135+
136+
if na_count:
137+
free(a)
138+
139+
return result
140+
141+
142+
cdef inline float64_t calc_median_linear(float64_t* a, int n, int na_count) nogil:
143+
cdef:
144+
float64_t result
145+
96146
if n % 2:
97147
result = kth_smallest_c(a, n // 2, n)
98148
else:
99149
result = (kth_smallest_c(a, n // 2, n) +
100150
kth_smallest_c(a, n // 2 - 1, n)) / 2
101151

102-
if na_count:
103-
free(a)
104-
105152
return result
106153

107154

155+
ctypedef fused int64float_t:
156+
int64_t
157+
uint64_t
158+
float32_t
159+
float64_t
160+
161+
108162
@cython.boundscheck(False)
109163
@cython.wraparound(False)
110164
def group_median_float64(
@@ -113,6 +167,8 @@ def group_median_float64(
113167
ndarray[float64_t, ndim=2] values,
114168
ndarray[intp_t] labels,
115169
Py_ssize_t min_count=-1,
170+
const uint8_t[:, :] mask=None,
171+
uint8_t[:, ::1] result_mask=None,
116172
) -> None:
117173
"""
118174
Only aggregates on axis=0
@@ -121,8 +177,12 @@ def group_median_float64(
121177
Py_ssize_t i, j, N, K, ngroups, size
122178
ndarray[intp_t] _counts
123179
ndarray[float64_t, ndim=2] data
180+
ndarray[uint8_t, ndim=2] data_mask
124181
ndarray[intp_t] indexer
125182
float64_t* ptr
183+
uint8_t* ptr_mask
184+
float64_t result
185+
bint uses_mask = mask is not None
126186

127187
assert min_count == -1, "'min_count' only used in sum and prod"
128188

@@ -137,15 +197,38 @@ def group_median_float64(
137197

138198
take_2d_axis1_float64_float64(values.T, indexer, out=data)
139199

140-
with nogil:
200+
if uses_mask:
201+
data_mask = np.empty((K, N), dtype=np.uint8)
202+
ptr_mask = <uint8_t *>cnp.PyArray_DATA(data_mask)
203+
204+
take_2d_axis1_bool_bool(mask.T, indexer, out=data_mask, fill_value=1)
141205

142-
for i in range(K):
143-
# exclude NA group
144-
ptr += _counts[0]
145-
for j in range(ngroups):
146-
size = _counts[j + 1]
147-
out[j, i] = median_linear(ptr, size)
148-
ptr += size
206+
with nogil:
207+
208+
for i in range(K):
209+
# exclude NA group
210+
ptr += _counts[0]
211+
ptr_mask += _counts[0]
212+
213+
for j in range(ngroups):
214+
size = _counts[j + 1]
215+
result = median_linear_mask(ptr, size, ptr_mask)
216+
out[j, i] = result
217+
218+
if result != result:
219+
result_mask[j, i] = 1
220+
ptr += size
221+
ptr_mask += size
222+
223+
else:
224+
with nogil:
225+
for i in range(K):
226+
# exclude NA group
227+
ptr += _counts[0]
228+
for j in range(ngroups):
229+
size = _counts[j + 1]
230+
out[j, i] = median_linear(ptr, size)
231+
ptr += size
149232

150233

151234
@cython.boundscheck(False)
@@ -206,13 +289,6 @@ def group_cumprod_float64(
206289
accum[lab, j] = NaN
207290

208291

209-
ctypedef fused int64float_t:
210-
int64_t
211-
uint64_t
212-
float32_t
213-
float64_t
214-
215-
216292
@cython.boundscheck(False)
217293
@cython.wraparound(False)
218294
def group_cumsum(

pandas/core/algorithms.py

+23-5
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
Sequence,
1515
cast,
1616
final,
17+
overload,
1718
)
1819
import warnings
1920

@@ -101,6 +102,7 @@
101102
Categorical,
102103
DataFrame,
103104
Index,
105+
MultiIndex,
104106
Series,
105107
)
106108
from pandas.core.arrays import (
@@ -1780,7 +1782,7 @@ def safe_sort(
17801782
na_sentinel: int = -1,
17811783
assume_unique: bool = False,
17821784
verify: bool = True,
1783-
) -> np.ndarray | tuple[np.ndarray, np.ndarray]:
1785+
) -> np.ndarray | MultiIndex | tuple[np.ndarray | MultiIndex, np.ndarray]:
17841786
"""
17851787
Sort ``values`` and reorder corresponding ``codes``.
17861788
@@ -1809,7 +1811,7 @@ def safe_sort(
18091811
18101812
Returns
18111813
-------
1812-
ordered : ndarray
1814+
ordered : ndarray or MultiIndex
18131815
Sorted ``values``
18141816
new_codes : ndarray
18151817
Reordered ``codes``; returned when ``codes`` is not None.
@@ -1827,6 +1829,7 @@ def safe_sort(
18271829
raise TypeError(
18281830
"Only list-like objects are allowed to be passed to safe_sort as values"
18291831
)
1832+
original_values = values
18301833

18311834
if not isinstance(values, (np.ndarray, ABCExtensionArray)):
18321835
# don't convert to string types
@@ -1838,6 +1841,7 @@ def safe_sort(
18381841
values = np.asarray(values, dtype=dtype) # type: ignore[arg-type]
18391842

18401843
sorter = None
1844+
ordered: np.ndarray | MultiIndex
18411845

18421846
if (
18431847
not is_extension_array_dtype(values)
@@ -1853,7 +1857,7 @@ def safe_sort(
18531857
# which would work, but which fails for special case of 1d arrays
18541858
# with tuples.
18551859
if values.size and isinstance(values[0], tuple):
1856-
ordered = _sort_tuples(values)
1860+
ordered = _sort_tuples(values, original_values)
18571861
else:
18581862
ordered = _sort_mixed(values)
18591863

@@ -1915,19 +1919,33 @@ def _sort_mixed(values) -> np.ndarray:
19151919
)
19161920

19171921

1918-
def _sort_tuples(values: np.ndarray) -> np.ndarray:
1922+
@overload
1923+
def _sort_tuples(values: np.ndarray, original_values: np.ndarray) -> np.ndarray:
1924+
...
1925+
1926+
1927+
@overload
1928+
def _sort_tuples(values: np.ndarray, original_values: MultiIndex) -> MultiIndex:
1929+
...
1930+
1931+
1932+
def _sort_tuples(
1933+
values: np.ndarray, original_values: np.ndarray | MultiIndex
1934+
) -> np.ndarray | MultiIndex:
19191935
"""
19201936
Convert array of tuples (1d) to array or array (2d).
19211937
We need to keep the columns separately as they contain different types and
19221938
nans (can't use `np.sort` as it may fail when str and nan are mixed in a
19231939
column as types cannot be compared).
1940+
We have to apply the indexer to the original values to keep the dtypes in
1941+
case of MultiIndexes
19241942
"""
19251943
from pandas.core.internals.construction import to_arrays
19261944
from pandas.core.sorting import lexsort_indexer
19271945

19281946
arrays, _ = to_arrays(values, None)
19291947
indexer = lexsort_indexer(arrays, orders=True)
1930-
return values[indexer]
1948+
return original_values[indexer]
19311949

19321950

19331951
def union_with_duplicates(lvals: ArrayLike, rvals: ArrayLike) -> ArrayLike:

pandas/core/dtypes/common.py

+3
Original file line numberDiff line numberDiff line change
@@ -280,6 +280,9 @@ def is_categorical(arr) -> bool:
280280
"""
281281
Check whether an array-like is a Categorical instance.
282282
283+
.. deprecated:: 1.1.0
284+
Use ``is_categorical_dtype`` instead.
285+
283286
Parameters
284287
----------
285288
arr : array-like

pandas/core/frame.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -9862,7 +9862,7 @@ def join(
98629862
values given, the `other` DataFrame must have a MultiIndex. Can
98639863
pass an array as the join key if it is not already contained in
98649864
the calling DataFrame. Like an Excel VLOOKUP operation.
9865-
how : {'left', 'right', 'outer', 'inner'}, default 'left'
9865+
how : {'left', 'right', 'outer', 'inner', 'cross'}, default 'left'
98669866
How to handle the operation of the two objects.
98679867
98689868
* left: use calling frame's index (or column if on is specified)

0 commit comments

Comments
 (0)