Skip to content

Commit a23f101

Browse files
author
MarcoGorelli
committed
Merge remote-tracking branch 'upstream/main' into deprecate-date-parser
2 parents 39cd663 + 10c51ba commit a23f101

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

59 files changed

+650
-432
lines changed

.pre-commit-config.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,7 @@ repos:
135135
types: [python]
136136
stages: [manual]
137137
additional_dependencies: &pyright_dependencies
138-
138+
139139
- id: pyright_reportGeneralTypeIssues
140140
# note: assumes python env is setup and activated
141141
name: pyright reportGeneralTypeIssues

ci/code_checks.sh

-2
Original file line numberDiff line numberDiff line change
@@ -578,13 +578,11 @@ if [[ -z "$CHECK" || "$CHECK" == "docstrings" ]]; then
578578

579579
MSG='Partially validate docstrings (EX02)' ; echo $MSG
580580
$BASE_DIR/scripts/validate_docstrings.py --format=actions --errors=EX02 --ignore_functions \
581-
pandas.DataFrame.copy \
582581
pandas.DataFrame.plot.line \
583582
pandas.DataFrame.std \
584583
pandas.DataFrame.var \
585584
pandas.Index.factorize \
586585
pandas.Period.strftime \
587-
pandas.Series.copy \
588586
pandas.Series.factorize \
589587
pandas.Series.floordiv \
590588
pandas.Series.plot.line \

doc/source/development/internals.rst

+23-23
Original file line numberDiff line numberDiff line change
@@ -15,24 +15,24 @@ Indexing
1515
In pandas there are a few objects implemented which can serve as valid
1616
containers for the axis labels:
1717

18-
* ``Index``: the generic "ordered set" object, an ndarray of object dtype
18+
* :class:`Index`: the generic "ordered set" object, an ndarray of object dtype
1919
assuming nothing about its contents. The labels must be hashable (and
2020
likely immutable) and unique. Populates a dict of label to location in
2121
Cython to do ``O(1)`` lookups.
2222
* ``Int64Index``: a version of ``Index`` highly optimized for 64-bit integer
2323
data, such as time stamps
2424
* ``Float64Index``: a version of ``Index`` highly optimized for 64-bit float data
25-
* ``MultiIndex``: the standard hierarchical index object
26-
* ``DatetimeIndex``: An Index object with ``Timestamp`` boxed elements (impl are the int64 values)
27-
* ``TimedeltaIndex``: An Index object with ``Timedelta`` boxed elements (impl are the in64 values)
28-
* ``PeriodIndex``: An Index object with Period elements
25+
* :class:`MultiIndex`: the standard hierarchical index object
26+
* :class:`DatetimeIndex`: An Index object with :class:`Timestamp` boxed elements (impl are the int64 values)
27+
* :class:`TimedeltaIndex`: An Index object with :class:`Timedelta` boxed elements (impl are the in64 values)
28+
* :class:`PeriodIndex`: An Index object with Period elements
2929

3030
There are functions that make the creation of a regular index easy:
3131

32-
* ``date_range``: fixed frequency date range generated from a time rule or
32+
* :func:`date_range`: fixed frequency date range generated from a time rule or
3333
DateOffset. An ndarray of Python datetime objects
34-
* ``period_range``: fixed frequency date range generated from a time rule or
35-
DateOffset. An ndarray of ``Period`` objects, representing timespans
34+
* :func:`period_range`: fixed frequency date range generated from a time rule or
35+
DateOffset. An ndarray of :class:`Period` objects, representing timespans
3636

3737
The motivation for having an ``Index`` class in the first place was to enable
3838
different implementations of indexing. This means that it's possible for you,
@@ -43,28 +43,28 @@ From an internal implementation point of view, the relevant methods that an
4343
``Index`` must define are one or more of the following (depending on how
4444
incompatible the new object internals are with the ``Index`` functions):
4545

46-
* ``get_loc``: returns an "indexer" (an integer, or in some cases a
46+
* :meth:`~Index.get_loc`: returns an "indexer" (an integer, or in some cases a
4747
slice object) for a label
48-
* ``slice_locs``: returns the "range" to slice between two labels
49-
* ``get_indexer``: Computes the indexing vector for reindexing / data
48+
* :meth:`~Index.slice_locs`: returns the "range" to slice between two labels
49+
* :meth:`~Index.get_indexer`: Computes the indexing vector for reindexing / data
5050
alignment purposes. See the source / docstrings for more on this
51-
* ``get_indexer_non_unique``: Computes the indexing vector for reindexing / data
51+
* :meth:`~Index.get_indexer_non_unique`: Computes the indexing vector for reindexing / data
5252
alignment purposes when the index is non-unique. See the source / docstrings
5353
for more on this
54-
* ``reindex``: Does any pre-conversion of the input index then calls
54+
* :meth:`~Index.reindex`: Does any pre-conversion of the input index then calls
5555
``get_indexer``
56-
* ``union``, ``intersection``: computes the union or intersection of two
56+
* :meth:`~Index.union`, :meth:`~Index.intersection`: computes the union or intersection of two
5757
Index objects
58-
* ``insert``: Inserts a new label into an Index, yielding a new object
59-
* ``delete``: Delete a label, yielding a new object
60-
* ``drop``: Deletes a set of labels
61-
* ``take``: Analogous to ndarray.take
58+
* :meth:`~Index.insert`: Inserts a new label into an Index, yielding a new object
59+
* :meth:`~Index.delete`: Delete a label, yielding a new object
60+
* :meth:`~Index.drop`: Deletes a set of labels
61+
* :meth:`~Index.take`: Analogous to ndarray.take
6262

6363
MultiIndex
6464
~~~~~~~~~~
6565

66-
Internally, the ``MultiIndex`` consists of a few things: the **levels**, the
67-
integer **codes** (until version 0.24 named *labels*), and the level **names**:
66+
Internally, the :class:`MultiIndex` consists of a few things: the **levels**, the
67+
integer **codes**, and the level **names**:
6868

6969
.. ipython:: python
7070
@@ -80,13 +80,13 @@ You can probably guess that the codes determine which unique element is
8080
identified with that location at each layer of the index. It's important to
8181
note that sortedness is determined **solely** from the integer codes and does
8282
not check (or care) whether the levels themselves are sorted. Fortunately, the
83-
constructors ``from_tuples`` and ``from_arrays`` ensure that this is true, but
84-
if you compute the levels and codes yourself, please be careful.
83+
constructors :meth:`~MultiIndex.from_tuples` and :meth:`~MultiIndex.from_arrays` ensure
84+
that this is true, but if you compute the levels and codes yourself, please be careful.
8585

8686
Values
8787
~~~~~~
8888

89-
pandas extends NumPy's type system with custom types, like ``Categorical`` or
89+
pandas extends NumPy's type system with custom types, like :class:`Categorical` or
9090
datetimes with a timezone, so we have multiple notions of "values". For 1-D
9191
containers (``Index`` classes and ``Series``) we have the following convention:
9292

doc/source/user_guide/indexing.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -231,7 +231,7 @@ You can also assign a ``dict`` to a row of a ``DataFrame``:
231231
232232
You can use attribute access to modify an existing element of a Series or column of a DataFrame, but be careful;
233233
if you try to use attribute access to create a new column, it creates a new attribute rather than a
234-
new column. In 0.21.0 and later, this will raise a ``UserWarning``:
234+
new column and will this raise a ``UserWarning``:
235235

236236
.. code-block:: ipython
237237

doc/source/whatsnew/v2.0.0.rst

+34
Original file line numberDiff line numberDiff line change
@@ -334,6 +334,36 @@ a supported dtype:
334334
335335
pd.Series(["2016-01-01"], dtype="datetime64[D]")
336336
337+
.. _whatsnew_200.api_breaking.value_counts:
338+
339+
Value counts sets the resulting name to ``count``
340+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
341+
In past versions, when running :meth:`Series.value_counts`, the result would inherit
342+
the original object's name, and the result index would be nameless. This would cause
343+
confusion when resetting the index, and the column names would not correspond with the
344+
column values.
345+
Now, the result name will be ``'count'`` (or ``'proportion'`` if ``normalize=True`` was passed),
346+
and the index will be named after the original object (:issue:`49497`).
347+
348+
*Previous behavior*:
349+
350+
.. code-block:: ipython
351+
352+
In [8]: pd.Series(['quetzal', 'quetzal', 'elk'], name='animal').value_counts()
353+
354+
Out[2]:
355+
quetzal 2
356+
elk 1
357+
Name: animal, dtype: int64
358+
359+
*New behavior*:
360+
361+
.. ipython:: python
362+
363+
pd.Series(['quetzal', 'quetzal', 'elk'], name='animal').value_counts()
364+
365+
Likewise for other ``value_counts`` methods (for example, :meth:`DataFrame.value_counts`).
366+
337367
.. _whatsnew_200.api_breaking.astype_to_unsupported_datetimelike:
338368

339369
Disallow astype conversion to non-supported datetime64/timedelta64 dtypes
@@ -636,6 +666,7 @@ Other API changes
636666

637667
Deprecations
638668
~~~~~~~~~~~~
669+
- Deprecated parsing datetime strings with system-local timezone to ``tzlocal``, pass a ``tz`` keyword or explicitly call ``tz_localize`` instead (:issue:`50791`)
639670
- Deprecated argument ``infer_datetime_format`` in :func:`to_datetime` and :func:`read_csv`, as a strict version of it is now the default (:issue:`48621`)
640671
- Deprecated behavior of :func:`to_datetime` with ``unit`` when parsing strings, in a future version these will be parsed as datetimes (matching unit-less behavior) instead of cast to floats. To retain the old behavior, cast strings to numeric types before calling :func:`to_datetime` (:issue:`50735`)
641672
- Deprecated :func:`pandas.io.sql.execute` (:issue:`50185`)
@@ -950,6 +981,8 @@ Performance improvements
950981
- Performance improvement in :meth:`.SeriesGroupBy.value_counts` with categorical dtype (:issue:`46202`)
951982
- Fixed a reference leak in :func:`read_hdf` (:issue:`37441`)
952983
- Fixed a memory leak in :meth:`DataFrame.to_json` and :meth:`Series.to_json` when serializing datetimes and timedeltas (:issue:`40443`)
984+
- Decreased memory usage in many :class:`DataFrameGroupBy` methods (:issue:`51090`)
985+
-
953986

954987
.. ---------------------------------------------------------------------------
955988
.. _whatsnew_200.bug_fixes:
@@ -1028,6 +1061,7 @@ Conversion
10281061
- Bug where any :class:`ExtensionDtype` subclass with ``kind="M"`` would be interpreted as a timezone type (:issue:`34986`)
10291062
- Bug in :class:`.arrays.ArrowExtensionArray` that would raise ``NotImplementedError`` when passed a sequence of strings or binary (:issue:`49172`)
10301063
- Bug in :meth:`Series.astype` raising ``pyarrow.ArrowInvalid`` when converting from a non-pyarrow string dtype to a pyarrow numeric type (:issue:`50430`)
1064+
- Bug in :meth:`DataFrame.astype` modifying input array inplace when converting to ``string`` and ``copy=False`` (:issue:`51073`)
10311065
- Bug in :meth:`Series.to_numpy` converting to NumPy array before applying ``na_value`` (:issue:`48951`)
10321066
- Bug in :meth:`DataFrame.astype` not copying data when converting to pyarrow dtype (:issue:`50984`)
10331067
- Bug in :func:`to_datetime` was not respecting ``exact`` argument when ``format`` was an ISO8601 format (:issue:`12649`)

pandas/_libs/internals.pyi

+3-1
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,9 @@ class BlockPlacement:
4444
@property
4545
def is_slice_like(self) -> bool: ...
4646
@overload
47-
def __getitem__(self, loc: slice | Sequence[int]) -> BlockPlacement: ...
47+
def __getitem__(
48+
self, loc: slice | Sequence[int] | npt.NDArray[np.intp]
49+
) -> BlockPlacement: ...
4850
@overload
4951
def __getitem__(self, loc: int) -> int: ...
5052
def __iter__(self) -> Iterator[int]: ...

pandas/_libs/lib.pyx

+7
Original file line numberDiff line numberDiff line change
@@ -739,6 +739,7 @@ cpdef ndarray[object] ensure_string_array(
739739
"""
740740
cdef:
741741
Py_ssize_t i = 0, n = len(arr)
742+
bint already_copied = True
742743

743744
if hasattr(arr, "to_numpy"):
744745

@@ -757,6 +758,8 @@ cpdef ndarray[object] ensure_string_array(
757758

758759
if copy and result is arr:
759760
result = result.copy()
761+
elif not copy and result is arr:
762+
already_copied = False
760763

761764
if issubclass(arr.dtype.type, np.str_):
762765
# short-circuit, all elements are str
@@ -768,6 +771,10 @@ cpdef ndarray[object] ensure_string_array(
768771
if isinstance(val, str):
769772
continue
770773

774+
elif not already_copied:
775+
result = result.copy()
776+
already_copied = True
777+
771778
if not checknull(val):
772779
if not util.is_float_object(val):
773780
# f"{val}" is faster than str(val)

pandas/_libs/tslibs/parsing.pyx

+12
Original file line numberDiff line numberDiff line change
@@ -686,6 +686,18 @@ cdef datetime dateutil_parse(
686686
ret = ret + relativedelta.relativedelta(weekday=res.weekday)
687687
if not ignoretz:
688688
if res.tzname and res.tzname in time.tzname:
689+
# GH#50791
690+
if res.tzname != "UTC":
691+
# If the system is localized in UTC (as many CI runs are)
692+
# we get tzlocal, once the deprecation is enforced will get
693+
# timezone.utc, not raise.
694+
warnings.warn(
695+
"Parsing '{res.tzname}' as tzlocal (dependent on system timezone) "
696+
"is deprecated and will raise in a future version. Pass the 'tz' "
697+
"keyword or call tz_localize after construction instead",
698+
FutureWarning,
699+
stacklevel=find_stack_level()
700+
)
689701
ret = ret.replace(tzinfo=_dateutil_tzlocal())
690702
elif res.tzoffset == 0:
691703
ret = ret.replace(tzinfo=_dateutil_tzutc())

pandas/_testing/contexts.py

+8-1
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,11 @@
1111
)
1212
import uuid
1313

14+
from pandas._typing import (
15+
BaseBuffer,
16+
CompressionOptions,
17+
FilePath,
18+
)
1419
from pandas.compat import PYPY
1520
from pandas.errors import ChainedAssignmentError
1621

@@ -20,7 +25,9 @@
2025

2126

2227
@contextmanager
23-
def decompress_file(path, compression) -> Generator[IO[bytes], None, None]:
28+
def decompress_file(
29+
path: FilePath | BaseBuffer, compression: CompressionOptions
30+
) -> Generator[IO[bytes], None, None]:
2431
"""
2532
Open a compressed file and return a file object.
2633

pandas/core/algorithms.py

+11-4
Original file line numberDiff line numberDiff line change
@@ -847,7 +847,8 @@ def value_counts(
847847
Series,
848848
)
849849

850-
name = getattr(values, "name", None)
850+
index_name = getattr(values, "name", None)
851+
name = "proportion" if normalize else "count"
851852

852853
if bins is not None:
853854
from pandas.core.reshape.tile import cut
@@ -860,6 +861,7 @@ def value_counts(
860861

861862
# count, remove nulls (from the index), and but the bins
862863
result = ii.value_counts(dropna=dropna)
864+
result.name = name
863865
result = result[result.index.notna()]
864866
result.index = result.index.astype("interval")
865867
result = result.sort_index()
@@ -878,14 +880,18 @@ def value_counts(
878880
# handle Categorical and sparse,
879881
result = Series(values)._values.value_counts(dropna=dropna)
880882
result.name = name
883+
result.index.name = index_name
881884
counts = result._values
882885

883886
elif isinstance(values, ABCMultiIndex):
884887
# GH49558
885888
levels = list(range(values.nlevels))
886-
result = Series(index=values).groupby(level=levels, dropna=dropna).size()
887-
# TODO: allow index names to remain (see discussion in GH49497)
888-
result.index.names = [None] * values.nlevels
889+
result = (
890+
Series(index=values, name=name)
891+
.groupby(level=levels, dropna=dropna)
892+
.size()
893+
)
894+
result.index.names = values.names
889895
counts = result._values
890896

891897
else:
@@ -899,6 +905,7 @@ def value_counts(
899905
idx = Index(keys)
900906
if idx.dtype == bool and keys.dtype == object:
901907
idx = idx.astype(object)
908+
idx.name = index_name
902909

903910
result = Series(counts, index=idx, name=name)
904911

pandas/core/arrays/arrow/array.py

+8-3
Original file line numberDiff line numberDiff line change
@@ -934,7 +934,7 @@ def value_counts(self, dropna: bool = True) -> Series:
934934

935935
index = Index(type(self)(values))
936936

937-
return Series(counts, index=index).astype("Int64")
937+
return Series(counts, index=index, name="count").astype("Int64")
938938

939939
@classmethod
940940
def _concat_same_type(
@@ -1255,7 +1255,7 @@ def _quantile(
12551255
pa_dtype = self._data.type
12561256

12571257
data = self._data
1258-
if pa.types.is_temporal(pa_dtype) and interpolation in ["lower", "higher"]:
1258+
if pa.types.is_temporal(pa_dtype):
12591259
# https://github.com/apache/arrow/issues/33769 in these cases
12601260
# we can cast to ints and back
12611261
nbits = pa_dtype.bit_width
@@ -1266,7 +1266,12 @@ def _quantile(
12661266

12671267
result = pc.quantile(data, q=qs, interpolation=interpolation)
12681268

1269-
if pa.types.is_temporal(pa_dtype) and interpolation in ["lower", "higher"]:
1269+
if pa.types.is_temporal(pa_dtype):
1270+
nbits = pa_dtype.bit_width
1271+
if nbits == 32:
1272+
result = result.cast(pa.int32())
1273+
else:
1274+
result = result.cast(pa.int64())
12701275
result = result.cast(pa_dtype)
12711276

12721277
return type(self)(result)

pandas/core/arrays/categorical.py

+11-2
Original file line numberDiff line numberDiff line change
@@ -1499,7 +1499,7 @@ def value_counts(self, dropna: bool = True) -> Series:
14991499
ix = coerce_indexer_dtype(ix, self.dtype.categories)
15001500
ix = self._from_backing_data(ix)
15011501

1502-
return Series(count, index=CategoricalIndex(ix), dtype="int64")
1502+
return Series(count, index=CategoricalIndex(ix), dtype="int64", name="count")
15031503

15041504
# error: Argument 2 of "_empty" is incompatible with supertype
15051505
# "NDArrayBackedExtensionArray"; supertype defines the argument type as
@@ -2284,7 +2284,16 @@ def _replace(self, *, to_replace, value, inplace: bool = False):
22842284
ser = ser.replace(to_replace=to_replace, value=value)
22852285

22862286
all_values = Index(ser)
2287-
new_categories = Index(ser.drop_duplicates(keep="first"))
2287+
2288+
# GH51016: maintain order of existing categories
2289+
idxr = cat.categories.get_indexer_for(all_values)
2290+
locs = np.arange(len(ser))
2291+
locs = np.where(idxr == -1, locs, idxr)
2292+
locs = locs.argsort()
2293+
2294+
new_categories = ser.take(locs)
2295+
new_categories = new_categories.drop_duplicates(keep="first")
2296+
new_categories = Index(new_categories)
22882297
new_codes = recode_for_categories(
22892298
cat._codes, all_values, new_categories, copy=False
22902299
)

pandas/core/arrays/masked.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -996,7 +996,7 @@ def value_counts(self, dropna: bool = True) -> Series:
996996
)
997997

998998
if dropna:
999-
res = Series(value_counts, index=keys)
999+
res = Series(value_counts, index=keys, name="count")
10001000
res.index = res.index.astype(self.dtype)
10011001
res = res.astype("Int64")
10021002
return res
@@ -1012,7 +1012,7 @@ def value_counts(self, dropna: bool = True) -> Series:
10121012
mask = np.zeros(len(counts), dtype="bool")
10131013
counts_array = IntegerArray(counts, mask)
10141014

1015-
return Series(counts_array, index=index)
1015+
return Series(counts_array, index=index, name="count")
10161016

10171017
@doc(ExtensionArray.equals)
10181018
def equals(self, other) -> bool:

0 commit comments

Comments
 (0)