Skip to content

Commit 1bfde5a

Browse files
author
MarcoGorelli
committed
Merge remote-tracking branch 'upstream/main' into minimal-reqs
2 parents 891ba6e + 1bb128e commit 1bfde5a

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

89 files changed

+1581
-1120
lines changed

ci/code_checks.sh

-8
Original file line numberDiff line numberDiff line change
@@ -597,21 +597,13 @@ if [[ -z "$CHECK" || "$CHECK" == "docstrings" ]]; then
597597
pandas.api.types.is_datetime64_dtype \
598598
pandas.api.types.is_datetime64_ns_dtype \
599599
pandas.api.types.is_datetime64tz_dtype \
600-
pandas.api.types.is_dict_like \
601-
pandas.api.types.is_file_like \
602600
pandas.api.types.is_float_dtype \
603-
pandas.api.types.is_hashable \
604601
pandas.api.types.is_int64_dtype \
605602
pandas.api.types.is_integer_dtype \
606603
pandas.api.types.is_interval_dtype \
607-
pandas.api.types.is_iterator \
608-
pandas.api.types.is_list_like \
609-
pandas.api.types.is_named_tuple \
610604
pandas.api.types.is_numeric_dtype \
611605
pandas.api.types.is_object_dtype \
612606
pandas.api.types.is_period_dtype \
613-
pandas.api.types.is_re \
614-
pandas.api.types.is_re_compilable \
615607
pandas.api.types.is_signed_integer_dtype \
616608
pandas.api.types.is_sparse \
617609
pandas.api.types.is_string_dtype \

ci/deps/actions-38-minimum_versions.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ dependencies:
4343
- openpyxl=3.0.7
4444
- pandas-gbq=0.15.0
4545
- psycopg2=2.8.6
46-
- pyarrow=6.0.0
46+
- pyarrow=7.0.0
4747
- pymysql=1.0.2
4848
- pyreadstat=1.1.2
4949
- pytables=3.6.1
+41
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
.. _copy_on_write:
2+
3+
{{ header }}
4+
5+
*************
6+
Copy on write
7+
*************
8+
9+
Copy on Write is a mechanism to simplify the indexing API and improve
10+
performance through avoiding copies if possible.
11+
CoW means that any DataFrame or Series derived from another in any way always
12+
behaves as a copy.
13+
14+
Reference tracking
15+
------------------
16+
17+
To be able to determine, if we have to make a copy when writing into a DataFrame,
18+
we have to be aware, if the values are shared with another DataFrame. pandas
19+
keeps track of all ``Blocks`` that share values with another block internally to
20+
be able to tell when a copy needs to be triggered. The reference tracking
21+
mechanism is implemented on the Block level.
22+
23+
We use a custom reference tracker object, ``BlockValuesRefs``, that keeps
24+
track of every block, whose values share memory with each other. The reference
25+
is held through a weak-reference. Every two blocks that share some memory should
26+
point to the same ``BlockValuesRefs`` object. If one block goes out of
27+
scope, the reference to this block dies. As a consequence, the reference tracker
28+
object always knows how many blocks are alive and share memory.
29+
30+
Whenever a :class:`DataFrame` or :class:`Series` object is sharing data with another
31+
object, it is required that each of those objects have its own BlockManager and Block
32+
objects. Thus, in other words, one Block instance (that is held by a DataFrame, not
33+
necessarily for intermediate objects) should always be uniquely used for only
34+
a single DataFrame/Series object. For example, when you want to use the same
35+
Block for another object, you can create a shallow copy of the Block instance
36+
with ``block.copy(deep=False)`` (which will create a new Block instance with
37+
the same underlying values and which will correctly set up the references).
38+
39+
We can ask the reference tracking object if there is another block alive that shares
40+
data with us before writing into the values. We can trigger a copy before
41+
writing if there is in fact another block alive.

doc/source/development/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ Development
1818
contributing_codebase
1919
maintaining
2020
internals
21+
copy_on_write
2122
debugging_extensions
2223
extending
2324
developer

doc/source/getting_started/install.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -441,7 +441,7 @@ PyTables 3.6.1 hdf5 HDF5-based reading
441441
blosc 1.21.0 hdf5 Compression for HDF5; only available on ``conda``
442442
zlib hdf5 Compression for HDF5
443443
fastparquet 0.6.3 - Parquet reading / writing (pyarrow is default)
444-
pyarrow 6.0.0 parquet, feather Parquet, ORC, and feather reading / writing
444+
pyarrow 7.0.0 parquet, feather Parquet, ORC, and feather reading / writing
445445
pyreadstat 1.1.2 spss SPSS files (.sav) reading
446446
odfpy 1.4.1 excel Open document format (.odf, .ods, .odt) reading / writing
447447
========================= ================== ================ =============================================================

doc/source/reference/arrays.rst

+1
Original file line numberDiff line numberDiff line change
@@ -653,6 +653,7 @@ Data type introspection
653653
.. autosummary::
654654
:toctree: api/
655655

656+
api.types.is_any_real_numeric_dtype
656657
api.types.is_bool_dtype
657658
api.types.is_categorical_dtype
658659
api.types.is_complex_dtype

doc/source/reference/groupby.rst

+2
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,7 @@ Function application
9797
DataFrameGroupBy.quantile
9898
DataFrameGroupBy.rank
9999
DataFrameGroupBy.resample
100+
DataFrameGroupBy.rolling
100101
DataFrameGroupBy.sample
101102
DataFrameGroupBy.sem
102103
DataFrameGroupBy.shift
@@ -152,6 +153,7 @@ Function application
152153
SeriesGroupBy.quantile
153154
SeriesGroupBy.rank
154155
SeriesGroupBy.resample
156+
SeriesGroupBy.rolling
155157
SeriesGroupBy.sample
156158
SeriesGroupBy.sem
157159
SeriesGroupBy.shift

doc/source/whatsnew/v2.0.0.rst

+15-10
Original file line numberDiff line numberDiff line change
@@ -65,15 +65,14 @@ Below is a possibly non-exhaustive list of changes:
6565

6666
1. Instantiating using a numpy numeric array now follows the dtype of the numpy array.
6767
Previously, all indexes created from numpy numeric arrays were forced to 64-bit. Now,
68-
the index dtype follows the dtype of the numpy array. For example, it would for all
69-
signed integer arrays previously return an index with ``int64`` dtype, but will now
70-
reuse the dtype of the supplied numpy array. So ``Index(np.array([1, 2, 3]))`` will be ``int32`` on 32-bit systems.
68+
for example, ``Index(np.array([1, 2, 3]))`` will be ``int32`` on 32-bit systems, where
69+
it previously would have been ``int64``` even on 32-bit systems.
7170
Instantiating :class:`Index` using a list of numbers will still return 64bit dtypes,
7271
e.g. ``Index([1, 2, 3])`` will have a ``int64`` dtype, which is the same as previously.
73-
2. The various numeric datetime attributes of :class:`DateTimeIndex` (:attr:`~Date_TimeIndex.day`,
74-
:attr:`~DateTimeIndex.month`, :attr:`~DateTimeIndex.year` etc.) were previously in of
72+
2. The various numeric datetime attributes of :class:`DatetimeIndex` (:attr:`~DatetimeIndex.day`,
73+
:attr:`~DatetimeIndex.month`, :attr:`~DatetimeIndex.year` etc.) were previously in of
7574
dtype ``int64``, while they were ``int32`` for :class:`DatetimeArray`. They are now
76-
``int32`` on ``DateTimeIndex`` also:
75+
``int32`` on ``DatetimeIndex`` also:
7776

7877
.. ipython:: python
7978
@@ -92,7 +91,7 @@ Below is a possibly non-exhaustive list of changes:
9291
([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])), shape=(3, 4)
9392
)
9493
ser = pd.Series.sparse.from_coo(A)
95-
ser.index.dtype
94+
ser.index.dtypes
9695
9796
4. :class:`Index` cannot be instantiated using a float16 dtype. Previously instantiating
9897
an :class:`Index` using dtype ``float16`` resulted in a :class:`Float64Index` with a
@@ -224,6 +223,7 @@ Copy-on-Write improvements
224223
- :meth:`DataFrame.to_period` / :meth:`Series.to_period`
225224
- :meth:`DataFrame.truncate`
226225
- :meth:`DataFrame.tz_convert` / :meth:`Series.tz_localize`
226+
- :meth:`DataFrame.infer_objects` / :meth:`Series.infer_objects`
227227
- :func:`concat`
228228

229229
These methods return views when Copy-on-Write is enabled, which provides a significant
@@ -264,6 +264,7 @@ Alternatively, copy on write can be enabled locally through:
264264

265265
Other enhancements
266266
^^^^^^^^^^^^^^^^^^
267+
- Added support for ``dt`` accessor methods when using :class:`ArrowDtype` with a ``pyarrow.timestamp`` type (:issue:`50954`)
267268
- :func:`read_sas` now supports using ``encoding='infer'`` to correctly read and use the encoding specified by the sas file. (:issue:`48048`)
268269
- :meth:`.DataFrameGroupBy.quantile`, :meth:`.SeriesGroupBy.quantile` and :meth:`.DataFrameGroupBy.std` now preserve nullable dtypes instead of casting to numpy dtypes (:issue:`37493`)
269270
- :meth:`Series.add_suffix`, :meth:`DataFrame.add_suffix`, :meth:`Series.add_prefix` and :meth:`DataFrame.add_prefix` support an ``axis`` argument. If ``axis`` is set, the default behaviour of which axis to consider can be overwritten (:issue:`47819`)
@@ -651,7 +652,7 @@ If installed, we now require:
651652
+-------------------+-----------------+----------+---------+
652653
| Package | Minimum Version | Required | Changed |
653654
+===================+=================+==========+=========+
654-
| mypy (dev) | 0.991 | | X |
655+
| mypy (dev) | 1.0 | | X |
655656
+-------------------+-----------------+----------+---------+
656657
| pytest (dev) | 7.0.0 | | X |
657658
+-------------------+-----------------+----------+---------+
@@ -669,7 +670,7 @@ Optional libraries below the lowest tested version may still work, but are not c
669670
+-----------------+-----------------+---------+
670671
| Package | Minimum Version | Changed |
671672
+=================+=================+=========+
672-
| pyarrow | 6.0.0 | X |
673+
| pyarrow | 7.0.0 | X |
673674
+-----------------+-----------------+---------+
674675
| matplotlib | 3.6.1 | X |
675676
+-----------------+-----------------+---------+
@@ -761,6 +762,7 @@ Other API changes
761762
- The levels of the index of the :class:`Series` returned from ``Series.sparse.from_coo`` now always have dtype ``int32``. Previously they had dtype ``int64`` (:issue:`50926`)
762763
- :func:`to_datetime` with ``unit`` of either "Y" or "M" will now raise if a sequence contains a non-round ``float`` value, matching the ``Timestamp`` behavior (:issue:`50301`)
763764
- The methods :meth:`Series.round`, :meth:`DataFrame.__invert__`, :meth:`Series.__invert__`, :meth:`DataFrame.swapaxes`, :meth:`DataFrame.first`, :meth:`DataFrame.last`, :meth:`Series.first`, :meth:`Series.last` and :meth:`DataFrame.align` will now always return new objects (:issue:`51032`)
765+
- Added :func:`pandas.api.types.is_any_real_numeric_dtype` to check for real numeric dtypes (:issue:`51152`)
764766

765767
.. ---------------------------------------------------------------------------
766768
.. _whatsnew_200.deprecations:
@@ -775,7 +777,7 @@ Deprecations
775777
- :meth:`Index.is_integer` has been deprecated. Use :func:`pandas.api.types.is_integer_dtype` instead (:issue:`50042`)
776778
- :meth:`Index.is_floating` has been deprecated. Use :func:`pandas.api.types.is_float_dtype` instead (:issue:`50042`)
777779
- :meth:`Index.holds_integer` has been deprecated. Use :func:`pandas.api.types.infer_dtype` instead (:issue:`50243`)
778-
- :meth:`Index.is_numeric` has been deprecated. Use :func:`pandas.api.types.is_numeric_dtype` instead (:issue:`50042`)
780+
- :meth:`Index.is_numeric` has been deprecated. Use :func:`pandas.api.types.is_any_real_numeric_dtype` instead (:issue:`50042`,:issue:`51152`)
779781
- :meth:`Index.is_categorical` has been deprecated. Use :func:`pandas.api.types.is_categorical_dtype` instead (:issue:`50042`)
780782
- :meth:`Index.is_object` has been deprecated. Use :func:`pandas.api.types.is_object_dtype` instead (:issue:`50042`)
781783
- :meth:`Index.is_interval` has been deprecated. Use :func:`pandas.api.types.is_intterval_dtype` instead (:issue:`50042`)
@@ -1136,6 +1138,7 @@ Datetimelike
11361138
- Bug in :meth:`Series.interpolate` and :meth:`DataFrame.interpolate` with datetime or timedelta dtypes incorrectly raising ``ValueError`` (:issue:`11312`)
11371139
- Bug in :func:`to_datetime` was not returning input with ``errors='ignore'`` when input was out-of-bounds (:issue:`50587`)
11381140
- Bug in :func:`DataFrame.from_records` when given a :class:`DataFrame` input with timezone-aware datetime64 columns incorrectly dropping the timezone-awareness (:issue:`51162`)
1141+
- Bug in :func:`to_datetime` was raising ``decimal.InvalidOperation`` when parsing date strings with ``errors='coerce'`` (:issue:`51084`)
11391142
-
11401143

11411144
Timedelta
@@ -1200,6 +1203,7 @@ Indexing
12001203
- Bug in :meth:`Series.loc` raising error for out of bounds end of slice indexer (:issue:`50161`)
12011204
- Bug in :meth:`DataFrame.loc` raising ``ValueError`` with ``bool`` indexer and :class:`MultiIndex` (:issue:`47687`)
12021205
- Bug in :meth:`DataFrame.loc` raising ``IndexError`` when setting values for a pyarrow-backed column with a non-scalar indexer (:issue:`50085`)
1206+
- Bug in :meth:`DataFrame.loc` modifying object when setting incompatible value with an empty indexer (:issue:`45981`)
12031207
- Bug in :meth:`DataFrame.__setitem__` raising ``ValueError`` when right hand side is :class:`DataFrame` with :class:`MultiIndex` columns (:issue:`49121`)
12041208
- Bug in :meth:`DataFrame.reindex` casting dtype to ``object`` when :class:`DataFrame` has single extension array column when re-indexing ``columns`` and ``index`` (:issue:`48190`)
12051209
- Bug in :meth:`DataFrame.iloc` raising ``IndexError`` when indexer is a :class:`Series` with numeric extension array dtype (:issue:`49521`)
@@ -1324,6 +1328,7 @@ ExtensionArray
13241328
- Bug in :meth:`api.types.is_numeric_dtype` where a custom :class:`ExtensionDtype` would not return ``True`` if ``_is_numeric`` returned ``True`` (:issue:`50563`)
13251329
- Bug in :meth:`api.types.is_integer_dtype`, :meth:`api.types.is_unsigned_integer_dtype`, :meth:`api.types.is_signed_integer_dtype`, :meth:`api.types.is_float_dtype` where a custom :class:`ExtensionDtype` would not return ``True`` if ``kind`` returned the corresponding NumPy type (:issue:`50667`)
13261330
- Bug in :class:`Series` constructor unnecessarily overflowing for nullable unsigned integer dtypes (:issue:`38798`, :issue:`25880`)
1331+
- Bug in setting non-string value into ``StringArray`` raising ``ValueError`` instead of ``TypeError`` (:issue:`49632`)
13271332

13281333
Styler
13291334
^^^^^^

environment.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ dependencies:
7979
- cpplint
8080
- flake8=6.0.0
8181
- isort>=5.2.1 # check that imports are in the right order
82-
- mypy=0.991
82+
- mypy=1.0
8383
- pre-commit>=2.15.0
8484
- pyupgrade
8585
- ruff=0.0.215

pandas/_libs/groupby.pyi

+1
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@ def group_any_all(
5555
mask: np.ndarray, # const uint8_t[::1]
5656
val_test: Literal["any", "all"],
5757
skipna: bool,
58+
nullable: bool,
5859
) -> None: ...
5960
def group_sum(
6061
out: np.ndarray, # complexfloatingintuint_t[:, ::1]

pandas/_libs/internals.pyi

+13-1
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ from typing import (
44
final,
55
overload,
66
)
7+
import weakref
78

89
import numpy as np
910

@@ -59,8 +60,13 @@ class SharedBlock:
5960
_mgr_locs: BlockPlacement
6061
ndim: int
6162
values: ArrayLike
63+
refs: BlockValuesRefs
6264
def __init__(
63-
self, values: ArrayLike, placement: BlockPlacement, ndim: int
65+
self,
66+
values: ArrayLike,
67+
placement: BlockPlacement,
68+
ndim: int,
69+
refs: BlockValuesRefs | None = ...,
6470
) -> None: ...
6571

6672
class NumpyBlock(SharedBlock):
@@ -87,3 +93,9 @@ class BlockManager:
8793
) -> None: ...
8894
def get_slice(self: T, slobj: slice, axis: int = ...) -> T: ...
8995
def _rebuild_blknos_and_blklocs(self) -> None: ...
96+
97+
class BlockValuesRefs:
98+
referenced_blocks: list[weakref.ref]
99+
def __init__(self, blk: SharedBlock) -> None: ...
100+
def add_reference(self, blk: SharedBlock) -> None: ...
101+
def has_reference(self) -> bool: ...

0 commit comments

Comments
 (0)