Skip to content

ENH: Hand numpy-like arrays with is_list_like #39830

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 30 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
10f1bde
PERF: is_list_like
jbrockmendel Feb 16, 2021
b8856be
typo fixup
jbrockmendel Feb 16, 2021
60e1008
isort fixup
jbrockmendel Feb 16, 2021
dda6588
isort fixup
jbrockmendel Feb 16, 2021
f36b11a
CLN: Better method of determining read-only status of openpyxl worksh…
rhshadrach Feb 15, 2021
b758123
TYP: tidy comments for # type: ignore (#39794)
simonjayhawkins Feb 15, 2021
ecf44fd
TYP: np.ndarray does not yet accept type parameters (#39792)
simonjayhawkins Feb 15, 2021
b190231
TYP: fix mypy errors in pandas/core/arraylike.py (#39104)
ivanovmg Feb 15, 2021
d7705e8
typing refactor (#39812)
attack68 Feb 15, 2021
8f48f22
[ArrayManager] Indexing - implement iset (#39734)
jorisvandenbossche Feb 15, 2021
fd1c797
DEP: bump min version of openpyxl to 3.0.0 #39603 (#39702)
fangchenli Feb 15, 2021
cc6597e
REF: Dispatch TimedeltaBlock.fillna to TimedeltaArray (#39811)
jbrockmendel Feb 15, 2021
b5b970f
REF: put Block replace methods together (#39810)
jbrockmendel Feb 15, 2021
f16e5fb
move validate_rst_title_capitalization to pre-commit (#39779)
MarcoGorelli Feb 15, 2021
c433dc3
API: transform behaves differently with 'ffill' on DataFrameGroupBy a…
ftrihardjo Feb 15, 2021
15ec5e3
TST: fixturize indexing intervalindex tests (#39803)
jbrockmendel Feb 15, 2021
d819f65
DOC: skip evaluation of code in v0.8.0 release notes (#39801)
afeld Feb 15, 2021
a1173dc
Regression in to_excel when setting duplicate column names (#39800)
phofl Feb 15, 2021
b77d3fa
DOC: Add reference to Text Extensions for Pandas project (#39783)
frreiss Feb 15, 2021
7238b95
REF: share DTBlock/TDBLok _maybe_coerce_values (#39815)
jbrockmendel Feb 15, 2021
5cbafe4
CI: upload coverage report to Codecov (#39822)
fangchenli Feb 15, 2021
f4de260
DOC: Ban mutation in UDF methods (#39762)
rhshadrach Feb 15, 2021
5092a07
CLN: remove redundant openpyxl type conversions (#39782)
ahawryluk Feb 15, 2021
6646f7c
TST/REF: split/collect large tests (#39789)
jbrockmendel Feb 15, 2021
46fb34c
CLN: Remove "how" return from agg (#39786)
rhshadrach Feb 15, 2021
126f406
DEPR: casting date to dt64 in maybe_promote (#39767)
jbrockmendel Feb 15, 2021
8a726f0
TST: split large tests (#39768)
jbrockmendel Feb 15, 2021
4ea0473
BUG: incorrectly accepting datetime64(nat) for dt64tz (#39769)
jbrockmendel Feb 16, 2021
6c13843
Update is_list_like
znicholls Feb 16, 2021
015e1e8
Add and pass tests
znicholls Feb 17, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -153,3 +153,9 @@ jobs:
run: |
source activate pandas-dev
pytest pandas/tests/frame/methods --array-manager

# indexing iset related (temporary since other tests don't pass yet)
pytest pandas/tests/frame/indexing/test_indexing.py::TestDataFrameIndexing::test_setitem_multi_index --array-manager
pytest pandas/tests/frame/indexing/test_setitem.py::TestDataFrameSetItem::test_setitem_listlike_indexer_duplicate_columns --array-manager
pytest pandas/tests/indexing/multiindex/test_setitem.py::TestMultiIndexSetItem::test_astype_assignment_with_dups --array-manager
pytest pandas/tests/indexing/multiindex/test_setitem.py::TestMultiIndexSetItem::test_frame_setitem_multi_column --array-manager
8 changes: 8 additions & 0 deletions .github/workflows/database.yml
Original file line number Diff line number Diff line change
Expand Up @@ -170,3 +170,11 @@ jobs:

- name: Print skipped tests
run: python ci/print_skipped.py

- name: Upload coverage to Codecov
uses: codecov/codecov-action@v1
with:
files: /tmp/test_coverage.xml
flags: unittests
name: codecov-pandas
fail_ci_if_error: true
6 changes: 6 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,12 @@ repos:
language: pygrep
types: [python]
files: ^pandas/tests/
- id: title-capitalization
name: Validate correct capitalization among titles in documentation
entry: python scripts/validate_rst_title_capitalization.py
language: python
types: [rst]
files: ^doc/source/(development|reference)/
- repo: https://github.com/asottile/yesqa
rev: v1.2.2
hooks:
Expand Down
42 changes: 42 additions & 0 deletions asv_bench/benchmarks/libs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
"""
Benchmarks for code in pandas/_libs, excluding pandas/_libs/tslibs,
which has its own directory
"""
import numpy as np

from pandas._libs.lib import (
is_list_like,
is_scalar,
)

from pandas import (
NA,
NaT,
)

# TODO: share with something in pd._testing?
scalars = [
0,
1.0,
1 + 2j,
True,
"foo",
b"bar",
None,
np.datetime64(123, "ns"),
np.timedelta64(123, "ns"),
NaT,
NA,
]
zero_dims = [np.array("123")]
listlikes = [np.array([1, 2, 3]), {0: "foo"}, set(1, 2, 3), [1, 2, 3], (1, 2, 3)]


class ScalarListLike:
params = scalars + zero_dims + listlikes

def time_is_list_like(self, param):
is_list_like(param)

def time_is_scalar(self, param):
is_scalar(param)
4 changes: 0 additions & 4 deletions ci/code_checks.sh
Original file line number Diff line number Diff line change
Expand Up @@ -233,10 +233,6 @@ if [[ -z "$CHECK" || "$CHECK" == "docstrings" ]]; then
$BASE_DIR/scripts/validate_docstrings.py --format=actions --errors=GL03,GL04,GL05,GL06,GL07,GL09,GL10,SS02,SS04,SS05,PR03,PR04,PR05,PR10,EX04,RT01,RT04,RT05,SA02,SA03
RET=$(($RET + $?)) ; echo $MSG "DONE"

MSG='Validate correct capitalization among titles in documentation' ; echo $MSG
$BASE_DIR/scripts/validate_rst_title_capitalization.py $BASE_DIR/doc/source/development $BASE_DIR/doc/source/reference
RET=$(($RET + $?)) ; echo $MSG "DONE"

fi

### TYPING ###
Expand Down
2 changes: 1 addition & 1 deletion ci/deps/azure-37-locale_slow.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ dependencies:
- lxml
- matplotlib=3.0.0
- numpy=1.16.*
- openpyxl=2.6.0
- openpyxl=3.0.0
- python-dateutil
- python-blosc
- pytz=2017.3
Expand Down
2 changes: 1 addition & 1 deletion ci/deps/azure-37-minimum_versions.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ dependencies:
- numba=0.46.0
- numexpr=2.6.8
- numpy=1.16.5
- openpyxl=2.6.0
- openpyxl=3.0.0
- pytables=3.5.1
- python-dateutil=2.7.3
- pytz=2017.3
Expand Down
8 changes: 8 additions & 0 deletions doc/source/ecosystem.rst
Original file line number Diff line number Diff line change
Expand Up @@ -476,6 +476,14 @@ storing numeric arrays with units. These arrays can be stored inside pandas'
Series and DataFrame. Operations between Series and DataFrame columns which
use pint's extension array are then units aware.

`Text Extensions for Pandas`_
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``Text Extensions for Pandas <https://ibm.biz/text-extensions-for-pandas>``
provides extension types to cover common data structures for representing natural language
data, plus library integrations that convert the outputs of popular natural language
processing libraries into Pandas DataFrames.

.. _ecosystem.accessors:

Accessors
Expand Down
2 changes: 1 addition & 1 deletion doc/source/getting_started/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -274,7 +274,7 @@ html5lib 1.0.1 HTML parser for read_html (see :ref
lxml 4.3.0 HTML parser for read_html (see :ref:`note <optional_html>`)
matplotlib 2.2.3 Visualization
numba 0.46.0 Alternative execution engine for rolling operations
openpyxl 2.6.0 Reading / writing for xlsx files
openpyxl 3.0.0 Reading / writing for xlsx files
pandas-gbq 0.12.0 Google Big Query access
psycopg2 2.7 PostgreSQL engine for sqlalchemy
pyarrow 0.15.0 Parquet, ORC, and feather reading / writing
Expand Down
69 changes: 69 additions & 0 deletions doc/source/user_guide/gotchas.rst
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,75 @@ To test for membership in the values, use the method :meth:`~pandas.Series.isin`
For ``DataFrames``, likewise, ``in`` applies to the column axis,
testing for membership in the list of column names.

.. _udf-mutation:

Mutating with User Defined Function (UDF) methods
-------------------------------------------------

It is a general rule in programming that one should not mutate a container
while it is being iterated over. Mutation will invalidate the iterator,
causing unexpected behavior. Consider the example:

.. ipython:: python

values = [0, 1, 2, 3, 4, 5]
n_removed = 0
for k, value in enumerate(values):
idx = k - n_removed
if value % 2 == 1:
del values[idx]
n_removed += 1
else:
values[idx] = value + 1
values

One probably would have expected that the result would be ``[1, 3, 5]``.
When using a pandas method that takes a UDF, internally pandas is often
iterating over the
``DataFrame`` or other pandas object. Therefore, if the UDF mutates (changes)
the ``DataFrame``, unexpected behavior can arise.

Here is a similar example with :meth:`DataFrame.apply`:

.. ipython:: python

def f(s):
s.pop("a")
return s

df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
try:
df.apply(f, axis="columns")
except Exception as err:
print(repr(err))

To resolve this issue, one can make a copy so that the mutation does
not apply to the container being iterated over.

.. ipython:: python

values = [0, 1, 2, 3, 4, 5]
n_removed = 0
for k, value in enumerate(values.copy()):
idx = k - n_removed
if value % 2 == 1:
del values[idx]
n_removed += 1
else:
values[idx] = value + 1
values

.. ipython:: python

def f(s):
s = s.copy()
s.pop("a")
return s

df = pd.DataFrame({"a": [1, 2, 3], 'b': [4, 5, 6]})
df.apply(f, axis="columns")


``NaN``, Integer ``NA`` values and ``NA`` type promotions
---------------------------------------------------------

Expand Down
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v0.8.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,7 @@ New plotting methods
Vytautas Jancauskas, the 2012 GSOC participant, has added many new plot
types. For example, ``'kde'`` is a new option:

.. ipython:: python
.. code-block:: python

s = pd.Series(
np.concatenate((np.random.randn(1000), np.random.randn(1000) * 0.5 + 3))
Expand Down
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.2.3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ including other versions of pandas.
Fixed regressions
~~~~~~~~~~~~~~~~~

-
- Fixed regression in :func:`pandas.to_excel` raising ``KeyError`` when giving duplicate columns with ``columns`` attribute (:issue:`39695`)
-

.. ---------------------------------------------------------------------------
Expand Down
6 changes: 4 additions & 2 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@ Optional libraries below the lowest tested version may still work, but are not c
+-----------------+-----------------+---------+
| numba | 0.46.0 | |
+-----------------+-----------------+---------+
| openpyxl | 2.6.0 | |
| openpyxl | 3.0.0 | X |
+-----------------+-----------------+---------+
| pyarrow | 0.15.0 | |
+-----------------+-----------------+---------+
Expand Down Expand Up @@ -239,7 +239,7 @@ Deprecations
- Deprecated :attr:`Rolling.is_datetimelike` (:issue:`38963`)
- Deprecated :meth:`core.window.ewm.ExponentialMovingWindow.vol` (:issue:`39220`)
- Using ``.astype`` to convert between ``datetime64[ns]`` dtype and :class:`DatetimeTZDtype` is deprecated and will raise in a future version, use ``obj.tz_localize`` or ``obj.dt.tz_localize`` instead (:issue:`38622`)
-
- Deprecated casting ``datetime.date`` objects to ``datetime64`` when used as ``fill_value`` in :meth:`DataFrame.unstack`, :meth:`DataFrame.shift`, :meth:`Series.shift`, and :meth:`DataFrame.reindex`, pass ``pd.Timestamp(dateobj)`` instead (:issue:`39767`)

.. ---------------------------------------------------------------------------

Expand Down Expand Up @@ -346,7 +346,9 @@ Indexing
- Bug in setting ``timedelta64`` or ``datetime64`` values into numeric :class:`Series` failing to cast to object dtype (:issue:`39086`, issue:`39619`)
- Bug in setting :class:`Interval` values into a :class:`Series` or :class:`DataFrame` with mismatched :class:`IntervalDtype` incorrectly casting the new values to the existing dtype (:issue:`39120`)
- Bug in setting ``datetime64`` values into a :class:`Series` with integer-dtype incorrect casting the datetime64 values to integers (:issue:`39266`)
- Bug in setting ``np.datetime64("NaT")`` into a :class:`Series` with :class:`Datetime64TZDtype` incorrectly treating the timezone-naive value as timezone-aware (:issue:`39769`)
- Bug in :meth:`Index.get_loc` not raising ``KeyError`` when method is specified for ``NaN`` value when ``NaN`` is not in :class:`Index` (:issue:`39382`)
- Bug in :meth:`DatetimeIndex.insert` when inserting ``np.datetime64("NaT")`` into a timezone-aware index incorrectly treating the timezone-naive value as timezone-aware (:issue:`39769`)
- Bug in incorrectly raising in :meth:`Index.insert`, when setting a new column that cannot be held in the existing ``frame.columns``, or in :meth:`Series.reset_index` or :meth:`DataFrame.reset_index` instead of casting to a compatible dtype (:issue:`39068`)
- Bug in :meth:`RangeIndex.append` where a single object of length 1 was concatenated incorrectly (:issue:`39401`)
- Bug in setting ``numpy.timedelta64`` values into an object-dtype :class:`Series` using a boolean indexer (:issue:`39488`)
Expand Down
8 changes: 6 additions & 2 deletions pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -1044,11 +1044,15 @@ def is_list_like(obj: object, allow_sets: bool = True) -> bool:

cdef inline bint c_is_list_like(object obj, bint allow_sets) except -1:
return (
isinstance(obj, abc.Iterable)
# equiv: `isinstance(obj, abc.Iterable)`
hasattr(obj, "__iter__") and not isinstance(obj, type)
# we do not count strings/unicode/bytes as list-like
and not isinstance(obj, (str, bytes))
# exclude zero-dimensional numpy arrays, effectively scalars
and not (util.is_array(obj) and obj.ndim == 0)
and not cnp.PyArray_IsZeroDim(obj)
# extra check for numpy-like objects which aren't captured by
# the above
and not (hasattr(obj, "ndim") and obj.ndim == 0)
# exclude sets if allow_sets is False
and not (allow_sets is False and isinstance(obj, abc.Set))
)
Expand Down
2 changes: 1 addition & 1 deletion pandas/_testing/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -559,7 +559,7 @@ def makeCustomIndex(
"p": makePeriodIndex,
}.get(idx_type)
if idx_func:
# pandas\_testing.py:2120: error: Cannot call function of unknown type
# error: Cannot call function of unknown type
idx = idx_func(nentries) # type: ignore[operator]
# but we need to fill in the name
if names:
Expand Down
8 changes: 3 additions & 5 deletions pandas/_testing/_io.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,9 +82,8 @@ def dec(f):
is_decorating = not kwargs and len(args) == 1 and callable(args[0])
if is_decorating:
f = args[0]
# pandas\_testing.py:2331: error: Incompatible types in assignment
# (expression has type "List[<nothing>]", variable has type
# "Tuple[Any, ...]")
# error: Incompatible types in assignment (expression has type
# "List[<nothing>]", variable has type "Tuple[Any, ...]")
args = [] # type: ignore[assignment]
return dec(f)
else:
Expand Down Expand Up @@ -205,8 +204,7 @@ def wrapper(*args, **kwargs):
except Exception as err:
errno = getattr(err, "errno", None)
if not errno and hasattr(errno, "reason"):
# pandas\_testing.py:2521: error: "Exception" has no attribute
# "reason"
# error: "Exception" has no attribute "reason"
errno = getattr(err.reason, "errno", None) # type: ignore[attr-defined]

if errno in skip_errnos:
Expand Down
2 changes: 1 addition & 1 deletion pandas/compat/_optional.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
"matplotlib": "2.2.3",
"numexpr": "2.6.8",
"odfpy": "1.3.0",
"openpyxl": "2.6.0",
"openpyxl": "3.0.0",
"pandas_gbq": "0.12.0",
"pyarrow": "0.15.0",
"pytest": "5.0.1",
Expand Down
8 changes: 8 additions & 0 deletions pandas/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -1565,6 +1565,14 @@ def indexer_si(request):
return request.param


@pytest.fixture(params=[tm.setitem, tm.loc])
def indexer_sl(request):
"""
Parametrize over __setitem__, loc.__setitem__
"""
return request.param


@pytest.fixture
def using_array_manager(request):
"""
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -2208,7 +2208,7 @@ def _sort_mixed(values):
return np.concatenate([nums, np.asarray(strs, dtype=object)])


def _sort_tuples(values: np.ndarray[tuple]):
def _sort_tuples(values: np.ndarray):
"""
Convert array of tuples (1d) to array or array (2d).
We need to keep the columns separately as they contain different types and
Expand Down
22 changes: 8 additions & 14 deletions pandas/core/apply.py
Original file line number Diff line number Diff line change
Expand Up @@ -147,18 +147,14 @@ def index(self) -> Index:
def apply(self) -> FrameOrSeriesUnion:
pass

def agg(self) -> Tuple[Optional[FrameOrSeriesUnion], Optional[bool]]:
def agg(self) -> Optional[FrameOrSeriesUnion]:
"""
Provide an implementation for the aggregators.

Returns
-------
tuple of result, how.

Notes
-----
how can be a string describe the required post-processing, or
None if not required.
Result of aggregation, or None if agg cannot be performed by
this method.
"""
obj = self.obj
arg = self.f
Expand All @@ -171,23 +167,21 @@ def agg(self) -> Tuple[Optional[FrameOrSeriesUnion], Optional[bool]]:

result = self.maybe_apply_str()
if result is not None:
return result, None
return result

if is_dict_like(arg):
return self.agg_dict_like(_axis), True
return self.agg_dict_like(_axis)
elif is_list_like(arg):
# we require a list, but not a 'str'
return self.agg_list_like(_axis=_axis), None
else:
result = None
return self.agg_list_like(_axis=_axis)

if callable(arg):
f = obj._get_cython_func(arg)
if f and not args and not kwargs:
return getattr(obj, f)(), None
return getattr(obj, f)()

# caller can react
return result, True
return None

def agg_list_like(self, _axis: int) -> FrameOrSeriesUnion:
"""
Expand Down
Loading