ENH: Implement StringArray.min / max #33351

dsaxton · 2020-04-07T02:03:12Z

tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Using the new masked reductions from @jorisvandenbossche to implement these for StringArray. Part of #31746 but doesn't close because we're not adding sum here.

jorisvandenbossche

Thanks for looking at this!

pandas/core/arrays/string_.py

jorisvandenbossche · 2020-04-07T06:39:43Z

pandas/core/arrays/string_.py

        raise TypeError(f"Cannot perform reduction '{name}' with string dtype")

+    def min(self, axis=None, out=None, keepdims=False, skipna=True):


Is it possible to move all of axis, out and keepdims into **kwargs ? (will need to test with numpy)

This should probably also validate those keywords, if we want to accept them (and we should also test this if we are adding them)

Updated, assuming by test with numpy you mean the validation functions in /compat/numpy/function?

Yes, indeed, those validation functions can be used to check additional keywords.

In addition, we should also test this in the tests (so test that np.min(a) works, because that is the only reason we would add those keywords)

dsaxton · 2020-04-07T22:04:52Z

@jorisvandenbossche What do you think of implementing these reductions at the PandasArray or BaseMaskedArray level (I figure they should probably be used at least for IntegerArray anyways, but if so then might as well make them inheritable by other masked arrays, e.g., BooleanArray)?

pandas/core/arrays/string_.py

jorisvandenbossche · 2020-04-08T10:01:43Z

What do you think of implementing these reductions at the PandasArray or BaseMaskedArray level (I figure they should probably be used at least for IntegerArray anyways, but if so then might as well make them inheritable by other masked arrays, e.g., BooleanArray)?

Yeah, we should indeed decide on this more generally (for all our internal EAs, so string/boolean/int, I think the older ones already have those methods). And also more broadly for all/most reductions (so also mean, sum, etc). But that can be discussed / done separately from this PR though. Eg one question is if we actually add this to the EA interface (so base class EA), or only to our own EAs.
Do you want to open an issue for this?

jreback · 2020-04-08T15:37:59Z

pandas/tests/extension/base/reduce.py

@@ -25,6 +25,13 @@ class BaseNoReduceTests(BaseReduceTests):

    @pytest.mark.parametrize("skipna", [True, False])
    def test_reduce_series_numeric(self, data, all_numeric_reductions, skipna):
+        if isinstance(data, pd.arrays.StringArray) and all_numeric_reductions in [


hmm, @TomAugspurger, @jbrockmendel is this the pattern we are using here for skips like this?

in indexes tests when we have a "this is tested in this other place" comment we usually return/pass instead of pytest.skip.

It's not a perfect system, but I think of it as a way of distinguishing between "this test is skipped but would be nice to enable" vs "nothing to see here"

Returning None rather than skipping also keeps the test output cleaner, since we print skipped tests.

We typically refine this for a specific type by overriding this test to have custom behaviour in extension/test_strings.py

ok on the test names? @jorisvandenbossche

@dsaxton can you move this special case for string to test_strings.py ? (and there you an override this test method, to do the correct thing for string dtype)

jreback

looks pretty good.

pandas/core/arrays/string_.py

dsaxton · 2020-04-08T18:51:34Z

Yeah, we should indeed decide on this more generally (for all our internal EAs, so string/boolean/int, I think the older ones already have those methods). And also more broadly for all/most reductions (so also mean, sum, etc). But that can be discussed / done separately from this PR though.

After annotating the StringArray methods mypy complains about a mismatch between the signatures with PandasArray. I think we could just replace the parent class implementations in this PR and that should still work / be a bit cleaner? (I'll make that update for now and can revert if there are drawbacks.)

pandas/core/arrays/numpy_.py

pandas/tests/arrays/string_/test_string.py

simonjayhawkins · 2020-04-24T13:46:04Z

i'm seeing a regression here...

>>> import numpy as np
>>> import pandas as pd
>>>
>>> pd.__version__
'1.1.0.dev0+1273.gfd6296309'
>>>
>>> arr = pd.array([np.inf, 100])
>>>
>>> np.max(arr)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 5, in amax
  File "C:\Users\simon\Anaconda3\envs\pandas-dev\lib\site-packages\numpy\core\fromnumeric.py", line 2667, in amax
    return _wrapreduction(a, np.maximum, 'max', axis, None, out,
  File "C:\Users\simon\Anaconda3\envs\pandas-dev\lib\site-packages\numpy\core\fromnumeric.py", line 88, in _wrapreduction
    return reduction(axis=axis, out=out, **passkwargs)
  File "C:\Users\simon\pandas\pandas\core\arrays\numpy_.py", line 362, in max
    nv.validate_max((), kwargs)
  File "C:\Users\simon\pandas\pandas\compat\numpy\function.py", line 70, in __call__
    validate_args_and_kwargs(
  File "C:\Users\simon\pandas\pandas\util\_validators.py", line 205, in validate_args_and_kwargs
    validate_kwargs(fname, kwargs, compat_args)
  File "C:\Users\simon\pandas\pandas\util\_validators.py", line 148, in validate_kwargs
    _check_for_invalid_keys(fname, kwargs, compat_args)
  File "C:\Users\simon\pandas\pandas\util\_validators.py", line 122, in _check_for_invalid_keys
    raise TypeError(f"{fname}() got an unexpected keyword argument '{bad_arg}'")
TypeError: max() got an unexpected keyword argument 'axis'
>>>

on master

>>> import numpy as np
>>> import pandas as pd
>>>
>>> pd.__version__
'1.1.0.dev0+1347.g428791c5e'
>>>
>>> arr = pd.array([np.inf, 100])
>>>
>>> np.max(arr)
inf
>>>

dsaxton · 2020-04-24T15:33:17Z

i'm seeing a regression here...

I think this is caused by validating all the kwargs, any idea what the best approach is here @jorisvandenbossche ? Looks like before only out and keepdims were validated

jorisvandenbossche · 2020-04-24T15:53:16Z

Yeah, that's because the validation functions is not handling axis properly

jorisvandenbossche · 2020-04-24T15:55:07Z

pandas/tests/arrays/string_/test_string.py

+
+@pytest.mark.parametrize("method", ["min", "max"])
+def test_min_max_numpy(method):
+    arr = pd.Series(["a", "b", "c", None], dtype="string")


so we should also test here for the actual array (and not put in a Series), maybe parametrize over pd.array and pd.Series as constructor

jreback

lgtm. @jorisvandenbossche

jorisvandenbossche · 2020-04-25T08:02:00Z

Thanks @dsaxton !

dsaxton added 2 commits April 6, 2020 20:53

ENH: Implement StringArray.min / max

a470f02

Skip

a50a2c4

jorisvandenbossche reviewed Apr 7, 2020

View reviewed changes

dsaxton added 4 commits April 7, 2020 08:11

Merge remote-tracking branch 'upstream/master' into string-min-max

4e09c50

Update

6e8f0c5

Add numpy tests

34f8d5f

Merge remote-tracking branch 'upstream/master' into string-min-max

08ce4d5

jorisvandenbossche reviewed Apr 8, 2020

View reviewed changes

pandas/core/arrays/string_.py Outdated Show resolved Hide resolved

pandas/core/arrays/string_.py Outdated Show resolved Hide resolved

dsaxton added 2 commits April 8, 2020 09:18

No kwargs

fa13cec

Merge remote-tracking branch 'upstream/master' into string-min-max

64e6d19

jreback reviewed Apr 8, 2020

View reviewed changes

jreback requested changes Apr 8, 2020

View reviewed changes

pandas/core/arrays/string_.py Outdated Show resolved Hide resolved

jreback added Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data labels Apr 8, 2020

jreback added this to the 1.1 milestone Apr 8, 2020

jbrockmendel reviewed Apr 8, 2020

View reviewed changes

pandas/core/arrays/string_.py Outdated Show resolved Hide resolved

Type

d804e4b

dsaxton added 2 commits April 8, 2020 13:56

Move to PandasArray

5cc91e0

Return None

0c98b08

TomAugspurger reviewed Apr 10, 2020

View reviewed changes

pandas/core/arrays/numpy_.py Show resolved Hide resolved

jreback reviewed Apr 10, 2020

View reviewed changes

pandas/tests/arrays/string_/test_string.py Show resolved Hide resolved

dsaxton added 3 commits April 10, 2020 13:30

Merge remote-tracking branch 'upstream/master' into string-min-max

df6ba29

Lint

3613f65

Merge remote-tracking branch 'upstream/master' into string-min-max

fd62963

jorisvandenbossche mentioned this pull request Apr 24, 2020

ENH: Implement IntegerArray.sum #33538

Merged

5 tasks

Merge remote-tracking branch 'upstream/master' into string-min-max

9083273

dsaxton added 2 commits April 24, 2020 10:23

Move test

208835c

Merge remote-tracking branch 'upstream/master' into string-min-max

77144d0

jorisvandenbossche reviewed Apr 24, 2020

View reviewed changes

dsaxton added 2 commits April 24, 2020 12:13

Parametrize test

9197125

Edit validator

3fcd200

jreback approved these changes Apr 24, 2020

View reviewed changes

jorisvandenbossche approved these changes Apr 25, 2020

View reviewed changes

jorisvandenbossche merged commit f49269f into pandas-dev:master Apr 25, 2020

dsaxton deleted the string-min-max branch April 25, 2020 13:31

rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request May 10, 2020

ENH: Implement StringArray.min / max (pandas-dev#33351)

30de091

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Implement StringArray.min / max #33351

ENH: Implement StringArray.min / max #33351

dsaxton commented Apr 7, 2020 •

edited

Loading

jorisvandenbossche left a comment

jorisvandenbossche Apr 7, 2020

dsaxton Apr 7, 2020

jorisvandenbossche Apr 7, 2020

dsaxton commented Apr 7, 2020

jorisvandenbossche commented Apr 8, 2020

jreback Apr 8, 2020

jbrockmendel Apr 8, 2020

TomAugspurger Apr 8, 2020

jorisvandenbossche Apr 10, 2020

jreback Apr 10, 2020

jorisvandenbossche Apr 24, 2020

jreback left a comment

dsaxton commented Apr 8, 2020 •

edited

Loading

simonjayhawkins commented Apr 24, 2020

dsaxton commented Apr 24, 2020

jorisvandenbossche commented Apr 24, 2020

jorisvandenbossche Apr 24, 2020

jreback left a comment

jorisvandenbossche commented Apr 25, 2020

		raise TypeError(f"Cannot perform reduction '{name}' with string dtype")

		def min(self, axis=None, out=None, keepdims=False, skipna=True):

ENH: Implement StringArray.min / max #33351

ENH: Implement StringArray.min / max #33351

Conversation

dsaxton commented Apr 7, 2020 • edited Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsaxton commented Apr 7, 2020

jorisvandenbossche commented Apr 8, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

dsaxton commented Apr 8, 2020 • edited Loading

simonjayhawkins commented Apr 24, 2020

dsaxton commented Apr 24, 2020

jorisvandenbossche commented Apr 24, 2020

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Apr 25, 2020

dsaxton commented Apr 7, 2020 •

edited

Loading

dsaxton commented Apr 8, 2020 •

edited

Loading