BUG: .transform(ngroup) for axis=1 grouper #50761

iofall · 2023-01-15T18:13:14Z

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

iofall · 2023-01-15T18:14:51Z

pandas/core/groupby/groupby.py

@@ -3369,7 +3369,7 @@ def ngroup(self, ascending: bool = True):
        dtype: int64
        """
        with self._group_selection_context():
-            index = self._selected_obj.index
+            index = self._selected_obj._get_axis(self.axis)


Get index based on self.axis instead of getting the default axis=0 index.

iofall · 2023-01-15T18:21:31Z

About the tests, they do pass for me when I run:

pytest pandas\tests\groupby -n 4 -m "not slow and not network and not db and not single_cpu" -r sxX

But get a failure for:

pytest pandas\tests\groupby -n 4

It fails for test_value_counts.py with the message:

FAILED pandas/tests/groupby/test_value_counts.py::test_series_groupby_value_counts[False-False-False-False-False-['1st', '2nd']-1000-203] - KeyError: <class 'numpy.intc'>
768 failed, 17098 passed, 402 skipped, 226 xfailed

I am on Windows 10.

rhshadrach · 2023-01-16T21:37:53Z

Changes look good - can you post a stack trace for the failing test? You should just be able to run

pytest pandas/tests/groupby/test_value_counts.py::test_series_groupby_value_counts[False-False-False-False-False-['1st', '2nd']-1000-203]

You may need to switch the orientation of the slashes.

Also - can you see if you get the same failure on main and 1.5.x?

iofall · 2023-01-21T11:54:42Z

pytest pandas/tests/groupby/test_value_counts.py::test_series_groupby_value_counts[False-False-False-False-False-['1st', '2nd']-1000-203]

Sorry for the late reply. The above command needs to be enclosed in "" as there is a space in between. Here is the output:

================================================================================= test session starts =================================================================================
platform win32 -- Python 3.8.15, pytest-7.2.0, pluggy-1.0.0 -- C:\Users\Me\mambaforge\envs\pandas-dev\python.exe
cachedir: .pytest_cache
hypothesis profile 'ci' -> deadline=None, suppress_health_check=[HealthCheck.too_slow], database=DirectoryBasedExampleDatabase('C:\\Users\\Me\\Desktop\\pandas\\.hypothesis\\examples') 
rootdir: C:\Users\Me\Desktop\pandas, configfile: pyproject.toml
plugins: anyio-3.6.2, hypothesis-6.59.0, asyncio-0.20.2, cov-4.0.0, cython-0.2.0, xdist-3.0.2
asyncio: mode=strict
collected 1 item

pandas/tests/groupby/test_value_counts.py::test_series_groupby_value_counts[False-False-False-False-False-['1st', '2nd']-1000-203] FAILED

====================================================================================== FAILURES ======================================================================================= 
_______________________________________________ test_series_groupby_value_counts[False-False-False-False-False-['1st', '2nd']-1000-203] _______________________________________________

df =     1st        2nd  3rd
0     d 2015-08-31   17
1     d 2015-08-26   18
2     c 2015-08-27    1
3     b 2015-08-29   1...96   b 2015-08-28    9
997   c 2015-08-26   15
998   a 2015-08-27   16
999   c 2015-08-29    2

[1000 rows x 3 columns]
keys = ['1st', '2nd'], bins = array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20]), n = 1000, m = 20, isort = False, normalize = False, sort = False, ascending = False, dropna = False  

    @pytest.mark.slow
    @pytest.mark.parametrize("df, keys, bins, n, m", binned, ids=ids)
    @pytest.mark.parametrize("isort", [True, False])
    @pytest.mark.parametrize("normalize", [True, False])
    @pytest.mark.parametrize("sort", [True, False])
    @pytest.mark.parametrize("ascending", [True, False])
    @pytest.mark.parametrize("dropna", [True, False])
    def test_series_groupby_value_counts(
        df, keys, bins, n, m, isort, normalize, sort, ascending, dropna
    ):
        def rebuild_index(df):
            arr = list(map(df.index.get_level_values, range(df.index.nlevels)))
            df.index = MultiIndex.from_arrays(arr, names=df.index.names)
            return df

        kwargs = {
            "normalize": normalize,
            "sort": sort,
            "ascending": ascending,
            "dropna": dropna,
            "bins": bins,
        }

        gr = df.groupby(keys, sort=isort)
>       left = gr["3rd"].value_counts(**kwargs)

pandas\tests\groupby\test_value_counts.py:109:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas\core\groupby\generic.py:771: in value_counts
    _, idx = get_join_indexers(left, right, sort=False, how="left")
pandas\core\reshape\merge.py:1653: in get_join_indexers
    zipped = zip(*mapped)
pandas\core\reshape\merge.py:1650: in <genexpr>
    _factorize_keys(left_keys[n], right_keys[n], sort=sort, how=how)
pandas\core\reshape\merge.py:2373: in _factorize_keys
    klass, lk, rk = _convert_arrays_and_get_rizer_klass(lk, rk)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

lk = array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1,
       2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4,..., 0, 1, 2, 3,
       4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5,
       6, 7, 8, 9], dtype=int32)
rk = array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2,
       3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 4, 5, 6,...0, 2, 4, 5, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4,
       5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8], dtype=int32)

    def _convert_arrays_and_get_rizer_klass(
        lk: ArrayLike, rk: ArrayLike
    ) -> tuple[type[libhashtable.Factorizer], ArrayLike, ArrayLike]:
        klass: type[libhashtable.Factorizer]
        if is_numeric_dtype(lk.dtype):
            if not is_dtype_equal(lk, rk):
                dtype = find_common_type([lk.dtype, rk.dtype])
                if isinstance(dtype, ExtensionDtype):
                    cls = dtype.construct_array_type()
                    if not isinstance(lk, ExtensionArray):
                        lk = cls._from_sequence(lk, dtype=dtype, copy=False)
                    else:
                        lk = lk.astype(dtype)

                    if not isinstance(rk, ExtensionArray):
                        rk = cls._from_sequence(rk, dtype=dtype, copy=False)
                    else:
                        rk = rk.astype(dtype)
                else:
                    lk = lk.astype(dtype)
                    rk = rk.astype(dtype)
            if isinstance(lk, BaseMaskedArray):
                #  Invalid index type "type" for "Dict[Type[object], Type[Factorizer]]";
                #  expected type "Type[object]"
                klass = _factorizers[lk.dtype.type]  # type: ignore[index]
            else:
>               klass = _factorizers[lk.dtype.type]
E               KeyError: <class 'numpy.intc'>

pandas\core\reshape\merge.py:2440: KeyError
------------------------------------------------------------ generated xml file: C:\Users\Me\Desktop\pandas\test-data.xml ------------------------------------------------------------- 
================================================================================ slowest 30 durations ================================================================================= 
0.01s call     pandas/tests/groupby/test_value_counts.py::test_series_groupby_value_counts[False-False-False-False-False-['1st', '2nd']-1000-203]
0.00s setup    pandas/tests/groupby/test_value_counts.py::test_series_groupby_value_counts[False-False-False-False-False-['1st', '2nd']-1000-203]
0.00s teardown pandas/tests/groupby/test_value_counts.py::test_series_groupby_value_counts[False-False-False-False-False-['1st', '2nd']-1000-203]
=============================================================================== short test summary info =============================================================================== 
FAILED pandas/tests/groupby/test_value_counts.py::test_series_groupby_value_counts[False-False-False-False-False-['1st', '2nd']-1000-203] - KeyError: <class 'numpy.intc'>
================================================================================== 1 failed in 0.72s ==================================================================================

It does not fail on v1.5.2 and v.1.5.3, but does fail for main.

rhshadrach · 2023-01-21T12:56:04Z

Thanks, this is almost certainly due to #50548. I've opened #50920 to track.

Edit: Actually, this is failing in the SeriesGroupBy specific implementation where bins is not None. I'm now doubting if it's #50548.

rhshadrach

This looks great - just needs a line in the whatsnew for 2.0 under the Bugfix - Groupby section.

@iofall - I merged main here to make sure recent changes in groupby didn't impact this. An unrelated failure due to the latest release of fsspec popped up. Merging main should fix everything(ish).

rhshadrach · 2023-01-24T21:44:45Z

Thanks @iofall - nice work!

Fix .transform(ngroup) for axis=1 grouper

b356c6a

iofall commented Jan 15, 2023

View reviewed changes

iofall changed the title ~~Fix .transform(ngroup) for axis=1 grouper~~ BUG: .transform(ngroup) for axis=1 grouper Jan 15, 2023

mroeschke requested a review from rhshadrach January 16, 2023 18:51

mroeschke added Groupby Apply Apply, Aggregate, Transform, Map labels Jan 17, 2023

rhshadrach mentioned this pull request Jan 21, 2023

REGR: SeriesGroupBy.value_counts failure on Windows #50920

Closed

Merge branch 'main' into ngroup-axis-1

bebcc6e

rhshadrach requested changes Jan 22, 2023

View reviewed changes

Add whatsnew for transform axis-1 ngroup bug fix

d7c8a58

iofall force-pushed the ngroup-axis-1 branch from c416592 to d7c8a58 Compare January 23, 2023 18:29

Fix merge conflict

3f3be30

rhshadrach approved these changes Jan 24, 2023

View reviewed changes

rhshadrach added this to the 2.0 milestone Jan 24, 2023

rhshadrach mentioned this pull request Jan 24, 2023

BUG: Some str reducers fail in Groupby.transform with axis=1 #45986

Closed

7 tasks

rhshadrach merged commit 33f4f7b into pandas-dev:main Jan 24, 2023

pooja-subramaniam pushed a commit to pooja-subramaniam/pandas that referenced this pull request Jan 25, 2023

BUG: .transform(ngroup) for axis=1 grouper (pandas-dev#50761)

c0a9447

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: .transform(ngroup) for axis=1 grouper #50761

BUG: .transform(ngroup) for axis=1 grouper #50761

Uh oh!

iofall commented Jan 15, 2023 •

edited by rhshadrach

Loading

Uh oh!

iofall Jan 15, 2023

Uh oh!

iofall commented Jan 15, 2023

Uh oh!

rhshadrach commented Jan 16, 2023 •

edited

Loading

Uh oh!

iofall commented Jan 21, 2023

Uh oh!

rhshadrach commented Jan 21, 2023 •

edited

Loading

Uh oh!

rhshadrach left a comment

Uh oh!

rhshadrach commented Jan 24, 2023

Uh oh!

Uh oh!

Uh oh!

BUG: .transform(ngroup) for axis=1 grouper #50761

BUG: .transform(ngroup) for axis=1 grouper #50761

Uh oh!

Conversation

iofall commented Jan 15, 2023 • edited by rhshadrach Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iofall Jan 15, 2023

Choose a reason for hiding this comment

Uh oh!

iofall commented Jan 15, 2023

Uh oh!

rhshadrach commented Jan 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iofall commented Jan 21, 2023

Uh oh!

rhshadrach commented Jan 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhshadrach left a comment

Choose a reason for hiding this comment

Uh oh!

rhshadrach commented Jan 24, 2023

Uh oh!

Uh oh!

iofall commented Jan 15, 2023 •

edited by rhshadrach

Loading

rhshadrach commented Jan 16, 2023 •

edited

Loading

rhshadrach commented Jan 21, 2023 •

edited

Loading