Skip to content

BUG: .transform(ngroup) for axis=1 grouper #50761

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 24, 2023

Conversation

iofall
Copy link
Contributor

@iofall iofall commented Jan 15, 2023

Part of #45986

@@ -3369,7 +3369,7 @@ def ngroup(self, ascending: bool = True):
dtype: int64
"""
with self._group_selection_context():
index = self._selected_obj.index
index = self._selected_obj._get_axis(self.axis)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get index based on self.axis instead of getting the default axis=0 index.

@iofall iofall changed the title Fix .transform(ngroup) for axis=1 grouper BUG: .transform(ngroup) for axis=1 grouper Jan 15, 2023
@iofall
Copy link
Contributor Author

iofall commented Jan 15, 2023

About the tests, they do pass for me when I run:

pytest pandas\tests\groupby -n 4 -m "not slow and not network and not db and not single_cpu" -r sxX

But get a failure for:

pytest pandas\tests\groupby -n 4

It fails for test_value_counts.py with the message:

FAILED pandas/tests/groupby/test_value_counts.py::test_series_groupby_value_counts[False-False-False-False-False-['1st', '2nd']-1000-203] - KeyError: <class 'numpy.intc'>
768 failed, 17098 passed, 402 skipped, 226 xfailed

I am on Windows 10.

@mroeschke mroeschke requested a review from rhshadrach January 16, 2023 18:51
@rhshadrach
Copy link
Member

rhshadrach commented Jan 16, 2023

Changes look good - can you post a stack trace for the failing test? You should just be able to run

pytest pandas/tests/groupby/test_value_counts.py::test_series_groupby_value_counts[False-False-False-False-False-['1st', '2nd']-1000-203]

You may need to switch the orientation of the slashes.

Also - can you see if you get the same failure on main and 1.5.x?

@mroeschke mroeschke added Groupby Apply Apply, Aggregate, Transform, Map labels Jan 17, 2023
@iofall
Copy link
Contributor Author

iofall commented Jan 21, 2023

pytest pandas/tests/groupby/test_value_counts.py::test_series_groupby_value_counts[False-False-False-False-False-['1st', '2nd']-1000-203]

Sorry for the late reply. The above command needs to be enclosed in "" as there is a space in between. Here is the output:

================================================================================= test session starts =================================================================================
platform win32 -- Python 3.8.15, pytest-7.2.0, pluggy-1.0.0 -- C:\Users\Me\mambaforge\envs\pandas-dev\python.exe
cachedir: .pytest_cache
hypothesis profile 'ci' -> deadline=None, suppress_health_check=[HealthCheck.too_slow], database=DirectoryBasedExampleDatabase('C:\\Users\\Me\\Desktop\\pandas\\.hypothesis\\examples') 
rootdir: C:\Users\Me\Desktop\pandas, configfile: pyproject.toml
plugins: anyio-3.6.2, hypothesis-6.59.0, asyncio-0.20.2, cov-4.0.0, cython-0.2.0, xdist-3.0.2
asyncio: mode=strict
collected 1 item

pandas/tests/groupby/test_value_counts.py::test_series_groupby_value_counts[False-False-False-False-False-['1st', '2nd']-1000-203] FAILED

====================================================================================== FAILURES ======================================================================================= 
_______________________________________________ test_series_groupby_value_counts[False-False-False-False-False-['1st', '2nd']-1000-203] _______________________________________________

df =     1st        2nd  3rd
0     d 2015-08-31   17
1     d 2015-08-26   18
2     c 2015-08-27    1
3     b 2015-08-29   1...96   b 2015-08-28    9
997   c 2015-08-26   15
998   a 2015-08-27   16
999   c 2015-08-29    2

[1000 rows x 3 columns]
keys = ['1st', '2nd'], bins = array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20]), n = 1000, m = 20, isort = False, normalize = False, sort = False, ascending = False, dropna = False  

    @pytest.mark.slow
    @pytest.mark.parametrize("df, keys, bins, n, m", binned, ids=ids)
    @pytest.mark.parametrize("isort", [True, False])
    @pytest.mark.parametrize("normalize", [True, False])
    @pytest.mark.parametrize("sort", [True, False])
    @pytest.mark.parametrize("ascending", [True, False])
    @pytest.mark.parametrize("dropna", [True, False])
    def test_series_groupby_value_counts(
        df, keys, bins, n, m, isort, normalize, sort, ascending, dropna
    ):
        def rebuild_index(df):
            arr = list(map(df.index.get_level_values, range(df.index.nlevels)))
            df.index = MultiIndex.from_arrays(arr, names=df.index.names)
            return df

        kwargs = {
            "normalize": normalize,
            "sort": sort,
            "ascending": ascending,
            "dropna": dropna,
            "bins": bins,
        }

        gr = df.groupby(keys, sort=isort)
>       left = gr["3rd"].value_counts(**kwargs)

pandas\tests\groupby\test_value_counts.py:109:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas\core\groupby\generic.py:771: in value_counts
    _, idx = get_join_indexers(left, right, sort=False, how="left")
pandas\core\reshape\merge.py:1653: in get_join_indexers
    zipped = zip(*mapped)
pandas\core\reshape\merge.py:1650: in <genexpr>
    _factorize_keys(left_keys[n], right_keys[n], sort=sort, how=how)
pandas\core\reshape\merge.py:2373: in _factorize_keys
    klass, lk, rk = _convert_arrays_and_get_rizer_klass(lk, rk)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

lk = array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1,
       2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4,..., 0, 1, 2, 3,
       4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5,
       6, 7, 8, 9], dtype=int32)
rk = array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2,
       3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 4, 5, 6,...0, 2, 4, 5, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4,
       5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8], dtype=int32)

    def _convert_arrays_and_get_rizer_klass(
        lk: ArrayLike, rk: ArrayLike
    ) -> tuple[type[libhashtable.Factorizer], ArrayLike, ArrayLike]:
        klass: type[libhashtable.Factorizer]
        if is_numeric_dtype(lk.dtype):
            if not is_dtype_equal(lk, rk):
                dtype = find_common_type([lk.dtype, rk.dtype])
                if isinstance(dtype, ExtensionDtype):
                    cls = dtype.construct_array_type()
                    if not isinstance(lk, ExtensionArray):
                        lk = cls._from_sequence(lk, dtype=dtype, copy=False)
                    else:
                        lk = lk.astype(dtype)

                    if not isinstance(rk, ExtensionArray):
                        rk = cls._from_sequence(rk, dtype=dtype, copy=False)
                    else:
                        rk = rk.astype(dtype)
                else:
                    lk = lk.astype(dtype)
                    rk = rk.astype(dtype)
            if isinstance(lk, BaseMaskedArray):
                #  Invalid index type "type" for "Dict[Type[object], Type[Factorizer]]";
                #  expected type "Type[object]"
                klass = _factorizers[lk.dtype.type]  # type: ignore[index]
            else:
>               klass = _factorizers[lk.dtype.type]
E               KeyError: <class 'numpy.intc'>

pandas\core\reshape\merge.py:2440: KeyError
------------------------------------------------------------ generated xml file: C:\Users\Me\Desktop\pandas\test-data.xml ------------------------------------------------------------- 
================================================================================ slowest 30 durations ================================================================================= 
0.01s call     pandas/tests/groupby/test_value_counts.py::test_series_groupby_value_counts[False-False-False-False-False-['1st', '2nd']-1000-203]
0.00s setup    pandas/tests/groupby/test_value_counts.py::test_series_groupby_value_counts[False-False-False-False-False-['1st', '2nd']-1000-203]
0.00s teardown pandas/tests/groupby/test_value_counts.py::test_series_groupby_value_counts[False-False-False-False-False-['1st', '2nd']-1000-203]
=============================================================================== short test summary info =============================================================================== 
FAILED pandas/tests/groupby/test_value_counts.py::test_series_groupby_value_counts[False-False-False-False-False-['1st', '2nd']-1000-203] - KeyError: <class 'numpy.intc'>
================================================================================== 1 failed in 0.72s ================================================================================== 

It does not fail on v1.5.2 and v.1.5.3, but does fail for main.

@rhshadrach
Copy link
Member

rhshadrach commented Jan 21, 2023

Thanks, this is almost certainly due to #50548. I've opened #50920 to track.

Edit: Actually, this is failing in the SeriesGroupBy specific implementation where bins is not None. I'm now doubting if it's #50548.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great - just needs a line in the whatsnew for 2.0 under the Bugfix - Groupby section.

@iofall - I merged main here to make sure recent changes in groupby didn't impact this. An unrelated failure due to the latest release of fsspec popped up. Merging main should fix everything(ish).

@rhshadrach rhshadrach added this to the 2.0 milestone Jan 24, 2023
@rhshadrach rhshadrach merged commit 33f4f7b into pandas-dev:main Jan 24, 2023
@rhshadrach
Copy link
Member

Thanks @iofall - nice work!

pooja-subramaniam pushed a commit to pooja-subramaniam/pandas that referenced this pull request Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants