BUG: DataFrameGroupBy.value_counts when grouper has a frequency #47286

LucasG0 · 2022-06-08T15:51:29Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

>>> df = DataFrame(
      {
          "Timestamp": [pd.Timestamp(i) for i in range(3)],
          "Food": ["apple", "apple", "banana"],
      }
  )

>>> dfg = df.groupby(Grouper(freq="1D", key="Timestamp"))
>>> dfg.value_counts()

../../core/groupby/generic.py:1800: in value_counts
    result_series = cast(Series, gb.size())
../../core/groupby/groupby.py:2323: in size
    result = self.grouper.size()
../../core/groupby/ops.py:881: in size
    ids, _, ngroups = self.group_info
pandas/_libs/properties.pyx:36: in pandas._libs.properties.CachedProperty.__get__
    ???
../../core/groupby/ops.py:915: in group_info
    comp_ids, obs_group_ids = self._get_compressed_codes()
../../core/groupby/ops.py:941: in _get_compressed_codes
    group_index = get_group_index(self.codes, self.shape, sort=True, xnull=True)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

labels = [array([0]), array([0, 1, 2]), array([0, 0, 1])], shape = (1, 3, 2)
sort = True, xnull = True

    def get_group_index(
        labels, shape: Shape, sort: bool, xnull: bool
    ) -> npt.NDArray[np.int64]:
        """
        For the particular label_list, gets the offsets into the hypothetical list
        representing the totally ordered cartesian product of all possible label
        combinations, *as long as* this space fits within int64 bounds;
        otherwise, though group indices identify unique combinations of
        labels, they cannot be deconstructed.
        - If `sort`, rank of returned ids preserve lexical ranks of labels.
          i.e. returned id's can be used to do lexical sort on labels;
        - If `xnull` nulls (-1 labels) are passed through.
    
        Parameters
        ----------
        labels : sequence of arrays
            Integers identifying levels at each location
        shape : tuple[int, ...]
            Number of unique levels at each location
        sort : bool
            If the ranks of returned ids should match lexical ranks of labels
        xnull : bool
            If true nulls are excluded. i.e. -1 values in the labels are
            passed through.
    
        Returns
        -------
        An array of type int64 where two elements are equal if their corresponding
        labels are equal at all location.
    
        Notes
        -----
        The length of `labels` and `shape` must be identical.
        """
    
        def _int64_cut_off(shape) -> int:
            acc = 1
            for i, mul in enumerate(shape):
                acc *= int(mul)
                if not acc < lib.i8max:
                    return i
            return len(shape)
    
        def maybe_lift(lab, size) -> tuple[np.ndarray, int]:
            # promote nan values (assigned -1 label in lab array)
            # so that all output values are non-negative
            return (lab + 1, size + 1) if (lab == -1).any() else (lab, size)
    
        labels = [ensure_int64(x) for x in labels]
        lshape = list(shape)
        if not xnull:
            for i, (lab, size) in enumerate(zip(labels, shape)):
                lab, size = maybe_lift(lab, size)
                labels[i] = lab
                lshape[i] = size
    
        labels = list(labels)
    
        # Iteratively process all the labels in chunks sized so less
        # than lib.i8max unique int ids will be required for each chunk
        while True:
            # how many levels can be done without overflow:
            nlev = _int64_cut_off(lshape)
    
            # compute flat ids for the first `nlev` levels
            stride = np.prod(lshape[1:nlev], dtype="i8")
            out = stride * labels[0].astype("i8", subok=False, copy=False)
    
            for i in range(1, nlev):
                if lshape[i] == 0:
                    stride = np.int64(0)
                else:
                    stride //= lshape[i]
>               out += labels[i] * stride
E               ValueError: non-broadcastable output operand with shape (1,) doesn't match the broadcast shape (3,)

../../core/sorting.py:182: ValueError

Issue Description

DataFrameGroupBy.value_counts fails with a Grouper with a freq, while it works for a SeriesGroupBy. There is already a test for the SeriesGroupBy implementation named test_series_groupby_value_counts_with_grouper.

Expected Behavior

In this case, the dataframe has only one column, so it should return a similar result to the SeriesGroupBy implementation:

>>> dfg["Food"].value_counts()
Timestamp   Food  
1970-01-01  apple     2
            banana    1
Name: Food, dtype: int64

This difference between Series and DataFrame behaviors comes from the fact that they currently have two very different implementations. The refactor of these two implementations into a single one might be done in #46940.

Installed Versions

INSTALLED VERSIONS

commit : 997f84b
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.13.0-44-generic
Version : #49~20.04.1-Ubuntu SMP Wed May 18 18:44:28 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : fr_FR.UTF-8
LOCALE : fr_FR.UTF-8

pandas : 1.1.0.dev0+8026.g997f84bd8f.dirty
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
setuptools : 57.5.0
pip : 20.0.2
Cython : 0.29.30
pytest : 7.1.2
hypothesis : 6.46.2
sphinx : 4.5.0
blosc : 1.10.6
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.8.0
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.3.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : 1.3.4
brotli : None
fastparquet : 0.8.1
fsspec : 2022.3.0
gcsfs : 2022.3.0
matplotlib : 3.5.2
numba : None
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 8.0.0
pyreadstat : 1.1.6
pyxlsb : None
s3fs : 2022.3.0
scipy : 1.8.0
snappy :
sqlalchemy : 1.4.36
tables : 3.7.0
tabulate : 0.8.9
xarray : 0.18.2
xlrd : 2.0.1
xlwt : 1.3.0
zstandard : None

The text was updated successfully, but these errors were encountered:

simonjayhawkins · 2022-06-09T10:27:15Z

Thanks @LucasG0 for the report and investigation.

DataFrameGroupBy.value_counts was added in pandas-1.4 #44267

This difference between Series and DataFrame behaviors comes from the fact that they currently have two very different implementations. The refactor of these two implementations into a single one might be done in #46940.

would be ideal to backport a fix to 1.4.x, but to do this would need to restrict changes to the bug fix only.

mroeschke · 2022-08-22T20:52:00Z

The core issue here is when filtering to check which columns should be in the value count

pandas/pandas/core/groupby/generic.py

Line 1802 in e65a30e

grouping.name for grouping in self.grouper.groupings if grouping.in_axis

And groupers with frequencies (as well as resample I would suspect), always set in_axis=False.

pandas/pandas/core/groupby/ops.py

Line 1263 in e65a30e

ping = grouper.Grouping(lev, lev, in_axis=False, level=None)

There appear no easy way to set in_axis=True and may have further ramifications downstream, so I think a more sensible fix suggested above is somehow combining the Series implementation with the DataFrame implementation which is out of scope for a point release so removing the milestone.

rhshadrach · 2023-01-08T20:41:31Z

I now get the expected result on main. This was fixed by #50507

LucasG0 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 8, 2022

LucasG0 mentioned this issue Jun 8, 2022

PERF: Improve SeriesGroupBy.value_counts performances with categorical values. #46940

Closed

4 tasks

simonjayhawkins added Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 9, 2022

simonjayhawkins added this to the 1.4.3 milestone Jun 9, 2022

simonjayhawkins added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Jun 11, 2022

simonjayhawkins modified the milestones: 1.4.3, 1.4.4 Jun 22, 2022

mroeschke removed this from the 1.4.4 milestone Aug 22, 2022

simonjayhawkins added this to the Contributions Welcome milestone Aug 23, 2022

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

rhshadrach closed this as completed Jan 8, 2023

rhshadrach added this to the 2.0 milestone Jan 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrameGroupBy.value_counts when grouper has a frequency #47286

BUG: DataFrameGroupBy.value_counts when grouper has a frequency #47286

LucasG0 commented Jun 8, 2022 •

edited

Loading

INSTALLED VERSIONS

simonjayhawkins commented Jun 9, 2022

mroeschke commented Aug 22, 2022

rhshadrach commented Jan 8, 2023

BUG: DataFrameGroupBy.value_counts when grouper has a frequency #47286

BUG: DataFrameGroupBy.value_counts when grouper has a frequency #47286

Comments

LucasG0 commented Jun 8, 2022 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

simonjayhawkins commented Jun 9, 2022

mroeschke commented Aug 22, 2022

rhshadrach commented Jan 8, 2023

LucasG0 commented Jun 8, 2022 •

edited

Loading