Skip to content

BUG: DataFrameGroupBy.value_counts when grouper has a frequency #47286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
LucasG0 opened this issue Jun 8, 2022 · 3 comments
Closed
3 tasks done

BUG: DataFrameGroupBy.value_counts when grouper has a frequency #47286

LucasG0 opened this issue Jun 8, 2022 · 3 comments
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Groupby
Milestone

Comments

@LucasG0
Copy link
Contributor

LucasG0 commented Jun 8, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

>>> df = DataFrame(
      {
          "Timestamp": [pd.Timestamp(i) for i in range(3)],
          "Food": ["apple", "apple", "banana"],
      }
  )

>>> dfg = df.groupby(Grouper(freq="1D", key="Timestamp"))
>>> dfg.value_counts()

../../core/groupby/generic.py:1800: in value_counts
    result_series = cast(Series, gb.size())
../../core/groupby/groupby.py:2323: in size
    result = self.grouper.size()
../../core/groupby/ops.py:881: in size
    ids, _, ngroups = self.group_info
pandas/_libs/properties.pyx:36: in pandas._libs.properties.CachedProperty.__get__
    ???
../../core/groupby/ops.py:915: in group_info
    comp_ids, obs_group_ids = self._get_compressed_codes()
../../core/groupby/ops.py:941: in _get_compressed_codes
    group_index = get_group_index(self.codes, self.shape, sort=True, xnull=True)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

labels = [array([0]), array([0, 1, 2]), array([0, 0, 1])], shape = (1, 3, 2)
sort = True, xnull = True

    def get_group_index(
        labels, shape: Shape, sort: bool, xnull: bool
    ) -> npt.NDArray[np.int64]:
        """
        For the particular label_list, gets the offsets into the hypothetical list
        representing the totally ordered cartesian product of all possible label
        combinations, *as long as* this space fits within int64 bounds;
        otherwise, though group indices identify unique combinations of
        labels, they cannot be deconstructed.
        - If `sort`, rank of returned ids preserve lexical ranks of labels.
          i.e. returned id's can be used to do lexical sort on labels;
        - If `xnull` nulls (-1 labels) are passed through.
    
        Parameters
        ----------
        labels : sequence of arrays
            Integers identifying levels at each location
        shape : tuple[int, ...]
            Number of unique levels at each location
        sort : bool
            If the ranks of returned ids should match lexical ranks of labels
        xnull : bool
            If true nulls are excluded. i.e. -1 values in the labels are
            passed through.
    
        Returns
        -------
        An array of type int64 where two elements are equal if their corresponding
        labels are equal at all location.
    
        Notes
        -----
        The length of `labels` and `shape` must be identical.
        """
    
        def _int64_cut_off(shape) -> int:
            acc = 1
            for i, mul in enumerate(shape):
                acc *= int(mul)
                if not acc < lib.i8max:
                    return i
            return len(shape)
    
        def maybe_lift(lab, size) -> tuple[np.ndarray, int]:
            # promote nan values (assigned -1 label in lab array)
            # so that all output values are non-negative
            return (lab + 1, size + 1) if (lab == -1).any() else (lab, size)
    
        labels = [ensure_int64(x) for x in labels]
        lshape = list(shape)
        if not xnull:
            for i, (lab, size) in enumerate(zip(labels, shape)):
                lab, size = maybe_lift(lab, size)
                labels[i] = lab
                lshape[i] = size
    
        labels = list(labels)
    
        # Iteratively process all the labels in chunks sized so less
        # than lib.i8max unique int ids will be required for each chunk
        while True:
            # how many levels can be done without overflow:
            nlev = _int64_cut_off(lshape)
    
            # compute flat ids for the first `nlev` levels
            stride = np.prod(lshape[1:nlev], dtype="i8")
            out = stride * labels[0].astype("i8", subok=False, copy=False)
    
            for i in range(1, nlev):
                if lshape[i] == 0:
                    stride = np.int64(0)
                else:
                    stride //= lshape[i]
>               out += labels[i] * stride
E               ValueError: non-broadcastable output operand with shape (1,) doesn't match the broadcast shape (3,)

../../core/sorting.py:182: ValueError

Issue Description

DataFrameGroupBy.value_counts fails with a Grouper with a freq, while it works for a SeriesGroupBy. There is already a test for the SeriesGroupBy implementation named test_series_groupby_value_counts_with_grouper.

Expected Behavior

In this case, the dataframe has only one column, so it should return a similar result to the SeriesGroupBy implementation:

>>> dfg["Food"].value_counts()
Timestamp   Food  
1970-01-01  apple     2
            banana    1
Name: Food, dtype: int64

This difference between Series and DataFrame behaviors comes from the fact that they currently have two very different implementations. The refactor of these two implementations into a single one might be done in #46940.

Installed Versions

INSTALLED VERSIONS

commit : 997f84b
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.13.0-44-generic
Version : #49~20.04.1-Ubuntu SMP Wed May 18 18:44:28 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : fr_FR.UTF-8
LOCALE : fr_FR.UTF-8

pandas : 1.1.0.dev0+8026.g997f84bd8f.dirty
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
setuptools : 57.5.0
pip : 20.0.2
Cython : 0.29.30
pytest : 7.1.2
hypothesis : 6.46.2
sphinx : 4.5.0
blosc : 1.10.6
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.8.0
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.3.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : 1.3.4
brotli : None
fastparquet : 0.8.1
fsspec : 2022.3.0
gcsfs : 2022.3.0
matplotlib : 3.5.2
numba : None
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 8.0.0
pyreadstat : 1.1.6
pyxlsb : None
s3fs : 2022.3.0
scipy : 1.8.0
snappy :
sqlalchemy : 1.4.36
tables : 3.7.0
tabulate : 0.8.9
xarray : 0.18.2
xlrd : 2.0.1
xlwt : 1.3.0
zstandard : None

@LucasG0 LucasG0 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 8, 2022
@simonjayhawkins
Copy link
Member

Thanks @LucasG0 for the report and investigation.

DataFrameGroupBy.value_counts was added in pandas-1.4 #44267

This difference between Series and DataFrame behaviors comes from the fact that they currently have two very different implementations. The refactor of these two implementations into a single one might be done in #46940.

would be ideal to backport a fix to 1.4.x, but to do this would need to restrict changes to the bug fix only.

@simonjayhawkins simonjayhawkins added Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 9, 2022
@simonjayhawkins simonjayhawkins added this to the 1.4.3 milestone Jun 9, 2022
@simonjayhawkins simonjayhawkins added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Jun 11, 2022
@simonjayhawkins simonjayhawkins modified the milestones: 1.4.3, 1.4.4 Jun 22, 2022
@mroeschke
Copy link
Member

The core issue here is when filtering to check which columns should be in the value count

grouping.name for grouping in self.grouper.groupings if grouping.in_axis

And groupers with frequencies (as well as resample I would suspect), always set in_axis=False.

ping = grouper.Grouping(lev, lev, in_axis=False, level=None)

There appear no easy way to set in_axis=True and may have further ramifications downstream, so I think a more sensible fix suggested above is somehow combining the Series implementation with the DataFrame implementation which is out of scope for a point release so removing the milestone.

@mroeschke mroeschke removed this from the 1.4.4 milestone Aug 22, 2022
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Aug 23, 2022
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@rhshadrach
Copy link
Member

I now get the expected result on main. This was fixed by #50507

@rhshadrach rhshadrach added this to the 2.0 milestone Jan 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Groupby
Projects
None yet
Development

No branches or pull requests

4 participants