You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
>>>df=DataFrame(
{
"Timestamp": [pd.Timestamp(i) foriinrange(3)],
"Food": ["apple", "apple", "banana"],
}
)
>>>dfg=df.groupby(Grouper(freq="1D", key="Timestamp"))
>>>dfg.value_counts()
../../core/groupby/generic.py:1800: invalue_countsresult_series=cast(Series, gb.size())
../../core/groupby/groupby.py:2323: insizeresult=self.grouper.size()
../../core/groupby/ops.py:881: insizeids, _, ngroups=self.group_infopandas/_libs/properties.pyx:36: inpandas._libs.properties.CachedProperty.__get__
???
../../core/groupby/ops.py:915: ingroup_infocomp_ids, obs_group_ids=self._get_compressed_codes()
../../core/groupby/ops.py:941: in_get_compressed_codesgroup_index=get_group_index(self.codes, self.shape, sort=True, xnull=True)
__ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
labels= [array([0]), array([0, 1, 2]), array([0, 0, 1])], shape= (1, 3, 2)
sort=True, xnull=Truedefget_group_index(
labels, shape: Shape, sort: bool, xnull: bool
) ->npt.NDArray[np.int64]:
""" For the particular label_list, gets the offsets into the hypothetical list representing the totally ordered cartesian product of all possible label combinations, *as long as* this space fits within int64 bounds; otherwise, though group indices identify unique combinations of labels, they cannot be deconstructed. - If `sort`, rank of returned ids preserve lexical ranks of labels. i.e. returned id's can be used to do lexical sort on labels; - If `xnull` nulls (-1 labels) are passed through. Parameters ---------- labels : sequence of arrays Integers identifying levels at each location shape : tuple[int, ...] Number of unique levels at each location sort : bool If the ranks of returned ids should match lexical ranks of labels xnull : bool If true nulls are excluded. i.e. -1 values in the labels are passed through. Returns ------- An array of type int64 where two elements are equal if their corresponding labels are equal at all location. Notes ----- The length of `labels` and `shape` must be identical. """def_int64_cut_off(shape) ->int:
acc=1fori, mulinenumerate(shape):
acc*=int(mul)
ifnotacc<lib.i8max:
returnireturnlen(shape)
defmaybe_lift(lab, size) ->tuple[np.ndarray, int]:
# promote nan values (assigned -1 label in lab array)# so that all output values are non-negativereturn (lab+1, size+1) if (lab==-1).any() else (lab, size)
labels= [ensure_int64(x) forxinlabels]
lshape=list(shape)
ifnotxnull:
fori, (lab, size) inenumerate(zip(labels, shape)):
lab, size=maybe_lift(lab, size)
labels[i] =lablshape[i] =sizelabels=list(labels)
# Iteratively process all the labels in chunks sized so less# than lib.i8max unique int ids will be required for each chunkwhileTrue:
# how many levels can be done without overflow:nlev=_int64_cut_off(lshape)
# compute flat ids for the first `nlev` levelsstride=np.prod(lshape[1:nlev], dtype="i8")
out=stride*labels[0].astype("i8", subok=False, copy=False)
foriinrange(1, nlev):
iflshape[i] ==0:
stride=np.int64(0)
else:
stride//=lshape[i]
>out+=labels[i] *strideEValueError: non-broadcastableoutputoperandwithshape (1,) doesn'tmatchthebroadcastshape (3,)
../../core/sorting.py:182: ValueError
Issue Description
DataFrameGroupBy.value_counts fails with a Grouper with a freq, while it works for a SeriesGroupBy. There is already a test for the SeriesGroupBy implementation named test_series_groupby_value_counts_with_grouper.
Expected Behavior
In this case, the dataframe has only one column, so it should return a similar result to the SeriesGroupBy implementation:
This difference between Series and DataFrame behaviors comes from the fact that they currently have two very different implementations. The refactor of these two implementations into a single one might be done in #46940.
Installed Versions
INSTALLED VERSIONS
commit : 997f84b
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.13.0-44-generic
Version : #49~20.04.1-Ubuntu SMP Wed May 18 18:44:28 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : fr_FR.UTF-8
LOCALE : fr_FR.UTF-8
DataFrameGroupBy.value_counts was added in pandas-1.4 #44267
This difference between Series and DataFrame behaviors comes from the fact that they currently have two very different implementations. The refactor of these two implementations into a single one might be done in #46940.
would be ideal to backport a fix to 1.4.x, but to do this would need to restrict changes to the bug fix only.
ping=grouper.Grouping(lev, lev, in_axis=False, level=None)
There appear no easy way to set in_axis=True and may have further ramifications downstream, so I think a more sensible fix suggested above is somehow combining the Series implementation with the DataFrame implementation which is out of scope for a point release so removing the milestone.
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
DataFrameGroupBy.value_counts
fails with aGrouper
with afreq
, while it works for aSeriesGroupBy
. There is already a test for theSeriesGroupBy
implementation namedtest_series_groupby_value_counts_with_grouper
.Expected Behavior
In this case, the dataframe has only one column, so it should return a similar result to the
SeriesGroupBy
implementation:This difference between Series and DataFrame behaviors comes from the fact that they currently have two very different implementations. The refactor of these two implementations into a single one might be done in #46940.
Installed Versions
INSTALLED VERSIONS
commit : 997f84b
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.13.0-44-generic
Version : #49~20.04.1-Ubuntu SMP Wed May 18 18:44:28 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : fr_FR.UTF-8
LOCALE : fr_FR.UTF-8
pandas : 1.1.0.dev0+8026.g997f84bd8f.dirty
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
setuptools : 57.5.0
pip : 20.0.2
Cython : 0.29.30
pytest : 7.1.2
hypothesis : 6.46.2
sphinx : 4.5.0
blosc : 1.10.6
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.8.0
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.3.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : 1.3.4
brotli : None
fastparquet : 0.8.1
fsspec : 2022.3.0
gcsfs : 2022.3.0
matplotlib : 3.5.2
numba : None
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 8.0.0
pyreadstat : 1.1.6
pyxlsb : None
s3fs : 2022.3.0
scipy : 1.8.0
snappy :
sqlalchemy : 1.4.36
tables : 3.7.0
tabulate : 0.8.9
xarray : 0.18.2
xlrd : 2.0.1
xlwt : 1.3.0
zstandard : None
The text was updated successfully, but these errors were encountered: