-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Rolling count aggregation produces unexpected results #34466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@brandon-b-miller Thanks for reporting this issue. When your window size is an even number, with However, you would get what you expected with >>> import pandas as pd
>>> pd.__version__
'1.1.0.dev0+1712.g1cad9e52e'
>>> s = pd.Series([1, 2, 3, 4])
>>> s
0 1
1 2
2 3
3 4
# You can see the window does not covers the element below even with center=True
>>> s.rolling(2, min_periods=2, center=True).sum()
0 NaN
1 3.0
2 5.0
3 7.0
dtype: float64
# With window=3, the window covers from the element above to the element below.
>>> s.rolling(3, min_periods=3, center=True).sum()
0 NaN
1 6.0
2 9.0
3 NaN
dtype: float64
# This is what you expected I think
>>> s.rolling(3, min_periods=3, center=True).count()
0 NaN
1 3.0
2 3.0
3 NaN
dtype: float64 |
Sorry I think I misunderstood you earlier. I thought you were referring to the behavior of Just so we are on the same page, and correct me if I'm wrong, you are concerned why the below >>> pd.__version__
'1.1.0.dev0+1712.g1cad9e52e'
>>> s = pd.Series([1,1,1,None])
>>> s
0 1.0
1 1.0
2 1.0
3 NaN
dtype: float64
# The last element is NaN
>>> s.rolling(2, min_periods=2).sum()
0 NaN
1 2.0
2 2.0
3 NaN
dtype: float64
# The last element is not NaN
>>> s.rolling(2, min_periods=2).count()
0 NaN
1 2.0
2 2.0
3 1.0
dtype: float64
>>> I think we actually discussed about this in #30923. When dealing with If say we change that, and >>> s = pd.Series([1, 1, 1, 1, 1, np.nan, 1, 1, 1, 1, 1])
# This is current behavior
>>> s.rolling(5, min_periods=5).count()
0 NaN
1 NaN
2 NaN
3 NaN
4 5.0
5 4.0
6 4.0
7 4.0
8 4.0
9 4.0
10 5.0
dtype: float64
# This is hypothetical.
# Now we don't get any useful information. The whole thing would be either 5 or NaN,
# and not doing the counting it's supposed to
>>> s.rolling(5, min_periods=5).count()
0 NaN
1 NaN
2 NaN
3 NaN
4 5.0
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 5.0
dtype: float64 So I believe the current behavior should be the "correct" / intended behavior. I also mentioned about this in #31302. |
@fujiaxiang thank you so much for responding. I'm still not sure I understand the output from the example I gave. To be clear, I'm only talking about
As you say, since this is an even number each window considers the element immediately before the anchor and the anchor itself. So these are the pairs of elements:
So I should expect
But Instead I get
|
Apologies - I just read your response a bit more carefully. My understanding of that PR was that the change was to allow |
@brandon-b-miller, thanks for pointing this PR out to me. @fujiaxiang, thanks for working on #30923. The following comment from that discussion is particularly illuminating:
Should this not apply to offset windows as well? Consider the following example: import pandas as pd
import numpy as np
series = pd.Series(
[1,2,4, 4, np.nan, 9],
index=[
pd.Timestamp("20200101 09:00:00"),
pd.Timestamp("20200101 09:00:01"),
pd.Timestamp("20200101 09:00:02"),
pd.Timestamp("20200101 09:00:04"),
pd.Timestamp("20200101 09:00:07"),
pd.Timestamp("20200101 09:00:08"),
])
series.rolling("2s", min_periods=1).count() I would have expected this to return the following: Could I please request for clarification on this behaviour? |
@mythrocks Thanks for pointing this out. Yes I see the behavior of @brandon-b-miller Regarding the behavior on I see your point on by setting |
I could be wrong. So let's hear more from the pandas core team! @pandas-dev/pandas-core
A brief summary of earlier arguments: One way or another, we need to also address another inconsistency pointed out by @mythrocks . The outcome of this discussion should also determine how to resolve that. |
I'm in favor of consistency with other methods. I understand that
... but counting non-NA values, is as easy as setting |
Why not just use
|
Thank you, @toobaz, @toobaz, @fujiaxiang. Perhaps there might be a disconnect in my understanding of
I took this to mean the minimum window-width at which the aggregation ( From this discussion, |
Having a value means not-missing, e.g, not null. |
Thanks, @bashtage. I'm going to try reconcile these knobs against those that SQL provides. I might have a followup shortly. |
Sorry for the late reply. Please correct me if I'm mistaken in any of the following:
With indexed_series = pd.Series(
# [1,2,4, 4, np.nan, 9],
[np.nan, np.nan, 7, np.nan, np.nan, np.nan],
index=[
pd.Timestamp("20200101 09:00:00"),
pd.Timestamp("20200101 09:00:01"),
pd.Timestamp("20200101 09:00:02"),
pd.Timestamp("20200101 09:00:04"),
pd.Timestamp("20200101 09:00:07"),
pd.Timestamp("20200101 09:00:08"),
]) Regardless of whether I use series.rolling("2s", min_periods=1).count()
Out[32]:
2020-01-01 09:00:00 1.0
2020-01-01 09:00:01 2.0
2020-01-01 09:00:02 2.0
2020-01-01 09:00:04 1.0
2020-01-01 09:00:07 NaN
2020-01-01 09:00:08 1.0
dtype: float64
series.rolling("2s", min_periods=0).count()
Out[33]:
2020-01-01 09:00:00 1.0
2020-01-01 09:00:01 2.0
2020-01-01 09:00:02 2.0
2020-01-01 09:00:04 1.0
2020-01-01 09:00:07 NaN
2020-01-01 09:00:08 1.0
dtype: float64 The inconsistency with the non-indexed series is disconcerting. If indexed and non-indexed series would behave similarly, all of it would be congruent with what one would expect from, say, SQL-like systems as well. |
@fujiaxiang, do I understand correctly that the following behaviour is the result of #30923? # Case 1: Regular, un-indexed series.
# NaN output in count() depends only on window-width.
# a) If window_width < min_periods: output == NaN
# b) If window_width >= min_periods, output == Number of non-null rows
# c) Is there a way to COUNT_VALID vs COUNT_ALL? No, not using min_periods.
# This is @fujiaxiang's code in action.
series = pd.Series([np.nan, np.nan, 7, np.nan, np.nan, np.nan])
series.rolling(2, min_periods=2).count() # => [ NaN, 0, 1, 1, 0, 0 ]
series.rolling(2, min_periods=1).count() # => [ 0, 0, 1, 1, 0, 0 ]
series.rolling(2, min_periods=0).count() # => [ 0, 0, 1, 1, 0, 0 ] All of this is quite similar to SQL. For instance, the same query in SparkSQL would be as follows: scala> spark.createDataFrame( spark.sparkContext.parallelize(Seq(Row(null), Row(null), Row(7), Row(null), Row(null), Row(null))), StructType(Array(StructField("i", IntegerType, nullable=true)))).createOrReplaceTempView("mytable")
scala> spark.sql( " select count(i) over (rows between 1 preceding and current row) as count_i from all_nulls_but_1 ").show
20/10/02 23:27:31 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+-------+
|count_i|
+-------+
| 0|
| 0|
| 1|
| 1|
| 0|
| 0|
+-------+ An SQL windowed-count query is equivalent to Pandas', but with |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Problem description
Apologies if this is expected behavior and not a bug. I noticed that prior to version 1.0,
rolling.count
ignored themin_periods
parameter, as discussed in #30923. However I'm having trouble understanding this output. Shouldn't the last element of the output beNaN
, given that the final window includes the last and second to last elements, of which only one is valid?Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : 70d7c04
python : 3.8.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-76-generic
Version : #86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.0.dev0+1690.g70d7c04ff
numpy : 1.18.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.4.0.post20200518
Cython : 0.29.17
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
The text was updated successfully, but these errors were encountered: