Skip to content

BUG: Rolling count aggregation produces unexpected results #34466

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
brandon-b-miller opened this issue May 29, 2020 · 14 comments
Open
3 tasks done

BUG: Rolling count aggregation produces unexpected results #34466

brandon-b-miller opened this issue May 29, 2020 · 14 comments
Labels
Bug Window rolling, ewma, expanding

Comments

@brandon-b-miller
Copy link

brandon-b-miller commented May 29, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


>>> pd.__version__
'1.1.0.dev0+1690.g70d7c04ff'
>>> pd.Series([1,1,1,None]).rolling(2, min_periods=2, center=True).count()
0    NaN
1    2.0
2    2.0
3    1.0
dtype: float64

Problem description

Apologies if this is expected behavior and not a bug. I noticed that prior to version 1.0, rolling.count ignored the min_periods parameter, as discussed in #30923. However I'm having trouble understanding this output. Shouldn't the last element of the output be NaN, given that the final window includes the last and second to last elements, of which only one is valid?

Expected Output

0    NaN
1    2.0
2    2.0
3    NaN
dtype: float64

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 70d7c04
python : 3.8.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-76-generic
Version : #86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.0.dev0+1690.g70d7c04ff
numpy : 1.18.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.4.0.post20200518
Cython : 0.29.17
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@brandon-b-miller brandon-b-miller added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 29, 2020
@fujiaxiang
Copy link
Member

@brandon-b-miller Thanks for reporting this issue.

When your window size is an even number, with center=True you actually include 1 more element above the anchor than below the anchor. In this case, when window size is 2, setting center=True does not change anything because the window will still be the element above the anchor and the anchor itself. This is consistent across other functions too.

However, you would get what you expected with window=3, as this would include the anchor, the element above, and the element below.

>>> import pandas as pd
>>> pd.__version__
'1.1.0.dev0+1712.g1cad9e52e'

>>> s = pd.Series([1, 2, 3, 4])
>>> s
0    1
1    2
2    3
3    4

# You can see the window does not covers the element below even with center=True
>>> s.rolling(2, min_periods=2, center=True).sum()
0    NaN
1    3.0
2    5.0
3    7.0
dtype: float64

# With window=3, the window covers from the element above to the element below.
>>> s.rolling(3, min_periods=3, center=True).sum()
0    NaN
1    6.0
2    9.0
3    NaN
dtype: float64

# This is what you expected I think
>>> s.rolling(3, min_periods=3, center=True).count()
0    NaN
1    3.0
2    3.0
3    NaN
dtype: float64

@fujiaxiang
Copy link
Member

Sorry I think I misunderstood you earlier. I thought you were referring to the behavior of center=True.

Just so we are on the same page, and correct me if I'm wrong, you are concerned why the below .sum() and .count() yields different results (which has nothing to do with center=True or not):

>>> pd.__version__
'1.1.0.dev0+1712.g1cad9e52e'
>>> s = pd.Series([1,1,1,None])
>>> s
0    1.0
1    1.0
2    1.0
3    NaN
dtype: float64

# The last element is NaN
>>> s.rolling(2, min_periods=2).sum()
0    NaN
1    2.0
2    2.0
3    NaN
dtype: float64

# The last element is not NaN
>>> s.rolling(2, min_periods=2).count()
0    NaN
1    2.0
2    2.0
3    1.0
dtype: float64
>>>

I think we actually discussed about this in #30923. When dealing with rolling.count(), the min_periods only concerns about the current window size, and not the number of valid (i.e not NaN``) values. This is because the exact purpose of .count()` is to count the number of valid values.

If say we change that, and min_periods concerns about NaN too, we wouldn't get much useful info when we have min_periods equal to window size (a very common scenario):

>>> s = pd.Series([1, 1, 1, 1, 1, np.nan, 1, 1, 1, 1, 1])

# This is current behavior
>>> s.rolling(5, min_periods=5).count()
0     NaN
1     NaN
2     NaN
3     NaN
4     5.0
5     4.0
6     4.0
7     4.0
8     4.0
9     4.0
10    5.0
dtype: float64

# This is hypothetical. 
# Now we don't get any useful information. The whole thing would be either 5 or NaN, 
# and not doing the counting it's supposed to
>>> s.rolling(5, min_periods=5).count()
0     NaN
1     NaN
2     NaN
3     NaN
4     5.0
5     NaN
6     NaN
7     NaN
8     NaN
9     NaN
10    5.0
dtype: float64

So I believe the current behavior should be the "correct" / intended behavior. I also mentioned about this in #31302.

@jreback @WillAyd Any thoughts?

@brandon-b-miller
Copy link
Author

brandon-b-miller commented May 30, 2020

@fujiaxiang thank you so much for responding. I'm still not sure I understand the output from the example I gave. To be clear, I'm only talking about count, the other aggregation seem ok.

pd.Series([1,1,1,None])
0    1.0
1    1.0
2    1.0
3    NaN
dtype: float64

As you say, since this is an even number each window considers the element immediately before the anchor and the anchor itself. So these are the pairs of elements:

[NaN, 1] -> NaN (this is the first row so there isn't enough data, with min_periods==2)
[1,1] -> 2
[1,1] -> 2
[1, NaN] -> NaN (only one valid element, so it should be nan because min_periods==2)

So I should expect

0    NaN
1    2.0
2    2.0
3    NaN

But Instead I get

0    NaN
1    2.0
2    2.0
3    1.0

@brandon-b-miller
Copy link
Author

brandon-b-miller commented May 30, 2020

Apologies - I just read your response a bit more carefully. My understanding of that PR was that the change was to allow min_periods to consider NaN. I see why the scenario you provide leads to results that are not so useful, but I would say that is actually the result I expect if I explicitly set min_periods==window_size.

@mythrocks
Copy link

So I believe the current behavior should be the "correct" / intended behavior. I also mentioned about this in #31302.

@brandon-b-miller, thanks for pointing this PR out to me.

@fujiaxiang, thanks for working on #30923. The following comment from that discussion is particularly illuminating:

... the min_periods argument for rolling.count only takes effect on the window size, and doesn't care how many NA values are in the window. If we change its default value to the same as window, same as other APIs, we would only see NAs in the first few rows (which would be obvious to user that this is caused by min_periods), and the rest are consistent.

Should this not apply to offset windows as well? Consider the following example:

import pandas as pd
import numpy as np

series = pd.Series(
    [1,2,4, 4, np.nan, 9],
    index=[
        pd.Timestamp("20200101 09:00:00"),
        pd.Timestamp("20200101 09:00:01"),
        pd.Timestamp("20200101 09:00:02"),
        pd.Timestamp("20200101 09:00:04"),
        pd.Timestamp("20200101 09:00:07"),
        pd.Timestamp("20200101 09:00:08"),
    ])
series.rolling("2s", min_periods=1).count()

I would have expected this to return the following:
[1.0, 2.0, 2.0, 1.0, 0.0, 1.0]
But Pandas 1.1.2 returns the following instead:
[1.0, 2.0, 2.0, 1.0, NaN, 1.0]

Could I please request for clarification on this behaviour?

@fujiaxiang
Copy link
Member

@mythrocks Thanks for pointing this out. Yes I see the behavior of rolling.count for is not consistent here.

@brandon-b-miller Regarding the behavior on min_period in rolling.count, I have explained my original intention with my earlier comment. I do agree with you that this behavior is not completely consistent with other methods such as rolling.sum, and I think we both agree there is a trade-off between usability (the returned results should be useful) and consistency.

I see your point on by setting min_periods==window_size explicitly, you expect the result to give NaN where there are less non-null values in the window. However, with changes in #36649, min_periods will be set to be equal to window_size by default. I believe this tips the scale a little bit more in favor of the original intention.

@fujiaxiang
Copy link
Member

fujiaxiang commented Sep 30, 2020

I could be wrong. So let's hear more from the pandas core team! @pandas-dev/pandas-core
The core issue we concern here is: Should we change the following behavior?

the min_periods argument for rolling.count only takes effect on the window size, and doesn't care how many NA values are in the window

A brief summary of earlier arguments:
Keeping this behavior would give more useful results for users, especially considering count method is meant for counting non-NA values. However, this behavior is not consistent with similar methods such as rolling.mean. Refer to earlier discussions for more details.

One way or another, we need to also address another inconsistency pointed out by @mythrocks . The outcome of this discussion should also determine how to resolve that.

@toobaz
Copy link
Member

toobaz commented Sep 30, 2020

I'm in favor of consistency with other methods. I understand that

Keeping this behavior would give more useful results for users, especially considering count method is meant for counting non-NA values.

... but counting non-NA values, is as easy as setting min_periods=0, right? Vice-versa, counting the number of non-NA values whenever there are at least a minimum number of non-NA values is the goal of min_periods, and it is more annoying to do otherwise.

@bashtage
Copy link
Contributor

Why not just use min_periods=0? To me is seems like this alternative is already supported only by using a different min periods. This is presumable because the intention is different. By changing min_periods one implements to psuedocode

return cnt if cnt := block.notnull().sum() >= min_periods else np.NaN

@mythrocks
Copy link

Thank you, @toobaz, @toobaz, @fujiaxiang. Perhaps there might be a disconnect in my understanding of min_periods. From the docs:

min_periods : int, default None
Minimum number of observations in window required to have a value (otherwise result is NA). For a window that is specified by an offset, min_periods will default to 1. Otherwise, min_periods will default to the size of the window.

I took this to mean the minimum window-width at which the aggregation (count() here) might be performed, regardless of the values being Nan. Based on the comments in #30923, @fujiaxiang might have shared that interpretation.

From this discussion, Nan values do not count against min_periods?

@bashtage
Copy link
Contributor

Minimum number of observations in window required to have a value

Having a value means not-missing, e.g, not null.

@mythrocks
Copy link

Thanks, @bashtage. I'm going to try reconcile these knobs against those that SQL provides. I might have a followup shortly.

@mythrocks
Copy link

mythrocks commented Oct 3, 2020

Sorry for the late reply. Please correct me if I'm mistaken in any of the following:

... just use min_periods=0? To me is seems like this alternative is already supported only by using a different min periods.

With pandas 1.1.2, consider the dataframe from my initial comment:

indexed_series = pd.Series(
    # [1,2,4, 4, np.nan, 9],
    [np.nan, np.nan, 7, np.nan, np.nan, np.nan],
    index=[
        pd.Timestamp("20200101 09:00:00"),
        pd.Timestamp("20200101 09:00:01"),
        pd.Timestamp("20200101 09:00:02"),
        pd.Timestamp("20200101 09:00:04"),
        pd.Timestamp("20200101 09:00:07"),
        pd.Timestamp("20200101 09:00:08"),
    ])

Regardless of whether I use min_periods=1 or min_periods=0, I seem to get the same result:

series.rolling("2s", min_periods=1).count()
Out[32]: 
2020-01-01 09:00:00    1.0
2020-01-01 09:00:01    2.0
2020-01-01 09:00:02    2.0
2020-01-01 09:00:04    1.0
2020-01-01 09:00:07    NaN
2020-01-01 09:00:08    1.0
dtype: float64

series.rolling("2s", min_periods=0).count()
Out[33]: 
2020-01-01 09:00:00    1.0
2020-01-01 09:00:01    2.0
2020-01-01 09:00:02    2.0
2020-01-01 09:00:04    1.0
2020-01-01 09:00:07    NaN
2020-01-01 09:00:08    1.0
dtype: float64

The inconsistency with the non-indexed series is disconcerting. If indexed and non-indexed series would behave similarly, all of it would be congruent with what one would expect from, say, SQL-like systems as well.

@mythrocks
Copy link

mythrocks commented Oct 3, 2020

@fujiaxiang, do I understand correctly that the following behaviour is the result of #30923?

# Case 1: Regular, un-indexed series.
# NaN output in count() depends only on window-width.
#  a) If window_width <  min_periods: output == NaN
#  b) If window_width >= min_periods, output == Number of non-null rows
#  c) Is there a way to COUNT_VALID vs COUNT_ALL? No, not using min_periods.
# This is @fujiaxiang's code in action.
series = pd.Series([np.nan, np.nan, 7, np.nan, np.nan, np.nan])
series.rolling(2, min_periods=2).count()    # => [ NaN, 0, 1, 1, 0, 0 ]
series.rolling(2, min_periods=1).count()    # => [   0, 0, 1, 1, 0, 0 ]
series.rolling(2, min_periods=0).count()    # => [   0, 0, 1, 1, 0, 0 ]

All of this is quite similar to SQL. For instance, the same query in SparkSQL would be as follows:

scala> spark.createDataFrame( spark.sparkContext.parallelize(Seq(Row(null), Row(null), Row(7), Row(null), Row(null), Row(null))), StructType(Array(StructField("i", IntegerType, nullable=true)))).createOrReplaceTempView("mytable")

scala> spark.sql( " select count(i) over (rows between 1 preceding and current row) as count_i from all_nulls_but_1 ").show
20/10/02 23:27:31 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+-------+
|count_i|
+-------+
|      0|
|      0|
|      1|
|      1|
|      0|
|      0|
+-------+

An SQL windowed-count query is equivalent to Pandas', but with min_periods=0. So, your change in #30923 is quite welcome. :]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Window rolling, ewma, expanding
Projects
None yet
Development

No branches or pull requests

6 participants