Skip to content

Empty product not equal to 1 #7889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
anderslundstedt opened this issue Jul 31, 2014 · 30 comments
Closed

Empty product not equal to 1 #7889

anderslundstedt opened this issue Jul 31, 2014 · 30 comments
Assignees
Labels
Bug Numeric Operations Arithmetic, Comparison, and Logical operations

Comments

@anderslundstedt
Copy link

In [1]: import pandas as pd
In [2]: pd.Series().product()
Out[2]: nan

I excepted:

Out[2]: 1

http://en.wikipedia.org/wiki/Empty_product

@jreback
Copy link
Contributor

jreback commented Jul 31, 2014

hmm, that is different from numpy....

want to do a pull-request?

@jreback jreback added this to the 0.15.0 milestone Jul 31, 2014
@ischwabacher
Copy link
Contributor

xref #7869

If you want to fix this, the problem is in the bottleneck_switch decorator in nanops.py; otherwise, you can wait for me to get to it. That decorator's constructor takes a zero_value argument for an operation to use when the array it's operating on is empty, then uses 0 instead in any case. We should rename the parameter to empty_value.

@anderslundstedt
Copy link
Author

I do not feel comfortable fixing this myself, so I will wait. Thanks for the prompt responses!

@anderslundstedt
Copy link
Author

Ooops. Accidentally closed. Now open again I hope. (I am not very experienced with Github.)

@cpcloud
Copy link
Member

cpcloud commented Aug 4, 2014

I can take this.

@cpcloud cpcloud self-assigned this Aug 4, 2014
@cpcloud
Copy link
Member

cpcloud commented Aug 4, 2014

@anderslundstedt @jreback Fixing this to be consistent with numpy breaks tested behavior of TimeGrouper aggregations. See tseries/tests/test_resample.py:TestTimeGrouper.test_aggregate_with_nat for the test that breaks. Is this an API that should be broken?

@jreback
Copy link
Contributor

jreback commented Aug 4, 2014

I would just fix that test ('prod') as its 'wrong' too

@cpcloud
Copy link
Member

cpcloud commented Aug 4, 2014

i did that in the pr

@cpcloud
Copy link
Member

cpcloud commented Aug 4, 2014

whoops forgot to xref this issue

@jorisvandenbossche
Copy link
Member

@cpcloud Can you give an example of a resample that would break?

Eg, would the NaN in the following example become 1?

In [3]: s1 = pd.Series(np.arange(5), index=pd.date_range('2012-01-01', periods=5))
In [4]: s2 = pd.Series(np.arange(5), index=pd.date_range('2012-01-10', periods=5))
In [5]: s = pd.concat([s1, s2])

In [10]: s.resample('2D', how='prod')
Out[10]:
2012-01-01     0
2012-01-03     6
2012-01-05     4
2012-01-07   NaN
2012-01-09     0
2012-01-11     2
2012-01-13    12
Freq: 2D, dtype: float64

@ischwabacher
Copy link
Contributor

Do we care that type(pd.Series([], dtype='f64').product()) is still int?

@cpcloud
Copy link
Member

cpcloud commented Aug 4, 2014

@jorisvandenbossche no, if the group key is nan (or nat) then the value at that key will be 1 instead of nan or nat. Using your example s:

In [44]: s
Out[44]:
2012-01-01    0
2012-01-02    1
2012-01-03    2
2012-01-04    3
2012-01-05    4
2012-01-10    0
2012-01-11    1
2012-01-12    2
2012-01-13    3
2012-01-14    4
dtype: int64

In [45]: df = s.reset_index(name='col').rename(columns={'index': 'date'})

In [46]: df.loc[2, 'date'] = pd.NaT

In [47]: df
Out[47]:
        date  col
0 2012-01-01    0
1 2012-01-02    1
2        NaT    2
3 2012-01-04    3
4 2012-01-05    4
5 2012-01-10    0
6 2012-01-11    1
7 2012-01-12    2
8 2012-01-13    3
9 2012-01-14    4

In [48]: df.groupby(pd.TimeGrouper(key='date', freq='D')).prod()
Out[48]:
            col
date
2012-01-01    0
2012-01-02    1
2012-01-03    1
2012-01-04    3
2012-01-05    4
...         ...
2012-01-10    0
2012-01-11    1
2012-01-12    2
2012-01-13    3
2012-01-14    4

[14 rows x 1 columns]

@jorisvandenbossche
Copy link
Member

@cpcloud Hmm, I don't fully get that. What is the difference? When the third row is set to NaT in your example, this then actually means that the 2012-01-03 row is removed, and so is missing, just like the others (2012-01-06 to 2012-01-09). So if that date gives 1 in the result, the others should also?

WIth 0.14.1, your example gives:

In [8]: df.groupby(pd.TimeGrouper(key='date', freq='D')).prod()
Out[8]:
            col
date
2012-01-01    0
2012-01-02    1
2012-01-03  NaN
2012-01-04    3
2012-01-05    4
2012-01-06  NaN
2012-01-07  NaN
2012-01-08  NaN
2012-01-09  NaN
2012-01-10    0
2012-01-11    1
2012-01-12    2
2012-01-13    3
2012-01-14    4

So the NaN in the third row becomes 1 according to your example? But not the other NaN values? That seems inconsistent to me? But if they all become 1, that also seems like a rather large API break, no?

@cpcloud
Copy link
Member

cpcloud commented Aug 5, 2014

Those will be one as well I just have my display max rows set to 10. I'll post the full example in a bit

@cpcloud
Copy link
Member

cpcloud commented Aug 5, 2014

Yes it's a breaking change, I'm not sure if it's the way to go as I'm not sure if this brings any benefits other that consistency with numpy and Wikipedia. OTOH I would've expected more breakage by changing the zero value default in nanops

@jorisvandenbossche
Copy link
Member

In [23]: np.array([]).prod()
Out[23]: 1.0

In [24]: np.array([np.nan]).prod()
Out[24]: nan

So would it in some way be possible to regard this example above as a product of NaN (which stays NaN) instead of an empty product?
(warning, very naive view without looking at the actual code is following: that you follow a reasoning like first 'resample' and add the missing dates with NaN values, and only then groupby to reduce with the product operator)

@anderslundstedt
Copy link
Author

@cpcloud It is more than "consistency with Wikipedia", it is matter of consistency with any mathematical identity involving a possibly empty product.

@cpcloud
Copy link
Member

cpcloud commented Aug 5, 2014

It was a tongue in cheek comment. I don't like introducing "valid" values where previously there were nans. This seems like it would break a ton of existing code. @anderslundstedt what operations are you doing that allowed you to find this inconsistency?

@anderslundstedt
Copy link
Author

Well, I want to take the product of a possibly empty series. The context is that I want to compute historical stock prices adjusted for dividend etc., when there has been no dividend after price time this involves an empty product.

@cpcloud
Copy link
Member

cpcloud commented Aug 5, 2014

Can you show some code and data in an example? Thanks.

@anderslundstedt
Copy link
Author

I do not see the point and I am not able to share the code I work with. My initial post basically sums up the problem.

It is not difficult to work around, so it is not a problem (for me) if you do not change the behaviour. Just though it would be nice with empty products equal to one, but I understand that you may not want to break existing code.

@jorisvandenbossche
Copy link
Member

What I wanted to say with my comment above: the fact that in a resample operation with product, missing dates end up 1 or NaN is more an "implementation detail" of resample than the
insurmountable consequence of the mathematical identity of an empty product.
As you can also see the missing dates in resample as a product of NaN, which is NaN.

So it would maybe be possible to change the result of Series([]).product() to 1 while keeping the same result in resample/TimeGrouper (giving NaNs)? (again, without knowing the code)

Note: in R, an empty product also gives 1 by definition.

@cpcloud
Copy link
Member

cpcloud commented Aug 5, 2014

That's easy to do can just special case the empty product and leave the resample code as is.

@anderslundstedt
Copy link
Author

Yes, I meant empty products of series that are empty in the strongest sense (neither values nor missing values). I have no opinion how one best would treat products of a series with only missing values.

@jreback
Copy link
Contributor

jreback commented Aug 5, 2014

as an aide, you can always do: s.fillna(1).product() to guarantee that you can do some sort of multiplication

@anderslundstedt
Copy link
Author

@jreback I do not know if you meant that it would solve the original case, which it does not:

In [1]: import pandas as pd
In [2]: pd.Series().fillna(1).product()
Out[2]: nan

@jreback
Copy link
Contributor

jreback commented Aug 5, 2014

@anderslundstedt hmm, empty series is a special case I guess then, ok @cpcloud why don't we just fix the scalar case. What does that do to the resample case? (obvious the test case has to change)

@cpcloud
Copy link
Member

cpcloud commented Aug 8, 2014

this is far less trivial than i thought, mostly bc arith methods are added after the class is defined so i can't override it properly. i guess i'll just have to monkey patch it. i have a big refactor of this arith stuff that defines them on ndframe abstractly so that a subclass can override them for specific cases if necessary

@jreback
Copy link
Contributor

jreback commented Sep 9, 2014

@cpcloud status?

@jreback jreback modified the milestones: 0.15.1, 0.15.0 Sep 14, 2014
@jreback jreback modified the milestones: Next Major Release, 0.16.0 Mar 2, 2015
@TomAugspurger
Copy link
Contributor

This is being handled in #18678

@TomAugspurger TomAugspurger modified the milestones: Next Major Release, No action Dec 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants