-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Variance is not calculated correctly in some cases + inconsistent definition #10242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report. The difference is caused by the default of # numpy
dat['x'].values.var()
# 2.2448979591836733
dat['x'].values.var(ddof=True)
# 2.6190476190476186
# pandas
dat['x'].var()
# 2.6190476190476191
dat['x'].var(ddof=False)
# 2.2448979591836733 It is written in API but easy to miss it... should be listed on parameter section. PR is appreciated. |
Ah I missed latter part. I think it should be precise as the same as numpy. Marked as a bug also. |
Yeah the latter part is the more concerning. The first was more of a remark, but it should definitely be added to the parameters list. Might be a good idea to also make it explicit in the docs that numpy assumes a different ddof, and maybe add the example I gave as a caution to users. |
I'm trying to trace this back and found the following: dat = pd.DataFrame({'x': [9.0692124699,9.0692124692,9.0692124702,9.0692124686,
9.0692124687,9.0692124707,9.0692124679,9.0692124685,
9.0692124698,9.0692124719,9.0692124698,9.0692124692,
9.0692124689,9.0692124673,9.0692124707,9.0692124714,
9.0692124714,9.0692124734,9.0692124719,9.0692124710,
9.0692124694,9.0692124705,9.0692124713,9.0692124717
]})
values = dat['x'].values
# Formula used internally by pandas to calculate the variance (ignoring the final division by N-1):
print(np.sum(values)**2/len(values)-np.sum(values**2))
# Normal variance formula (once again ignoring the final div by N-1):
print(np.sum((values - np.mean(values))**2))
As far as I can tell these should be exactly equivalent, but for some reason they are not. |
do u have bottleneck installed? |
@jreback It works fine if # with bottleneck 1.0.0
dat['x'].var()
# 2.158098183211887e-18
dat['x'].var(ddof=False)
# 2.0681774255780584e-18 As @NickLH pointed, issue looks to be caused by # after fixing squared-sum to calculate mean first
# numpy
dat['x'].values.var()
# 2.0681774255780584e-18
dat['x'].values.var(ddof=True)
# 2.158098183211887e-18
dat['x'].var()
# 2.158098183211887e-18
dat['x'].var(ddof=False)
# 2.0681774255780584e-18 |
No bottleneck... here's the full output of my system INSTALLED VERSIONS
------------------
commit: None
python: 3.4.0.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-52-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8
pandas: 0.16.1
nose: 1.3.1
Cython: None
numpy: 1.9.2
scipy: 0.15.1
statsmodels: None
IPython: 1.2.1
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: 0.8
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None I think it's a problem with the algorithm that's being used. According to this page on wiki http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance Quote: The method I used that seems to work is the two pass algorithm mentioned on the wiki article, which is apparently numerically stable except in the case of large The good formula (again according to the wiki) that is numerically stable, even for large def compensated_variance(data):
n = 0
sum1 = 0
for x in data:
n = n + 1
sum1 = sum1 + x
mean = sum1/n
sum2 = 0
sum3 = 0
for x in data:
sum2 = sum2 + (x - mean)**2
sum3 = sum3 + (x - mean)
variance = (sum2 - sum3**2/n)/(n - 1)
return variance
print(dat['x'].values.var(ddof=1))
print(compensated_variance(dat['x']))
However, the 'correct' value (calculated with Mathematica) is: |
look at how roll_var was implemented - this uses weldord method - iirc there is s comment that var should be changed |
Implementing it with Welford's method (or the modified one from Knuth that is on that wiki page) would probably slow down the computation, but it is probably worth the small performance drop since we should be trying to get rid of weird roundoff errors such as those I found. I'm quite busy at the moment, so I cannot work on a patch, but if nobody has taken this on by the time my workload reduces, I'd be happy to contribute. In the mean time, I suggest that people just use the numpy version: dat['x'].values.var(ddof=1) That's what I'll be doing at least... |
this would have to be benchmarked, but this is very much an edge case (difference of small numbers). As far as the |
I took the liberty of opening a small PR to implement Welford's method in |
|
std has the same issue |
closed by #10679 |
The var method acts inconsistently when used on the Series vs the numpy array returned from values:
This is due to numpy using ddof=0 as its default (calculates the biased variance), whereas pandas calculates the unbiased variance by default. These two should be consistent, so either pandas should adapt or numpy should (maybe it should be numpy since we should calculate unbiased by default?).
The other problem is that pandas does not calculate the variance of this DataFrame properly.
Inconsistent definition aside, the variance should clearly not be zero when calculated from pandas.
Here is the system information:
INSTALLED VERSIONS:
commit: None
python: 3.4.0.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-52-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8
pandas: 0.16.1
numpy: 1.9.2
The text was updated successfully, but these errors were encountered: