Skip to content

Inconsistency in some calculations in pd.DataFrames with large number of rows #11346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rubennj opened this issue Oct 16, 2015 · 12 comments
Closed
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Windows Windows OS

Comments

@rubennj
Copy link

rubennj commented Oct 16, 2015

Strange behaviour that appears only with a large number of rows.

import pandas as pd
import numpy as np

np.random.seed(0)
data = pd.DataFrame(np.random.rand(500000, 4))

def sum_products(data):
    data_np = sum(data.values[:,0] * data.values[:,1] * data.values[:,2] * data.values[:,3])
    data_pd = sum(data.ix[:,0] * data.ix[:,1] * data.ix[:,2] * data.ix[:,3])

    return [data_np, data_pd]

calls = []
for _ in range(25):
    test = sum_products(data)
    calls.append(test)

print('[Numpy, Pandas]', calls )

Out[6]:
[Numpy, Pandas]
[[31234.084575876517, 31234.084575876517],
 [31234.084575876517, 31234.084575876517],
 [31234.084575876517, 31234.084575876517],
 [31234.084575876517, 0.0],
 [31234.084575876517, 31234.084575876517],
 [31234.084575876517, 31234.084575876517],
 [31234.084575876517, 31234.084575876517],
 [31234.084575876517, 31234.084575876517],
 [31234.084575876517, 0.0],
 [31234.084575876517, 31234.084575876517],
 [31234.084575876517, 31234.084575876517],
 [31234.084575876517, 31234.084575876517],
 [31234.084575876517, 31234.084575876517],
 [31234.084575876517, 31234.084575876517],
 [31234.084575876517, 31234.084575876517],
 [31234.084575876517, 0.0],
 [31234.084575876517, 31234.084575876517],
 [31234.084575876517, 31234.084575876517],
 [31234.084575876517, 31234.084575876517],
 [31234.084575876517, 31234.084575876517],
 [31234.084575876517, 31234.084575876517],
 [31234.084575876517, 31234.084575876517],
 [31234.084575876517, 31234.084575876517],
 [31234.084575876517, 31234.084575876517],
 [31234.084575876517, 31234.084575876517]]

This bug appears randomly, but using this combination (4 columns, 500k rows and 25 repetitions) I can see it at least once almost all the executions.

I have tried several Pandas versions (0.13-0.17) with Python 3.4 and the bug is still there.
HOWEVER, using Python 2.7 it works prefectly (at least after several executions).

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: es_ES

pandas: 0.17.0
nose: 1.3.7
pip: 7.1.2
setuptools: 18.4
Cython: 0.23.3
numpy: 1.10.1
scipy: 0.16.0
statsmodels: None
IPython: 4.0.0
sphinx: 1.2.3
patsy: 0.4.0
dateutil: 2.4.1
pytz: 2015.6
blosc: None
bottleneck: None
tables: None
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: 2.0.2
xlrd: 0.9.4
xlwt: None
xlsxwriter: 0.7.5
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.8
pymysql: None
psycopg2: None
@jreback
Copy link
Contributor

jreback commented Oct 16, 2015

why would you use sum and not .sum() the former is a python call, while the latter is a numpy/pandas call.

@jreback
Copy link
Contributor

jreback commented Oct 16, 2015

In [3]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.0.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 70 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.17.0
nose: None
pip: 7.1.2
setuptools: 18.4
Cython: None
numpy: 1.10.1
scipy: None
statsmodels: None
IPython: 4.0.0
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None

In [4]: %cpate
ERROR: Line magic function `%cpate` not found.

In [5]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:import pandas as pd
:import numpy as np
:
:np.random.seed(0)
:data = pd.DataFrame(np.random.rand(500000, 4))
:
:def sum_products(data):
:    data_np = sum(data.values[:,0] * data.values[:,1] * data.values[:,2] * data.values[:,3])
:    data_pd = sum(data.ix[:,0] * data.ix[:,1] * data.ix[:,2] * data.ix[:,3])
:
:    return [data_np, data_pd]
:
:calls = []
:for _ in range(25):
:    test = sum_products(data)
:    calls.append(test)
:
:print('[Numpy, Pandas]', calls )
:--
[Numpy, Pandas] [[31234.08457587703, 31234.08457587703], [31234.08457587703, 31234.08457587703], [31234.08457587703, 31234.08457587703], [31234.08457587703, 31234.08457587703], [31234.08457587703, 312
34.08457587703], [31234.08457587703, 31234.08457587703], [31234.08457587703, 31234.08457587703], [31234.08457587703, 31234.08457587703], [31234.08457587703, 31234.08457587703], [31234.08457587703, 312
34.08457587703], [31234.08457587703, 31234.08457587703], [31234.08457587703, 31234.08457587703], [31234.08457587703, 31234.08457587703], [31234.08457587703, 31234.08457587703], [31234.08457587703, 312
34.08457587703], [31234.08457587703, 31234.08457587703], [31234.08457587703, 31234.08457587703], [31234.08457587703, 31234.08457587703], [31234.08457587703, 31234.08457587703], [31234.08457587703, 312
34.08457587703], [31234.08457587703, 31234.08457587703], [31234.08457587703, 31234.08457587703], [31234.08457587703, 31234.08457587703], [31234.08457587703, 31234.08457587703], [31234.08457587703, 312
34.08457587703]]

only difference is you have a locale setting (e.g. es_ES)

@rubennj
Copy link
Author

rubennj commented Oct 16, 2015

I use them indistinctively.
Anyway, using .sum() gives the same error.

def sum_products(data):
    data_np = (data.values[:,0] * data.values[:,1] * data.values[:,2] * data.values[:,3]).sum()
    data_pd = (data.ix[:,0] * data.ix[:,1] * data.ix[:,2] * data.ix[:,3]).sum()

@jreback
Copy link
Contributor

jreback commented Oct 16, 2015

can't repro (see above). if you eventually figure it out pls reopen.

@jreback jreback closed this as completed Oct 16, 2015
@jreback jreback added Can't Repro Numeric Operations Arithmetic, Comparison, and Logical operations Windows Windows OS labels Oct 16, 2015
@jreback
Copy link
Contributor

jreback commented Oct 16, 2015

how did you install? I assume you didn't compile this yourself (and instead used a wheel/conda)

@rubennj
Copy link
Author

rubennj commented Oct 16, 2015

Yes, I have installed everything using conda. I have tried also in 2 different PCs (both win64) with the same results.

PD: Have you tried at least a few executions? Sometimes I don't get it at the first tries.

@jreback
Copy link
Contributor

jreback commented Oct 16, 2015

yep.

try a clean conda environment. if you can repro then show the steps

@rubennj
Copy link
Author

rubennj commented Oct 16, 2015

OK, using a new environment with only pandas + dependencies it works well. Thanks!!!

It was not obvious to see this issue at all. I fear for my previous work... Can it be a conda issue?

@jreback
Copy link
Contributor

jreback commented Oct 16, 2015

I would say that you have multiple installlations of anaconda/miniconda. If you dont have a particular library installed then it will pick up from the other installation (if you have both on the path).

so clean up your path.

@rubennj
Copy link
Author

rubennj commented Mar 1, 2016

It seems that this issue is linked to #12023 and therefore to numexpr=2.4.4 #12489.
In this case it appeared without calling .query() but numexpr=2.4.4 was installed.

@jreback
Copy link
Contributor

jreback commented Mar 1, 2016

just try upgrading numexpr

lots of calculation defer to numexpr

@rubennj
Copy link
Author

rubennj commented Mar 1, 2016

That's it, with numexpr=2.5 it works well. Just confirming it for the record.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Windows Windows OS
Projects
None yet
Development

No branches or pull requests

2 participants