Skip to content

multiplication of long Series returns garbled results #12556

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mobiusdumpling opened this issue Mar 7, 2016 · 1 comment
Closed

multiplication of long Series returns garbled results #12556

mobiusdumpling opened this issue Mar 7, 2016 · 1 comment
Labels
Compat pandas objects compatability with Numpy or Python functions

Comments

@mobiusdumpling
Copy link

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
num_tries = 1000
length = 100000
for i in range(num_tries):
    df = pd.DataFrame({"A" : 100000 * (200 + np.random.random(length)),
                       "B" : 100000 * (200 + np.random.random(length))})
    df['A*B'] = df.A * df.B
    if (df['A*B'] < 100).any():
        print "found the bug in iteration", i

Expected Output

Empty

Actual Output Example:

found the bug in iteration 7
found the bug in iteration 36
found the bug in iteration 103
found the bug in iteration 113
found the bug in iteration 120
....

Commentary

The output is different on each run.

The problem does not occur if we set length = 10000. Thus, the problem only occurs for long vectors.

The problem also does not occur if we cast the pandas column to numpy arrays. Thus, if we replace the multiplication line by
df['A*B'] = np.array(df.A) * np.array(df.B) then the problem does not occur

I suspect this is a CPU-related problem, and that some users will not be able to reproduce it. I'm running on a 64-bit Intel i7-3610QM. It is entirely possible that my CPU is malfunctioning, but it's still odd that the problem is so prevalent, and that it gets solved when using numpy multiplication instead of pandas multiplication.

more info

here is an example of the head of the data-frame from a buggy iteration:

                  A                B            A*B
0   20061284.685632  20098766.753960  6.698204e-316
1   20049742.754308  20038548.079325  8.741955e-316
2   20002832.802216  20045228.660298   2.000283e+07
3   20009535.502458  20098483.298851   2.000954e+07
4   20091479.138939  20031304.664403   2.009148e+07
5   20053896.596382  20014109.767089   2.005390e+07
6   20019252.981156  20074785.383487   2.001925e+07
7   20073563.758979  20007609.514515   2.007356e+07
8   20092312.956176  20007082.949870   2.009231e+07
9   20070058.044947  20049647.829544   2.007006e+07
10  20010300.783713  20011630.430404   2.001030e+07

output of pd.show_versions()

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Windows
OS-release: 8.1
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.17.1
nose: 1.3.7
pip: 8.0.3
setuptools: 20.1.1
Cython: 0.23.2
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
IPython: 4.0.1
sphinx: 1.2.3
patsy: 0.4.0
dateutil: 2.4.1
pytz: 2015.6
blosc: None
bottleneck: None
tables: 3.2.0
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: 2.0.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.8
pymysql: None
psycopg2: None
Jinja2: None

@jreback
Copy link
Contributor

jreback commented Mar 7, 2016

This is an issue in numexpr==2.4.4, see the blacklisting in upcoming 0.18.0 here

simply upgrade numexpr>2.4.4 and this will be fixed

@jreback jreback closed this as completed Mar 7, 2016
@jreback jreback added the Compat pandas objects compatability with Numpy or Python functions label Mar 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions
Projects
None yet
Development

No branches or pull requests

2 participants