multiplication of long Series returns garbled results #12556

mobiusdumpling · 2016-03-07T18:26:47Z

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
num_tries = 1000
length = 100000
for i in range(num_tries):
    df = pd.DataFrame({"A" : 100000 * (200 + np.random.random(length)),
                       "B" : 100000 * (200 + np.random.random(length))})
    df['A*B'] = df.A * df.B
    if (df['A*B'] < 100).any():
        print "found the bug in iteration", i

Expected Output

Empty

Actual Output Example:

found the bug in iteration 7
found the bug in iteration 36
found the bug in iteration 103
found the bug in iteration 113
found the bug in iteration 120
....

Commentary

The output is different on each run.

The problem does not occur if we set length = 10000. Thus, the problem only occurs for long vectors.

The problem also does not occur if we cast the pandas column to numpy arrays. Thus, if we replace the multiplication line by
df['A*B'] = np.array(df.A) * np.array(df.B) then the problem does not occur

I suspect this is a CPU-related problem, and that some users will not be able to reproduce it. I'm running on a 64-bit Intel i7-3610QM. It is entirely possible that my CPU is malfunctioning, but it's still odd that the problem is so prevalent, and that it gets solved when using numpy multiplication instead of pandas multiplication.

more info

here is an example of the head of the data-frame from a buggy iteration:

                  A                B            A*B
0   20061284.685632  20098766.753960  6.698204e-316
1   20049742.754308  20038548.079325  8.741955e-316
2   20002832.802216  20045228.660298   2.000283e+07
3   20009535.502458  20098483.298851   2.000954e+07
4   20091479.138939  20031304.664403   2.009148e+07
5   20053896.596382  20014109.767089   2.005390e+07
6   20019252.981156  20074785.383487   2.001925e+07
7   20073563.758979  20007609.514515   2.007356e+07
8   20092312.956176  20007082.949870   2.009231e+07
9   20070058.044947  20049647.829544   2.007006e+07
10  20010300.783713  20011630.430404   2.001030e+07

output of `pd.show_versions()`

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Windows
OS-release: 8.1
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.17.1
nose: 1.3.7
pip: 8.0.3
setuptools: 20.1.1
Cython: 0.23.2
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
IPython: 4.0.1
sphinx: 1.2.3
patsy: 0.4.0
dateutil: 2.4.1
pytz: 2015.6
blosc: None
bottleneck: None
tables: 3.2.0
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: 2.0.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.8
pymysql: None
psycopg2: None
Jinja2: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-03-07T18:38:03Z

This is an issue in numexpr==2.4.4, see the blacklisting in upcoming 0.18.0 here

simply upgrade numexpr>2.4.4 and this will be fixed

jreback closed this as completed Mar 7, 2016

jreback added the Compat pandas objects compatability with Numpy or Python functions label Mar 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multiplication of long Series returns garbled results #12556

multiplication of long Series returns garbled results #12556

mobiusdumpling commented Mar 7, 2016

jreback commented Mar 7, 2016

multiplication of long Series returns garbled results #12556

multiplication of long Series returns garbled results #12556

Comments

mobiusdumpling commented Mar 7, 2016

Code Sample, a copy-pastable example if possible

Expected Output

Actual Output Example:

Commentary

more info

output of pd.show_versions()

jreback commented Mar 7, 2016

output of `pd.show_versions()`