Issue with Zero's Broadcasting down a column in Heterogeneous data columns ... #13758

mmcky · 2016-07-22T19:31:27Z

I had a recent use case which produced some unexpected output and I have isolated the problem down to a much simpler case to replicate the behavior.

Please see simplified code example below.

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
data = {
    'One' : pd.Series(['A', 1.2, np.nan]),
    'Two' : pd.Series([1.4, 3.2, 4.5])
}
df = pd.DataFrame(data)

Doing the sum across the first column (which in this case is redundant). [Note: In my case I was doing groupby operations where some columns were being summed and others were single columns as indexed by a level of a MultiIndex. The issue was that some of the columns ended up being full of 0's)

df[['One']].sum(axis=1)

produces an unexpected outcome:

0    0.0
1    0.0
2    0.0
dtype: float64

I think this might be an error and it produces a valid column of 0's that is of dtype=float64.

Expected Output

0      A
1    1.2
2    NaN
Name: One, dtype: object

output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-38-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.1
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-07-22T19:46:12Z

you are expecting the sum of objects and floats to actually work, but it doesn't; pandas catches the error and ignores the column. So these results are all as expected

you are simply getting back the Two column.

In [32]: df
Out[32]: 
   One  Two
0    A  1.4
1  1.2  3.2
2  NaN  4.5

In [37]: df.dtypes
Out[37]: 
One     object
Two    float64
dtype: object

In [33]: df.sum(axis=1)
Out[33]: 
0    1.4
1    3.2
2    4.5
dtype: float64

operations between object and float columns don't work

In [43]: df['One'].values
Out[43]: array(['A', 1.2, nan], dtype=object)

In [44]: df['Two'].values
Out[44]: array([ 1.4,  3.2,  4.5])

In [45]: df['One'].values + df['Two'].values
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-45-391c55893f92> in <module>()
----> 1 df['One'].values + df['Two'].values

TypeError: cannot concatenate 'str' and 'float' objects

you probably want o use pd.to_numeric to make the invalid values nan.

In [53]: df2 = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))

In [54]: df2
Out[54]: 
   One  Two
0  NaN  1.4
1  1.2  3.2
2  NaN  4.5

In [55]: df2.dtypes
Out[55]: 
One    float64
Two    float64
dtype: object

In [56]: df2.sum(axis=1)
Out[56]: 
0    1.4
1    4.4
2    4.5
dtype: float64

mmcky · 2016-07-22T20:15:25Z

Thanks for your response @jreback. I see your reasoning with .to_numeric() -- good call.

But focusing on the main case.

Using the first example

import pandas as pd
import numpy as np
data = {
    'One' : pd.Series(['A', 1.2, np.nan]),
    'Two' : pd.Series([1.4, 3.2, 4.5])
}
df = pd.DataFrame(data)

If a column meets the condition above then it get's replaced with a column of 0's removing valid data and turning it to a numeric.

df[['One']].sum(axis=1)

will produce a column of 0 values replacing both np.nan values and all other valid data. This is very hard to detect because it replaces the entire column with 0 and changes it to dtype=float64. I think this is dangerous default behavior.

jreback · 2016-07-22T20:24:02Z

not sure where that's coming from, I think something (bottleneck, numpy) is trying to be helpful here.

ok i'll reopen, but pls edit down the top just to include that example. It is too much otherwise.

jreback · 2016-07-22T20:24:47Z

and pls don't include images at all, just copy-paste text (format using markdown). the more helpful to readers (and the shorter), the more likely someone will look at this.

mmcky · 2016-07-22T20:28:33Z

Thanks @jreback. Pandas is great! Original question has been edited.

mroeschke · 2019-11-02T01:38:22Z

Looks to be fixed on master. Could use a test.

In [15]: import pandas as pd
    ...: import numpy as np
    ...: data = {
    ...:     'One' : pd.Series(['A', 1.2, np.nan]),
    ...:     'Two' : pd.Series([1.4, 3.2, 4.5])
    ...: }
    ...: df = pd.DataFrame(data)

In [16]: df[['One']].sum(axis=1)
    ...:
Out[16]:
0      A
1    1.2
2      0
dtype: object

In [17]: pd.__version__
Out[17]: '0.26.0.dev0+734.g0de99558b'

jreback closed this as completed Jul 22, 2016

jreback added Dtype Conversions Unexpected or buggy dtype conversions Usage Question labels Jul 22, 2016

jreback reopened this Jul 22, 2016

jorisvandenbossche added Bug Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff and removed Dtype Conversions Unexpected or buggy dtype conversions Usage Question labels Jul 23, 2016

jorisvandenbossche added this to the Next Major Release milestone Jul 23, 2016

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug labels Nov 2, 2019

mroeschke mentioned this issue Jan 7, 2020

TST: Add tests for fixed issues #30769

Merged

8 tasks

simonjayhawkins modified the milestones: Contributions Welcome, 1.0 Jan 7, 2020

mroeschke closed this as completed in #30769 Jan 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Zero's Broadcasting down a column in Heterogeneous data columns ... #13758

Issue with Zero's Broadcasting down a column in Heterogeneous data columns ... #13758

mmcky commented Jul 22, 2016 •

edited

Loading

jreback commented Jul 22, 2016

mmcky commented Jul 22, 2016 •

edited

Loading

jreback commented Jul 22, 2016

jreback commented Jul 22, 2016

mmcky commented Jul 22, 2016 •

edited

Loading

mroeschke commented Nov 2, 2019

Issue with Zero's Broadcasting down a column in Heterogeneous data columns ... #13758

Issue with Zero's Broadcasting down a column in Heterogeneous data columns ... #13758

Comments

mmcky commented Jul 22, 2016 • edited Loading

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Jul 22, 2016

mmcky commented Jul 22, 2016 • edited Loading

jreback commented Jul 22, 2016

jreback commented Jul 22, 2016

mmcky commented Jul 22, 2016 • edited Loading

mroeschke commented Nov 2, 2019

mmcky commented Jul 22, 2016 •

edited

Loading

output of `pd.show_versions()`

mmcky commented Jul 22, 2016 •

edited

Loading

mmcky commented Jul 22, 2016 •

edited

Loading