Skip to content

Issue with Zero's Broadcasting down a column in Heterogeneous data columns ... #13758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mmcky opened this issue Jul 22, 2016 · 6 comments · Fixed by #30769
Closed

Issue with Zero's Broadcasting down a column in Heterogeneous data columns ... #13758

mmcky opened this issue Jul 22, 2016 · 6 comments · Fixed by #30769
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@mmcky
Copy link
Contributor

mmcky commented Jul 22, 2016

I had a recent use case which produced some unexpected output and I have isolated the problem down to a much simpler case to replicate the behavior.

Please see simplified code example below.

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
data = {
    'One' : pd.Series(['A', 1.2, np.nan]),
    'Two' : pd.Series([1.4, 3.2, 4.5])
}
df = pd.DataFrame(data)

Doing the sum across the first column (which in this case is redundant). [Note: In my case I was doing groupby operations where some columns were being summed and others were single columns as indexed by a level of a MultiIndex. The issue was that some of the columns ended up being full of 0's)

df[['One']].sum(axis=1)

produces an unexpected outcome:

0    0.0
1    0.0
2    0.0
dtype: float64

I think this might be an error and it produces a valid column of 0's that is of dtype=float64.

Expected Output

0      A
1    1.2
2    NaN
Name: One, dtype: object

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-38-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.1
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Jul 22, 2016

you are expecting the sum of objects and floats to actually work, but it doesn't; pandas catches the error and ignores the column. So these results are all as expected

you are simply getting back the Two column.

In [32]: df
Out[32]: 
   One  Two
0    A  1.4
1  1.2  3.2
2  NaN  4.5

In [37]: df.dtypes
Out[37]: 
One     object
Two    float64
dtype: object

In [33]: df.sum(axis=1)
Out[33]: 
0    1.4
1    3.2
2    4.5
dtype: float64

operations between object and float columns don't work

In [43]: df['One'].values
Out[43]: array(['A', 1.2, nan], dtype=object)

In [44]: df['Two'].values
Out[44]: array([ 1.4,  3.2,  4.5])

In [45]: df['One'].values + df['Two'].values
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-45-391c55893f92> in <module>()
----> 1 df['One'].values + df['Two'].values

TypeError: cannot concatenate 'str' and 'float' objects

you probably want o use pd.to_numeric to make the invalid values nan.

In [53]: df2 = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))

In [54]: df2
Out[54]: 
   One  Two
0  NaN  1.4
1  1.2  3.2
2  NaN  4.5

In [55]: df2.dtypes
Out[55]: 
One    float64
Two    float64
dtype: object

In [56]: df2.sum(axis=1)
Out[56]: 
0    1.4
1    4.4
2    4.5
dtype: float64

@jreback jreback closed this as completed Jul 22, 2016
@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Usage Question labels Jul 22, 2016
@mmcky
Copy link
Contributor Author

mmcky commented Jul 22, 2016

Thanks for your response @jreback. I see your reasoning with .to_numeric() -- good call.

But focusing on the main case.

Using the first example

import pandas as pd
import numpy as np
data = {
    'One' : pd.Series(['A', 1.2, np.nan]),
    'Two' : pd.Series([1.4, 3.2, 4.5])
}
df = pd.DataFrame(data)

If a column meets the condition above then it get's replaced with a column of 0's removing valid data and turning it to a numeric.

df[['One']].sum(axis=1)

will produce a column of 0 values replacing both np.nan values and all other valid data. This is very hard to detect because it replaces the entire column with 0 and changes it to dtype=float64. I think this is dangerous default behavior.

@jreback
Copy link
Contributor

jreback commented Jul 22, 2016

not sure where that's coming from, I think something (bottleneck, numpy) is trying to be helpful here.

ok i'll reopen, but pls edit down the top just to include that example. It is too much otherwise.

@jreback jreback reopened this Jul 22, 2016
@jreback
Copy link
Contributor

jreback commented Jul 22, 2016

and pls don't include images at all, just copy-paste text (format using markdown). the more helpful to readers (and the shorter), the more likely someone will look at this.

@mmcky
Copy link
Contributor Author

mmcky commented Jul 22, 2016

Thanks @jreback. Pandas is great! Original question has been edited.

@jorisvandenbossche jorisvandenbossche added Bug Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff and removed Dtype Conversions Unexpected or buggy dtype conversions Usage Question labels Jul 23, 2016
@jorisvandenbossche jorisvandenbossche added this to the Next Major Release milestone Jul 23, 2016
@mroeschke
Copy link
Member

Looks to be fixed on master. Could use a test.

In [15]: import pandas as pd
    ...: import numpy as np
    ...: data = {
    ...:     'One' : pd.Series(['A', 1.2, np.nan]),
    ...:     'Two' : pd.Series([1.4, 3.2, 4.5])
    ...: }
    ...: df = pd.DataFrame(data)

In [16]: df[['One']].sum(axis=1)
    ...:
Out[16]:
0      A
1    1.2
2      0
dtype: object

In [17]: pd.__version__
Out[17]: '0.26.0.dev0+734.g0de99558b'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug labels Nov 2, 2019
@simonjayhawkins simonjayhawkins modified the milestones: Contributions Welcome, 1.0 Jan 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants