Skip to content

DOC: Clarifiy fill_value behavior in arithmetic ops #19653

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
HagaiHargil opened this issue Feb 12, 2018 · 11 comments · Fixed by #19675
Closed

DOC: Clarifiy fill_value behavior in arithmetic ops #19653

HagaiHargil opened this issue Feb 12, 2018 · 11 comments · Fixed by #19675
Labels
Docs good first issue Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations Usage Question
Milestone

Comments

@HagaiHargil
Copy link
Contributor

HagaiHargil commented Feb 12, 2018

When adding two DataFrames using df1.add(df2) one can use the fill_value parameter to fill in any NaNs that might come up. This parameter seems pretty broken:

a = pd.DataFrame(np.ones((3, 2)))
b = pd.DataFrame(np.ones((4, 3)))
print(a + b)  # all good:
#      0    1   2
# 0  2.0  2.0 NaN
# 1  2.0  2.0 NaN
# 2  2.0  2.0 NaN
# 3  NaN  NaN NaN

print(a.add(b))  # all good
print(a.add(b, fill_value=0))  # broken
#      0    1    2
# 0  2.0  2.0  1.0
# 1  2.0  2.0  1.0
# 2  2.0  2.0  1.0
# 3  1.0  1.0  1.0

As you see, it filled the NaNs with 1.0. Changing fill_value=1 will fill everything with 2.0. However, changing some of the values inside b leads to more peculiar results, and I couldn't really connect the dots and find some pattern.

This was observed on Python 3.6 on both Linux and Windows.

Thanks.

``` commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-514.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.22.0 pytest: None pip: 9.0.1 setuptools: 38.4.0 Cython: 0.27.3 numpy: 1.14.0 scipy: 1.0.0 pyarrow: None xarray: None IPython: 6.2.1 sphinx: 1.6.7 patsy: None dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.1.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.9999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None ```
@jreback
Copy link
Contributor

jreback commented Feb 12, 2018

pls read the documentation. This is doing an align then a fillna operation.

In [1]: a = pd.DataFrame(np.ones((3, 2))) * 2
   ...: b = pd.DataFrame(np.ones((4, 3)))
   ...: 

In [2]: a
Out[2]: 
     0    1
0  2.0  2.0
1  2.0  2.0
2  2.0  2.0

In [3]: b
Out[3]: 
     0    1    2
0  1.0  1.0  1.0
1  1.0  1.0  1.0
2  1.0  1.0  1.0
3  1.0  1.0  1.0

In [4]: ax, bx = a.align(b)

In [5]: ax
Out[5]: 
     0    1   2
0  2.0  2.0 NaN
1  2.0  2.0 NaN
2  2.0  2.0 NaN
3  NaN  NaN NaN

In [6]: bx
Out[6]: 
     0    1    2
0  1.0  1.0  1.0
1  1.0  1.0  1.0
2  1.0  1.0  1.0
3  1.0  1.0  1.0

In [7]: ax.fillna(0) + bx.fillna(0)
Out[7]: 
     0    1    2
0  3.0  3.0  1.0
1  3.0  3.0  1.0
2  3.0  3.0  1.0
3  1.0  1.0  1.0

In [8]: a.add(b,fill_value=0)
Out[8]: 
     0    1    2
0  3.0  3.0  1.0
1  3.0  3.0  1.0
2  3.0  3.0  1.0
3  1.0  1.0  1.0

@jreback jreback closed this as completed Feb 12, 2018
@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Usage Question Numeric Operations Arithmetic, Comparison, and Logical operations labels Feb 12, 2018
@jreback jreback added this to the No action milestone Feb 12, 2018
@HagaiHargil
Copy link
Contributor Author

HagaiHargil commented Feb 12, 2018

Hi,

I wouldn't post a GitHub issue without reading the documentation first :)

The documentation can be understood both ways, and actually I think that my "explanation" is more intuitive and straight-forward.

It currently simply states that:

fill_value: Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing.

It's very unclear that it refers to two types of NaN values - (a) the ones that already exist in the data, (b) and the ones that are generated after the alignment step and before the addition step - but does not include the NaNs that are visible after a simple addition of the dataframes, which are the most relevant to the end user.

At the least I believe that the documentation should be edited, won't you think?

@jreback
Copy link
Contributor

jreback commented Feb 12, 2018

you are welcome to PR a doc update if u think it is not clear

@TomAugspurger
Copy link
Contributor

Yep, I can see the point of confusion.

@HagaiHargil are you comfortable making a PR clarifying things?

pandas/pandas/core/ops.py

Lines 248 to 259 in a277108

_flex_doc_SERIES = """
{desc} of series and other, element-wise (binary operator `{op_name}`).
Equivalent to ``{equiv}``, but with support to substitute a fill_value for
missing data in one of the inputs.
Parameters
----------
other : Series or scalar value
fill_value : None or float value, default None (NaN)
Fill missing (NaN) values with this value. If both Series are
missing, the result will be missing

and

pandas/pandas/core/ops.py

Lines 273 to 284 in a277108

_arith_doc_FRAME = """
Binary operator %s with support to substitute a fill_value for missing data in
one of the inputs
Parameters
----------
other : Series, DataFrame, or constant
axis : {0, 1, 'index', 'columns'}
For Series input, axis to match Series index on
fill_value : None or float value, default None
Fill missing (NaN) values with this value. If both DataFrame locations are
missing, the result will be missing

@TomAugspurger TomAugspurger reopened this Feb 12, 2018
@TomAugspurger TomAugspurger changed the title DataFrame.add() with fill_value parameter fills in wrong value DOC: Clarifiy fill_value behavior in arithmetic ops Feb 12, 2018
@TomAugspurger TomAugspurger modified the milestones: No action, Next Major Release Feb 12, 2018
@HagaiHargil
Copy link
Contributor Author

I'd be happy to, I'm just having a hard time phrasing it. Here's a draft:

fill_value : None or float value, default None
Fill existing missing (NaN) values, and any new element needed for successful array alignment with this value before computation. If data in both corresponding DataFrame locations is
missing, the result will be missing

I'd also like to add an example, similar to the one in my initial post. Should it be added in both locations as well?

@TomAugspurger
Copy link
Contributor

Fill existing missing (NaN) values

That part is incorrect. Existing missing values will still be NA after the operation. Only newly-created missing values (created in the background by the align will be filled with fill_value). Everything else looks good though. So maybe something like

Fill missing values created by alignment with this value before doing the operation.
Existing missing values will not be filled by ``fill_value``.
In [19]: a = pd.Series([1, 1, np.nan, np.nan], index=['a', 'b', 'c', 'd'])

In [20]: b = pd.Series([1, 1, np.nan, np.nan], index=['b', 'c', 'd', 'e'])

In [21]: a
Out[21]:
a    1.0
b    1.0
c    NaN
d    NaN
dtype: float64

In [22]: b
Out[22]:
b    1.0
c    1.0
d    NaN
e    NaN
dtype: float64

In [23]: a.add(b, fill_value=0)
Out[23]:
a    1.0
b    2.0
c    1.0
d    NaN
e    NaN
dtype: float64

Should it be added in both locations as well?

Yep, both places (one using Series, one using DataFrame). Feel free to just put an example for add in there, even though it'll show up for e.g. Series.sub. I think people will get the idea.

It'd be good to include an example that has existing missing values and newly created missing values.

@HagaiHargil
Copy link
Contributor Author

Umm, I might be misunderstanding you, but I think that it does fill existing NA values:

a = pd.DataFrame(np.ones((3, 2)))
b = pd.DataFrame(np.ones((4, 3)))

a.iloc[0, 0] = np.nan
a.add(b, fill_value=10)
#       0     1     2
# 0  11.0   2.0  11.0
# 1   2.0   2.0  11.0
# 2   2.0   2.0  11.0
# 3  11.0  11.0  11.0

I consider the NA value in a[0, 0] to be existing, in which case fill_value does turn it into a 10 before the addition.

@TomAugspurger
Copy link
Contributor

I was apparently confused. It seems to fill when there are not NAs on both left and right. So valid + NA, both existing, will be filled.

In [90]: a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])

In [91]: b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'c_', 'd'])

In [92]: a
Out[92]:
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64

In [93]: b
Out[93]:
a     1.0
b     NaN
c_    1.0
d     NaN
dtype: float64

In [94]: a.add(b, fill_value=0)
Out[94]:
a     2.0
b     1.0
c     1.0
c_    1.0
d     NaN
dtype: float64
op output
valid + valid valid
valid + existing NA valid (valid + fill_value)
valid + new NA valid (valid + fill_value)
new NA + valid valid (fill_value + valid)
existing NA + existing NA NA

That's a bit strange to me, but not worth changing at this point I think.

@ZhuBaohe
Copy link
Contributor

DataFrame.add(other, axis=’columns’, level=None, fill_value=None)
Addition of dataframe and other, element-wise (binary operator add).
Equivalent to dataframe + other, but with support to substitute a fill_value for missing data in one
of the inputs
.

I think the document is clear.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 13, 2018 via email

@HagaiHargil
Copy link
Contributor Author

I posted #19675. @TomAugspurger Notice that after rebuilding the docs I found out that without creating the last two commits in that PR, referring to line 338 and below, the pd.DataFrame docs didn't update, only the pd.Series ones.

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Feb 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs good first issue Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations Usage Question
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants