Skip to content

BUG: Don't ignore na_rep in DataFrame.to_html #36690

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Oct 24, 2020
Merged

BUG: Don't ignore na_rep in DataFrame.to_html #36690

merged 31 commits into from
Oct 24, 2020

Conversation

dsaxton
Copy link
Member

@dsaxton dsaxton commented Sep 27, 2020

@dsaxton dsaxton added Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap labels Sep 27, 2020
@simonjayhawkins
Copy link
Member

Off the top of my head I recall a discussion where this may not be a bug. There may be a duplicate issue.

I've not checked the code. but if a user supplies a formatter, they are responsible for the na values too. Is that what is happening here?

@dsaxton
Copy link
Member Author

dsaxton commented Sep 29, 2020

Off the top of my head I recall a discussion where this may not be a bug. There may be a duplicate issue.

I've not checked the code. but if a user supplies a formatter, they are responsible for the na values too. Is that what is happening here?

There was an old PR similar to this that was almost merged but ended up being closed as stale, so I assume this is considered a real bug.

I could see how it might also be seen as ambiguous since NaN is technically a float, so that the user is asking for two different formats for these values (but in that case maybe the docs should explicitly clarify what happens in this situation).

@simonjayhawkins
Copy link
Member

does this also close #9046? and presumably to_string does the same?

@simonjayhawkins
Copy link
Member

I've not yet found the issue but from

float_format: one-parameter function, optional, default None
Formatter function to apply to columns’ elements if they are floats. The result of this function must be a unicode string.

so the changes here break existing code (this is a long standing behaviour even if considered a bug)

def my_formatter(x):
    if np.isnan(x):
        return "ted"
    else:
        return str(x)


df = pd.DataFrame(
    [
        ["A", 1.2225],
        [
            "A",
        ],
    ],
    columns=["Group", "Data"],
)
df.to_html(float_format=my_formatter)

1.1.2
image

this PR
image

@dsaxton
Copy link
Member Author

dsaxton commented Sep 29, 2020

does this also close #9046? and presumably to_string does the same?

Yes, tests added for those cases

@dsaxton
Copy link
Member Author

dsaxton commented Sep 30, 2020

so the changes here break existing code (this is a long standing behaviour even if considered a bug)

Ah I was forgetting "NaN" is actually the default and not None. Made a change which fixes this but leaves a different "bug" (albiet a more obscure one, it only ignores the specific string "NaN"; I don't think it's possible to respect both arguments in absolutely all cases):

[ins] In [1]: import numpy as np
         ...: import pandas as pd
         ...:
         ...: def my_formatter(x):
         ...:     if np.isnan(x):
         ...:         return "ted"
         ...:     else:
         ...:         return str(x)
         ...:
         ...:
         ...: df = pd.DataFrame(
         ...:     [
         ...:         ["A", 1.2225],
         ...:         [
         ...:             "A",
         ...:         ],
         ...:     ],
         ...:     columns=["Group", "Data"],
         ...: )
         ...: print(df.to_html(na_rep="NaN", float_format=my_formatter))
         ...:
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Group</th>
      <th>Data</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>A</td>
      <td>1.2225</td>
    </tr>
    <tr>
      <th>1</th>
      <td>A</td>
      <td>ted</td>
    </tr>
  </tbody>
</table>

@jreback jreback added this to the 1.2 milestone Oct 1, 2020
@@ -1533,7 +1533,14 @@ def format_values_with(float_format):
def _format_strings(self) -> List[str]:
# shortcut
if self.formatter is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this entire section you added should be instead on L1440

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and you will want to do something like

mask = isna(values)	
values = np.array(values, dtype="object")
values[mask] = na_rep
imask = (~mask).ravel(
values.flat[imask] = np.array(	
                [formatter(val) for val in values.ravel()[imask]]
            )

you will want to factor that out into a function and use it above in 2 (or more places)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand, get_result_as_array never actually gets called here. Are you thinking the formatter should already be handling NaN?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no my point is to share code; you are doing virtually the same thing, just in another way

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see what you mean

@jreback jreback added IO LaTeX to_latex Output-Formatting __repr__ of pandas objects, to_string labels Oct 1, 2020
@jreback
Copy link
Contributor

jreback commented Oct 1, 2020

K Can add a release note once the doc is started for 1.1.4

generally we won't do older bug fixes on a point release; only regressions. the reason is sometimes bug fixes can introduce a further bug :-> want to minimize the risk.

@@ -3432,3 +3432,14 @@ def test_format_remove_leading_space_dataframe(input_array, expected):
# GH: 24980
df = pd.DataFrame(input_array).to_string(index=False)
assert df == expected


@pytest.mark.parametrize("na_rep, string", [("NaN", "nan"), ("Ted", "Ted")])
Copy link
Member

@simonjayhawkins simonjayhawkins Oct 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure about using "nan" when na_rep is "NaN", although that is the string output of "{:.2f}".format(np.nan).

Maybe could use lib.no_default for na_rep, raise if float_format also specified and use "NaN" if float_format not specified.

maybe worth considering passing np.nan to func if passed to float_format, and if result is a string assume func handles missing values, something like

try:
   func_handles_na = isinstance(func(np.nan), str)
except Execption:
   func_handles_na = False

and then use na_rep if func_handles_na is False. but I'm not sure if we gain anything.

The main issue with the custom formatters (applies to formatters kwarg as well) is that we do not give complete control to the custom formatter. off the top of my head, strings maybe trimmed, spaces added, precision changed, truncation applied.

The other issue is that the EAs use the custom formatters. So this is not necessarily an easy issue to fix in isolation.

</tr>
</tbody>
</table>"""
assert result == expected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for html tests, can use expected_html fixture. see test_to_html_justify for usage as template.

@simonjayhawkins
Copy link
Member

so returning to issue #13828, the fact that "{:.2f}".format(np.nan)" is "nan", I think the output is correct. and this is not a bug #36690 (comment)

but raising if na_rep is specified when float_format is specified maybe a better solution.

currently, in this PR

image

This is inconsistent.

@dsaxton
Copy link
Member Author

dsaxton commented Oct 2, 2020

but raising if na_rep is specified when float_format is specified maybe a better solution.

Raising when both are specified makes sense to me (or rather when float_format is specified and na_rep != "NaN", unless the default were also changed), since there is in fact no way to apply this consistently

@simonjayhawkins
Copy link
Member

despite my previous comments, I think we could only raise if not a mixed frame. otherwise may cause another regression.

I looked at sorting out some of the formatting in the past, and it's very easy to break things. (needs many more tests with lots of parameterisation to know for sure)

It is maybe that we do need to break things, and maybe the case in #36690 (comment) is one of those.

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dsaxton lgtm.

regarding the corner case where a user custom function handles missing values which is now broken, any idea on how we should communicate that to the users?

@dsaxton
Copy link
Member Author

dsaxton commented Oct 14, 2020

Thanks @dsaxton lgtm.

regarding the corner case where a user custom function handles missing values which is now broken, any idea on how we should communicate that to the users?

Maybe warrants a note in user_guide/io.rst?

@simonjayhawkins
Copy link
Member

Maybe warrants a note in user_guide/io.rst?

how about just a versionchanged tag for float_format with a one-liner along the lines "If a function is passed, only non-missing values are passed to the function and na_rep is used for missing values"

@dsaxton
Copy link
Member Author

dsaxton commented Oct 18, 2020

how about just a versionchanged tag for float_format with a one-liner along the lines "If a function is passed, only non-missing values are passed to the function and na_rep is used for missing values"

Added a note and versionchanged to the common docstring

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment, ping on greenish

@@ -1444,8 +1449,19 @@ def get_result_as_array(self) -> np.ndarray:
Returns the float values converted into strings using
the parameters given at initialisation, as a numpy array
"""

def format_with_na_rep(values, formatter, na_rep):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some types

def format_with_na_rep(values, formatter, na_rep):
mask = isna(values)
formatted = np.array(
[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use this function more generally? (e.g. maybe define it in the module); can be a followup as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems similar patterns occur elsewhere, although most are at the scalar level

@dsaxton
Copy link
Member Author

dsaxton commented Oct 23, 2020

@jreback I think this is good to go, CI failure is unrelated

@jreback jreback merged commit d240cb8 into pandas-dev:master Oct 24, 2020
@jreback
Copy link
Contributor

jreback commented Oct 24, 2020

thanks @dsaxton

@dsaxton dsaxton deleted the na_rep-with-float_format branch October 24, 2020 02:51
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Oct 26, 2020
* BUG: Don't ignore na_rep in DataFrame.to_html

* Back to list

* Move test and cover

* Test for to_latex

* More tests

* Maybe

* Note

* Refactor

* Move note

* Nothing

* Fixup

* Remove

* Doc

* Type
kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
* BUG: Don't ignore na_rep in DataFrame.to_html

* Back to list

* Move test and cover

* Test for to_latex

* More tests

* Maybe

* Note

* Refactor

* Move note

* Nothing

* Fixup

* Remove

* Doc

* Type
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap IO LaTeX to_latex Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_html ignores na_rep when float_format set DataFrame.to_latex() should honor na_rep after formatter.
3 participants