-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Don't ignore na_rep in DataFrame.to_html #36690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Off the top of my head I recall a discussion where this may not be a bug. There may be a duplicate issue. I've not checked the code. but if a user supplies a formatter, they are responsible for the na values too. Is that what is happening here? |
There was an old PR similar to this that was almost merged but ended up being closed as stale, so I assume this is considered a real bug. I could see how it might also be seen as ambiguous since NaN is technically a float, so that the user is asking for two different formats for these values (but in that case maybe the docs should explicitly clarify what happens in this situation). |
does this also close #9046? and presumably to_string does the same? |
I've not yet found the issue but from
so the changes here break existing code (this is a long standing behaviour even if considered a bug)
|
Yes, tests added for those cases |
Ah I was forgetting "NaN" is actually the default and not None. Made a change which fixes this but leaves a different "bug" (albiet a more obscure one, it only ignores the specific string "NaN"; I don't think it's possible to respect both arguments in absolutely all cases): [ins] In [1]: import numpy as np
...: import pandas as pd
...:
...: def my_formatter(x):
...: if np.isnan(x):
...: return "ted"
...: else:
...: return str(x)
...:
...:
...: df = pd.DataFrame(
...: [
...: ["A", 1.2225],
...: [
...: "A",
...: ],
...: ],
...: columns=["Group", "Data"],
...: )
...: print(df.to_html(na_rep="NaN", float_format=my_formatter))
...:
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Group</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>A</td>
<td>1.2225</td>
</tr>
<tr>
<th>1</th>
<td>A</td>
<td>ted</td>
</tr>
</tbody>
</table> |
pandas/io/formats/format.py
Outdated
@@ -1533,7 +1533,14 @@ def format_values_with(float_format): | |||
def _format_strings(self) -> List[str]: | |||
# shortcut | |||
if self.formatter is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this entire section you added should be instead on L1440
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and you will want to do something like
mask = isna(values)
values = np.array(values, dtype="object")
values[mask] = na_rep
imask = (~mask).ravel(
values.flat[imask] = np.array(
[formatter(val) for val in values.ravel()[imask]]
)
you will want to factor that out into a function and use it above in 2 (or more places)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand, get_result_as_array
never actually gets called here. Are you thinking the formatter should already be handling NaN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no my point is to share code; you are doing virtually the same thing, just in another way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see what you mean
generally we won't do older bug fixes on a point release; only regressions. the reason is sometimes bug fixes can introduce a further bug :-> want to minimize the risk. |
@@ -3432,3 +3432,14 @@ def test_format_remove_leading_space_dataframe(input_array, expected): | |||
# GH: 24980 | |||
df = pd.DataFrame(input_array).to_string(index=False) | |||
assert df == expected | |||
|
|||
|
|||
@pytest.mark.parametrize("na_rep, string", [("NaN", "nan"), ("Ted", "Ted")]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure about using "nan" when na_rep
is "NaN", although that is the string output of "{:.2f}".format(np.nan)
.
Maybe could use lib.no_default
for na_rep
, raise if float_format
also specified and use "NaN" if float_format
not specified.
maybe worth considering passing np.nan
to func
if passed to float_format
, and if result is a string assume func
handles missing values, something like
try:
func_handles_na = isinstance(func(np.nan), str)
except Execption:
func_handles_na = False
and then use na_rep
if func_handles_na
is False
. but I'm not sure if we gain anything.
The main issue with the custom formatters (applies to formatters
kwarg as well) is that we do not give complete control to the custom formatter. off the top of my head, strings maybe trimmed, spaces added, precision changed, truncation applied.
The other issue is that the EAs use the custom formatters. So this is not necessarily an easy issue to fix in isolation.
</tr> | ||
</tbody> | ||
</table>""" | ||
assert result == expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for html tests, can use expected_html fixture. see test_to_html_justify for usage as template.
so returning to issue #13828, the fact that "{:.2f}".format(np.nan)" is "nan", I think the output is correct. and this is not a bug #36690 (comment) but raising if currently, in this PR This is inconsistent. |
Raising when both are specified makes sense to me (or rather when float_format is specified and na_rep != "NaN", unless the default were also changed), since there is in fact no way to apply this consistently |
despite my previous comments, I think we could only raise if not a mixed frame. otherwise may cause another regression. I looked at sorting out some of the formatting in the past, and it's very easy to break things. (needs many more tests with lots of parameterisation to know for sure) It is maybe that we do need to break things, and maybe the case in #36690 (comment) is one of those. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dsaxton lgtm.
regarding the corner case where a user custom function handles missing values which is now broken, any idea on how we should communicate that to the users?
Maybe warrants a note in user_guide/io.rst? |
how about just a versionchanged tag for float_format with a one-liner along the lines "If a function is passed, only non-missing values are passed to the function and na_rep is used for missing values" |
Added a note and versionchanged to the common docstring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comment, ping on greenish
pandas/io/formats/format.py
Outdated
@@ -1444,8 +1449,19 @@ def get_result_as_array(self) -> np.ndarray: | |||
Returns the float values converted into strings using | |||
the parameters given at initialisation, as a numpy array | |||
""" | |||
|
|||
def format_with_na_rep(values, formatter, na_rep): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some types
def format_with_na_rep(values, formatter, na_rep): | ||
mask = isna(values) | ||
formatted = np.array( | ||
[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use this function more generally? (e.g. maybe define it in the module); can be a followup as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems similar patterns occur elsewhere, although most are at the scalar level
@jreback I think this is good to go, CI failure is unrelated |
thanks @dsaxton |
* BUG: Don't ignore na_rep in DataFrame.to_html * Back to list * Move test and cover * Test for to_latex * More tests * Maybe * Note * Refactor * Move note * Nothing * Fixup * Remove * Doc * Type
* BUG: Don't ignore na_rep in DataFrame.to_html * Back to list * Move test and cover * Test for to_latex * More tests * Maybe * Note * Refactor * Move note * Nothing * Fixup * Remove * Doc * Type
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff