Skip to content

BUG: float-like string, trailing 0 truncation #38759

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 29, 2020
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.2.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ including other versions of pandas.
Fixed regressions
~~~~~~~~~~~~~~~~~
- The deprecated attributes ``_AXIS_NAMES`` and ``_AXIS_NUMBERS`` of :class:`DataFrame` and :class:`Series` will no longer show up in ``dir`` or ``inspect.getmembers`` calls (:issue:`38740`)
- Bug in float-like strings having trailing 0's truncated (:issue:`38708`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add in "repr of" and "after the decimal" here for clarity that it is just a bug in regards to presentation (which is still important!).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are not floats, rather strings in an object dtype

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

-

.. ---------------------------------------------------------------------------
Expand Down
31 changes: 21 additions & 10 deletions pandas/io/formats/format.py
Original file line number Diff line number Diff line change
Expand Up @@ -1310,7 +1310,9 @@ def _format(x):
tpl = " {v}"
fmt_values.append(tpl.format(v=_format(v)))

fmt_values = _trim_zeros_float(str_floats=fmt_values, decimal=".")
fmt_values = _trim_zeros_float(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a larger issue to address first. these are not floats at all. these are object dtypes that are reprering. why are they hitting this path is the issue. we likley do not want to touch the formatters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are likley getting inferred as floats and that is why the formatter is picked. this is incorrect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the GenericArrayFormatter - seems reasonable for object type data. I think hitting _trim_zeros_float in GenericArrayFormatter makes sense because of a case like

s = pd.Series([1.20, "1.00"])
repr(s)

which should hit _trim_zeros_float so that 1.20 is truncated to 1.2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, sure but a string should not be truncated. that's where the issue is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see the issue here.

instead of _trim_zeros_float being a function that accepts a ndarray/list of strings, just have it process a single value, then call it when needed in the loop on L1299.

this makes the code much simpler. i am not entirely sure why it was done this way.

Copy link
Member Author

@mzeitlin11 mzeitlin11 Dec 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason why it was done that way is that floats are formatted with fixed-width, e.g.

s = pd.Series([1.230, 1.20000])
repr(s)

gives

0    1.23
1    1.20
dtype: float64

_trim_zeros_float was designed for usage in FloatArrayFormatter (so it needs the whole list to keep fixed-width, since it only truncates if all values can be truncated). The usage of this function in GenericArrayFormatter is what caused the regression. I think then it makes sense to keep _trim_zeros_float as is for that original purpose, and GenericArrayFormatter can use a much simpler solution for truncation since it does not have the same fixed width requirements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok that sounds fine, tests are pretty sensitive here so you can refactor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the help here, this is much cleaner!

str_floats=fmt_values, decimal=".", is_float_type=is_float_type
)

return fmt_values

Expand Down Expand Up @@ -1828,35 +1830,44 @@ def _trim_zeros_complex(str_complexes: np.ndarray, decimal: str = ".") -> List[s


def _trim_zeros_float(
str_floats: Union[np.ndarray, List[str]], decimal: str = "."
str_floats: Union[np.ndarray, List[str]],
decimal: str = ".",
is_float_type: Optional[np.ndarray] = None,
) -> List[str]:
"""
Trims zeros, leaving just one before the decimal points if need be.
"""
trimmed = str_floats
number_regex = re.compile(fr"^\s*[\+-]?[0-9]+\{decimal}[0-9]*$")
float_locs = (
is_float_type
if is_float_type is not None
else np.ones(len(trimmed), dtype=bool)
)

def is_number_with_decimal(x):
return re.match(number_regex, x) is not None
def is_number_with_decimal(x, i):
return re.match(number_regex, x) is not None and float_locs[i]

def should_trim(values: Union[np.ndarray, List[str]]) -> bool:
"""
Determine if an array of strings should be trimmed.

Returns True if all numbers containing decimals (defined by the
above regular expression) within the array end in a zero, otherwise
returns False.
above regular expression and the specification of float locations
`float_locs`) within the array end in a zero, otherwise returns False.
"""
numbers = [x for x in values if is_number_with_decimal(x)]
numbers = [x for i, x in enumerate(values) if is_number_with_decimal(x, i)]
return len(numbers) > 0 and all(x.endswith("0") for x in numbers)

while should_trim(trimmed):
trimmed = [x[:-1] if is_number_with_decimal(x) else x for x in trimmed]
trimmed = [
x[:-1] if is_number_with_decimal(x, i) else x for i, x in enumerate(trimmed)
]

# leave one 0 after the decimal points if need be.
result = [
x + "0" if is_number_with_decimal(x) and x.endswith(decimal) else x
for x in trimmed
x + "0" if is_number_with_decimal(x, i) and x.endswith(decimal) else x
for i, x in enumerate(trimmed)
]
return result

Expand Down
12 changes: 12 additions & 0 deletions pandas/tests/series/test_repr.py
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,18 @@ def test_series_repr_nat(self):
)
assert result == expected

def test_series_repr_float_like_object_no_truncate(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you parametrize/combine these two tests using the parameters data and expected, and add in cases where np.nan/None values are present, including data = [np.nan] and data = [None].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

series = Series(["3.50"])
result = repr(series)
expected = "0 3.50\ndtype: object"
assert result == expected

def test_mixed_series_repr_float_like_object_no_truncate(self):
series = Series([1.20, "1.00"])
result = repr(series)
expected = "0 1.2\n1 1.00\ndtype: object"
assert result == expected


class TestCategoricalRepr:
def test_categorical_repr_unicode(self):
Expand Down