-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: repr of Categorical does not distinguish int and str. #34222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/io/formats/format.py
Outdated
@@ -1237,6 +1237,8 @@ def _format(x): | |||
fmt_values.append(f" {_format(v)}") | |||
elif is_float_type[i]: | |||
fmt_values.append(float_format(v)) | |||
elif isinstance(v, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GenericArrayFormatter has a quoting
parameter. format_array does not have this parameter to be able to pass it on.
_repr_categories
in pandas\core\arrays\categorical.py
uses format_array
.
is it feasible to add quoting
parameter to format_array instead of changing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonjayhawkins thanks for the suggestion!
So, in _repr_categories
from pandas\core\arrays\categorical.py
, format_array
is called. This could be changed to, say,
category_strs = fmt.format_array(self.categories, None, quoting=csv.QUOTE_NONNUMERIC)
and then, inside format_array
, pass quoting
on to fmt_klass
, so that (here) GenericArrayFormatter
would be initialised with it.
It's unclear to me me what GenericArrayFormatter
currently does with quoting
, I don't see it being used. Should there be no need to change anything here in _format_strings
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear to me me what
GenericArrayFormatter
currently does withquoting
, I don't see it being used. Should there be no need to change anything here in_format_strings
?
that does seem strange, I must have assumed wrongly where the quoting parameter is used. will look again, but your suggestions sgtm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of this you need to adjust _format, L1221; you maybe able to simply adjust the args to pprint_thing which can quote as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for pointing me towards that, that was useful!
So, pprint_thing
takes a quote_string
argument, which is boolean, while quoting
is Optional[int]
, so I've tried doing this by setting
quote_strings = self.quoting is not None and self.quoting != QUOTE_NONE
and then passing quote_strings
to pprint_thing
Does this have impact on the Series repr, or is only for the Categorical (the array) repr? |
Should only be for the categorical repr, will update accordingly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MarcoGorelli generally lgtm. some doctest failures.
doc/source/whatsnew/v1.1.0.rst
Outdated
@@ -997,6 +997,7 @@ I/O | |||
- Bug in :meth:`~SQLDatabase.execute` was raising a ``ProgrammingError`` for some DB-API drivers when the SQL statement contained the `%` character and no parameters were present (:issue:`34211`) | |||
- Bug in :meth:`~pandas.io.stata.StataReader` which resulted in categorical variables with difference dtypes when reading data using an iterator. (:issue:`31544`) | |||
- :meth:`HDFStore.keys` has now an optional `include` parameter that allows the retrieval of all native HDF5 table names (:issue:`29916`) | |||
- Repr of :class:`Categorical` was not distinguishing between int and str (:issue:`33676`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The is probably more appropriate for the Categorical section to be consistent with the similar change for Sparse, see #34352. (NOTE: 1.1 does not yet have a Categorical section)
Also, I think safe to say this was a bug since the repr of the categories in the dtype repr included the quotes #34352 (comment)
lambda x: pprint_thing( | ||
x, escape_chars=("\t", "\r", "\n"), quote_strings=quote_strings | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as the ternary expression is getting more complex, an if else may now be more readable. maybe use partial instead of lambda and maybe move quote_strings assignment inside the relevant if else block.
pandas/core/arrays/categorical.py
Outdated
@@ -1921,7 +1928,7 @@ def _get_repr(self, length=True, na_rep="NaN", footer=True) -> str: | |||
from pandas.io.formats import format as fmt | |||
|
|||
formatter = fmt.CategoricalFormatter( | |||
self, length=length, na_rep=na_rep, footer=footer | |||
self, length=length, na_rep=na_rep, footer=footer, quoting=QUOTE_NONNUMERIC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe don't need to pass this through here, since CategoricalFormatter should always be QUOTE_NONNUMERIC
pandas/io/formats/format.py
Outdated
@@ -170,12 +171,14 @@ def __init__( | |||
length: bool = True, | |||
na_rep: str = "NaN", | |||
footer: bool = True, | |||
quoting: Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe redundant see above comment.
pandas/io/formats/format.py
Outdated
): | ||
self.categorical = categorical | ||
self.buf = buf if buf is not None else StringIO("") | ||
self.na_rep = na_rep | ||
self.length = length | ||
self.footer = footer | ||
self.quoting = quoting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
@@ -200,6 +203,7 @@ def _get_formatted_values(self) -> List[str]: | |||
None, | |||
float_format=None, | |||
na_rep=self.na_rep, | |||
quoting=self.quoting, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quoting always QUOTE_NONNUMERIC
pandas/core/arrays/categorical.py
Outdated
category_strs = fmt.format_array(self.categories, None) | ||
category_strs = fmt.format_array( | ||
self.categories, None, quoting=QUOTE_NONNUMERIC | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe partial here to avoid duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MarcoGorelli lgtm ex doctest failures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MarcoGorelli lgtm. @jreback
@@ -1872,13 +1874,16 @@ def _repr_categories(self): | |||
) | |||
from pandas.io.formats import format as fmt | |||
|
|||
format_array = partial( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you changed this in pandas/io/format.py is it also necessary here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's necessary so that GenericArrayFormatter
is initialised with it
lgtm ex a question. |
thanks @MarcoGorelli very nice |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff