Skip to content

BUG: repr of Categorical does not distinguish int and str. #34222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jun 24, 2020

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented May 17, 2020

@MarcoGorelli MarcoGorelli requested review from simonjayhawkins and removed request for simonjayhawkins May 17, 2020 10:43
@MarcoGorelli MarcoGorelli changed the title quote string elements BUG: repr of Categorical does not distinguish int and str. May 17, 2020
@MarcoGorelli MarcoGorelli marked this pull request as draft May 17, 2020 10:49
@@ -1237,6 +1237,8 @@ def _format(x):
fmt_values.append(f" {_format(v)}")
elif is_float_type[i]:
fmt_values.append(float_format(v))
elif isinstance(v, str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GenericArrayFormatter has a quoting parameter. format_array does not have this parameter to be able to pass it on.

_repr_categories in pandas\core\arrays\categorical.py uses format_array.

is it feasible to add quoting parameter to format_array instead of changing this?

Copy link
Member Author

@MarcoGorelli MarcoGorelli May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonjayhawkins thanks for the suggestion!

So, in _repr_categories from pandas\core\arrays\categorical.py, format_array is called. This could be changed to, say,

category_strs = fmt.format_array(self.categories, None, quoting=csv.QUOTE_NONNUMERIC)

and then, inside format_array, pass quoting on to fmt_klass, so that (here) GenericArrayFormatter would be initialised with it.

It's unclear to me me what GenericArrayFormatter currently does with quoting, I don't see it being used. Should there be no need to change anything here in _format_strings?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me me what GenericArrayFormatter currently does with quoting, I don't see it being used. Should there be no need to change anything here in _format_strings?

that does seem strange, I must have assumed wrongly where the quoting parameter is used. will look again, but your suggestions sgtm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this you need to adjust _format, L1221; you maybe able to simply adjust the args to pprint_thing which can quote as needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for pointing me towards that, that was useful!

So, pprint_thing takes a quote_string argument, which is boolean, while quoting is Optional[int], so I've tried doing this by setting

quote_strings = self.quoting is not None and self.quoting != QUOTE_NONE

and then passing quote_strings to pprint_thing

@jorisvandenbossche
Copy link
Member

Does this have impact on the Series repr, or is only for the Categorical (the array) repr?

@MarcoGorelli
Copy link
Member Author

Should only be for the categorical repr, will update accordingly

@jreback jreback added Categorical Categorical Data Type Output-Formatting __repr__ of pandas objects, to_string labels Jun 14, 2020
@MarcoGorelli MarcoGorelli marked this pull request as ready for review June 18, 2020 10:59
Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MarcoGorelli generally lgtm. some doctest failures.

@@ -997,6 +997,7 @@ I/O
- Bug in :meth:`~SQLDatabase.execute` was raising a ``ProgrammingError`` for some DB-API drivers when the SQL statement contained the `%` character and no parameters were present (:issue:`34211`)
- Bug in :meth:`~pandas.io.stata.StataReader` which resulted in categorical variables with difference dtypes when reading data using an iterator. (:issue:`31544`)
- :meth:`HDFStore.keys` has now an optional `include` parameter that allows the retrieval of all native HDF5 table names (:issue:`29916`)
- Repr of :class:`Categorical` was not distinguishing between int and str (:issue:`33676`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The is probably more appropriate for the Categorical section to be consistent with the similar change for Sparse, see #34352. (NOTE: 1.1 does not yet have a Categorical section)

Also, I think safe to say this was a bug since the repr of the categories in the dtype repr included the quotes #34352 (comment)

lambda x: pprint_thing(
x, escape_chars=("\t", "\r", "\n"), quote_strings=quote_strings
)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as the ternary expression is getting more complex, an if else may now be more readable. maybe use partial instead of lambda and maybe move quote_strings assignment inside the relevant if else block.

@@ -1921,7 +1928,7 @@ def _get_repr(self, length=True, na_rep="NaN", footer=True) -> str:
from pandas.io.formats import format as fmt

formatter = fmt.CategoricalFormatter(
self, length=length, na_rep=na_rep, footer=footer
self, length=length, na_rep=na_rep, footer=footer, quoting=QUOTE_NONNUMERIC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe don't need to pass this through here, since CategoricalFormatter should always be QUOTE_NONNUMERIC

@@ -170,12 +171,14 @@ def __init__(
length: bool = True,
na_rep: str = "NaN",
footer: bool = True,
quoting: Optional[int] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe redundant see above comment.

):
self.categorical = categorical
self.buf = buf if buf is not None else StringIO("")
self.na_rep = na_rep
self.length = length
self.footer = footer
self.quoting = quoting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -200,6 +203,7 @@ def _get_formatted_values(self) -> List[str]:
None,
float_format=None,
na_rep=self.na_rep,
quoting=self.quoting,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quoting always QUOTE_NONNUMERIC

category_strs = fmt.format_array(self.categories, None)
category_strs = fmt.format_array(
self.categories, None, quoting=QUOTE_NONNUMERIC
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe partial here to avoid duplication

@simonjayhawkins simonjayhawkins added this to the 1.1 milestone Jun 22, 2020
@MarcoGorelli MarcoGorelli marked this pull request as draft June 23, 2020 07:35
Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MarcoGorelli lgtm ex doctest failures

@MarcoGorelli MarcoGorelli marked this pull request as ready for review June 24, 2020 09:15
Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MarcoGorelli lgtm. @jreback

@@ -1872,13 +1874,16 @@ def _repr_categories(self):
)
from pandas.io.formats import format as fmt

format_array = partial(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you changed this in pandas/io/format.py is it also necessary here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's necessary so that GenericArrayFormatter is initialised with it

@jreback
Copy link
Contributor

jreback commented Jun 24, 2020

lgtm ex a question.

@jreback jreback merged commit db48799 into pandas-dev:master Jun 24, 2020
@jreback
Copy link
Contributor

jreback commented Jun 24, 2020

thanks @MarcoGorelli very nice

@MarcoGorelli MarcoGorelli deleted the cat-rep branch June 25, 2020 07:03
fangchenli pushed a commit to fangchenli/pandas that referenced this pull request Jun 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: repr of Categorical does not distinguish int and str.
4 participants