-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REF: remove ExtensionArrayFormatter #26833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: remove ExtensionArrayFormatter #26833
Conversation
Codecov Report
@@ Coverage Diff @@
## master #26833 +/- ##
==========================================
+ Coverage 90.45% 91.85% +1.39%
==========================================
Files 179 179
Lines 50706 50700 -6
==========================================
+ Hits 45866 46570 +704
+ Misses 4840 4130 -710
Continue to review full report at Codecov.
|
1 similar comment
Codecov Report
@@ Coverage Diff @@
## master #26833 +/- ##
==========================================
+ Coverage 90.45% 91.85% +1.39%
==========================================
Files 179 179
Lines 50706 50700 -6
==========================================
+ Hits 45866 46570 +704
+ Misses 4840 4130 -710
Continue to review full report at Codecov.
|
if isinstance(values, (ABCIndexClass, ABCSeries)): | ||
values = values._values | ||
|
||
if is_categorical_dtype(values.dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would we not have both of these in the below if/elif clause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're deciding which values to use here in the same way as ExtensionArrayFormatter did.
the following if/else clause is selecting the Formatter class to use based on those values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the idea is to extract the datetime64[ns] from the Categorical, and then reuse the Datetime64Formatter by going into the if / elif below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes i get the idea, trying to see if the logic can somehow be simpler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could potentially define Categorical._formatter
, which would provide the appropriate scalar formatter based on it's .dtype
? Not sure if that'll work or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm we actually define this
def _formatter(self, boxed=False):
# Defer to CategoricalFormatter's formatter.
return None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we are not really using this attribute fully?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format_array is used by to_string
, to_html
, _repr_html
, to_latex
and for the repr
of many objects.
it should in theory be simpler and more generic.
it may be beneficial to move some logic out into the objects themselves so that format_array
can work with any extension array and not require this special casing. (i think that is outside of the scope of this PR)
This PR is intended to remove the call to format_array
from within ExtensionArrayFormatter
so that the formatter
parameter of format_array
can be used for custom formatters wihout defaults being applied.
ExtensionArrayFormatter
was dispatching back to format_array
to then dispatch to the appropriate (another) Formatter.
maybe we are not really using this attribute fully?
agreed. many just return None to defer to the Formatters.
ok this is fine then. I agree, think need to refactor how |
and we need to get rid of that leading_space business too? |
yes I think that's right |
if isinstance(values, (ABCIndexClass, ABCSeries)): | ||
values = values._values | ||
|
||
formatter = values._formatter(boxed=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonjayhawkins Is this still used in another place? (not very familiar with the formatting code, but wondering where this is called to ensure the underlying can determine the formatting)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see it used elsewhere in this PR, or not anymore on master. It might be that we didn't have good tests to cover behaviour where the ExtensionArray deviated from the "normal" behaviour to catch this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be that we didn't have good tests to cover behaviour where the ExtensionArray deviated from the "normal" behaviour to catch this.
i suspect that more tests will need to be added. particularly with cases where na_rep is passed to to_html etc or
we do have a specific issue here concerning EAs #25099
categorical, sparse, period etc return None (to defer) or just str (with boxed=True), so the _formatter was not adding anything.
for datetimelike we have
def _formatter(self, boxed=False):
# TODO: Remove Datetime & DatetimeTZ formatters.
return "'{}'".format
so will be removed.
for integer array, we have
def _formatter(self, boxed=False):
def fmt(x):
if isna(x):
return 'NaN'
return str(x)
return fmt
so this is now not being called following the removal of the pre-formatting step. but this likely contributes to the na_rep issues.
IMO the _formatter methods should be used from within the Formatter classes, not passed to them. and cannot assign 'NaN' explcitily.
this should be part a subsequent refactor, see #26833 (comment) and #26837
i should have a follow-on ready shortly continuing the format_array cleanup. (maybe not till next week due to PyLondinium)
I would then propose to revert this PR, and do a proper refactor all at once using this The |
EA authors would therefore also need to ensure that additional formatting options can be honoured, eg max_colwidth, float_format, na_rep etc. Should this be their responsibility? |
Well, that's indeed something to be discussed how to do that (that's what I meant above with looking into adding additional arguments to We probably should have done that thought exercise when introducing the |
a i'm OK if you want to revert for now though. |
This reverts commit a00659a.
I added a revert in the PR where I am adding a test: #26845 |
Is there already an issue to have the general discussion of how to deal with formatting options and ExtensionArrays ? |
no problem. i'll approve on green. |
i'll open a master tracker where we can add things like the Int64 na_rep issue. |
@simonjayhawkins where is the discussion issue? I ran into another issues with formatters I'd like to discuss before moving forward with this there's a suggestion/feature request I'd like to put up. |
I've not opened a separate issue yet for three reasons
feel free to open a new issue. |
How is the formatting discussion related to that PR? |
We can maybe use #26837 as the general issue to discuss this? |
IIUC correctly, before EAs, but we now have a ExtensionArrayFormatter that includes logic to convert to a numpy array
whereas before, _formatting_values was used and conversion to a numpy array looked like
even though _formatting_values is depecated:
it is still used. but does not necessarily return an numpy array
so when creating a repr_html of a DataFrame which uses _format_col, we pass a It seems a bit convoluted. i'm not sure why a numpy array is not returned from _formatting_values. I think the issues are related since the discussions revolve around numpy arrays from extension arrays and formatting probably works best if the formatting is done on a numpy array with am object dtype. performance is probably not an issue for formatting. |
@simonjayhawkins That's exactly the discussion we should have. But not in a merged PR :-) |
|
git diff upstream/master -u -- "*.py" | flake8 --diff