Skip to content

REF: remove ExtensionArrayFormatter #26833

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 13, 2019

Conversation

simonjayhawkins
Copy link
Member

@simonjayhawkins simonjayhawkins added Output-Formatting __repr__ of pandas objects, to_string Refactor Internal refactoring of code labels Jun 13, 2019
@codecov
Copy link

codecov bot commented Jun 13, 2019

Codecov Report

Merging #26833 into master will increase coverage by 1.39%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26833      +/-   ##
==========================================
+ Coverage   90.45%   91.85%   +1.39%     
==========================================
  Files         179      179              
  Lines       50706    50700       -6     
==========================================
+ Hits        45866    46570     +704     
+ Misses       4840     4130     -710
Flag Coverage Δ
#multiple 90.44% <100%> (-0.01%) ⬇️
#single 41.12% <20%> (?)
Impacted Files Coverage Δ
pandas/io/formats/format.py 97.89% <100%> (-0.02%) ⬇️
pandas/core/indexes/interval.py 96.11% <100%> (ø) ⬆️
pandas/core/arrays/base.py 98.87% <0%> (-0.57%) ⬇️
pandas/core/arrays/period.py 98.3% <0%> (-0.25%) ⬇️
pandas/core/arrays/sparse.py 93.59% <0%> (-0.12%) ⬇️
pandas/core/arrays/categorical.py 95.8% <0%> (ø) ⬆️
pandas/util/testing.py 90.94% <0%> (+0.1%) ⬆️
pandas/core/indexes/datetimes.py 96.37% <0%> (+0.16%) ⬆️
pandas/io/formats/printing.py 86.09% <0%> (+1.6%) ⬆️
pandas/io/clipboard/clipboards.py 34.78% <0%> (+2.89%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e8132a2...a777fa4. Read the comment docs.

1 similar comment
@codecov
Copy link

codecov bot commented Jun 13, 2019

Codecov Report

Merging #26833 into master will increase coverage by 1.39%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26833      +/-   ##
==========================================
+ Coverage   90.45%   91.85%   +1.39%     
==========================================
  Files         179      179              
  Lines       50706    50700       -6     
==========================================
+ Hits        45866    46570     +704     
+ Misses       4840     4130     -710
Flag Coverage Δ
#multiple 90.44% <100%> (-0.01%) ⬇️
#single 41.12% <20%> (?)
Impacted Files Coverage Δ
pandas/io/formats/format.py 97.89% <100%> (-0.02%) ⬇️
pandas/core/indexes/interval.py 96.11% <100%> (ø) ⬆️
pandas/core/arrays/base.py 98.87% <0%> (-0.57%) ⬇️
pandas/core/arrays/period.py 98.3% <0%> (-0.25%) ⬇️
pandas/core/arrays/sparse.py 93.59% <0%> (-0.12%) ⬇️
pandas/core/arrays/categorical.py 95.8% <0%> (ø) ⬆️
pandas/util/testing.py 90.94% <0%> (+0.1%) ⬆️
pandas/core/indexes/datetimes.py 96.37% <0%> (+0.16%) ⬆️
pandas/io/formats/printing.py 86.09% <0%> (+1.6%) ⬆️
pandas/io/clipboard/clipboards.py 34.78% <0%> (+2.89%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e8132a2...a777fa4. Read the comment docs.

if isinstance(values, (ABCIndexClass, ABCSeries)):
values = values._values

if is_categorical_dtype(values.dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would we not have both of these in the below if/elif clause?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're deciding which values to use here in the same way as ExtensionArrayFormatter did.

the following if/else clause is selecting the Formatter class to use based on those values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the idea is to extract the datetime64[ns] from the Categorical, and then reuse the Datetime64Formatter by going into the if / elif below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i get the idea, trying to see if the logic can somehow be simpler

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could potentially define Categorical._formatter, which would provide the appropriate scalar formatter based on it's .dtype? Not sure if that'll work or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm we actually define this

    def _formatter(self, boxed=False):
        # Defer to CategoricalFormatter's formatter.
        return None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we are not really using this attribute fully?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format_array is used by to_string, to_html, _repr_html, to_latex and for the repr of many objects.

it should in theory be simpler and more generic.

it may be beneficial to move some logic out into the objects themselves so that format_array can work with any extension array and not require this special casing. (i think that is outside of the scope of this PR)

This PR is intended to remove the call to format_array from within ExtensionArrayFormatter so that the formatter parameter of format_array can be used for custom formatters wihout defaults being applied.

ExtensionArrayFormatter was dispatching back to format_array to then dispatch to the appropriate (another) Formatter.

maybe we are not really using this attribute fully?

agreed. many just return None to defer to the Formatters.

@jreback jreback added this to the 0.25.0 milestone Jun 13, 2019
@jreback
Copy link
Contributor

jreback commented Jun 13, 2019

ok this is fine then. I agree, think need to refactor how format_array works, @simonjayhawkins maybe make an issue for it? to push more functionaily to the objects (and PandasArray) rather than encoding the conversions where they are.

@jreback jreback merged commit a00659a into pandas-dev:master Jun 13, 2019
@simonjayhawkins
Copy link
Member Author

and we need to get rid of that leading_space business too?

@jreback
Copy link
Contributor

jreback commented Jun 13, 2019

and we need to get rid of that leading_space business too?

yes I think that's right

if isinstance(values, (ABCIndexClass, ABCSeries)):
values = values._values

formatter = values._formatter(boxed=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonjayhawkins Is this still used in another place? (not very familiar with the formatting code, but wondering where this is called to ensure the underlying can determine the formatting)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see it used elsewhere in this PR, or not anymore on master. It might be that we didn't have good tests to cover behaviour where the ExtensionArray deviated from the "normal" behaviour to catch this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be that we didn't have good tests to cover behaviour where the ExtensionArray deviated from the "normal" behaviour to catch this.

i suspect that more tests will need to be added. particularly with cases where na_rep is passed to to_html etc or

we do have a specific issue here concerning EAs #25099

categorical, sparse, period etc return None (to defer) or just str (with boxed=True), so the _formatter was not adding anything.

for datetimelike we have

    def _formatter(self, boxed=False):
        # TODO: Remove Datetime & DatetimeTZ formatters.
        return "'{}'".format

so will be removed.

for integer array, we have

    def _formatter(self, boxed=False):
        def fmt(x):
            if isna(x):
                return 'NaN'
            return str(x)
        return fmt

so this is now not being called following the removal of the pre-formatting step. but this likely contributes to the na_rep issues.

IMO the _formatter methods should be used from within the Formatter classes, not passed to them. and cannot assign 'NaN' explcitily.

this should be part a subsequent refactor, see #26833 (comment) and #26837

i should have a follow-on ready shortly continuing the format_array cleanup. (maybe not till next week due to PyLondinium)

@jorisvandenbossche
Copy link
Member

i should have a follow-on ready shortly continuing the format_array cleanup. (maybe not till next week due to PyLondinium)

I would then propose to revert this PR, and do a proper refactor all at once using this _formatter method in the Formatter class as you mention.
We might also need to think about if we need to add additonal arguments to _formatter to enable some of the options (but didn't look enough in detail to the formatting code to assess this)

The _formatter is essential to let EA authors override the formatting (and eg now master will break what I am working on in GeoPandas). It's a bit unfortunate we didn't have proper tests for this.

@simonjayhawkins
Copy link
Member Author

The _formatter is essential to let EA authors override the formatting (and eg now master will break what I am working on in GeoPandas). It's a bit unfortunate we didn't have proper tests for this.

EA authors would therefore also need to ensure that additional formatting options can be honoured, eg max_colwidth, float_format, na_rep etc. Should this be their responsibility?

@jorisvandenbossche
Copy link
Member

Well, that's indeed something to be discussed how to do that (that's what I meant above with looking into adding additional arguments to _formatter).

We probably should have done that thought exercise when introducing the _formatter method, but that's too late now and the _formatter is in the EA interface, so we should honor it (at least for external EAs).

@simonjayhawkins
Copy link
Member Author

a formatters argument is in some of the to_* methods api, and we need to honour that too.

i'm OK if you want to revert for now though.

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Jun 14, 2019
@jorisvandenbossche
Copy link
Member

I added a revert in the PR where I am adding a test: #26845

@jorisvandenbossche
Copy link
Member

Is there already an issue to have the general discussion of how to deal with formatting options and ExtensionArrays ?

@simonjayhawkins
Copy link
Member Author

I added a revert in the PR where I am adding a test: #26845

no problem. i'll approve on green.

@simonjayhawkins
Copy link
Member Author

Is there already an issue to have the general discussion of how to deal with formatting options and ExtensionArrays ?

i'll open a master tracker where we can add things like the Int64 na_rep issue.

jreback pushed a commit that referenced this pull request Jun 14, 2019
…yFormatter removal (#26845)

* TST: test custom _formatter for ExtensionArray

* Revert "REF: remove ExtensionArrayFormatter (#26833)"

This reverts commit a00659a.
@ghost
Copy link

ghost commented Jun 17, 2019

i'll open a master tracker where we can add things like the Int64 na_rep issue.

@simonjayhawkins where is the discussion issue? I ran into another issues with formatters I'd like to discuss before moving forward with this there's a suggestion/feature request I'd like to put up.

@simonjayhawkins
Copy link
Member Author

i'll open a master tracker where we can add things like the Int64 na_rep issue.

@simonjayhawkins where is the discussion issue? I ran into another issues with formatters I'd like to discuss before moving forward with this there's a suggestion/feature request I'd like to put up.

I've not opened a separate issue yet for three reasons

feel free to open a new issue.

@jorisvandenbossche
Copy link
Member

awaiting the outcome of #26817

How is the formatting discussion related to that PR?

@jorisvandenbossche
Copy link
Member

We can maybe use #26837 as the general issue to discuss this?

@simonjayhawkins
Copy link
Member Author

simonjayhawkins commented Jun 17, 2019

awaiting the outcome of #26817

How is the formatting discussion related to that PR?

IIUC correctly, before EAs, format_array formatted an numpy array. and GenericArrayFormatter expects an numpy array, used mainly for numpy arrays with dtype=object

but we now have a ExtensionArrayFormatter that includes logic to convert to a numpy array

        values = self.values
        if isinstance(values, (ABCIndexClass, ABCSeries)):
            values = values._values

        formatter = values._formatter(boxed=True)

        if is_categorical_dtype(values.dtype):
            # Categorical is special for now, so that we can preserve tzinfo
            array = values.get_values()
        else:
            array = np.asarray(values)

        fmt_values = format_array(array,

whereas before, _formatting_values was used and conversion to a numpy array looked like

    def _format_col(self, i):
        frame = self.tr_frame
        formatter = self._get_formatter(i)
        values_to_format = frame.iloc[:, i]._formatting_values()
        return format_array(values_to_format, formatter,
                            float_format=self.float_format, na_rep=self.na_rep,
                            space=self.col_space, decimal=self.decimal)

even though _formatting_values is depecated:

    def _formatting_values(self) -> np.ndarray:
        # At the moment, this has to be an array since we use result.dtype
        """
        An array of values to be printed in, e.g. the Series repr

        .. deprecated:: 0.24.0

           Use :meth:`ExtensionArray._formatter` instead.
        """

it is still used. but does not necessarily return an numpy array

>>> import pandas as pd
>>> df = pd.DataFrame({"c": [2,3,float('nan')]})
>>> df = df.astype("Int64")
>>> df.c._formatting_values()
<IntegerArray>
[2, 3, NaN]
Length: 3, dtype: Int64

so when creating a repr_html of a DataFrame which uses _format_col, we pass a <IntegerArray> to format_array, which then uses ExtensionArrayFormatter to get the values as an numpy array and then call
format_array again which then formats the numpy array using GenericArrayFormatter.

It seems a bit convoluted.

i'm not sure why a numpy array is not returned from _formatting_values.

I think the issues are related since the discussions revolve around numpy arrays from extension arrays and formatting probably works best if the formatting is done on a numpy array with am object dtype. performance is probably not an issue for formatting.

@jorisvandenbossche
Copy link
Member

@simonjayhawkins That's exactly the discussion we should have. But not in a merged PR :-)
Can you just put that comment in a new issue? (or in #26837, same for me)

@jorisvandenbossche
Copy link
Member

@pilkibun can you move your comment to #26837 as well?

@ghost
Copy link

ghost commented Jun 19, 2019

@pilkibun can you move your comment to #26837 as well?

#26837 (comment)

@simonjayhawkins simonjayhawkins deleted the ExtensionArrayFormatter branch June 21, 2019 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Output-Formatting __repr__ of pandas objects, to_string Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants