Skip to content

REF: remove special-casing for internal EAs from format_array #26965

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

simonjayhawkins
Copy link
Member

cc @jorisvandenbossche

changes to tests for spacing before Datetime64TZ, which now have additional space.

@simonjayhawkins simonjayhawkins added ExtensionArray Extending pandas with custom dtypes or arrays. Output-Formatting __repr__ of pandas objects, to_string labels Jun 20, 2019
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick comments (didn't look in detail yet). This can certainly be a nice clean-up, but I don't think it closes the discussion in #26837

@@ -881,7 +881,7 @@ def _formatter(
``boxed=True``.
"""
if boxed:
return str
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can None ever be a correct formatter function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to pass nested values through unchanged to allow custom formatting options on the underlying floats

>>> import scipy.sparse
>>> from pandas import SparseSeries
>>> sparse = SparseSeries.from_coo(scipy.sparse.rand(350, 18))
C:\Users\simon\OneDrive\code\pandas-simonjayhawkins\pandas\core\sparse\scipy_sparse.py:151: FutureWarning: Series.to_sparse is deprecated and will be removed in a future version
  s = s.to_sparse()  # TODO: specify kind?
>>> sparse._formatting_values()._formatting_values()
array([0.1057938 , 0.12346779, 0.66926098, 0.63135293, 0.51537009,
       0.83084751, 0.11707738, 0.94582471, 0.53306677, 0.76971943,
       0.43649237, 0.21300667, 0.36538291, 0.75462779, 0.2525123 ,
       0.97204424, 0.62077108, 0.77137921, 0.77416063, 0.06010261,
       0.0758256 , 0.81359155, 0.493831  , 0.86517892, 0.99948485,
       0.16560856, 0.10064932, 0.2234172 , 0.37112586, 0.35745351,
       0.03196741, 0.99648038, 0.39904145, 0.17162981, 0.98518366,
       0.85796259, 0.91980664, 0.78111288, 0.00913085, 0.61513879,
       0.68807133, 0.13308309, 0.99738934, 0.41772456, 0.30984901,
       0.23902977, 0.52296826, 0.90989797, 0.25194294, 0.28347783,
       0.76426202, 0.68137012, 0.00918782, 0.6749851 , 0.3858443 ,
       0.82595983, 0.72062635, 0.63630261, 0.13853372, 0.40467435,
       0.03361887, 0.90591611, 0.50676725])
>>>
>>> sparse._formatting_values()._formatter
<bound method SparseArray._formatter of [0.10579380195508936, 0.123467793554345, 0.6692609807557325, 0.631352928127322, 0.5153700914920036, 0.8308475120205755, 0.11707738027292702, 0.9458247148462586, 0.53306677
34689285, 0.7697194280043602, 0.436492370827794, 0.21300667109238547, 0.3653829053917591, 0.7546277896220214, 0.2525123038343048, 0.9720442422636566, 0.62077107713076, 0.7713792114209488, 0.7741606333773862, 0.0
60102611123280636, 0.07582560258774373, 0.8135915546849382, 0.4938310020146096, 0.865178922460138, 0.9994848548516182, 0.16560855997590118, 0.10064932274817606, 0.22341720411731247, 0.3711258583673639, 0.3574535
112638453, 0.03196740969420808, 0.9964803793350895, 0.399041447277493, 0.17162981330873828, 0.985183664900777, 0.8579625879222734, 0.9198066431764218, 0.7811128816049261, 0.009130849663441576, 0.6151387882173937
, 0.6880713270054183, 0.13308308845790984, 0.997389339850816, 0.4177245615919253, 0.3098490139048895, 0.23902976566001577, 0.5229682615072562, 0.9098979745269118, 0.25194293787078936, 0.28347782689329026, 0.7642
620222948321, 0.6813701212402914, 0.009187820750616082, 0.6749850993894757, 0.3858442970367967, 0.8259598335077263, 0.7206263455689199, 0.6363026132729022, 0.1385337245480187, 0.4046743457416371, 0.0336188668302
50727, 0.905916112003116, 0.5067672489799548]
Fill: nan
BlockIndex
Block locations: array([0])
Block lengths: array([63])>
>>>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the example you show.

But in any case, I would argue that such a fix should not be in the base array, but only in those EAs that need it? (but maybe that doesn't make sense, as I don't really understand the fix :-))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the formatting is now recursive. so multiple formatters can be applied.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may be a better example of recursing until a numpy array returned.

>>> from pandas import period_range, Categorical
>>> idx = period_range('2011-01-01 09:00', freq='H', periods=5)
>>> c = Categorical(idx, ordered=True)
>>> exp = """[2011-01-01 09:00, 2011-01-01 10:00, 2011-01-01 11:00, 2011-01-01 12:00, 2011-01-01 13:00]
... Categories (5, period[H]): [2011-01-01 09:00 < 2011-01-01 10:00 < 2011-01-01 11:00 < 2011-01-01 12:00 <
...                     2011-01-01 13:00]"""  # noqa
>>>
>>> c
[2011-01-01 09:00, 2011-01-01 10:00, 2011-01-01 11:00, 2011-01-01 12:00, 2011-01-01 13:00]
Categories (5, period[H]): [2011-01-01 09:00 < 2011-01-01 10:00 < 2011-01-01 11:00 < 2011-01-01 12:00 <
                            2011-01-01 13:00]
>>>
>>>
>>> type(c)
<class 'pandas.core.arrays.categorical.Categorical'>
>>>
>>> c._formatting_values()
PeriodIndex(['2011-01-01 09:00', '2011-01-01 10:00', '2011-01-01 11:00',
             '2011-01-01 12:00', '2011-01-01 13:00'],
            dtype='period[H]', freq='H')
>>>
>>> type(c._formatting_values())
<class 'pandas.core.indexes.period.PeriodIndex'>
>>>
>>> c._formatting_values().dtype
period[H]
>>>
>>> type(c._formatter(boxed=True))
<class 'NoneType'>
>>>
>>> c._formatting_values()._formatting_values()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'PeriodIndex' object has no attribute '_formatting_values'
>>>
>>> type(c.get_values())
<class 'pandas.core.indexes.period.PeriodIndex'>
>>>
>>> c.get_values()
PeriodIndex(['2011-01-01 09:00', '2011-01-01 10:00', '2011-01-01 11:00',
             '2011-01-01 12:00', '2011-01-01 13:00'],
            dtype='period[H]', freq='H')
>>>
>>> c.get_values()._values
<PeriodArray>
['2011-01-01 09:00', '2011-01-01 10:00', '2011-01-01 11:00',
 '2011-01-01 12:00', '2011-01-01 13:00']
Length: 5, dtype: period[H]
>>>
>>> c.get_values()._values._formatter
<bound method PeriodArray._formatter of <PeriodArray>
['2011-01-01 09:00', '2011-01-01 10:00', '2011-01-01 11:00',
 '2011-01-01 12:00', '2011-01-01 13:00']
Length: 5, dtype: period[H]>
>>>
>>> c.get_values()._values._formatting_values()
array([Period('2011-01-01 09:00', 'H'), Period('2011-01-01 10:00', 'H'),
       Period('2011-01-01 11:00', 'H'), Period('2011-01-01 12:00', 'H'),
       Period('2011-01-01 13:00', 'H')], dtype=object)
>>>

@@ -457,6 +457,9 @@ def _formatter(self, boxed=False):
# Defer to CategoricalFormatter's formatter.
return None

def _formatting_values(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_formatting_values is currently deprecated on EAs, so why was this needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to support the recursive behaviour and remove the special casing.

@codecov
Copy link

codecov bot commented Jun 20, 2019

Codecov Report

Merging #26965 into master will decrease coverage by <.01%.
The diff coverage is 94.28%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26965      +/-   ##
==========================================
- Coverage   91.87%   91.86%   -0.01%     
==========================================
  Files         180      180              
  Lines       50746    50743       -3     
==========================================
- Hits        46623    46616       -7     
- Misses       4123     4127       +4
Flag Coverage Δ
#multiple 90.46% <94.28%> (-0.01%) ⬇️
#single 41.12% <62.85%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/base.py 100% <100%> (+0.56%) ⬆️
pandas/io/formats/format.py 98% <100%> (+0.09%) ⬆️
pandas/core/arrays/timedeltas.py 88.82% <100%> (ø) ⬆️
pandas/core/indexes/interval.py 96.44% <100%> (ø) ⬆️
pandas/core/arrays/categorical.py 95.81% <100%> (-0.12%) ⬇️
pandas/core/arrays/datetimelike.py 97.6% <66.66%> (-0.33%) ⬇️
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
pandas/core/frame.py 96.89% <0%> (-0.12%) ⬇️
pandas/io/formats/printing.py 87.2% <0%> (+0.47%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cfd65e9...5d023cc. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 20, 2019

Codecov Report

Merging #26965 into master will decrease coverage by 1.41%.
The diff coverage is 94.28%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26965      +/-   ##
==========================================
- Coverage   91.87%   90.46%   -1.42%     
==========================================
  Files         180      180              
  Lines       50746    50741       -5     
==========================================
- Hits        46623    45902     -721     
- Misses       4123     4839     +716
Flag Coverage Δ
#multiple 90.46% <94.28%> (-0.01%) ⬇️
#single ?
Impacted Files Coverage Δ
pandas/core/internals/blocks.py 94.37% <ø> (-0.01%) ⬇️
pandas/core/arrays/base.py 100% <100%> (+0.56%) ⬆️
pandas/io/formats/format.py 98% <100%> (+0.09%) ⬆️
pandas/core/arrays/timedeltas.py 88.82% <100%> (ø) ⬆️
pandas/core/indexes/interval.py 96.44% <100%> (ø) ⬆️
pandas/core/arrays/categorical.py 95.68% <100%> (-0.25%) ⬇️
pandas/core/arrays/datetimelike.py 97.6% <66.66%> (-0.33%) ⬇️
pandas/core/computation/pytables.py 62.5% <0%> (-27.75%) ⬇️
pandas/io/pytables.py 63.82% <0%> (-26.48%) ⬇️
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cfd65e9...1ef1969. Read the comment docs.

@jorisvandenbossche
Copy link
Member

the formatting is now recursive. so multiple formatters can be applied.

Sorry, that is still too cryptic for me. Can you try to explain what is going on?

@jorisvandenbossche
Copy link
Member

the formatting is now recursive. so multiple formatters can be applied.

So this is specific for Categorical ? Where the actual values (whether they be string, datetime, etc) can still format themselves?
And before, this was handled with some special casing in formats.py itself I suppose?

@simonjayhawkins
Copy link
Member Author

Some quick comments (didn't look in detail yet). This can certainly be a nice clean-up, but I don't think it closes the discussion in #26837

if the special-casing is removed we can close that issue. New issues can be raised for the other points being raised during that discussion.

@jorisvandenbossche
Copy link
Member

I personally think we need a better reason to un-deprecate EA._formatting_values. There needs to be a clear, recommended interface for EAs, now we are adding something back just for Categorical (and maybe Sparse?). That moves the special case from our formatting code to the EA interface, not sure that is a net improvement.

@simonjayhawkins
Copy link
Member Author

these are internal EAs. EA authors may need similar nesting functionality.

Internal EAs should not be special cased and use an interface available to all.

@simonjayhawkins
Copy link
Member Author

@jorisvandenbossche do you want to move the discussion back to the issue? This PR is a proof of concept and will obviously require further discussion and agreement before _formatting_values is undeprecated.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice code removal; do we need some tests around _get_formatted_values()?

@@ -883,14 +880,34 @@ def format_array(values, formatter, float_format=None, na_rep='NaN',
List[str]
"""

def _get_formatted_values(values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type & add a doc-string

except AttributeError:
formatter = None

def _format_values(values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this a module level function (with types & doc-string), maybe _format_as_array


try:
values = values._formatting_values()
return _format_values(_get_formatted_values(values))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the try/except just be around values = values._formatting_values()?

@jbrockmendel
Copy link
Member

@simonjayhawkins can you rebase

@jorisvandenbossche
Copy link
Member

@jbrockmendel I don't think is ready for getting merged, see above discussion.

@simonjayhawkins
Copy link
Member Author

closing for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor format_array in pandas\io\formats\format.py
4 participants