Skip to content

ExtensionArray: deprecation of _formatting_values not working? #24858

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Jan 21, 2019 · 5 comments
Closed
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays.

Comments

@jorisvandenbossche
Copy link
Member

I am doing a small demo ExtensionArray, and it seems the repr is not working properly without defining _formatting_values.

Reproducible example (it's not a complete implementation, but I think the error here is not caused by that);

from pandas.api.extensions import ExtensionDtype, ExtensionArray

from shapely.geometry.base import BaseGeometry

class GeometryDtype(ExtensionDtype):

    @property
    def name(self):
        return "my-geometry-type"

    @property
    def type(self):
        """The scalar type"""
        return BaseGeometry

    @classmethod
    def construct_from_string(cls, string):
        if string == cls.name:
            return cls()
        else:
            raise TypeError("Cannot construct a '{}' from "
                            "'{}'".format(cls, string))

class GeometryArray(ExtensionArray):
    
    def __init__(self, geoms):
        self._data = geoms
    
    @property
    def dtype(self):
        return GeometryDtype()
    
    def _from_sequence():
        GemetryArray(np.asarray(values, dtype=object))
    
    def __len__(self):
        return len(self._data)
    
    def __getitem__(self, key):
        if isinstance(key, int):
            return self._data[key]
        else:
            return GeometryArray(self._data[key])
        
    def isna(self):
        return np.array([val is None for val in self._data], dtype=bool)
    
    def _formatting_values(self):
        # type: () -> np.ndarray
        # At the moment, this has to be an array since we use result.dtype
        """
        An array of values to be printed in, e.g. the Series repr

        .. deprecated:: 0.24.0

           Use :meth:`ExtensionArray._formatter` instead.
        """
        return np.array(self, dtype=object)
from shapely.geometry import Point

df = pd.DataFrame({
    'a': [1, 2, 3],
    'geoms': GeometryArray(np.array([Point(0, 0), Point(1, 1), Point(2, 2)], dtype=object))})
repr(df)

gives "TypeError: float() argument must be a string or a number, not 'Point'".

full error message
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/scipy/repos/ipython/IPython/core/formatters.py in __call__(self, obj)
    343             method = get_real_method(obj, self.print_method)
    344             if method is not None:
--> 345                 return method()
    346             return None
    347         else:

~/scipy/pandas/pandas/core/frame.py in _repr_html_(self)
    665 
    666             return self.to_html(max_rows=max_rows, max_cols=max_cols,
--> 667                                 show_dimensions=show_dimensions, notebook=True)
    668         else:
    669             return None

~/scipy/pandas/pandas/core/frame.py in to_html(self, buf, columns, col_space, header, index, na_rep, formatters, float_format, sparsify, index_names, justify, max_rows, max_cols, show_dimensions, decimal, bold_rows, classes, escape, notebook, border, table_id, render_links)
   2253                                            render_links=render_links)
   2254         # TODO: a generic formatter wld b in DataFrameFormatter
-> 2255         formatter.to_html(classes=classes, notebook=notebook, border=border)
   2256 
   2257         if buf is None:

~/scipy/pandas/pandas/io/formats/format.py in to_html(self, classes, notebook, border)
    733         from pandas.io.formats.html import HTMLFormatter, NotebookFormatter
    734         Klass = NotebookFormatter if notebook else HTMLFormatter
--> 735         html = Klass(self, classes=classes, border=border).render()
    736         if hasattr(self.buf, 'write'):
    737             buffer_put_lines(self.buf, html)

~/scipy/pandas/pandas/io/formats/html.py in render(self)
    527         self.write('<div>')
    528         self.write_style()
--> 529         super(NotebookFormatter, self).render()
    530         self.write('</div>')
    531         return self.elements

~/scipy/pandas/pandas/io/formats/html.py in render(self)
    144 
    145     def render(self):
--> 146         self._write_table()
    147 
    148         if self.should_show_dimensions:

~/scipy/pandas/pandas/io/formats/html.py in _write_table(self, indent)
    180             self._write_header(indent + self.indent_delta)
    181 
--> 182         self._write_body(indent + self.indent_delta)
    183 
    184         self.write('</table>', indent)

~/scipy/pandas/pandas/io/formats/html.py in _write_body(self, indent)
    323     def _write_body(self, indent):
    324         self.write('<tbody>', indent)
--> 325         fmt_values = {i: self.fmt._format_col(i) for i in range(self.ncols)}
    326 
    327         # write values

~/scipy/pandas/pandas/io/formats/html.py in <dictcomp>(.0)
    323     def _write_body(self, indent):
    324         self.write('<tbody>', indent)
--> 325         fmt_values = {i: self.fmt._format_col(i) for i in range(self.ncols)}
    326 
    327         # write values

~/scipy/pandas/pandas/io/formats/format.py in _format_col(self, i)
    712         return format_array(values_to_format, formatter,
    713                             float_format=self.float_format, na_rep=self.na_rep,
--> 714                             space=self.col_space, decimal=self.decimal)
    715 
    716     def to_html(self, classes=None, notebook=False, border=None):

~/scipy/pandas/pandas/io/formats/format.py in format_array(values, formatter, float_format, na_rep, digits, space, justify, decimal, leading_space)
    911                         leading_space=leading_space)
    912 
--> 913     return fmt_obj.get_result()
    914 
    915 

~/scipy/pandas/pandas/io/formats/format.py in get_result(self)
    932 
    933     def get_result(self):
--> 934         fmt_values = self._format_strings()
    935         return _make_fixed_width(fmt_values, self.justify)
    936 

~/scipy/pandas/pandas/io/formats/format.py in _format_strings(self)
   1179             array = values.get_values()
   1180         else:
-> 1181             array = np.asarray(values)
   1182 
   1183         fmt_values = format_array(array,

~/miniconda3/envs/dev/lib/python3.5/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
    529 
    530     """
--> 531     return array(a, dtype, copy=False, order=order)
    532 
    533 

TypeError: float() argument must be a string or a number, not 'Point'

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/scipy/repos/ipython/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

~/scipy/repos/ipython/IPython/lib/pretty.py in pretty(self, obj)
    400                         if cls is not object \
    401                                 and callable(cls.__dict__.get('__repr__')):
--> 402                             return _repr_pprint(obj, self, cycle)
    403 
    404             return _default_pprint(obj, self, cycle)

~/scipy/repos/ipython/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    695     """A pprint that just redirects to the normal repr function."""
    696     # Find newlines and replace them with p.break_()
--> 697     output = repr(obj)
    698     for idx,output_line in enumerate(output.splitlines()):
    699         if idx:

~/scipy/pandas/pandas/core/base.py in __repr__(self)
     76         Yields Bytestring in Py2, Unicode String in py3.
     77         """
---> 78         return str(self)
     79 
     80 

~/scipy/pandas/pandas/core/base.py in __str__(self)
     55 
     56         if compat.PY3:
---> 57             return self.__unicode__()
     58         return self.__bytes__()
     59 

~/scipy/pandas/pandas/core/frame.py in __unicode__(self)
    631             width = None
    632         self.to_string(buf=buf, max_rows=max_rows, max_cols=max_cols,
--> 633                        line_width=width, show_dimensions=show_dimensions)
    634 
    635         return buf.getvalue()

~/scipy/pandas/pandas/core/frame.py in to_string(self, buf, columns, col_space, header, index, na_rep, formatters, float_format, sparsify, index_names, justify, max_rows, max_cols, show_dimensions, decimal, line_width)
    712                                            decimal=decimal,
    713                                            line_width=line_width)
--> 714         formatter.to_string()
    715 
    716         if buf is None:

~/scipy/pandas/pandas/io/formats/format.py in to_string(self)
    602         else:
    603 
--> 604             strcols = self._to_str_columns()
    605             if self.line_width is None:  # no need to wrap around just print
    606                 # the whole frame

~/scipy/pandas/pandas/io/formats/format.py in _to_str_columns(self)
    537                 header_colwidth = max(self.col_space or 0,
    538                                       *(self.adj.len(x) for x in cheader))
--> 539                 fmt_values = self._format_col(i)
    540                 fmt_values = _make_fixed_width(fmt_values, self.justify,
    541                                                minimum=header_colwidth,

~/scipy/pandas/pandas/io/formats/format.py in _format_col(self, i)
    712         return format_array(values_to_format, formatter,
    713                             float_format=self.float_format, na_rep=self.na_rep,
--> 714                             space=self.col_space, decimal=self.decimal)
    715 
    716     def to_html(self, classes=None, notebook=False, border=None):

~/scipy/pandas/pandas/io/formats/format.py in format_array(values, formatter, float_format, na_rep, digits, space, justify, decimal, leading_space)
    911                         leading_space=leading_space)
    912 
--> 913     return fmt_obj.get_result()
    914 
    915 

~/scipy/pandas/pandas/io/formats/format.py in get_result(self)
    932 
    933     def get_result(self):
--> 934         fmt_values = self._format_strings()
    935         return _make_fixed_width(fmt_values, self.justify)
    936 

~/scipy/pandas/pandas/io/formats/format.py in _format_strings(self)
   1179             array = values.get_values()
   1180         else:
-> 1181             array = np.asarray(values)
   1182 
   1183         fmt_values = format_array(array,

~/miniconda3/envs/dev/lib/python3.5/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
    529 
    530     """
--> 531     return array(a, dtype, copy=False, order=order)
    532 
    533 

TypeError: float() argument must be a string or a number, not 'Point'

So it seems that it still tries to convert the values to an array, as the default of _formatting_values does (return np.array(self)), instead of calling the _formatter on the individual elements.
If I add a _formatting_values that returns an object array, it works correctly.

cc @TomAugspurger

@jorisvandenbossche jorisvandenbossche added the ExtensionArray Extending pandas with custom dtypes or arrays. label Jan 21, 2019
@jorisvandenbossche
Copy link
Member Author

OK, so it might be due to the incomplete implementation. Because if I add a def __array__, it also works (without needing to define _formatting_values).
But, currently, the interface docs don't list __array__ / or the ability to be convertible to an ndarray as mandatory (although that last one is maybe obvious).

The special case here is that I have objects that have a length, so if you want to convert them to a numpy array, np.array(vals) fails (with the error above), and you need to explicitly specify object dtype as np.array(vals, dtype=object)

It is probably obvious that np.array(EA) needs to work (I was just trying to make a minimal demo example, so therefore didn't do this yet).
But should we be more explicit what the expected return type is here? Is it expected to be an array of the scalar type? (EA..dtype.type)

And maybe we should be explicit in _formatter that the formatting function is called on elements after converting to an ndarray, not to scalars of the EA directly. As currently we don't assume this is the same (eg in a DatetimeArray, it can be np.datetime64 instance vs pd.Timestamp)

@TomAugspurger
Copy link
Contributor

Would rewriting the formatting code to handle either ndarrays or ExtensionArrays would be the best path forward for users, right? Then we wouldn't have an intermediate conversion to ndarray?

@jorisvandenbossche
Copy link
Member Author

Normally it still first slices the part of the full EA that is needed for the display, so a possibly costly conversion of the full EA to a ndarray is avoided in that case. So given that, I am not sure it is that important to avoid this conversion to ndarray?

@TomAugspurger
Copy link
Contributor

Agreed. So pass dtype=object in _formatting_values?

@mroeschke
Copy link
Member

I don't see _formatting_values defined anywhere in the code base so I'm assuming it got removed or refactored and this is no longer an issue. Closing, but if its relevant in another method can reopen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

No branches or pull requests

3 participants