Skip to content

Table Schema bombs with MultiIndex #15996

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rgbkrk opened this issue Apr 14, 2017 · 13 comments · Fixed by #16132
Closed

Table Schema bombs with MultiIndex #15996

rgbkrk opened this issue Apr 14, 2017 · 13 comments · Fixed by #16132
Labels
Bug MultiIndex Output-Formatting __repr__ of pandas objects, to_string

Comments

@rgbkrk
Copy link
Contributor

rgbkrk commented Apr 14, 2017

Using the new (ok, yet to be released) table schemaized with the generated MultiIndex as mentioned in #15379, I noticed that it creates a traceback (and falls back on HTML)

Code Sample

import pandas as pd
import numpy as np
pd.options.display.html.table_schema = True

midx = pd.MultiIndex.from_product([['A', 'B'], ['a', 'b', 'c']])

df = pd.DataFrame(np.random.randn(5, len(midx)), columns=midx)

df

Problem description

Full Traceback
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/Users/kylek/code/src/github.com/pandas-dev/pandas/pandas/indexes/multi.py in _convert_can_do_setop(self, other)
   2513                 try:
-> 2514                     other = MultiIndex.from_tuples(other)
   2515                 except:

/Users/kylek/code/src/github.com/pandas-dev/pandas/pandas/indexes/multi.py in from_tuples(cls, tuples, sortorder, names)
   1128         elif isinstance(tuples, list):
-> 1129             arrays = list(lib.to_object_array_tuples(tuples).T)
   1130         else:

TypeError: Argument 'rows' has incorrect type (expected list, got FrozenList)

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
    880             method = get_real_method(obj, self.print_method)
    881             if method is not None:
--> 882                 method()
    883                 return True
    884 

/Users/kylek/code/src/github.com/pandas-dev/pandas/pandas/core/generic.py in _ipython_display_(self)
    138         latex = self._repr_latex_() if hasattr(self, '_repr_latex_') else None
    139         html = self._repr_html_() if hasattr(self, '_repr_html_') else None
--> 140         table_schema = self._repr_table_schema_()
    141         # We need the inital newline since we aren't going through the
    142         # usual __repr__. See

/Users/kylek/code/src/github.com/pandas-dev/pandas/pandas/core/generic.py in _repr_table_schema_(self)
    156         if config.get_option("display.html.table_schema"):
    157             data = self.head(config.get_option('display.max_rows'))
--> 158             payload = json.loads(data.to_json(orient='table'),
    159                                  object_pairs_hook=collections.OrderedDict)
    160             return payload

/Users/kylek/code/src/github.com/pandas-dev/pandas/pandas/core/generic.py in to_json(self, path_or_buf, orient, date_format, double_precision, force_ascii, date_unit, default_handler, lines)
   1232                             force_ascii=force_ascii, date_unit=date_unit,
   1233                             default_handler=default_handler,
-> 1234                             lines=lines)
   1235 
   1236     def to_hdf(self, path_or_buf, key, **kwargs):

/Users/kylek/code/src/github.com/pandas-dev/pandas/pandas/io/json/json.py in to_json(path_or_buf, obj, orient, date_format, double_precision, force_ascii, date_unit, default_handler, lines)
     44         obj, orient=orient, date_format=date_format,
     45         double_precision=double_precision, ensure_ascii=force_ascii,
---> 46         date_unit=date_unit, default_handler=default_handler).write()
     47 
     48     if lines:

/Users/kylek/code/src/github.com/pandas-dev/pandas/pandas/io/json/json.py in __init__(self, obj, orient, date_format, double_precision, ensure_ascii, date_unit, default_handler)
    141         # TODO: Do this timedelta properly in objToJSON.c See GH #15137
    142         if ((obj.ndim == 1) and (obj.name in set(obj.index.names)) or
--> 143                 len(obj.columns & obj.index.names)):
    144             msg = "Overlapping names between the index and columns"
    145             raise ValueError(msg)

/Users/kylek/code/src/github.com/pandas-dev/pandas/pandas/indexes/base.py in __and__(self, other)
   2046 
   2047     def __and__(self, other):
-> 2048         return self.intersection(other)
   2049 
   2050     def __or__(self, other):

/Users/kylek/code/src/github.com/pandas-dev/pandas/pandas/indexes/multi.py in intersection(self, other)
   2447         """
   2448         self._assert_can_do_setop(other)
-> 2449         other, result_names = self._convert_can_do_setop(other)
   2450 
   2451         if self.equals(other):

/Users/kylek/code/src/github.com/pandas-dev/pandas/pandas/indexes/multi.py in _convert_can_do_setop(self, other)
   2514                     other = MultiIndex.from_tuples(other)
   2515                 except:
-> 2516                     raise TypeError(msg)
   2517         else:
   2518             result_names = self.names if self.names == other.names else None

TypeError: other must be a MultiIndex or a list of tuples

The key line in there is expected list, got FrozenList from arrays = list(lib.to_object_array_tuples(tuples).T)

@jreback
Copy link
Contributor

jreback commented Apr 14, 2017

cc @TomAugspurger

@jreback jreback added this to the 0.20.0 milestone Apr 14, 2017
@jreback jreback added Bug MultiIndex Output-Formatting __repr__ of pandas objects, to_string labels Apr 14, 2017
@TomAugspurger
Copy link
Contributor

Right... I think I knew about this and forgot to handle it. I think the issue is that the spec wants the field names to be a string. We would probably represent this as a tuple.

So we can do this, it just means we won't be compliant. Thoughts? I guess we could have a strict mode that would raise in cases like this? But for use-cases like sending data to nteract, we don't really want to raise an exception, so we'd be laxer.

@rgbkrk
Copy link
Contributor Author

rgbkrk commented Apr 14, 2017

Ah right, I remember a little bit of chatter on this. I think it would be ok to only put out an HTML table in cases like these, I'd prefer compliance over niceties for multiindex. We can move specs forward over time.

/cc @pwalsh

@TomAugspurger
Copy link
Contributor

Oh, we also generate invalid JSON with a MultiIndex in the columns 😬 #15273. Don't think I'll have time to get to that before the release.

If you think falling back in these cases is appropriate, I'll put that logic in the publish part.

Another option is to serialize all the level names down to a string like <level1>-<level2>... and add some additional fields with the information needed to deserialze it (number of levels, separator...)

@jorisvandenbossche
Copy link
Member

There were indeed still a bunch of cases where the json schema generation errors, see the list in the PR: #14904 (comment) (and multi-index columns is one of them). We should probably open an issue for those (or use this one as the general follow-up issue).

@jreback
Copy link
Contributor

jreback commented Apr 14, 2017

we could add a NotImplementedError for now?

jreback added a commit to jreback/pandas that referenced this issue Apr 22, 2017
@jreback jreback modified the milestones: Next Major Release, 0.20.0 Apr 22, 2017
@TomAugspurger TomAugspurger reopened this Apr 22, 2017
@TomAugspurger
Copy link
Contributor

@rgbkrk what's the ideal behavior here, as far as front-ends are concerned? Do you want us to just not hava a "application/vnd.dataresource+json" key (text/html should still be there)? Is there another channel we should publish information on? I'm looking through the jupyter client docs now to see if there are any recommendations.

@rgbkrk
Copy link
Contributor Author

rgbkrk commented Apr 24, 2017

Yeah, just don't have the application/vnd.dataresource+json if it's not supported and only provide text/html.

@rgbkrk
Copy link
Contributor Author

rgbkrk commented Apr 24, 2017

As for another channel, are you asking about where to send errors to?

@TomAugspurger
Copy link
Contributor

As for another channel, are you asking about where to send errors to?

Yeah, my memory is a bit hazy, but I thought there were other channels to publish errors. It looks like I was misremembering though. Is there some way to notify you that vnd.dataresource+json version failed?

@rgbkrk
Copy link
Contributor Author

rgbkrk commented Apr 24, 2017

Beyond a traceback? There's probably a way to push out other errors. Emitting a warning is fine too.

What do you think @minrk and @Carreau?

@Carreau
Copy link
Contributor

Carreau commented Apr 25, 2017

Warning on stderr will be printed in red in classic notebook. (I believe it should be yellow because red is scary). An error will stop the execution while just printing on stderr is considered as just "text" and the notebook should keep on processing as planned. Does that answer your question ?

@TomAugspurger
Copy link
Contributor

Does that answer your question ?

I think so, thanks. I'll emit a warning when we fail to serialize the object, and publish just the text and html reprs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug MultiIndex Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants