ENH: DataFrame.style sparsified MultiIndex

TomAugspurger · TomAugspurger · commit 5d163ceb8a39 · 2016-08-04T06:43:39.000-05:00
- [x] closes #11655 - [x] tests added / passed - [x] passes ``git diff upstream/master | flake8 --diff`` - [x] whatsnew entry [Notebook comparing `DataFrame._html_repr_` to `DataFrame.style`](http s://gist.github.com/609c398f814b4a505bf4f406670e457e) I think we're identical for non-truncated DataFrames. That' has not been implemented in `Styler` yet. Along the way I noticed two other things that ended up needing fixing. 1. DataFrame.columns.names were not displayed 2. CSS classes weren't being assigned correctly to row labels. The fixes ended up being pretty intertwined, so I've put them in a single PR. Unfortunately, the commits are a bit jumbled as well :/ Author: Tom Augspurger <tom.augspurger88@gmail.com> Closes #13775 from TomAugspurger/style-sparse-mi-2 and squashes the following commits: 7c03a72 [Tom Augspurger] ENH: DataFrame.style column names ecba615 [Tom Augspurger] ENH: MultiIndex Structure for DataFrame.style
diff --git a/doc/source/html-styling.ipynb b/doc/source/html-styling.ipynb
@@ -788,6 +788,27 @@
     "We hope to collect some useful ones either in pandas, or preferable in a new package that [builds on top](#Extensibility) the tools here."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# CSS Classes\n",
+    "\n",
+    "Certain CSS classes are attached to cells.\n",
+    "\n",
+    "- Index and Column names include `index_name` and `level<k>` where `k` is its level in a MultiIndex\n",
+    "- Index label cells include\n",
+    "  + `row_heading`\n",
+    "  + `row<n>` where `n` is the numeric position of the row\n",
+    "  + `level<k>` where `k` is the level in a MultiIndex\n",
+    "- Column label cells include\n",
+    "  + `col_heading`\n",
+    "  + `col<n>` where `n` is the numeric position of the column\n",
+    "  + `level<k>` where `k` is the level in a MultiIndex\n",
+    "- Blank cells include `blank`\n",
+    "- Data cells include `data`"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
diff --git a/doc/source/whatsnew/v0.19.0.txt b/doc/source/whatsnew/v0.19.0.txt
@@ -373,6 +373,8 @@ Other enhancements
 - ``Series.append`` now supports the ``ignore_index`` option (:issue:`13677`)
 - ``.to_stata()`` and ``StataWriter`` can now write variable labels to Stata dta files using a dictionary to make column names to labels (:issue:`13535`, :issue:`13536`)
 - ``.to_stata()`` and ``StataWriter`` will automatically convert ``datetime64[ns]`` columns to Stata format ``%tc``, rather than raising a ``ValueError`` (:issue:`12259`)
+- ``DataFrame.style`` will now render sparsified MultiIndexes (:issue:`11655`)
+- ``DataFrame.style`` will now show column level names (e.g. ``DataFrame.columns.names``) (:issue:`13775`)
 - ``DataFrame`` has gained support to re-order the columns based on the values
   in a row using ``df.sort_values(by='...', axis=1)`` (:issue:`10806`)
 
@@ -884,10 +886,10 @@ Bug Fixes
 - Bug in ``groupby`` with ``as_index=False`` returns all NaN's when grouping on multiple columns including a categorical one (:issue:`13204`)
 - Bug in ``df.groupby(...)[...]`` where getitem with ``Int64Index`` raised an error (:issue:`13731`)
 
+- Bug in the CSS classes assigned to ``DataFrame.style`` for index names. Previously they were assigned ``"col_heading level<n> col<c>"`` where ``n`` was the number of levels + 1. Now they are assigned ``"index_name level<n>"``, where ``n`` is the correct level for that MultiIndex.
 - Bug where ``pd.read_gbq()`` could throw ``ImportError: No module named discovery`` as a result of a naming conflict with another python package called apiclient  (:issue:`13454`)
 - Bug in ``Index.union`` returns an incorrect result with a named empty index (:issue:`13432`)
 - Bugs in ``Index.difference`` and ``DataFrame.join`` raise in Python3 when using mixed-integer indexes (:issue:`13432`, :issue:`12814`)
-
 - Bug in ``.to_excel()`` when DataFrame contains a MultiIndex which contains a label with a NaN value (:issue:`13511`)
 - Bug in ``pd.read_csv`` in Python 2.x with non-UTF8 encoded, multi-character separated data (:issue:`3404`)
 - Bug in ``Index`` raises ``KeyError`` displaying incorrect column when column is not in the df and columns contains duplicate values (:issue:`13822`)
diff --git a/pandas/formats/style.py b/pandas/formats/style.py
@@ -21,7 +21,9 @@
 
 import numpy as np
 import pandas as pd
-from pandas.compat import lzip, range
+from pandas.compat import range
+from pandas.core.config import get_option
+import pandas.core.common as com
 from pandas.core.indexing import _maybe_numeric_slice, _non_reducing_slice
 try:
     import matplotlib.pyplot as plt
@@ -79,6 +81,24 @@ class Styler(object):
     to automatically render itself. Otherwise call Styler.render to get
     the genterated HTML.
 
+    CSS classes are attached to the generated HTML
+
+    * Index and Column names include ``index_name`` and ``level<k>``
+      where `k` is its level in a MultiIndex
+    * Index label cells include
+
+      * ``row_heading``
+      * ``row<n>`` where `n` is the numeric position of the row
+      * ``level<k>`` where `k` is the level in a MultiIndex
+
+    * Column label cells include
+      * ``col_heading``
+      * ``col<n>`` where `n` is the numeric position of the column
+      * ``evel<k>`` where `k` is the level in a MultiIndex
+
+    * Blank cells include ``blank``
+    * Data cells include ``data``
+
     See Also
     --------
     pandas.DataFrame.style
@@ -110,7 +130,10 @@ class Styler(object):
             {% for r in head %}
             <tr>
                 {% for c in r %}
-                <{{c.type}} class="{{c.class}}">{{c.value}}
+                {% if c.is_visible != False %}
+                <{{c.type}} class="{{c.class}}" {{ c.attributes|join(" ") }}>
+                  {{c.value}}
+                {% endif %}
                 {% endfor %}
             </tr>
             {% endfor %}
@@ -119,8 +142,11 @@ class Styler(object):
             {% for r in body %}
             <tr>
                 {% for c in r %}
-                <{{c.type}} id="T_{{uuid}}{{c.id}}" class="{{c.class}}">
+                {% if c.is_visible != False %}
+                <{{c.type}} id="T_{{uuid}}{{c.id}}"
+                 class="{{c.class}}" {{ c.attributes|join(" ") }}>
                     {{ c.display_value }}
+                {% endif %}
                 {% endfor %}
             </tr>
             {% endfor %}
@@ -148,7 +174,7 @@ def __init__(self, data, precision=None, table_styles=None, uuid=None,
         self.table_styles = table_styles
         self.caption = caption
         if precision is None:
-            precision = pd.options.display.precision
+            precision = get_option('display.precision')
         self.precision = precision
         self.table_attributes = table_attributes
         # display_funcs maps (row, col) -> formatting function
@@ -177,21 +203,26 @@ def _translate(self):
         uuid = self.uuid or str(uuid1()).replace("-", "_")
         ROW_HEADING_CLASS = "row_heading"
         COL_HEADING_CLASS = "col_heading"
+        INDEX_NAME_CLASS = "index_name"
+
         DATA_CLASS = "data"
         BLANK_CLASS = "blank"
         BLANK_VALUE = ""
 
+        def format_attr(pair):
+            return "{key}={value}".format(**pair)
+
+        # for sparsifying a MultiIndex
+        idx_lengths = _get_level_lengths(self.index)
+        col_lengths = _get_level_lengths(self.columns)
+
         cell_context = dict()
 
         n_rlvls = self.data.index.nlevels
         n_clvls = self.data.columns.nlevels
         rlabels = self.data.index.tolist()
         clabels = self.data.columns.tolist()
 
-        idx_values = self.data.index.format(sparsify=False, adjoin=False,
-                                            names=False)
-        idx_values = lzip(*idx_values)
-
         if n_rlvls == 1:
             rlabels = [[x] for x in rlabels]
         if n_clvls == 1:
@@ -202,9 +233,24 @@ def _translate(self):
         head = []
 
         for r in range(n_clvls):
+            # Blank for Index columns...
             row_es = [{"type": "th",
                        "value": BLANK_VALUE,
-                       "class": " ".join([BLANK_CLASS])}] * n_rlvls
+                       "display_value": BLANK_VALUE,
+                       "is_visible": True,
+                       "class": " ".join([BLANK_CLASS])}] * (n_rlvls - 1)
+
+            # ... except maybe the last for columns.names
+            name = self.data.columns.names[r]
+            cs = [BLANK_CLASS if name is None else INDEX_NAME_CLASS,
+                  "level%s" % r]
+            name = BLANK_VALUE if name is None else name
+            row_es.append({"type": "th",
+                           "value": name,
+                           "display_value": name,
+                           "class": " ".join(cs),
+                           "is_visible": True})
+
             for c in range(len(clabels[0])):
                 cs = [COL_HEADING_CLASS, "level%s" % r, "col%s" % c]
                 cs.extend(cell_context.get(
@@ -213,16 +259,23 @@ def _translate(self):
                 row_es.append({"type": "th",
                                "value": value,
                                "display_value": value,
-                               "class": " ".join(cs)})
+                               "class": " ".join(cs),
+                               "is_visible": _is_visible(c, r, col_lengths),
+                               "attributes": [
+                                   format_attr({"key": "colspan",
+                                                "value": col_lengths.get(
+                                                    (r, c), 1)})
+                               ]})
             head.append(row_es)
 
-        if self.data.index.names and self.data.index.names != [None]:
+        if self.data.index.names and not all(x is None
+                                             for x in self.data.index.names):
             index_header_row = []
 
             for c, name in enumerate(self.data.index.names):
-                cs = [COL_HEADING_CLASS,
-                      "level%s" % (n_clvls + 1),
-                      "col%s" % c]
+                cs = [INDEX_NAME_CLASS,
+                      "level%s" % c]
+                name = '' if name is None else name
                 index_header_row.append({"type": "th", "value": name,
                                          "class": " ".join(cs)})
 
@@ -236,12 +289,17 @@ def _translate(self):
 
         body = []
         for r, idx in enumerate(self.data.index):
-            cs = [ROW_HEADING_CLASS, "level%s" % c, "row%s" % r]
-            cs.extend(
-                cell_context.get("row_headings", {}).get(r, {}).get(c, []))
+            #  cs.extend(
+            #    cell_context.get("row_headings", {}).get(r, {}).get(c, []))
             row_es = [{"type": "th",
+                       "is_visible": _is_visible(r, c, idx_lengths),
+                       "attributes": [
+                           format_attr({"key": "rowspan",
+                                        "value": idx_lengths.get((c, r), 1)})
+                       ],
                        "value": rlabels[r][c],
-                       "class": " ".join(cs),
+                       "class": " ".join([ROW_HEADING_CLASS, "level%s" % c,
+                                          "row%s" % r]),
                        "display_value": rlabels[r][c]}
                       for c in range(len(rlabels[r]))]
 
@@ -893,6 +951,40 @@ def _highlight_extrema(data, color='yellow', max_=True):
                                 index=data.index, columns=data.columns)
 
 
+def _is_visible(idx_row, idx_col, lengths):
+    """
+    Index -> {(idx_row, idx_col): bool})
+    """
+    return (idx_col, idx_row) in lengths
+
+
+def _get_level_lengths(index):
+    """
+    Given an index, find the level lenght for each element.
+
+    Result is a dictionary of (level, inital_position): span
+    """
+    sentinel = com.sentinel_factory()
+    levels = index.format(sparsify=sentinel, adjoin=False, names=False)
+
+    if index.nlevels == 1:
+        return {(0, i): 1 for i, value in enumerate(levels)}
+
+    lengths = {}
+
+    for i, lvl in enumerate(levels):
+        for j, row in enumerate(lvl):
+            if not get_option('display.multi_sparse'):
+                lengths[(i, j)] = 1
+            elif row != sentinel:
+                last_label = j
+                lengths[(i, last_label)] = 1
+            else:
+                lengths[(i, last_label)] += 1
+
+    return lengths
+
+
 def _maybe_wrap_formatter(formatter):
     if is_string_like(formatter):
         return lambda x: formatter.format(x)
diff --git a/pandas/tests/formats/test_style.py b/pandas/tests/formats/test_style.py