ENH: IO header formatting #15548

mcocdawc · 2017-03-02T13:51:56Z

closes Header option for to_latex #15536
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

In this branch I changed pd.formats.format.DataFrameFormatter._to_str_columns which
is used by to_latex and to_string to format output nicely.
They support now optional headers for the column titles.
See the appended example jupyter notebook.

to_html does not use pd.formats.format.DataFrameFormatter._to_str_columns and remains an issue.

Test cases were created.

jreback · 2017-03-02T13:57:22Z

pandas/tests/formats/test_format.py

@@ -1125,6 +1125,17 @@ def test_to_string_no_header(self):

        self.assertEqual(df_s, expected)

+    def test_to_latex_specified_header(self):


move to pandas/tests/formats/test_to_latex (new)

Sorry that was the wrong function name. It should be:
test_to_string_specified_header

jreback · 2017-03-02T14:04:51Z

so this has a bit of overlap with code in CSVFormatter._save_header. anything that you can combine would be helpful (e.g. maybe make a member function that can be called from both for example).

mcocdawc

As far as I understand it, that would not work very well with the abstract structure that is used at the moment.
The _save_header method diretly writes into the file without knowing about the values of other rows in the same column.
For a CSV file this is not a problem, because the string formatting of each row can be independent
from the other rows.
(Justification is simply not an issue for CSV files).
In the case of LaTeX or string tables however the formatting of the header row depends on the n-1 other rows.
The maximum columnwidth determines the justification of every row.
For this reason the _to_str_column method exists.

One could rewrite the CSVFormatter to use _to_str_column as well.
But I assume that the person who introduced a specific CSVFormatter did so for performance reasons.
It is unlikely that size and performance is a problem of tables exported for LaTeX,
but it could be one for CSV files.

I just speculate, so please correct me, if I am wrong.

What could be done also with the current logical structure is to rewrite to_html.

mcocdawc · 2017-03-02T14:10:16Z

pandas/formats/format.py

+                                               minimum=(self.col_space or 0),
+                                               adj=self.adj)
+                stringified.append(fmt_values)
+        elif has_aliases:


It might be worth noting that the only difference between the elif has_aliases: and else: conditional branches is the test for raising a ValueError and cheader = [self.header[i]] instead of cheader = str_columns[i].

It would be possible to write instead:

if not (has_aliases or self.header): # same code block pass else: if has_aliases: if len(self.header) != len(self.columns): raise ValueError(('Writing %d cols but got %d aliases' % (len(self.columns), len(self.header)))) stringified = [] for i, c in enumerate(frame): if has_aliases: cheader = [self.header[i]] else: cheader = str_columns[i] header_colwidth = max(self.col_space or 0, *(self.adj.len(x) for x in cheader)) fmt_values = self._format_col(i) fmt_values = _make_fixed_width(fmt_values, self.justify, minimum=header_colwidth, adj=self.adj) max_len = max(np.max([self.adj.len(x) for x in fmt_values]), header_colwidth) cheader = self.adj.justify(cheader, max_len, mode=self.justify) stringified.append(cheader + fmt_values)

What do you prefer? Redundancy vs. readability.

@mcocdawc readability!

Aye, so I keep it as it is

jreback · 2017-03-02T14:26:57Z

fyi your travis job was cancled. master is breaking, so just wait a min while I fix (then rebase).

jreback · 2017-03-02T14:29:01Z

@mcocdawc about combining code. I merely want you to make a helper method or function (maybe _header_from_columns, which can be called from both _save_header and _str_columns as these seem to be doing a lot of the same things (in part)

mcocdawc · 2017-03-02T14:42:59Z

This function _header_from_columns is then neither a method of the formatCSVFormatter nor of the format.DataFrameFormatter class alone.

So should it just reside as function in the format namespace and then I make a construction like this:

def f(self, x):
    return self.value * x

class example1:
    def __init__(self):
        self.value = 5
    def f(self, x):
        return f(self, x)

class example2:
    def __init__(self):
        self.value = 10
    def f(self, x):
        return f(self, x)

Or what is the preferred structure?

jreback · 2017-03-02T14:48:00Z

@mcocdawc you can call it as a function directly, no? It might be more convenient as a method (e.g. in DataFrameFormatter, because then its easy to call and has access to state (e.g. .header).

FYI you can rebase on master now and push

mcocdawc · 2017-03-02T14:56:44Z

Ok, but if I define it as method in DataFrameFormatter that relies on the state of a DataFrameFormatter object. How could it be reused by a CSVFormatter object.

Should it be similar to this construct?:

class example3:
    def __init__(self, value):
        self.value = value
    def g(self):
        return self.value * 10

class example4:
    def __init__(self, value):
        self.g = example3(value).g

jreback · 2017-03-02T15:04:01Z

yeah this is a dumb problem. CSVFormatter should really inherit from DataFrameFormatter. So what you can do instead is aimply add a Mixin

e.g.

class FormatterMixin(object):
     def _my_shared_method(.....):
          .....

class DataFrameFormatter(...., FormatterMixin):
       ....

class CSVFormatter(....., FormatterMixin):
      ....

maybe they could be combined more. I would actually do this as a separate PR (merged before this one). To make this clean.

mcocdawc · 2017-03-02T16:02:57Z

Of course, a Mixin Class really looks like a clean solution implementationwise.
Still there is the question of why CSVFormatter and DataFrameFormatter differ so much.
Is this just legacy because there was only CSVFormatter in the beginning and DataFrameFormatter came later? Or is there a real performance reason why CSVFormatter writes directly. Because if "directly writing" is not required, one could easily include CSV into DataFrameFormatter.
I am really interested into improving this code part, but I am a student in the middle of my exam period. So I cannot start doing such a project before the middle of April.
The PR does not introduce any structural changes, but it is still better than the current state. So for the time being I would include it; but that is your decision.
If you see flaws in code style..., I am happy to learn.

jreback · 2017-03-02T16:21:48Z

Still there is the question of why CSVFormatter and DataFrameFormatter differ so much.
Is this just legacy because there was only CSVFormatter in the beginning and DataFrameFormatter came later? Or is there a real performance reason why CSVFormatter writes directly. Because if "directly writing" is not required, one could easily include CSV into DataFrameFormatter.

I think just historical (should be fixed)

I am really interested into improving this code part, but I am a student in the middle of my exam period. So I cannot start doing such a project before the middle of April.

no problem. please come back to help on it! (you can make an issue to describe what needs to be done so won't be lost)

jreback · 2017-03-02T16:23:08Z

you need to rebase on master and force push, you have picked up commits which are already there

something like

git fetch origin/master
git checkout thisbranch
git rebase -i origin/master
git push yourfork yourbranch -f

mcocdawc · 2017-03-02T16:31:29Z

Sorry I don't completely understand the problem and the output of your commands.
My local branch for doing stuff is IO_formatting.
I typed:

git fetch origin master 
git checkout IO_formatting
git rebase -i origin/master

The output of the last command was:

The previous cherry-pick is now empty, possibly due to conflict resolution.
If you wish to commit it anyway, use:

    git commit --allow-empty

Otherwise, please use 'git reset'
interactive rebase in progress; onto 3f91d5a
Last commands done (10 commands done):
   pick f329179 renamed test_to_string_specified_header
   pick 522dcdd ENH: _to_str_columns accepts now optional header Hereby also to_latex and to_string accept optional headers
Next commands to do (3 remaining commands):
   pick 5df79b5 Added tests for to_string and to_latex
   pick 23f420f Added tests for to_latex
You are currently rebasing branch 'IO_formatting' on '3f91d5a'.

nothing to commit, working directory clean
Could not apply 522dcddbf470aaf9458d4429f6a60bfdb00950df... ENH: _to_str_columns accepts now optional header Hereby also to_latex and to_string accept optional headers

Do I have to cheryypick something or can I proceed with git push IO_formatting -f (it defaults to my origin which is my fork, doesn't it?)

jreback · 2017-03-02T16:51:08Z

hmm, maybe just try with git rebase origin/master. it should just fast-forward.

codecov-io · 2017-03-02T18:46:05Z

Codecov Report

Merging #15548 into master will decrease coverage by -0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15548      +/-   ##
==========================================
- Coverage   91.06%   91.04%   -0.03%     
==========================================
  Files         137      137              
  Lines       49307    49314       +7     
==========================================
- Hits        44900    44896       -4     
- Misses       4407     4418      +11

Impacted Files	Coverage Δ
pandas/core/frame.py	`97.83% <100%> (-0.1%)`	❌
pandas/formats/format.py	`95.35% <100%> (+0.01%)`	✅
pandas/io/gbq.py	`25% <0%> (-58.34%)`	❌
pandas/tools/merge.py	`91.78% <0%> (-0.35%)`	❌

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0bf4532...2be1fb1. Read the comment docs.

jreback · 2017-03-02T18:48:15Z

can you add a whatsnew note. And update the .to_latex doc-string as needed.

mcocdawc · 2017-03-02T19:05:21Z

The header parameter for .to_latex comes from fmt.common_docstring.
It is also used for .to_html and .to_string.
Since .to_html is implemented in a way that it makes no use of ._to_str_columns
I will create a wrong documentation for .to_html if I change the fmt.common_docstring.
I could create a fmt.common_docstring_html which takes care of the special case of html.
Where do I create the whatsnew notes correctly.
I will create an Issue for the refactoring of the format module.

jreback · 2017-03-02T19:09:09Z

The header parameter for .to_latex comes from fmt.common_docstring.
It is also used for .to_html and .to_string.
Since .to_html is implemented in a way that it makes no use of ._to_str_columns
I will create a wrong documentation for .to_html if I change the fmt.common_docstring.
I could create a fmt.common_docstring_html which takes care of the special case of html.

you can use replaceable parameters in the doc-string with Substitution

Where do I create the whatsnew notes correctly.

doc/source/whatsnew/v0.20.0.txt in Enhancements

jreback · 2017-03-03T12:59:10Z

pandas/formats/format.py

@@ -53,7 +53,7 @@
    col_space : int, optional
        the minimum width of each column
    header : bool, optional
-        whether to print column labels, default True
+        %s


use a named argument e.g. %{header}s

jreback · 2017-03-03T13:00:00Z

pandas/formats/format.py

@@ -489,30 +489,48 @@ def _to_str_columns(self):
        str_index = self._get_formatted_index(frame)
        str_columns = self._get_formatted_column_labels(frame)

-        if self.header:
+        has_aliases = isinstance(self.header, (tuple, list, np.ndarray, Index))


use is_list_like

Just want to mention that the original reason why I wrote it in this "unpythonic" way was to have consistency with CSVFormatter.
It really seems as if CSVFormatter also could be refurbished.
But for the time being I just keep it in mind for when they are both transformed to using the Mixin class you mentioned.

jreback · 2017-03-03T13:01:35Z

pandas/formats/format.py

@@ -489,30 +489,48 @@ def _to_str_columns(self):
        str_index = self._get_formatted_index(frame)
        str_columns = self._get_formatted_column_labels(frame)

-        if self.header:
+        has_aliases = isinstance(self.header, (tuple, list, np.ndarray, Index))
+        if not (has_aliases or self.header):


so you can change this whole if test around, something like

headers = self.columns if is_list_like(self.header): # validate len(self.header) == len(self.columns) headers = self.header # do the loop from below

mcocdawc · 2017-03-06T11:40:37Z

Ok, I tried to make a clean rebase operation and it still does not work.
The bad thing is that I created an awfully cluttered history of snapshots.
My proposed solution is to create a new branch which is rebased on upstream/master from the beginning on.
I will stop this pull request then and make a new one for the new branch.

To summarise the results of your review:

format.py
- rewrite _to_str_columns to use provided aliases in the header parameter
- uses is_list_like for the conditional
- provides substitution for the header in common_docstring
frame.py
- make @Substitutions for the header parameter in the docstrings of to_latex, to_string and to_html
tests/formats
- test_format.py: Make function test_to_string_specified_header
- test_to_latex.py: Make function test_to_latex_specified_header
whatsnew/v0.20.0.txt
- summarise changes

jreback · 2017-03-06T13:19:24Z

@mcocdawc it doesn't matter to make this super clean. The merge will squash it. Though I generally squash my own PR's as they are simpler (to rebase) and understand. Lots of commits are fine, people do things differently.

jreback · 2017-03-06T13:19:51Z

the key thing is to rebase on origin/master; that will put all of your commits on the top

Hereby also to_latex and to_string accept optional headers

closes pandas-dev#15475 Author: Ben Thayer <[email protected]> Author: bthayer2365 <[email protected]> Closes pandas-dev#15506 from bthayer2365/frozen-index and squashes the following commits: 428a1b3 [Ben Thayer] Added __iadd__ test, fixed whatsnew 84ba405 [Ben Thayer] Merge branch 'master' of github.com:pandas-dev/pandas into frozen-index 8dbde1e [Ben Thayer] Rebased to upstream/master 6f6c140 [Ben Thayer] Added docstrings, depricated __iadd__, changed __add__ to use self.union() 66b3b91 [Ben Thayer] Fixed issue number 3d6cee5 [Ben Thayer] Depricated __add__ in favor of union ccd75c7 [Ben Thayer] Changed __sub__ to difference cd7de26 [Ben Thayer] Added versionadded tag in docs and renamed test_inplace to test_inplace_add for consistency 0ea8d21 [Ben Thayer] Added __isub__ and groupby example to docs 79dd958 [Ben Thayer] Updated whatsnew to reflect changes 0fc7e19 [Ben Thayer] Removed whitespace 73564ab [Ben Thayer] Added FrozenList subtraction fee7a7d [bthayer2365] Merge branch 'master' into frozen-index 6a2b48d [Ben Thayer] Added docstrings, depricated __iadd__, changed __add__ to use self.union() 2ab85cb [Ben Thayer] Fixed issue number cb95089 [Ben Thayer] Depricated __add__ in favor of union 2e43849 [Ben Thayer] Changed __sub__ to difference fdcfbbb [Ben Thayer] Added versionadded tag in docs and renamed test_inplace to test_inplace_add for consistency 2fad2f7 [Ben Thayer] Added __isub__ and groupby example to docs cd73faa [Ben Thayer] Updated whatsnew to reflect changes f6381a8 [Ben Thayer] Removed whitespace ada7cda [Ben Thayer] Added FrozenList subtraction

xref pandas-dev#15506

jreback · 2017-03-06T13:30:17Z

@mcocdawc I fixed your rebase. if any comments left by @jorisvandenbossche

mcocdawc · 2017-03-09T10:40:47Z

Thank you very much. Do I have to do now anything anymore for this commit?

jorisvandenbossche · 2017-03-09T10:59:02Z

@mcocdawc Thanks!

…ng column names (pandas-dev#15548) closes pandas-dev#15536

jreback reviewed Mar 2, 2017

View reviewed changes

jreback added Enhancement IO LaTeX to_latex Output-Formatting __repr__ of pandas objects, to_string labels Mar 2, 2017

jreback changed the title ~~Io formatting~~ ENH: IO header formatting Mar 2, 2017

mcocdawc commented Mar 2, 2017

View reviewed changes

jreback requested changes Mar 3, 2017

View reviewed changes

mcocdawc added 4 commits March 6, 2017 08:20

ENH: _to_str_columns accepts now optional header

dda6ee9

Hereby also to_latex and to_string accept optional headers

Added tests for to_string and to_latex

a9bec30

Added tests for to_latex

384fbd7

renamed test_to_string_specified_header

64c45c5

mcocdawc and others added 7 commits March 6, 2017 08:21

Added tests for to_string and to_latex

78ec8aa

renamed test_to_string_specified_header

87a3838

TST: remove deprecated usages of FrozenList.__add__ from test code

82a9e3e

xref pandas-dev#15506

Added WhatsNew + Docstrings to_latex to_string

60822e9

Nicer logic in _to_str_columns +Modified docstring

915e8a0

fix rebase

2be1fb1

jreback force-pushed the IO_formatting branch from 4c95d8e to 2be1fb1 Compare March 6, 2017 13:29

jreback approved these changes Mar 6, 2017

View reviewed changes

jorisvandenbossche approved these changes Mar 9, 2017

View reviewed changes

jorisvandenbossche merged commit ae0a92a into pandas-dev:master Mar 9, 2017

jorisvandenbossche added this to the 0.20.0 milestone Mar 9, 2017

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017

ENH: to_string/to_latex now accept list-like header arg for overwriti…

250dc8d

…ng column names (pandas-dev#15548) closes pandas-dev#15536

		@@ -1125,6 +1125,17 @@ def test_to_string_no_header(self):

		self.assertEqual(df_s, expected)

		def test_to_latex_specified_header(self):

Uh oh!

ENH: IO header formatting #15548

ENH: IO header formatting #15548

Uh oh!

Conversation

mcocdawc commented Mar 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Mar 2, 2017

Uh oh!

mcocdawc left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Mar 2, 2017

Uh oh!

jreback commented Mar 2, 2017

Uh oh!

mcocdawc commented Mar 2, 2017

Uh oh!

jreback commented Mar 2, 2017

Uh oh!

mcocdawc commented Mar 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Mar 2, 2017

Uh oh!

mcocdawc commented Mar 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Mar 2, 2017

Uh oh!

jreback commented Mar 2, 2017

Uh oh!

mcocdawc commented Mar 2, 2017

Uh oh!

jreback commented Mar 2, 2017

Uh oh!

codecov-io commented Mar 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jreback commented Mar 2, 2017

Uh oh!

mcocdawc commented Mar 2, 2017

Uh oh!

jreback commented Mar 2, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcocdawc commented Mar 6, 2017

Uh oh!

jreback commented Mar 6, 2017

Uh oh!

jreback commented Mar 6, 2017

Uh oh!

jreback commented Mar 6, 2017

Uh oh!

mcocdawc commented Mar 9, 2017

Uh oh!

jorisvandenbossche commented Mar 9, 2017

Uh oh!

Uh oh!

mcocdawc commented Mar 2, 2017 •

edited

Loading

mcocdawc left a comment •

edited

Loading

mcocdawc commented Mar 2, 2017 •

edited

Loading

mcocdawc commented Mar 2, 2017 •

edited

Loading

codecov-io commented Mar 2, 2017 •

edited

Loading