Skip to content

ENH: IO header formatting #15548

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 9, 2017
Merged

Conversation

mcocdawc
Copy link
Contributor

@mcocdawc mcocdawc commented Mar 2, 2017

In this branch I changed pd.formats.format.DataFrameFormatter._to_str_columns which
is used by to_latex and to_string to format output nicely.
They support now optional headers for the column titles.
See the appended example jupyter notebook.

to_html does not use pd.formats.format.DataFrameFormatter._to_str_columns and remains an issue.

Test cases were created.

@@ -1125,6 +1125,17 @@ def test_to_string_no_header(self):

self.assertEqual(df_s, expected)

def test_to_latex_specified_header(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to pandas/tests/formats/test_to_latex (new)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that was the wrong function name. It should be:
test_to_string_specified_header

@jreback
Copy link
Contributor

jreback commented Mar 2, 2017

so this has a bit of overlap with code in CSVFormatter._save_header. anything that you can combine would be helpful (e.g. maybe make a member function that can be called from both for example).

@jreback jreback added Enhancement IO LaTeX to_latex Output-Formatting __repr__ of pandas objects, to_string labels Mar 2, 2017
@jreback jreback changed the title Io formatting ENH: IO header formatting Mar 2, 2017
Copy link
Contributor Author

@mcocdawc mcocdawc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand it, that would not work very well with the abstract structure that is used at the moment.
The _save_header method diretly writes into the file without knowing about the values of other rows in the same column.
For a CSV file this is not a problem, because the string formatting of each row can be independent
from the other rows.
(Justification is simply not an issue for CSV files).
In the case of LaTeX or string tables however the formatting of the header row depends on the n-1 other rows.
The maximum columnwidth determines the justification of every row.
For this reason the _to_str_column method exists.

One could rewrite the CSVFormatter to use _to_str_column as well.
But I assume that the person who introduced a specific CSVFormatter did so for performance reasons.
It is unlikely that size and performance is a problem of tables exported for LaTeX,
but it could be one for CSV files.

I just speculate, so please correct me, if I am wrong.

What could be done also with the current logical structure is to rewrite to_html.

minimum=(self.col_space or 0),
adj=self.adj)
stringified.append(fmt_values)
elif has_aliases:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth noting that the only difference between the elif has_aliases: and else: conditional branches is the test for raising a ValueError and cheader = [self.header[i]] instead of cheader = str_columns[i].

It would be possible to write instead:

if not (has_aliases or self.header):
    # same code block
    pass
else:
    if has_aliases:
        if len(self.header) != len(self.columns):
            raise ValueError(('Writing %d cols but got %d aliases'
                              % (len(self.columns), len(self.header))))
    stringified = []
    for i, c in enumerate(frame):
        if has_aliases:
            cheader = [self.header[i]]
        else:
            cheader = str_columns[i]
          
        header_colwidth = max(self.col_space or 0,
                              *(self.adj.len(x) for x in cheader))
        fmt_values = self._format_col(i)
        fmt_values = _make_fixed_width(fmt_values, self.justify,
                                       minimum=header_colwidth,
                                       adj=self.adj)

        max_len = max(np.max([self.adj.len(x) for x in fmt_values]),
                      header_colwidth)
        cheader = self.adj.justify(cheader, max_len, mode=self.justify)
        stringified.append(cheader + fmt_values)

What do you prefer? Redundancy vs. readability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mcocdawc readability!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aye, so I keep it as it is

@jreback
Copy link
Contributor

jreback commented Mar 2, 2017

fyi your travis job was cancled. master is breaking, so just wait a min while I fix (then rebase).

@jreback
Copy link
Contributor

jreback commented Mar 2, 2017

@mcocdawc about combining code. I merely want you to make a helper method or function (maybe _header_from_columns, which can be called from both _save_header and _str_columns as these seem to be doing a lot of the same things (in part)

@mcocdawc
Copy link
Contributor Author

mcocdawc commented Mar 2, 2017

This function _header_from_columns is then neither a method of the formatCSVFormatter nor of the format.DataFrameFormatter class alone.

So should it just reside as function in the format namespace and then I make a construction like this:

def f(self, x):
    return self.value * x

class example1:
    def __init__(self):
        self.value = 5
    def f(self, x):
        return f(self, x)

class example2:
    def __init__(self):
        self.value = 10
    def f(self, x):
        return f(self, x)

Or what is the preferred structure?

@jreback
Copy link
Contributor

jreback commented Mar 2, 2017

@mcocdawc you can call it as a function directly, no? It might be more convenient as a method (e.g. in DataFrameFormatter, because then its easy to call and has access to state (e.g. .header).

FYI you can rebase on master now and push

@mcocdawc
Copy link
Contributor Author

mcocdawc commented Mar 2, 2017

Ok, but if I define it as method in DataFrameFormatter that relies on the state of a DataFrameFormatter object. How could it be reused by a CSVFormatter object.

Should it be similar to this construct?:

class example3:
    def __init__(self, value):
        self.value = value
    def g(self):
        return self.value * 10

class example4:
    def __init__(self, value):
        self.g = example3(value).g

@jreback
Copy link
Contributor

jreback commented Mar 2, 2017

yeah this is a dumb problem. CSVFormatter should really inherit from DataFrameFormatter. So what you can do instead is aimply add a Mixin

e.g.

class FormatterMixin(object):
     def _my_shared_method(.....):
          .....

class DataFrameFormatter(...., FormatterMixin):
       ....

class CSVFormatter(....., FormatterMixin):
      ....

maybe they could be combined more. I would actually do this as a separate PR (merged before this one). To make this clean.

@mcocdawc
Copy link
Contributor Author

mcocdawc commented Mar 2, 2017

  1. Of course, a Mixin Class really looks like a clean solution implementationwise.
  2. Still there is the question of why CSVFormatter and DataFrameFormatter differ so much.
    Is this just legacy because there was only CSVFormatter in the beginning and DataFrameFormatter came later? Or is there a real performance reason why CSVFormatter writes directly. Because if "directly writing" is not required, one could easily include CSV into DataFrameFormatter.
  3. I am really interested into improving this code part, but I am a student in the middle of my exam period. So I cannot start doing such a project before the middle of April.
  4. The PR does not introduce any structural changes, but it is still better than the current state. So for the time being I would include it; but that is your decision.
  5. If you see flaws in code style..., I am happy to learn.

@jreback
Copy link
Contributor

jreback commented Mar 2, 2017

  1. Still there is the question of why CSVFormatter and DataFrameFormatter differ so much.
    Is this just legacy because there was only CSVFormatter in the beginning and DataFrameFormatter came later? Or is there a real performance reason why CSVFormatter writes directly. Because if "directly writing" is not required, one could easily include CSV into DataFrameFormatter.

I think just historical (should be fixed)

  1. I am really interested into improving this code part, but I am a student in the middle of my exam period. So I cannot start doing such a project before the middle of April.

no problem. please come back to help on it! (you can make an issue to describe what needs to be done so won't be lost)

@jreback
Copy link
Contributor

jreback commented Mar 2, 2017

you need to rebase on master and force push, you have picked up commits which are already there

something like

git fetch origin/master
git checkout thisbranch
git rebase -i origin/master
git push yourfork yourbranch -f

@mcocdawc
Copy link
Contributor Author

mcocdawc commented Mar 2, 2017

Sorry I don't completely understand the problem and the output of your commands.
My local branch for doing stuff is IO_formatting.
I typed:

git fetch origin master 
git checkout IO_formatting
git rebase -i origin/master

The output of the last command was:

The previous cherry-pick is now empty, possibly due to conflict resolution.
If you wish to commit it anyway, use:

    git commit --allow-empty

Otherwise, please use 'git reset'
interactive rebase in progress; onto 3f91d5a
Last commands done (10 commands done):
   pick f329179 renamed test_to_string_specified_header
   pick 522dcdd ENH: _to_str_columns accepts now optional header Hereby also to_latex and to_string accept optional headers
Next commands to do (3 remaining commands):
   pick 5df79b5 Added tests for to_string and to_latex
   pick 23f420f Added tests for to_latex
You are currently rebasing branch 'IO_formatting' on '3f91d5a'.

nothing to commit, working directory clean
Could not apply 522dcddbf470aaf9458d4429f6a60bfdb00950df... ENH: _to_str_columns accepts now optional header Hereby also to_latex and to_string accept optional headers

Do I have to cheryypick something or can I proceed with git push IO_formatting -f (it defaults to my origin which is my fork, doesn't it?)

@jreback
Copy link
Contributor

jreback commented Mar 2, 2017

hmm, maybe just try with git rebase origin/master. it should just fast-forward.

@codecov-io
Copy link

codecov-io commented Mar 2, 2017

Codecov Report

Merging #15548 into master will decrease coverage by -0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15548      +/-   ##
==========================================
- Coverage   91.06%   91.04%   -0.03%     
==========================================
  Files         137      137              
  Lines       49307    49314       +7     
==========================================
- Hits        44900    44896       -4     
- Misses       4407     4418      +11
Impacted Files Coverage Δ
pandas/core/frame.py 97.83% <100%> (-0.1%)
pandas/formats/format.py 95.35% <100%> (+0.01%)
pandas/io/gbq.py 25% <0%> (-58.34%)
pandas/tools/merge.py 91.78% <0%> (-0.35%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0bf4532...2be1fb1. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Mar 2, 2017

can you add a whatsnew note. And update the .to_latex doc-string as needed.

@mcocdawc
Copy link
Contributor Author

mcocdawc commented Mar 2, 2017

  1. The header parameter for .to_latex comes from fmt.common_docstring.
    It is also used for .to_html and .to_string.
    Since .to_html is implemented in a way that it makes no use of ._to_str_columns
    I will create a wrong documentation for .to_html if I change the fmt.common_docstring.
    I could create a fmt.common_docstring_html which takes care of the special case of html.

  2. Where do I create the whatsnew notes correctly.

  3. I will create an Issue for the refactoring of the format module.

@jreback
Copy link
Contributor

jreback commented Mar 2, 2017

  1. The header parameter for .to_latex comes from fmt.common_docstring.
    It is also used for .to_html and .to_string.
    Since .to_html is implemented in a way that it makes no use of ._to_str_columns
    I will create a wrong documentation for .to_html if I change the fmt.common_docstring.
    I could create a fmt.common_docstring_html which takes care of the special case of html.

you can use replaceable parameters in the doc-string with Substitution

  1. Where do I create the whatsnew notes correctly.

doc/source/whatsnew/v0.20.0.txt in Enhancements

@@ -53,7 +53,7 @@
col_space : int, optional
the minimum width of each column
header : bool, optional
whether to print column labels, default True
%s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a named argument e.g. %{header}s

@@ -489,30 +489,48 @@ def _to_str_columns(self):
str_index = self._get_formatted_index(frame)
str_columns = self._get_formatted_column_labels(frame)

if self.header:
has_aliases = isinstance(self.header, (tuple, list, np.ndarray, Index))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use is_list_like

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to mention that the original reason why I wrote it in this "unpythonic" way was to have consistency with CSVFormatter.
It really seems as if CSVFormatter also could be refurbished.
But for the time being I just keep it in mind for when they are both transformed to using the Mixin class you mentioned.

@@ -489,30 +489,48 @@ def _to_str_columns(self):
str_index = self._get_formatted_index(frame)
str_columns = self._get_formatted_column_labels(frame)

if self.header:
has_aliases = isinstance(self.header, (tuple, list, np.ndarray, Index))
if not (has_aliases or self.header):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you can change this whole if test around, something like

headers = self.columns
if is_list_like(self.header):
    # validate len(self.header) == len(self.columns)
    headers = self.header

# do the loop from below

@mcocdawc
Copy link
Contributor Author

mcocdawc commented Mar 6, 2017

Ok, I tried to make a clean rebase operation and it still does not work.
The bad thing is that I created an awfully cluttered history of snapshots.
My proposed solution is to create a new branch which is rebased on upstream/master from the beginning on.
I will stop this pull request then and make a new one for the new branch.

To summarise the results of your review:

  • format.py
    • rewrite _to_str_columns to use provided aliases in the header parameter
    • uses is_list_like for the conditional
    • provides substitution for the header in common_docstring
  • frame.py
    • make @Substitutions for the header parameter in the docstrings of to_latex, to_string and to_html
  • tests/formats
    • test_format.py: Make function test_to_string_specified_header
    • test_to_latex.py: Make function test_to_latex_specified_header
  • whatsnew/v0.20.0.txt
    • summarise changes

@jreback
Copy link
Contributor

jreback commented Mar 6, 2017

@mcocdawc it doesn't matter to make this super clean. The merge will squash it. Though I generally squash my own PR's as they are simpler (to rebase) and understand. Lots of commits are fine, people do things differently.

@jreback
Copy link
Contributor

jreback commented Mar 6, 2017

the key thing is to rebase on origin/master; that will put all of your commits on the top

mcocdawc and others added 7 commits March 6, 2017 08:21
closes pandas-dev#15475

Author: Ben Thayer <[email protected]>
Author: bthayer2365 <[email protected]>

Closes pandas-dev#15506 from bthayer2365/frozen-index and squashes the following commits:

428a1b3 [Ben Thayer] Added __iadd__ test, fixed whatsnew
84ba405 [Ben Thayer] Merge branch 'master' of github.com:pandas-dev/pandas into frozen-index
8dbde1e [Ben Thayer] Rebased to upstream/master
6f6c140 [Ben Thayer] Added docstrings, depricated __iadd__, changed __add__ to use self.union()
66b3b91 [Ben Thayer] Fixed issue number
3d6cee5 [Ben Thayer] Depricated __add__ in favor of union
ccd75c7 [Ben Thayer] Changed __sub__ to difference
cd7de26 [Ben Thayer] Added versionadded tag in docs and renamed test_inplace to test_inplace_add for consistency
0ea8d21 [Ben Thayer] Added __isub__ and groupby example to docs
79dd958 [Ben Thayer] Updated whatsnew to reflect changes
0fc7e19 [Ben Thayer] Removed whitespace
73564ab [Ben Thayer] Added FrozenList subtraction
fee7a7d [bthayer2365] Merge branch 'master' into frozen-index
6a2b48d [Ben Thayer] Added docstrings, depricated __iadd__, changed __add__ to use self.union()
2ab85cb [Ben Thayer] Fixed issue number
cb95089 [Ben Thayer] Depricated __add__ in favor of union
2e43849 [Ben Thayer] Changed __sub__ to difference
fdcfbbb [Ben Thayer] Added versionadded tag in docs and renamed test_inplace to test_inplace_add for consistency
2fad2f7 [Ben Thayer] Added __isub__ and groupby example to docs
cd73faa [Ben Thayer] Updated whatsnew to reflect changes
f6381a8 [Ben Thayer] Removed whitespace
ada7cda [Ben Thayer] Added FrozenList subtraction
@jreback
Copy link
Contributor

jreback commented Mar 6, 2017

@mcocdawc I fixed your rebase. if any comments left by @jorisvandenbossche

@mcocdawc
Copy link
Contributor Author

mcocdawc commented Mar 9, 2017

Thank you very much. Do I have to do now anything anymore for this commit?

@jorisvandenbossche jorisvandenbossche merged commit ae0a92a into pandas-dev:master Mar 9, 2017
@jorisvandenbossche
Copy link
Member

@mcocdawc Thanks!

@jorisvandenbossche jorisvandenbossche added this to the 0.20.0 milestone Mar 9, 2017
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO LaTeX to_latex Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Header option for to_latex
5 participants