API: cut interval formatting #8595

fancychildren · 2014-10-21T13:26:30Z

it would be nice to have number in front of all labels. put number like 00, 01, 02 in front of labels so that it would order appropriately.

Lib\site-packages\pandas\tools\tile.py

def _format_levels(bins, prec, right=True,
                   include_lowest=False):
    fmt = lambda v: _format_label(v, precision=prec)
    cnter=0
    if right:
        levels = []
        for a, b in zip(bins, bins[1:]):
            fa, fb = fmt(a), fmt(b)

            if a != b and fa == fb:
                raise ValueError('precision too low')

            formatted = '%02d: (%s, %s]' % (cnter, fa, fb)
            cnter=cnter +1
            levels.append(formatted)

        if include_lowest:
            levels[0] = '[' + levels[0][1:]
    else:
        levels = ['[%s, %s)' % (fmt(a), fmt(b))
                  for a, b in zip(bins, bins[1:])]

    return levels

The text was updated successfully, but these errors were encountered:

jreback · 2014-10-21T13:35:50Z

this is going to be turned into a Categorical, so ordering will happen automatically. interested to do for 0.15.1?

jreback · 2014-10-21T13:36:23Z

@fancychildren can you post a full example (that doesn't order correctly)

fancychildren · 2014-10-21T13:43:49Z

here is the example. after i create the cuts and apply summary to the dataframe, the currLoanSize bucket is ordered as if it's a string, instead of a number of the lower boundary.

after i tweak the code, it appears like this

not sure if i am not using it correctly. but i would be nice to be able to order by the value of lower boundary, while keeping the label.

jreback · 2014-10-21T13:45:29Z

can you do a programatic example, e.g. df = DataFrame....... eventually we'll turn this into a test

fancychildren · 2014-10-21T14:12:37Z

import StringIO
from pandas import *
import numpy as np


data = StringIO.StringIO('''upb coupon
0.00    3.00
25000.00    3.00
50000.00    3.00
75000.00    3.00
100000.00   3.00
125000.00   3.00
150000.00   3.00
175000.00   3.00
200000.00   3.00
225000.00   3.00
250000.00   3.00
275000.00   3.00
300000.00   3.00
325000.00   3.00
350000.00   3.00
375000.00   3.00
400000.00   3.00
425000.00   3.00
450000.00   3.00
475000.00   3.00
500000.00   3.00
525000.00   3.00
550000.00   3.00
575000.00   3.00
600000.00   3.00
625000.00   3.00
650000.00   3.00
675000.00   3.00
700000.00   3.00
725000.00   3.00
750000.00   3.00
775000.00   3.00
800000.00   3.00
825000.00   3.00
850000.00   3.00
875000.00   3.00
900000.00   3.00
925000.00   3.00
950000.00   3.00
975000.00   3.00
1000000.00  3.00
''')

df = read_csv(data, sep='\t')

df['currLoanSize'] = cut(df['upb'], bins=[0,50000,100000,200000,400000,9999999])
df['count_pct'] = 1.0/len(df['coupon'])

def f_summary(group):
    return Series({'counts': len(group['coupon']),
                   'count%': np.sum(group['count_pct']),
                   },
                  index = ['counts', 'count%']
                  )

print df.groupby('currLoanSize').apply(f_summary)

8one6 · 2015-11-19T20:05:39Z

I'd offer one potential direction to go with this. I think it would be great to find a way to allow users to specify the formatting of the bin labels returned by cut. As things stand now, I need to call cut twice to get the result I want. First I call cut to find the bin edges. Then I format the bin edges into appropriately formatted labels. Then finally I call cut again specifying the labels parameter.

One approach would be to allow labels to take a function as its argument. If it does, then it could be presumed to accept the list of bin edges and return a list of bin labels. Or, alternatively, the function could expect the lower and upper limits of a bin and return the label for that bin.

Here's the situation I've got in mind:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(0)
df = pd.DataFrame(np.random.randn(1000, 2) * 3, columns=list('xy'))
df['y'] = (df['x'] * 3) + 5 + np.random.randn(1000) * 2
df['x_bin'], edges = pd.cut(df['x'], bins=5, retbins=True)
nice_names = ['{b:0.1f} : {t:0.1f}'.format(b=edges[i], t=edges[i+1]) for i in range(len(edges)-1)]
df['x_bin_better'] = pd.cut(df['x'], bins=5, labels=nice_names)

fig, axes = plt.subplots(3, 1, figsize=(10, 10))
axes[0].plot(df['x'], df['y'], linestyle='None', marker='o', mew=1, alpha=0.25)
sns.violinplot('x_bin', 'y', data=df, ax=axes[1], scale='count', palette='Blues')
sns.violinplot('x_bin_better', 'y', data=df, ax=axes[2], scale='count', palette='Reds')
fig.tight_layout()

To be clear, I'm not suggesting the format implemented above is better (I think the default is quite reasonable). Instead I'm suggesting we make it easier for users to override that format when appropriate.

jreback · 2015-11-19T20:16:09Z

this is pretty much solved by #7640 when pd.cut will return an IntervalIndex which can then be easily formatted. 📌 to @shoyer

8one6 · 2015-11-19T20:23:33Z

Would you be open to accepting a pull request for adding the ability for labels to accept a function while we wait for IntervalIndex to be fleshed out? If so, do you prefer the function operate on the list of endpoints or on a pair of endpoints?

jreback · 2015-11-19T20:45:43Z

sure

I would have a 2 arg callable returning a string

shoyer · 2015-11-20T01:18:08Z

Yeah, I like the idea of accepting a function that acts on scalars for the left and right bounds. Eventually we could add this option as a method on IntervalIndex.

wesm · 2018-07-06T22:19:16Z

Now that pd.cut returns Categorical this can be resolved

jreback added the Categorical Categorical Data Type label Oct 21, 2014

jreback added this to the 0.15.1 milestone Oct 21, 2014

jreback added API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 21, 2014

jreback modified the milestones: 0.16.0, 0.15.2 Nov 29, 2014

jreback added the Interval Interval data type label Nov 29, 2014

jreback changed the title ~~pandas\tools\title.py~~ API: cut interval formatting Nov 29, 2014

jreback mentioned this issue Nov 29, 2014

API/ENH: create Interval class #8625

Closed

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

wesm closed this as completed Jul 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: cut interval formatting #8595

API: cut interval formatting #8595

fancychildren commented Oct 21, 2014

jreback commented Oct 21, 2014

jreback commented Oct 21, 2014

fancychildren commented Oct 21, 2014

jreback commented Oct 21, 2014

fancychildren commented Oct 21, 2014

8one6 commented Nov 19, 2015

jreback commented Nov 19, 2015

8one6 commented Nov 19, 2015

jreback commented Nov 19, 2015

shoyer commented Nov 20, 2015

wesm commented Jul 6, 2018

API: cut interval formatting #8595

API: cut interval formatting #8595

Comments

fancychildren commented Oct 21, 2014

jreback commented Oct 21, 2014

jreback commented Oct 21, 2014

fancychildren commented Oct 21, 2014

jreback commented Oct 21, 2014

fancychildren commented Oct 21, 2014

8one6 commented Nov 19, 2015

jreback commented Nov 19, 2015

8one6 commented Nov 19, 2015

jreback commented Nov 19, 2015

shoyer commented Nov 20, 2015

wesm commented Jul 6, 2018