Skip to content

API: cut interval formatting #8595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fancychildren opened this issue Oct 21, 2014 · 11 comments
Closed

API: cut interval formatting #8595

fancychildren opened this issue Oct 21, 2014 · 11 comments
Labels
API Design Categorical Categorical Data Type Interval Interval data type Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@fancychildren
Copy link

it would be nice to have number in front of all labels. put number like 00, 01, 02 in front of labels so that it would order appropriately.

Lib\site-packages\pandas\tools\tile.py

def _format_levels(bins, prec, right=True,
                   include_lowest=False):
    fmt = lambda v: _format_label(v, precision=prec)
    cnter=0
    if right:
        levels = []
        for a, b in zip(bins, bins[1:]):
            fa, fb = fmt(a), fmt(b)

            if a != b and fa == fb:
                raise ValueError('precision too low')

            formatted = '%02d: (%s, %s]' % (cnter, fa, fb)
            cnter=cnter +1
            levels.append(formatted)

        if include_lowest:
            levels[0] = '[' + levels[0][1:]
    else:
        levels = ['[%s, %s)' % (fmt(a), fmt(b))
                  for a, b in zip(bins, bins[1:])]

    return levels
@jreback
Copy link
Contributor

jreback commented Oct 21, 2014

this is going to be turned into a Categorical, so ordering will happen automatically. interested to do for 0.15.1?

@jreback jreback added the Categorical Categorical Data Type label Oct 21, 2014
@jreback jreback added this to the 0.15.1 milestone Oct 21, 2014
@jreback jreback added API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 21, 2014
@jreback
Copy link
Contributor

jreback commented Oct 21, 2014

@fancychildren can you post a full example (that doesn't order correctly)

@fancychildren
Copy link
Author

here is the example. after i create the cuts and apply summary to the dataframe, the currLoanSize bucket is ordered as if it's a string, instead of a number of the lower boundary.

image

after i tweak the code, it appears like this
image

not sure if i am not using it correctly. but i would be nice to be able to order by the value of lower boundary, while keeping the label.

@jreback
Copy link
Contributor

jreback commented Oct 21, 2014

can you do a programatic example, e.g. df = DataFrame....... eventually we'll turn this into a test

@fancychildren
Copy link
Author

import StringIO
from pandas import *
import numpy as np


data = StringIO.StringIO('''upb coupon
0.00    3.00
25000.00    3.00
50000.00    3.00
75000.00    3.00
100000.00   3.00
125000.00   3.00
150000.00   3.00
175000.00   3.00
200000.00   3.00
225000.00   3.00
250000.00   3.00
275000.00   3.00
300000.00   3.00
325000.00   3.00
350000.00   3.00
375000.00   3.00
400000.00   3.00
425000.00   3.00
450000.00   3.00
475000.00   3.00
500000.00   3.00
525000.00   3.00
550000.00   3.00
575000.00   3.00
600000.00   3.00
625000.00   3.00
650000.00   3.00
675000.00   3.00
700000.00   3.00
725000.00   3.00
750000.00   3.00
775000.00   3.00
800000.00   3.00
825000.00   3.00
850000.00   3.00
875000.00   3.00
900000.00   3.00
925000.00   3.00
950000.00   3.00
975000.00   3.00
1000000.00  3.00
''')

df = read_csv(data, sep='\t')

df['currLoanSize'] = cut(df['upb'], bins=[0,50000,100000,200000,400000,9999999])
df['count_pct'] = 1.0/len(df['coupon'])

def f_summary(group):
    return Series({'counts': len(group['coupon']),
                   'count%': np.sum(group['count_pct']),
                   },
                  index = ['counts', 'count%']
                  )

print df.groupby('currLoanSize').apply(f_summary)

image

@jreback jreback modified the milestones: 0.16.0, 0.15.2 Nov 29, 2014
@jreback jreback added the Interval Interval data type label Nov 29, 2014
@jreback jreback changed the title pandas\tools\title.py API: cut interval formatting Nov 29, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@8one6
Copy link

8one6 commented Nov 19, 2015

I'd offer one potential direction to go with this. I think it would be great to find a way to allow users to specify the formatting of the bin labels returned by cut. As things stand now, I need to call cut twice to get the result I want. First I call cut to find the bin edges. Then I format the bin edges into appropriately formatted labels. Then finally I call cut again specifying the labels parameter.

One approach would be to allow labels to take a function as its argument. If it does, then it could be presumed to accept the list of bin edges and return a list of bin labels. Or, alternatively, the function could expect the lower and upper limits of a bin and return the label for that bin.

Here's the situation I've got in mind:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(0)
df = pd.DataFrame(np.random.randn(1000, 2) * 3, columns=list('xy'))
df['y'] = (df['x'] * 3) + 5 + np.random.randn(1000) * 2
df['x_bin'], edges = pd.cut(df['x'], bins=5, retbins=True)
nice_names = ['{b:0.1f} : {t:0.1f}'.format(b=edges[i], t=edges[i+1]) for i in range(len(edges)-1)]
df['x_bin_better'] = pd.cut(df['x'], bins=5, labels=nice_names)

fig, axes = plt.subplots(3, 1, figsize=(10, 10))
axes[0].plot(df['x'], df['y'], linestyle='None', marker='o', mew=1, alpha=0.25)
sns.violinplot('x_bin', 'y', data=df, ax=axes[1], scale='count', palette='Blues')
sns.violinplot('x_bin_better', 'y', data=df, ax=axes[2], scale='count', palette='Reds')
fig.tight_layout()

image

To be clear, I'm not suggesting the format implemented above is better (I think the default is quite reasonable). Instead I'm suggesting we make it easier for users to override that format when appropriate.

@jreback
Copy link
Contributor

jreback commented Nov 19, 2015

this is pretty much solved by #7640 when pd.cut will return an IntervalIndex which can then be easily formatted. 📌 to @shoyer

@8one6
Copy link

8one6 commented Nov 19, 2015

Would you be open to accepting a pull request for adding the ability for labels to accept a function while we wait for IntervalIndex to be fleshed out? If so, do you prefer the function operate on the list of endpoints or on a pair of endpoints?

@jreback
Copy link
Contributor

jreback commented Nov 19, 2015

sure

I would have a 2 arg callable returning a string

@shoyer
Copy link
Member

shoyer commented Nov 20, 2015

Yeah, I like the idea of accepting a function that acts on scalars for the left and right bounds. Eventually we could add this option as a method on IntervalIndex.

@wesm
Copy link
Member

wesm commented Jul 6, 2018

Now that pd.cut returns Categorical this can be resolved

@wesm wesm closed this as completed Jul 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type Interval Interval data type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

5 participants