Skip to content

ENH: add Series.histogram wrapping numpy.histogram #23710

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bluesquall opened this issue Nov 14, 2018 · 5 comments
Closed

ENH: add Series.histogram wrapping numpy.histogram #23710

bluesquall opened this issue Nov 14, 2018 · 5 comments

Comments

@bluesquall
Copy link

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
np.random.seed(3)
s = pd.Series(np.random.normal(0, 1, 100))
# What I'd like to be able to do:
h, b = s.histogram(20)
h
# array([ 1,  1,  1,  1,  3,  3,  4, 10,  7, 11, 11,  7,  7,  5,  9,  7, 3, 2,  4,  3])
len(b)
# 21
# Or, using a numpy automated bin selection algorithm:
ah, ab = s.histogram(bins='fd')
ah
# array([ 2,  5, 13, 22, 24, 15, 12,  7])
ab
# array([-2.91573775, -2.28150187, -1.64726598, -1.01303009, -0.3787942 , 0.25544168,  0.88967757,  1.52391346,  2.15814934])

Problem description

This is a lightweight wrapper around np.histogram like Series.hist seems to be a lightweight wrapper around matplotlib.pyplot.hist (at least from a user's perspective).

  • It differs from Series.hist in that it returns the histogram counts and bin edges, rather than going straight to a plot.
  • It also allows users to leverage the automatic binning algorithms and the density keyword from np.histogram.
  • But it may not work well with missing data, or with non-numerical series. (I'm happy to pull that thread further, if there's interest.)

In comparison, using pd.cut:

hb = pd.cut(s, 20).value_counts(sort=False)
# or
edges = np.arange(-3, 3,0.5)
hbe = pd.cut(s, bins=edges).value_counts(sort=False)
# or
hbesi = pd.cut(s, bins=edges).value_counts().sort_index()
# but now plotting leaves an empty x-axis
hbesi.plot()
# or one with categorical labels, even though the categories are numerical intervals
hbesi.plot(kind='bar')
  • requires more typing and function calls
  • requires the user to sort by index afterward, or to remember to tell value_counts not to sort
  • returns a series with a categorical index, which leads to a categorical axis when you eventually plot the data (I'm new to pandas, and still haven't figured out how I'm supposed to change a Categorical Index to regular floats. For my immediate application, I can just use pre-defined bin edges and keep that array around, but I'd like to be able to use automated binning in the future.)

Finally, using np.histogram is way faster:

%timeit hbesi = pd.cut(x,edges).value_counts().sort_index()
# 2.76 ms ± 8.75 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit hbe = pd.cut(x,edges).value_counts(sort='False')
# 2.58 ms ± 34.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit h, b = np.histogram(x,edges)
# 33.2 µs ± 46.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Related issues and pull requests:

#23580, #3945, #4502, #265

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.16-arch1-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.0
setuptools: 40.5.0
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.13
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.7.0

@jreback
Copy link
Contributor

jreback commented Nov 15, 2018

showing that a numpy routine is ‘way faster’ is not a great argument as that is often the case that they don’t handle dtypes fully (datetimes and such ( and certainly don’t handle the pandas dtypes)

so am not really sure your intent here

can u elaborate on an actual use case

we care way more about things working consistently across he board, having flexibility in the apis and just working

@bluesquall
Copy link
Author

bluesquall commented Nov 15, 2018

My intent is to get the histogram counts, without the plot, sorted by index (like a histogram would be).

I'd prefer to do this using a method to the Series class, without having to remember to set the sorting, and without having to call a chain of functions. I'd imagine other people could use this in their workflows, but perhaps I'm the only one using histograms in intermediate steps of my processing.

I'm open to implementing this using pd.cut().value_counts(sort='False') instead of np.histogram, especially if it would allay any concerns about working consistently across data types.

# somewhere in Series...
def histogram(self, bins=10, ...):
    return pd.cut(self, bins).value_counts(sort='False')

But my main objections to using pd.cut().value_counts(sort='False') are:

  • that it returns a categorical index even when the series is continuous and numerical, and
  • that it does not allow me to use the auto-binning methods in np.histogram
  • that it doesn't provide the density keyword

Perhaps the compromise would be to use pd.cut().value_counts(sort='False') unless: the bins arg is one of the auto-binning methods, the density kwarg is true, or some other kwarg is set.

My two present use cases involve:

  • adding histograms of the same variable from different logs to give me the histogram of the joined logs (instead of having to concatenate all the logs before doing the histogram)
  • plotting histograms of several related series stacked together as a pcolor heatmap

I'll work up some dummy data for illustrative purposes if needed. Another use case pops up when you want to look at the difference between two histograms, for example, or any other time when the histogram counts are an intermediate part of a calculation.

I can keep going on my own work using a local method or subclassing Series, but it struck me as something that others would find useful if it was there, so I submitted it as #23580.

@bluesquall
Copy link
Author

You could also think of my first use case as an extension of a stacked bar graph, if you want, but right now I don't need to differentiate the logs once the histograms are joined together.

Here's some dummy data, a bit simplified, but hopefully illustrating the point:

minutes = np.floor(np.random.normal(60,20,12)).astype(int)
edges = np.arange(0,1000,100)
centers = 0.5*(edges[1:] + edges[:-1])
h, b = np.histogram(s[0], edges)
for _s in s[1:]:
    _h, _b = np.histogram(_s, edges)
    h += _h
plt.bar(centers, h, width=np.diff(edges))

# and if you want to look at it a slightly different way:
for _s in s:
    _h, _b = np.histogram(_s, edges)
    plt.bar(centers, _h, width=np.diff(edges), color='g', alpha=0.05)

One key consideration that I made sure to include in this dummy data is that the lengths of each series are not the same, but the range of the variable x is comparable across series.

@mroeschke
Copy link
Member

Thanks for the request, but it appears the core team or community hasn't been interested in this enhancement in a while so closing. Can reopen if there's renewed interest

@bluesquall
Copy link
Author

If anyone else finds it useful to have access to the actual histogram data as part of a processing pipeline, instead of going straight to a plot, please comment.

I'd be happy to re-open this and to revise it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants