ENH: add Series.histogram wrapping numpy.histogram #23710

bluesquall · 2018-11-14T23:17:25Z

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
np.random.seed(3)
s = pd.Series(np.random.normal(0, 1, 100))
# What I'd like to be able to do:
h, b = s.histogram(20)
h
# array([ 1,  1,  1,  1,  3,  3,  4, 10,  7, 11, 11,  7,  7,  5,  9,  7, 3, 2,  4,  3])
len(b)
# 21
# Or, using a numpy automated bin selection algorithm:
ah, ab = s.histogram(bins='fd')
ah
# array([ 2,  5, 13, 22, 24, 15, 12,  7])
ab
# array([-2.91573775, -2.28150187, -1.64726598, -1.01303009, -0.3787942 , 0.25544168,  0.88967757,  1.52391346,  2.15814934])

Problem description

This is a lightweight wrapper around np.histogram like Series.hist seems to be a lightweight wrapper around matplotlib.pyplot.hist (at least from a user's perspective).

It differs from Series.hist in that it returns the histogram counts and bin edges, rather than going straight to a plot.
It also allows users to leverage the automatic binning algorithms and the density keyword from np.histogram.
But it may not work well with missing data, or with non-numerical series. (I'm happy to pull that thread further, if there's interest.)

In comparison, using pd.cut:

hb = pd.cut(s, 20).value_counts(sort=False)
# or
edges = np.arange(-3, 3,0.5)
hbe = pd.cut(s, bins=edges).value_counts(sort=False)
# or
hbesi = pd.cut(s, bins=edges).value_counts().sort_index()
# but now plotting leaves an empty x-axis
hbesi.plot()
# or one with categorical labels, even though the categories are numerical intervals
hbesi.plot(kind='bar')

requires more typing and function calls
requires the user to sort by index afterward, or to remember to tell value_counts not to sort
returns a series with a categorical index, which leads to a categorical axis when you eventually plot the data (I'm new to pandas, and still haven't figured out how I'm supposed to change a Categorical Index to regular floats. For my immediate application, I can just use pre-defined bin edges and keep that array around, but I'd like to be able to use automated binning in the future.)

Finally, using np.histogram is way faster:

%timeit hbesi = pd.cut(x,edges).value_counts().sort_index()
# 2.76 ms ± 8.75 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit hbe = pd.cut(x,edges).value_counts(sort='False')
# 2.58 ms ± 34.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit h, b = np.histogram(x,edges)
# 33.2 µs ± 46.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Related issues and pull requests:

#23580, #3945, #4502, #265

Output of `pd.show_versions()`

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.16-arch1-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.0
setuptools: 40.5.0
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.13
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.7.0

The text was updated successfully, but these errors were encountered:

jreback · 2018-11-15T00:45:11Z

showing that a numpy routine is ‘way faster’ is not a great argument as that is often the case that they don’t handle dtypes fully (datetimes and such ( and certainly don’t handle the pandas dtypes)

so am not really sure your intent here

can u elaborate on an actual use case

we care way more about things working consistently across he board, having flexibility in the apis and just working

bluesquall · 2018-11-15T01:32:53Z

My intent is to get the histogram counts, without the plot, sorted by index (like a histogram would be).

I'd prefer to do this using a method to the Series class, without having to remember to set the sorting, and without having to call a chain of functions. I'd imagine other people could use this in their workflows, but perhaps I'm the only one using histograms in intermediate steps of my processing.

I'm open to implementing this using pd.cut().value_counts(sort='False') instead of np.histogram, especially if it would allay any concerns about working consistently across data types.

# somewhere in Series...
def histogram(self, bins=10, ...):
    return pd.cut(self, bins).value_counts(sort='False')

But my main objections to using pd.cut().value_counts(sort='False') are:

that it returns a categorical index even when the series is continuous and numerical, and
that it does not allow me to use the auto-binning methods in np.histogram
that it doesn't provide the density keyword

Perhaps the compromise would be to use pd.cut().value_counts(sort='False') unless: the bins arg is one of the auto-binning methods, the density kwarg is true, or some other kwarg is set.

My two present use cases involve:

adding histograms of the same variable from different logs to give me the histogram of the joined logs (instead of having to concatenate all the logs before doing the histogram)
plotting histograms of several related series stacked together as a pcolor heatmap

I'll work up some dummy data for illustrative purposes if needed. Another use case pops up when you want to look at the difference between two histograms, for example, or any other time when the histogram counts are an intermediate part of a calculation.

I can keep going on my own work using a local method or subclassing Series, but it struck me as something that others would find useful if it was there, so I submitted it as #23580.

bluesquall · 2018-11-15T02:12:09Z

You could also think of my first use case as an extension of a stacked bar graph, if you want, but right now I don't need to differentiate the logs once the histograms are joined together.

Here's some dummy data, a bit simplified, but hopefully illustrating the point:

minutes = np.floor(np.random.normal(60,20,12)).astype(int)
edges = np.arange(0,1000,100)
centers = 0.5*(edges[1:] + edges[:-1])
h, b = np.histogram(s[0], edges)
for _s in s[1:]:
    _h, _b = np.histogram(_s, edges)
    h += _h
plt.bar(centers, h, width=np.diff(edges))

# and if you want to look at it a slightly different way:
for _s in s:
    _h, _b = np.histogram(_s, edges)
    plt.bar(centers, _h, width=np.diff(edges), color='g', alpha=0.05)

One key consideration that I made sure to include in this dummy data is that the lengths of each series are not the same, but the range of the variable x is comparable across series.

mroeschke · 2023-04-04T23:30:55Z

Thanks for the request, but it appears the core team or community hasn't been interested in this enhancement in a while so closing. Can reopen if there's renewed interest

bluesquall · 2023-04-19T21:50:06Z

If anyone else finds it useful to have access to the actual histogram data as part of a processing pipeline, instead of going straight to a plot, please comment.

I'd be happy to re-open this and to revise it.

gfyoung added Enhancement Visualization plotting labels Nov 15, 2018

mroeschke closed this as completed Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add Series.histogram wrapping numpy.histogram #23710

ENH: add Series.histogram wrapping numpy.histogram #23710

bluesquall commented Nov 14, 2018

INSTALLED VERSIONS

jreback commented Nov 15, 2018

bluesquall commented Nov 15, 2018 •

edited

Loading

bluesquall commented Nov 15, 2018

mroeschke commented Apr 4, 2023

bluesquall commented Apr 19, 2023

ENH: add Series.histogram wrapping numpy.histogram #23710

ENH: add Series.histogram wrapping numpy.histogram #23710

Comments

bluesquall commented Nov 14, 2018

Code Sample, a copy-pastable example if possible

Problem description

Related issues and pull requests:

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Nov 15, 2018

bluesquall commented Nov 15, 2018 • edited Loading

bluesquall commented Nov 15, 2018

mroeschke commented Apr 4, 2023

bluesquall commented Apr 19, 2023

Output of `pd.show_versions()`

bluesquall commented Nov 15, 2018 •

edited

Loading