-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: add Series.histogram wrapping numpy.histogram #23710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
showing that a numpy routine is ‘way faster’ is not a great argument as that is often the case that they don’t handle dtypes fully (datetimes and such ( and certainly don’t handle the pandas dtypes) so am not really sure your intent here can u elaborate on an actual use case we care way more about things working consistently across he board, having flexibility in the apis and just working |
My intent is to get the histogram counts, without the plot, sorted by index (like a histogram would be). I'd prefer to do this using a method to the Series class, without having to remember to set the sorting, and without having to call a chain of functions. I'd imagine other people could use this in their workflows, but perhaps I'm the only one using histograms in intermediate steps of my processing. I'm open to implementing this using # somewhere in Series...
def histogram(self, bins=10, ...):
return pd.cut(self, bins).value_counts(sort='False') But my main objections to using
Perhaps the compromise would be to use My two present use cases involve:
I'll work up some dummy data for illustrative purposes if needed. Another use case pops up when you want to look at the difference between two histograms, for example, or any other time when the histogram counts are an intermediate part of a calculation. I can keep going on my own work using a local method or subclassing Series, but it struck me as something that others would find useful if it was there, so I submitted it as #23580. |
You could also think of my first use case as an extension of a stacked bar graph, if you want, but right now I don't need to differentiate the logs once the histograms are joined together. Here's some dummy data, a bit simplified, but hopefully illustrating the point: minutes = np.floor(np.random.normal(60,20,12)).astype(int)
edges = np.arange(0,1000,100)
centers = 0.5*(edges[1:] + edges[:-1])
h, b = np.histogram(s[0], edges)
for _s in s[1:]:
_h, _b = np.histogram(_s, edges)
h += _h
plt.bar(centers, h, width=np.diff(edges))
# and if you want to look at it a slightly different way:
for _s in s:
_h, _b = np.histogram(_s, edges)
plt.bar(centers, _h, width=np.diff(edges), color='g', alpha=0.05) One key consideration that I made sure to include in this dummy data is that the lengths of each series are not the same, but the range of the variable |
Thanks for the request, but it appears the core team or community hasn't been interested in this enhancement in a while so closing. Can reopen if there's renewed interest |
If anyone else finds it useful to have access to the actual histogram data as part of a processing pipeline, instead of going straight to a plot, please comment. I'd be happy to re-open this and to revise it. |
Code Sample, a copy-pastable example if possible
Problem description
This is a lightweight wrapper around
np.histogram
likeSeries.hist
seems to be a lightweight wrapper aroundmatplotlib.pyplot.hist
(at least from a user's perspective).Series.hist
in that it returns the histogram counts and bin edges, rather than going straight to a plot.np.histogram
.In comparison, using
pd.cut
:value_counts
not to sortFinally, using
np.histogram
is way faster:Related issues and pull requests:
#23580, #3945, #4502, #265
Output of
pd.show_versions()
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.16-arch1-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: None
pip: 18.0
setuptools: 40.5.0
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.13
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.7.0
The text was updated successfully, but these errors were encountered: