Skip to content

Using scipy's entropy function with "string" dtype #32234

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
SylvainLan opened this issue Feb 25, 2020 · 3 comments
Open

Using scipy's entropy function with "string" dtype #32234

SylvainLan opened this issue Feb 25, 2020 · 3 comments
Labels
Bug Compat pandas objects compatability with Numpy or Python functions ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data

Comments

@SylvainLan
Copy link
Contributor

Hello,

I calculate entropies of real valued random variables thanks to pandas and scipy.stats. For some practical reason, I bin my df's values with pd.cut and the labels I assign to these binned valued has to be a string. I wanted to use the new "string" dtype introduced in 1.0.0, but there seems to be an issue, here is a code sample :

import pandas as pd 
import numpy as np 
from scipy.stats import entropy 
 
df = pd.DataFrame({'val': np.random.rand(1000)}) 
bins = np.linspace(0, 1, 11)
labels = [str(int(10*i)) for i in bins[:-1]] 
 
df['bin_val'] = pd.cut(df['val'], bins=bins, labels=labels) 
# This works
print(entropy(df['bin_val'].value_counts())) 
 
df['bin_val'] = df['bin_val'].astype("string") 
# This raises an error
print(entropy(df['bin_val'].value_counts()))

The error raised is TypeError: ufunc 'entr' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

@pandrey-fr
Copy link

pandrey-fr commented Feb 25, 2020

The issue seems to arise from the fact that the second Series obtained from df['bin_val'].value_counts() converts to a pandas.arrays.IntegerArray rather than a numpy.ndarray, which is not a supported input data type of scipy.stats.entropy. Similarly, explicitly calling entropy(df['bin_val'].value_counts().to_numpy()) fails due to values casting to object dtype.

The simplest ways I found to run the computation in the second case is to call entropy(df['bin_val'].value_counts().to_numpy(dtype=int)), which feels a bit too verbose, or more simply entropy(df['bin_val'].value_counts().astype(int), to cast from Int64 to int64.

@SylvainLan
Copy link
Contributor Author

Thanks @pandrey-fr
So you're telling me I basically got lucky in my first scenario for the conversion was handled directly by scipy ?
If that's the case there is no need for pandas to change anything, right ? I'll thereby close this issue

@pandrey-fr
Copy link

Well, basically, in the second scenario, the conversion of your data to the recently introduced pandas.StringDtype triggered the formatting of value_counts results to Int64. Not being a pandas developer, I do not know whether this is intended and/or desriable behaviour.

Note that if you converted your data to str (which pandas treats as dtype('O')) using df['bin_val'].astype(str), you would get int64 value counts, and thus not encounter this issue. In other words, the issue emerges from novel pandas features, and may well be worth reporting as these are not quite stabilized yet :-)

@jbrockmendel jbrockmendel added the Strings String extension data type and string data label Oct 12, 2020
@mroeschke mroeschke added Bug ExtensionArray Extending pandas with custom dtypes or arrays. labels Jul 29, 2021
@jbrockmendel jbrockmendel added the Compat pandas objects compatability with Numpy or Python functions label Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Compat pandas objects compatability with Numpy or Python functions ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

4 participants