Description
Pandas version checks
- I have checked that the issue still exists on the latest versions of the docs on
main
here
Location of the documentation
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html#pandas.DataFrame.corr
Documentation problem
Since rank correlation coefficients make sense also for ordinal data, documentation for option numeric_true
is a little confusing.
Currently, it's saying the following:
numeric_only : bool, default False
Include only `float`, `int` or `boolean` data.
When ordinal data (categorical dtype with defined order) presented in dataframe, it seems natural to expect that rank correlation still be computed when using numeric_only = False
. For example, something like this should work:
import pandas as pd
df = pd.DataFrame({'workweek' : [4, 5, 6, 4], 'income' : ['low', 'middle', 'high', 'low']})
df['income'] = df['income'].astype('category').cat.set_categories(['low', 'middle', 'high'], ordered=True)
df.corr(method='spearman')
Yet it throws ValueError
, since it cannot convert string low
to float. Moreover, using numerical categories with the specific order results in incorrect behavior. For example,
import pandas as pd
df = pd.DataFrame({'a' : [1, 2, 3, 4], 'b' : [4, 3, 2, 1]})
df['b'] = df['b'].astype('category').cat.set_categories([4, 3, 2, 1], ordered=True)
df.corr(method='spearman')
returns that Spearman's correlation between a
and b
is equal to -1, while real value is equal to 1.
Suggested fix for documentation
I believe, that following additional sentence makes things a bit less confusing.
numeric_only : bool, default False
Include only `float`, `int` or `boolean` data. If value is False, method will try cast non-numerical
columns to float (note that ordinal data, if possible, will be converted ignoring specified order)