Skip to content

get_dummies(df,sparse=True) does not return sparse DataFrame #10531

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tgarc opened this issue Jul 8, 2015 · 6 comments
Closed

get_dummies(df,sparse=True) does not return sparse DataFrame #10531

tgarc opened this issue Jul 8, 2015 · 6 comments
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type
Milestone

Comments

@tgarc
Copy link

tgarc commented Jul 8, 2015

Just like it says in the subject. Here's an example:

In [216]: pd.version.version
Out[216]: '0.16.2'

In [217]: df = pd.DataFrame(np.random.randint(10,size=(10000,5)),columns=list('abcde'))

In [218]: df.head()
Out[218]: 
   a  b  c  d  e
0  2  6  1  4  0
1  5  2  6  8  5
2  0  8  7  3  1
3  9  2  6  3  1
4  4  8  6  8  6

In [219]: ddf = pd.get_dummies(df,columns=df.columns,sparse=True)

In [220]: ddf.head()
Out[220]: 
   a_0  a_1  a_2  a_3  a_4  a_5  a_6  a_7  a_8  a_9 ...   e_0  e_1  e_2  e_3  \
0    0  NaN    1  NaN    0    0  NaN  NaN  NaN    0 ...     1    0  NaN  NaN   
1    0  NaN    0  NaN    0    1  NaN  NaN  NaN    0 ...     0    0  NaN  NaN   
2    1  NaN    0  NaN    0    0  NaN  NaN  NaN    0 ...     0    1  NaN  NaN   
3    0  NaN    0  NaN    0    0  NaN  NaN  NaN    1 ...     0    1  NaN  NaN   
4    0  NaN    0  NaN    1    0  NaN  NaN  NaN    0 ...     0    0  NaN  NaN   

   e_4  e_5  e_6  e_7  e_8  e_9  
0  NaN    0    0  NaN  NaN  NaN  
1  NaN    1    0  NaN  NaN  NaN  
2  NaN    0    0  NaN  NaN  NaN  
3  NaN    0    0  NaN  NaN  NaN  
4  NaN    0    1  NaN  NaN  NaN  

[5 rows x 50 columns]

In [221]: type(ddf)
Out[221]: pandas.core.frame.DataFrame

In [222]: hasattr(ddf,'density')
Out[222]: False

In [223]: ddf = ddf.to_sparse()

In [224]: type(ddf)
Out[224]: pandas.sparse.frame.SparseDataFrame

In [225]: ddf.density
Out[225]: 0.1

I notice the NaN encoding in the DataFrame returned by get_dummies when sparse=True but the datatype is not sparse. Is this expected behavior?

@jreback
Copy link
Contributor

jreback commented Jul 9, 2015

So this a bug here
concat is a list of multiple SparseDataFrames AND and empty DataFrame which will then coerce to a DataFrame with SpareBlocks. So this is actually ok. A SparseDataFrame could be created (when its a single dtype, so this itself is odd).

So this would need to be worked thru a bit to see if everything makes sense.

pull-requests are welcome.

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type labels Jul 9, 2015
@jreback jreback added this to the Next Major Release milestone Jul 9, 2015
@jreback
Copy link
Contributor

jreback commented Jul 9, 2015

xref #8823

cc @artemyk what to have a look?

@artemyk
Copy link
Contributor

artemyk commented Jul 9, 2015

@jreback I will make a PR --- if all columns are included in get_dummies, then it does not start concat-ing with an empty DataFrame.
This does not resolve whether concat([empty_df, nonempty_sparse_df]) should in fact return a SparseDataFrame. This could go into another issue if desired. An easy fix (?) would be for concat to drop any empty DataFrames from its list.

@tgarc
Copy link
Author

tgarc commented Jul 10, 2015

FWIW, the returned DataFrame does seem to be compressed (the size of the returned dataframe is much smaller with the sparse=True flag on then off).

@artemyk
Copy link
Contributor

artemyk commented Jul 10, 2015

That makes sense --- I believe it still will have SparseBlocks.

@jreback jreback modified the milestones: 0.17.0, Next Major Release Jul 11, 2015
@jreback
Copy link
Contributor

jreback commented Jul 21, 2015

closed by #10535

@jreback jreback closed this as completed Jul 21, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type
Projects
None yet
Development

No branches or pull requests

3 participants