Skip to content

fill_value kwarg for unstack #9746

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amcpherson opened this issue Mar 29, 2015 · 3 comments
Closed

fill_value kwarg for unstack #9746

amcpherson opened this issue Mar 29, 2015 · 3 comments
Labels
API Design Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Usage Question
Milestone

Comments

@amcpherson
Copy link
Contributor

Currently:

In [2]: df = pd.DataFrame({'x':['a', 'a', 'b'], 'y':['j', 'k', 'j'], 'z':[0, 1, 2]})

In [3]: df.set_index(['x', 'y']).unstack()
Out[3]:
   z
y  j   k
x
a  0   1
b  2 NaN

If I want to fill with -1, i need to fillna and then astype back to int. Ideally:

In [3]: df.set_index(['x', 'y']).unstack(fill_value=-1)
Out[3]:
   z
y  j   k
x
a  0   1
b  2  -1
@jreback
Copy link
Contributor

jreback commented Mar 29, 2015

You can do this by specifying the downcast keyword. This is NOT automatic as a general operation this can be expensive.

In [10]: df.set_index(['x','y']).unstack().fillna(-1,downcast='infer')
Out[10]: 
   z   
y  j  k
x      
a  0  1
b  2 -1

In [11]: df.set_index(['x','y']).unstack().fillna(-1,downcast='infer').dtypes
Out[11]: 
   y
z  j    int64
   k    int64
dtype: object

@jreback jreback closed this as completed Mar 29, 2015
@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate API Design Usage Question labels Mar 29, 2015
@amcpherson
Copy link
Contributor Author

There may be some merit to this being allowed directly, even if the functionality can be accomplished with a series of operations. For instance, when trying to limit memory usage on a big dataset, perhaps it would be preferable to keep the data as np.int8.

In [15]: idx = np.array([0, 0, 1], dtype=np.int32)

In [16]: idx2 = np.array([0, 1, 0], dtype=np.int8)

In [17]: value = np.array([0, 1, 2], dtype=np.int8)

In [18]: df = pd.DataFrame({'idx':idx, 'idx2':idx2, 'value':value})

In [19]: df.dtypes
Out[19]:
idx      int32
idx2      int8
value     int8
dtype: object

In [20]: df.set_index(['idx', 'idx2']).unstack().dtypes
Out[20]:
       idx2
value  0       float64
       1       float64
dtype: object

After the unstack my data table is suddenly much larger than necessary.

Also, from looking at the code this would be fairly trivial to implement, without much impact on existing code.

@jreback
Copy link
Contributor

jreback commented Mar 29, 2015

@amcpherson ok, if you can find a reasonable way to do this w/o affecting perf then would be ok to have a fill_value argument.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Usage Question
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants