Skip to content

Supplying an xarray Dataset to DataFrame constructor breaks #12353

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
max-sixty opened this issue Feb 16, 2016 · 25 comments
Closed

Supplying an xarray Dataset to DataFrame constructor breaks #12353

max-sixty opened this issue Feb 16, 2016 · 25 comments
Labels
Bug Closing Candidate May be closeable, needs more eyeballs Compat pandas objects compatability with Numpy or Python functions Constructors Series/DataFrame/Index/pd.array Constructors

Comments

@max-sixty
Copy link
Contributor

In [1]: import xarray as xr

In [2]: import pandas as pd

In [3]: df = pd.DataFrame({'a': pd.np.random.rand(10), 'b': pd.np.random.rand(10)})

In [4]: df
Out[4]: 
          a         b
0  0.711341  0.636954
1  0.199090  0.370938
2  0.486677  0.274427
3  0.407370  0.282419
4  0.760676  0.069163
5  0.098402  0.820085
6  0.710977  0.777998
7  0.687722  0.764163
8  0.297734  0.740927
9  0.554381  0.388324

In [5]: xr.Dataset(df)
Out[5]: 
<xarray.Dataset>
Dimensions:  (dim_0: 10)
Coordinates:
  * dim_0    (dim_0) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
    a        (dim_0) float64 0.7113 0.1991 0.4867 0.4074 0.7607 0.0984 0.711 ...
    b        (dim_0) float64 0.637 0.3709 0.2744 0.2824 0.06916 0.8201 0.778 ...

In [6]: pd.DataFrame(xr.Dataset(df))
---------------------------------------------------------------------------
PandasError                               Traceback (most recent call last)
<ipython-input-6-be1c7b096414> in <module>()
----> 1 pd.DataFrame(xr.Dataset(df))

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    303                                          copy=False)
    304             else:
--> 305                 raise PandasError('DataFrame constructor not properly called!')
    306 
    307         NDFrame.__init__(self, mgr, fastpath=True)

PandasError: DataFrame constructor not properly called!

I think because this looks for dict rather than 'dict-like' or Mapping: https://github.com/pydata/pandas/blob/master/pandas/core/frame.py#L222

@jreback
Copy link
Contributor

jreback commented Feb 16, 2016

you have to call .to_dataframe() right?

this never worked before.

@jreback jreback added Compat pandas objects compatability with Numpy or Python functions Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Feb 16, 2016
@jreback
Copy link
Contributor

jreback commented Feb 16, 2016

oh, this inherits from Mapping, yeh prob lots of cases where we just use dict, instead should use com.is_dict_like

@max-sixty
Copy link
Contributor Author

TBC, .to_dataframe() works - this is a separate more abstract issue. If we supplied a Mapping of name: array which weren't a dict, we'd have the same issue.

@jreback
Copy link
Contributor

jreback commented Feb 16, 2016

ok #12356 fixes for DataFrame, can you give me examples that should work for Series & Panel?

@max-sixty
Copy link
Contributor Author

Great! Thanks.

First, series: https://github.com/pydata/pandas/blob/master/pandas/core/series.py#L166

In [12]: series = pd.Series(range(10))

In [13]: series
Out[13]: 
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

In [14]: xr.Dataset(series)
Out[14]: 
<xarray.Dataset>
Dimensions:  ()
Coordinates:
    *empty*
Data variables:
    0        int64 0
    1        int64 1
    2        int64 2
    3        int64 3
    4        int64 4
    5        int64 5
    6        int64 6
    7        int64 7
    8        int64 8
    9        int64 9

In [15]: pd.Series(xr.Dataset(series))
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-15-60e12a391f83> in <module>()
----> 1 pd.Series(xr.Dataset(series))

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    223             else:
    224                 data = _sanitize_array(data, index, dtype, copy,
--> 225                                        raise_cast_failure=True)
    226 
    227                 data = SingleBlockManager(data, index, fastpath=True)

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/series.py in _sanitize_array(data, index, dtype, copy, raise_cast_failure)
   2855 
   2856     # scalar like
-> 2857     if subarr.ndim == 0:
   2858         if isinstance(data, list):  # pragma: no cover
   2859             subarr = np.array(data, dtype=object)

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/xarray/core/common.py in __getattr__(self, name)
    135                     return source[name]
    136         raise AttributeError("%r object has no attribute %r" %
--> 137                              (type(self).__name__, name))
    138 
    139     def __setattr__(self, name, value):

AttributeError: 'Dataset' object has no attribute 'ndim'

@max-sixty
Copy link
Contributor Author

Panel: https://github.com/pydata/pandas/blob/master/pandas/core/panel.py#L160

In [18]: xr.Dataset(pd.Panel(pd.np.random.rand(5,4,3)))
Out[18]: 
<xarray.Dataset>
Dimensions:  (dim_0: 4, dim_1: 3)
Coordinates:
  * dim_0    (dim_0) int64 0 1 2 3
  * dim_1    (dim_1) int64 0 1 2
Data variables:
    0        (dim_0, dim_1) float64 0.8917 0.4159 0.6102 0.2616 0.2068 ...
    1        (dim_0, dim_1) float64 0.4132 0.7464 0.6103 0.7006 0.8255 0.63 ...
    2        (dim_0, dim_1) float64 0.7507 0.8742 0.1039 0.2819 0.06264 ...
    3        (dim_0, dim_1) float64 0.3035 0.3156 0.8926 0.0023 0.05565 ...
    4        (dim_0, dim_1) float64 0.6555 0.8872 0.04457 0.7503 0.8936 ...

In [19]: pd.Panel(xr.Dataset(pd.Panel(pd.np.random.rand(5,4,3))))
---------------------------------------------------------------------------
PandasError                               Traceback (most recent call last)
<ipython-input-19-ce579b9d3522> in <module>()
----> 1 pd.Panel(xr.Dataset(pd.Panel(pd.np.random.rand(5,4,3))))

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/panel.py in __init__(self, data, items, major_axis, minor_axis, copy, dtype)
    133                  copy=False, dtype=None):
    134         self._init_data(data=data, items=items, major_axis=major_axis,
--> 135                         minor_axis=minor_axis, copy=copy, dtype=dtype)
    136 
    137     def _init_data(self, data, copy, dtype, **kwargs):

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/panel.py in _init_data(self, data, copy, dtype, **kwargs)
    173             copy = False
    174         else:  # pragma: no cover
--> 175             raise PandasError('Panel constructor not properly called!')
    176 
    177         NDFrame.__init__(self, mgr, axes=axes, copy=copy, dtype=dtype)

PandasError: Panel constructor not properly called!

jreback added a commit to jreback/pandas that referenced this issue Feb 16, 2016
@jreback
Copy link
Contributor

jreback commented Feb 17, 2016

the Series constructor cannot handle the 2-d input that you are passing it, so not really sure what if anything to do.

In [1]: s = Series(range(10))

In [2]: s.to_xarray()
Out[2]: 
<xarray.DataArray (index: 10)>
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Coordinates:
  * index    (index) int64 0 1 2 3 4 5 6 7 8 9

In [3]: Series(s.to_xarray())
Out[3]: 
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

@max-sixty
Copy link
Contributor Author

This goes a bit deeper than I thought, but bear with me:

  • Datasets vs DataArrays: A Dataset is a mapping of label to array. So xr.Dataset({'a': 2, 'b': 3}) has the same dimensionality as a Series.
  • .to_xarray() converts to a DataArray, which is a reasonable default; but converting to and from Datasets is also helpful
  • pd.Series(xr.Dataset({'a': 2, 'b': 3})) doesn't work for two separate reasons:
    • Because Series looks for dict rather than Mapping, and so doesn't unpack the Dataset
    • But pd.Series(dict(xr.Dataset({'a': 2, 'b': 3}))) doesn't work either, because the values in the dict are 0-th dimension arrays, and Series doesn't unpack them (list(dict(xr.Dataset({'a': 2, 'b': 3})).values())[0] for the full code)

So my proposed changes are to do both of those - does that seem reasonable?

ref pydata/xarray#740 on going the other direction

@jreback jreback added this to the 0.18.1 milestone Feb 17, 2016
@jreback jreback modified the milestones: 0.18.2, 0.18.1 Apr 18, 2016
@jreback
Copy link
Contributor

jreback commented Apr 18, 2016

pls update when you can

@max-sixty
Copy link
Contributor Author

The Mapping stuff is easy, I can do that. That coves DataFrame & Panel.

Series needs to accept a dict with 0-th dimension arrays. Any idea for how to do that? Or we can push that one off

@jreback
Copy link
Contributor

jreback commented Apr 18, 2016

what is a dict with 0th dims?

@max-sixty
Copy link
Contributor Author

In [42]: ds = xr.Dataset(dict(zip(list('abcde'), range(4))))

In [43]: ds
Out[43]: 
<xarray.Dataset>
Dimensions:  ()
Coordinates:
    *empty*
Data variables:
    b        int64 1
    d        int64 3
    c        int64 2
    a        int64 0

In [44]: list(dict(ds).values())[0]
Out[44]: 
<xarray.DataArray 'b' ()>
array(1)

In [45]: list(dict(ds).values())[0].ndim
Out[45]: 0

Each of Out[44] should be the value for a Series

@max-sixty
Copy link
Contributor Author

@jreback As discussed, we're using is_dict_like to see if something is a dict or similar.

Should pandas objects be dict-like? I think that may break some code (i.e. in a Series constructor, supplying a Series is not like supplying a dictionary, given metadata).

I think ideally pandas objects should be dict-like (and potentially inherit from Mapping ref: #12056).
But given practical considerations, I think it would make sense to have is_dict_like exclude pandas objects.

Do you agree?

@jreback
Copy link
Contributor

jreback commented Apr 27, 2016

no. a Series IS dict-like. as is a DataFrame

@jreback
Copy link
Contributor

jreback commented Apr 27, 2016

its not strictly necessary to inherit from Mapping to be dict-like. In fact that's the new 'way'. I don't have a problem with that actually.

@max-sixty
Copy link
Contributor Author

OK great. I may have to fix some more issues then, where isinstance(x, dict) isn't expecting to pass on a pandas object

its not strictly necessary to inherit from Mapping to be dict-like. In fact that's the new 'way'. I don't have a problem with that actually.

Can you clarify? That we check for Mapping rather than the attrs in is_dict_like? That we inherit DataFrame & Series from Mapping?

@jreback
Copy link
Contributor

jreback commented Apr 27, 2016

no I think is_dict_like is just fine. It is nice to inherit from collections.Mapping, but we already have a duck-typed dict-like interface (e.g. we have the appropriate methods). I think some people will do things like:

isinstance(series, collections.Mapping) so that is the reason for that.

@max-sixty
Copy link
Contributor Author

As covered here: #12056, you do need to inherit from Mapping for isinstance(series, collections.Mapping) to be True.

Other abc.collections objects, like Sized, don't need inheritance.

I think it would be good to have that inheritance - do you agree?

@jreback
Copy link
Contributor

jreback commented Apr 27, 2016

yes I don't see a problem here

@kawochen
Copy link
Contributor

It would be an API change though. Once you inherit from Mapping, users will expect the mixin methods to work as well, and there will be name conflicts, e.g. .values.

@jreback
Copy link
Contributor

jreback commented Apr 27, 2016

actually that's a good point

@jorisvandenbossche
Copy link
Member

@MaximilianR @jreback status of this issue?

@max-sixty
Copy link
Contributor Author

#12400 is half-finished; I'll try and finish it up in the next couple of weeks

@jreback jreback modified the milestones: 0.19.0, 0.19.1 Sep 28, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.20.0, 0.19.1 Oct 29, 2016
@jreback jreback modified the milestones: 0.20.0, 0.21.0, Next Major Release Mar 23, 2017
@jbrockmendel jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Jul 23, 2019
@mroeschke mroeschke added Bug and removed Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Apr 10, 2020
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@jbrockmendel
Copy link
Member

In sanitize_array we check for a __array__ method, which xr.Dataset has, but then Dataset.__array__ raises. I think the correct thing to do here is to call obj.to_pandas() on the xarray object.

@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Mar 14, 2023
@mroeschke
Copy link
Member

Agreed, since a obj.to_pandas() API exists I don't think this needs to be handled in DataFrame.__init__ so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Closing Candidate May be closeable, needs more eyeballs Compat pandas objects compatability with Numpy or Python functions Constructors Series/DataFrame/Index/pd.array Constructors
Projects
None yet
6 participants