Skip to content

COMPAT: Objects construction compat with xarray.Dataset #12400

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

max-sixty
Copy link
Contributor

I'm changed the dict / mapping issue, but I can't get this part working (so the tests fail at the moment):

But pd.Series(dict(xr.Dataset({'a': 2, 'b': 3}))) doesn't work either, because the values in the dict are 0-th dimension arrays, and Series doesn't unpack them (list(dict(xr.Dataset({'a': 2, 'b': 3})).values())[0] for the full code)
#12353 (comment)

Is that a reasonable goal? Where would be a good place to do the conversion?

@max-sixty max-sixty changed the title Series dataset compat COMPAT: Series construction compat with xarray.Dataset Feb 20, 2016
from pandas.util.testing import assert_series_equal
import pandas.util.testing as tm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was unintentional - PyCharm - although it is nicer now...

@max-sixty max-sixty force-pushed the series-dataset-compat branch 2 times, most recently from 6190d10 to 99f1ee7 Compare March 23, 2016 11:36
@jreback
Copy link
Contributor

jreback commented Mar 23, 2016

@MaximilianR since you are fixing this, thanks!

can you incoporate anything from my original PR that is needed: #12356

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Compat pandas objects compatability with Numpy or Python functions labels Mar 23, 2016
@max-sixty max-sixty force-pushed the series-dataset-compat branch from 2f3b8de to 2477347 Compare March 24, 2016 23:29
@max-sixty max-sixty force-pushed the series-dataset-compat branch 2 times, most recently from ad80374 to 539ccd5 Compare April 6, 2016 23:11
@max-sixty max-sixty changed the title COMPAT: Series construction compat with xarray.Dataset COMPAT: Objects construction compat with xarray.Dataset Apr 22, 2016
@max-sixty max-sixty force-pushed the series-dataset-compat branch from 539ccd5 to dddb479 Compare April 22, 2016 06:17
@max-sixty max-sixty force-pushed the series-dataset-compat branch 4 times, most recently from c2d833c to 97bac68 Compare May 3, 2016 16:22
@max-sixty
Copy link
Contributor Author

max-sixty commented May 3, 2016

@jreback Ready to go, less a what's new note and removing a commented out test (this is why the build fails).

Apologies for mixing formatting & functional improvements.

Needs a look through from you, in particular the to_dataframe check.

result = Series(xr.Dataset(d))
expected = Series(d)

# disabled for the moment (will remove from PR)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the issues with 0-d arrays discussed in the issue

@jreback
Copy link
Contributor

jreback commented May 13, 2016

@MaximilianR if you'd like to rebase/update

@max-sixty max-sixty force-pushed the series-dataset-compat branch from 97bac68 to 836672a Compare May 17, 2016 04:23
@max-sixty
Copy link
Contributor Author

Rebased.
@jreback, awaiting your review

from pandas import compat
from pandas.compat import (long, zip, map, string_types,
iteritems)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put things back in this file as these are not substantive changes (and things getting changed elsewhere)

@jreback
Copy link
Contributor

jreback commented May 17, 2016

you have some setting on your editor which just messes with existing (correct) linting / formatting. Pls turn that off. Makes it much harder to see what is changing.

Overall I think this approach is nice. .to_dataframe() is a good approach / very duck-like.

@max-sixty
Copy link
Contributor Author

I've been remiss about the final fixes here. Will need a couple of weeks on other commitments but I will get to this

@@ -217,11 +217,16 @@ def __init__(self, data=None, index=None, columns=None, dtype=None,
if isinstance(data, DataFrame):
data = data._data

if hasattr(data, 'to_dataframe'): # xr.Dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would slightly rather skip this special check. In theory, to_dataframe might return something else.

I don't think it will be that much slower to use the generic dict path, though we then run into the issue that DataFrame._init_dict does a special check for OrderedDict when deciding whether or not to order the keys. Sadly there's no way in check whether an arbitrary Mapping type has intentionally ordered keys or not (Python-ideas discussed adding collections.abc.Ordered but I don't think it was implemented).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The broader question is whether we want objects to be able to define their own conversion to a DataFrame - I think the main options are:

  1. Everything is treated as a dict
  2. We build the code within the DataFrame constructor - i.e. that if statement checks for DataSet (and anything else we want to check)
  3. There is some duck-like method such as to_dataframe that classes can define themselves

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with a duck-like method, but we should probably call it something very explicit like _to_pandas_dataframe_. This would be useful for a lot of other projects, not just xarray.

@max-sixty max-sixty force-pushed the series-dataset-compat branch from 836672a to 969d0c6 Compare August 25, 2016 01:11
@max-sixty max-sixty force-pushed the series-dataset-compat branch from 969d0c6 to d1b6c62 Compare August 25, 2016 01:17
@jreback
Copy link
Contributor

jreback commented Oct 6, 2016

@MaximilianR ?

@max-sixty
Copy link
Contributor Author

Apologies! I will do soon. Please don't close.

@jreback
Copy link
Contributor

jreback commented Dec 21, 2016

@MaximilianR want to rebase?

I would be onboard with allowing a special method __dataframe__ which is pretty clear IMHO.

@shoyer
Copy link
Member

shoyer commented Dec 21, 2016

I would be onboard with allowing a special method dataframe which is pretty clear IMHO.

At this point there are a number of alternative "dataframe" objects (e.g., dask, spark), many of which probably want to be coercible to pandas, so I think the special method should have pandas in the name to clarify. Hence my suggestion for _to_pandas_dataframe_.

@jreback
Copy link
Contributor

jreback commented Jan 21, 2017

status?

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

can you rebase / update

@max-sixty
Copy link
Contributor Author

Yes, thanks for your patience.

What do we want to name the method? __dataframe__? _to_dataframe?

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

__dataframe__. I don't think it matters that others have dataframes. pandas is the standard in python, just as numpy is for arrays.

@shoyer
Copy link
Member

shoyer commented Mar 20, 2017

I feel pretty strongly that the name should not just be __dataframe__ and we should have pandas in it somewhere.

If we were starting over with NumPy, the name probably wouldn't be __array__, either.

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

I don't remember If I suggested the special method, but what is wrong with .to_dataframe()? (or .to_pandas()?

@shoyer
Copy link
Member

shoyer commented Mar 20, 2017

I don't remember If I suggested the special method, but what is wrong with .to_dataframe()? (or .to_pandas()?

The main reason is that it's totally conceivable that some application might use these for something else, e.g., to_dataframe() might make a spark dataframe instead. In fact, xarray sometimes uses to_pandas() to create a pandas.Series (if the array is 1D). The advantage of looking for a special method with a name like _to_pandas_dataframe_ is that there is almost no chance that it will be used incorrectly.

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

@shoyer

Is this supposed to work?

In [30]: x = DataFrame({'A': [1, 2, 3], 'B': [3., 4., 5.]}).to_xarray()

In [31]: x
Out[31]: 
<xarray.Dataset>
Dimensions:  (index: 3)
Coordinates:
  * index    (index) int64 0 1 2
Data variables:
    A        (index) int64 1 2 3
    B        (index) float64 3.0 4.0 5.0

In [32]: np.array(x)
Out[32]: 
array(['index', 'A', 'B'], 
      dtype='<U5')

@shoyer
Copy link
Member

shoyer commented Mar 21, 2017

@jreback xarray.Dataset is not meant to be convertable to a NumPy array, so it doesn't define __array__. But it does subclass collections.Mapping, so NumPy apparently manages to convert it by iterating over the keys, like this class:

class WrapMapping(collections.Mapping):
  def __init__(self, mapping):
    self._map = mapping
  def __getitem__(self, key):
    return self._map[key]
  def __iter__(self):
    return iter(self._map)
  def __len__(self):
    return len(self._map)

>>> np.array(WrapMapping({'a': 1, 'b': 2}))
array(['b', 'a'], 
      dtype='|S1')

This is totally useless though, so I'll make a note to implement a __array__ for Dataset that just raises TypeError with a helpful error message.

@jreback
Copy link
Contributor

jreback commented Mar 21, 2017

@shoyer gotcha, ok, then I guess __pandas_dataframe__ is prob reasonable for external converters.

@jreback
Copy link
Contributor

jreback commented May 7, 2017

@MaximilianR rebase?

@jreback
Copy link
Contributor

jreback commented Jun 10, 2017

can you rebase and update?

@jreback
Copy link
Contributor

jreback commented Aug 17, 2017

closing as stale

@jreback jreback closed this Aug 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Supplying an xarray Dataset to DataFrame constructor breaks
3 participants