Skip to content

BUG: need better inference for path in Series construction #9456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue Feb 10, 2015 · 12 comments
Closed

BUG: need better inference for path in Series construction #9456

jreback opened this issue Feb 10, 2015 · 12 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Feb 10, 2015

This hits a path in Series.__init__ which I think needs some better inference

https://github.com/pydata/pandas/blob/master/pandas/core/series.py#L178

In [1]: d = {numpy.datetime64('2015-01-07T02:00:00.000000000+0200'): 42544017.198965244,
   ...:      numpy.datetime64('2015-01-08T02:00:00.000000000+0200'): 40512335.181958228,
   ...:      numpy.datetime64('2015-01-09T02:00:00.000000000+0200'): 39712952.781494237,
   ...:      numpy.datetime64('2015-01-12T02:00:00.000000000+0200'): 39002721.453793451}

In [2]: Series(d)
Out[2]: 
2015-01-07   NaN
2015-01-08   NaN
2015-01-09   NaN
2015-01-12   NaN
dtype: float64

In [3]: Series(d.values(),d.keys())
Out[3]: 
2015-01-07    42544017.198965
2015-01-08    40512335.181958
2015-01-09    39712952.781494
2015-01-12    39002721.453793
dtype: float64

The problem is the index is already converted at this point and its not easy to get the keys/values out (except to do so explicity which is better IMHO).

Need a review of what currently hits this path (can simply put a halt in here and see what tests hit this). Then figure out a better method.

@jreback jreback added Bug Good as first PR Dtype Conversions Unexpected or buggy dtype conversions labels Feb 10, 2015
@jreback jreback added this to the 0.16.0 milestone Feb 10, 2015
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 5, 2015
@patrickfournier
Copy link

I am working on this issue (not at the sprint, unfortunately).

@patrickfournier
Copy link

I put some traces in the elif isinstance(data, dict): block and ran the tests in pandas.tests.test_series.

  • The if isinstance(index, DatetimeIndex) block catches those tests:

    • test_from_csv
    • test_name_printing
    • test_to_dict

    However, all three throw a TypeError exception because index.astype('O') is an Index, not an nparray.

  • The elif isinstance(index, PeriodIndex) block catches the second Series() of the test_constructor_dict

  • The else block catches everything else. Three tests throw a TypeError exception because data is not a dict but a dict subclass:

    • test_constructor_subclass_dict
    • test_orderedDict_ctor
    • test_orderedDict_subclass_ctor

I suggest to rewrite the try block like this:

    if isinstance(index, DatetimeIndex) and lib.infer_dtype(data) != 'datetime64':
        data = lib.fast_multiget(data, index.astype('O').values, default=np.nan)
    elif isinstance(index, PeriodIndex):
        data = [data.get(i, nan) for i in index]
    else:
        data = lib.fast_multiget(data, index.values, default=np.nan)

If this is not complete nonsense, I can add a test and create a pull request.

@jreback
Copy link
Contributor Author

jreback commented Jun 10, 2015

closed by #10269

@jreback jreback closed this as completed Jun 10, 2015
@jreback jreback modified the milestones: 0.16.2, Next Major Release Jun 10, 2015
@ruidc
Copy link
Contributor

ruidc commented Aug 18, 2015

isn't this still an issue? eg.

In [1]: import pandas;import numpy;import datetime;

In [2]: ix = pandas.MultiIndex.from_arrays([pandas.Index(numpy.array([datetime.date(2015,7,31)], dtype='datetime64[D]')), numpy.array([0.1], dtype='object')])

In [3]: v = {'a':0.1}

In [4]: pandas.DataFrame(v,columns=ix)
Out[4]:
Empty DataFrame
Columns: [(2015-07-31 00:00:00, 0.1)]
Index: []

In [7]: pandas.__version__
Out[7]: u'0.16.2+286.g993942e'

@jreback
Copy link
Contributor Author

jreback commented Aug 18, 2015

@ruidc what exactly do you think your above should do? If anything I would say it should raise as you have all scalar values.

@ruidc
Copy link
Contributor

ruidc commented Aug 19, 2015

that's not what i was trying to show, I'd expect to see the values not an empty DataFrame or NaNs:

In [6]: pandas.DataFrame(v, index=v.keys(),columns=ix)
Out[6]:
  2015-07-31
           a
a        NaN
b        NaN

@jorisvandenbossche
Copy link
Member

@ruidc If you provide a dict to DataFrame() the dict keys map to the columns, so you would in any case not get the above (for that you have to do pd.Series(v) and convert to frame / set the name afterwards.

The reason you get an empty dataframe is because you first give a column name 'a' (dict key), but then also provide another column name (with columns=..) which will then reindex the original data provided with the dict, and since this column is not available in there, you get an empty dataframe

@jorisvandenbossche
Copy link
Member

But as @jreback said, you should actually get an error:

In [38]: pd.DataFrame({'a':0.1})
ValueError: If using all scalar values, you must pass an index

In [39]: pd.DataFrame({'a':0.1}, columns=['b'])
Out[39]:
Empty DataFrame
Columns: [b]
Index: []

but is seems this is not triggered when passing another column name

@ruidc
Copy link
Contributor

ruidc commented Aug 19, 2015

Ok, i see, I made mistakes in reducing my original problem.
and was further confused that the result is different between passing columns in the constructor vs setting columns afterwards.

@ruidc
Copy link
Contributor

ruidc commented Aug 20, 2015

maybe this shows the problem better, although it's not specific to MultiIndex:

In [1]: import pandas;import numpy;import datetime;
In [2]: v = datetime.date.today()
In [3]: pandas.DataFrame({v : pandas.Series(range(3),index=range(3))}, columns=[v])
Out[3]:
   2015-08-20
0           0
1           1
2           2
In [4]: v = v, v
In [5]: pandas.DataFrame({v : pandas.Series(range(3),index=range(3))}, columns=[v])
Out[5]:
  (2015-08-20, 2015-08-20)
0                      NaN
1                      NaN
2                      NaN

@jreback
Copy link
Contributor Author

jreback commented Aug 20, 2015

yeh suppose the last is prob a bug, pls create a new issue.

@ruidc
Copy link
Contributor

ruidc commented Aug 20, 2015

Thx for confirming, done: #10863 and sorry for the uninspired title, but i think I've been looking at this particular issue for too long to be creative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
4 participants