Skip to content

BUG: pd.to_timedelta handles missing data #5438

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 6, 2013

Conversation

danielballan
Copy link
Contributor

closes #5437

Updated: Mostly following @jreback's suggestion, but we need np.timedelta64('NaT'), not pd.tslib.iNaT, because numpy cannot cast np.datetime64('NaT') as np.timedelta64 type, and throws an AssertionError.

In [1]: pd.to_timedelta(Series(['00:00:01', '00:00:02']))
Out[1]: 
0   00:00:01
1   00:00:02
dtype: timedelta64[ns]

In [2]: pd.to_timedelta(Series(['00:00:01', np.nan]))
Out[2]: 
0   00:00:01
1        NaT
dtype: timedelta64[ns]

In [3]: pd.to_timedelta(Series(['00:00:01', pd.NaT]))
Out[3]: 
0   00:00:01
1        NaT
dtype: timedelta64[ns]

@danielballan
Copy link
Contributor Author

Another issue lurking...

In [76]: Series([np.timedelta64('NaT')], dtype='<m8[ns]') + pd.to_timedelta('00:00:01')
Out[76]: 
0   NaT
dtype: timedelta64[ns]

In [77]: DataFrame([np.timedelta64('NaT')], dtype='<m8[ns]') + pd.to_timedelta('00:00:01')
Out[77]: 
                              0
0 -106751 days, 23:47:15.854776

@jreback
Copy link
Contributor

jreback commented Nov 5, 2013

fyi....this test should hold as well:

pd.to_datetime(list_of_strings) == Series(list_of_strings).apply(pd.to_datetime)

@jreback
Copy link
Contributor

jreback commented Nov 5, 2013

This patch will fix the problem from above

need a sequence of tests (you can do in a single method if y ou want, just need these combinations).

x op y

where op in +, -
x and y in Series(valid_td), Series(NaT), Frame(valid_td), Frame(NaT), NaT, timedelta64 scalar

diff --git a/pandas/core/internals.py b/pandas/core/internals.py
index 62aa95d..209b960 100644
--- a/pandas/core/internals.py
+++ b/pandas/core/internals.py
@@ -1162,15 +1162,25 @@ class TimeDeltaBlock(IntBlock):

     def _try_coerce_args(self, values, other):
         """ provide coercion to our input arguments
-            we are going to compare vs i8, so coerce to integer
-            values is always ndarra like, other may not be """
-        values = values.view('i8')
+            we are going to compare vs i8, so coerce to floats
+            repring NaT with np.nan so nans propogate
+            values is always ndarray like, other may not be """
+        def masker(v):
+            mask = isnull(v)
+            v = v.view('i8').astype('float64')
+            v[mask] = np.nan
+            return v
+
+        values = masker(values)
+
         if isnull(other) or (np.isscalar(other) and other == tslib.iNaT):
-            other = tslib.iNaT
+            other = np.nan
         elif isinstance(other, np.timedelta64):
             other = _coerce_scalar_to_timedelta_type(other,unit='s').item()
+            if other == tslib.iNaT:
+                other = np.nan
         else:
-            other = other.view('i8')
+            other = masker(v)

         return values, other

@@ -84,6 +85,8 @@ def conv(v):
r = conv(r)
elif r == tslib.iNaT:
return r
elif pd.isnull(r):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just import isnull from pandas.core.common

@danielballan
Copy link
Contributor Author

Take a look at the comments near my tests. There is still one major issue:

In [6]: s1
Out[6]:
0   00:00:01
dtype: timedelta64[ns]

In [7]: df1
Out[7]:
         0
0 00:00:01

In [8]: s1 + df1
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-48d4d6969771> in <module>()
----> 1 s1 + df1

/Users/danielallan/Documents/Repos/pandas-danielballan/pandas/core/ops.pyc in wrapper(left, right, name)
    453     def wrapper(left, right, name=name):
    454
--> 455         time_converted = _TimeOp.maybe_convert_for_time_op(left, right, name)
    456
    457         if time_converted is None:

/Users/danielallan/Documents/Repos/pandas-danielballan/pandas/core/ops.pyc in maybe_convert_for_time_op(cls, left, right, name)
    422         if name.startswith('__r'):
    423             name = "__" + name[3:]
--> 424         return cls(left, right, name)
    425
    426

/Users/danielallan/Documents/Repos/pandas-danielballan/pandas/core/ops.pyc in __init__(self, left, right, name)
    256
    257         lvalues = self._convert_to_array(left, name=name)
--> 258         rvalues = self._convert_to_array(right, name=name)
    259
    260         self.is_timedelta_lhs = com.is_timedelta64_dtype(left)

/Users/danielallan/Documents/Repos/pandas-danielballan/pandas/core/ops.pyc in _convert_to_array(self, values, name)
    338         elif inferred_type == 'integer':
    339             # py3 compat where dtype is 'm' but is an integer
--> 340             if values.dtype.kind == 'm':
    341                 values = values.astype('timedelta64[ns]')
    342             elif isinstance(values, pd.PeriodIndex):

/Users/danielallan/Documents/Repos/pandas-danielballan/pandas/core/generic.pyc in __getattr__(self, name)
   1504             return self[name]
   1505         raise AttributeError("'%s' object has no attribute '%s'" %
-> 1506                              (type(self).__name__, name))
   1507
   1508     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'dtype'


# TODO: This should really pass.
# actual = s1 + NA
# assert_series_equal(actual, sn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's wrong here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Traceback (most recent call last):
  File "/Users/danielallan/Documents/Repos/pandas-danielballan/pandas/tseries/tests/test_timedeltas.py", line 253, in test_timedelta_ops_with_missing_values
    actual = s1 + NA
  File "/Users/danielallan/Documents/Repos/pandas-danielballan/pandas/core/ops.py", line 455, in wrapper
    time_converted = _TimeOp.maybe_convert_for_time_op(left, right, name)
  File "/Users/danielallan/Documents/Repos/pandas-danielballan/pandas/core/ops.py", line 424, in maybe_convert_for_time_op
    return cls(left, right, name)
  File "/Users/danielallan/Documents/Repos/pandas-danielballan/pandas/core/ops.py", line 258, in __init__
    rvalues = self._convert_to_array(right, name=name)
  File "/Users/danielallan/Documents/Repos/pandas-danielballan/pandas/core/ops.py", line 359, in _convert_to_array
    " operation".format(pa.array(values).dtype))
TypeError: incompatible type [float64] for a datetime/timedelta operation

@jreback
Copy link
Contributor

jreback commented Nov 5, 2013

I did a PR on your branch...incorporate the last commit..and should be good to go (though may need a couple of comparisons for the fixed tests)

@danielballan
Copy link
Contributor Author

Thanks.

I may have screwed up your PR. Thought I merged it, now don't see it....

@danielballan
Copy link
Contributor Author

nm, I see what I did. Thanks again for handling the hard parts of this. Stand by.

@danielballan
Copy link
Contributor Author

That worked.

I added a test to check

pd.to_datetime(list_of_strings) == Series(list_of_strings).apply(pd.to_datetime)

and I think apply is returning objects instead of timedelta64[ns]. Is this another case of taking the wrong path?

FAIL: test_apply_to_timedelta (pandas.tseries.tests.test_timedeltas.TestTimedeltas)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/danielallan/Documents/Repos/pandas-danielballan/pandas/tseries/tests/test_timedeltas.py", line 302, in test_apply_to_datetime
    assert_series_equal(a, b)
  File "/Users/danielallan/Documents/Repos/pandas-danielballan/pandas/util/testing.py", line 417, in assert_series_equal
    assert_attr_equal('dtype', left, right)
  File "/Users/danielallan/Documents/Repos/pandas-danielballan/pandas/util/testing.py", line 401, in assert_attr_equal
    assert_equal(left_attr,right_attr,"attr is not equal [{0}]" .format(attr))
  File "/Users/danielallan/Documents/Repos/pandas-danielballan/pandas/util/testing.py", line 384, in assert_equal
    assert a == b, "%s: %r != %r" % (msg.format(a,b), a, b)
AssertionError: attr is not equal [dtype]: dtype('<m8[ns]') != dtype('O')

@jreback
Copy link
Contributor

jreback commented Nov 5, 2013

the apply on the series is a separate and thorny topic, basically lib.maybe_convert_objects needs to be modified to return timedelta64[ns] if its detected, so this a separate issue. you can leave the tests in (but you can't really compare them), it 'works', but is the wrong dtype. Make separate issue (and separate PR) if you are so inclined.

@danielballan
Copy link
Contributor Author

I'll defer for now. This is good enough to be usable.

If the latest build passes (it should) this is ready.

@jreback
Copy link
Contributor

jreback commented Nov 5, 2013

gr8.....go ahead an create the other issue though (FYI look at the test in tests/test_frame.py/test_operators_timedelta64 where I have exactly this problem (even works because it a mixed array)

@jreback
Copy link
Contributor

jreback commented Nov 5, 2013

you are going to have to do the skip if np < 1.7 for most/all of those tests. you can possibly check if you should raise in the actual ops....in theory it should work, but not 100 %...numpy 1.6.2 is a bear

@danielballan
Copy link
Contributor Author

I pre-emptively used _skip_if_numpy_unfriendly() on all of them. Seems to me, if you're using timedeltas on np < 1.7, you're asking your trouble.

@danielballan
Copy link
Contributor Author

@jreback Looks like some datetime tests were broken. Will return to this tomorrow.

@jreback
Copy link
Contributor

jreback commented Nov 6, 2013

@danielballan I did a PR to your branch to fix the last bit of tests....

@danielballan
Copy link
Contributor Author

I put this on your PR, @jreback, but I'm duplicating it here:

One error left:

ERROR: test_timedelta_ops_with_missing_values (pandas.tseries.tests.test_timedeltas.TestTimedeltas)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/dallan/pandas-danielballan/pandas/tseries/tests/test_timedeltas.py", line 247, in test_timedelta_ops_with_missing_values
    actual = s1 + timedelta_NaT
  File "/home/dallan/pandas-danielballan/pandas/core/ops.py", line 473, in wrapper
    time_converted = _TimeOp.maybe_convert_for_time_op(left, right, name)
  File "/home/dallan/pandas-danielballan/pandas/core/ops.py", line 440, in maybe_convert_for_time_op
    return cls(left, right, name)
  File "/home/dallan/pandas-danielballan/pandas/core/ops.py", line 258, in __init__
    rvalues = self._convert_to_array(right, name=name, other=lvalues)
  File "/home/dallan/pandas-danielballan/pandas/core/ops.py", line 375, in _convert_to_array
    " operation".format(pa.array(values).dtype))
TypeError: incompatible type [timedelta64[ns]] for a datetime/timedelta operation

----------------------------------------------------------------------
Ran 829 tests in 43.246s

FAILED (SKIP=16, errors=1)

@jreback
Copy link
Contributor

jreback commented Nov 6, 2013

where is this failing? the current pr passes

@danielballan
Copy link
Contributor Author

My local copy wasn't quite right. Fine now. Ready to merge.

@danielballan
Copy link
Contributor Author

^ This is just a rebase to flatten.

@jtratner
Copy link
Contributor

jtratner commented Nov 6, 2013

+1 on nice hisrory

jreback added a commit that referenced this pull request Nov 6, 2013
BUG: pd.to_timedelta handles missing data
@jreback jreback merged commit 72a0c43 into pandas-dev:master Nov 6, 2013
@danielballan danielballan deleted the to-timedelta-missing-data branch November 6, 2013 23:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: pd.to_timedelta raises instead of returning NaT
3 participants