BUG: maybe_convert_objects mixed datetimes and timedeltas #27438

BeforeFlight · 2019-07-17T18:47:51Z

corresponding discussion 27417
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

In order to stick to maybe_convert_objects realization it was needed to introduce Seen().nat_ param. Since we want to distinguish following situations: were seen NaT or datetime or timedelta

NOTE: in current realization array full of NaT will be converted to np.datetime64.

…ut cases.

BeforeFlight · 2019-07-17T19:01:37Z

Have also ran following tests which compares new with old realization:

check_types = [None, np.nan, pd.NaT,
               True,
               np.iinfo(np.int64).min-1, -1, 1, np.iinfo(np.int64).max+1,
               np.iinfo(np.uint64).max, np.iinfo(np.uint64).max+1,
               1.1, np.float_(1.1),
               1j, np.complex_(1j),
               datetime(2000, 1, 1),
               datetime(2000, 1, 1, 0, 0, 0, 0, timezone.utc),
               np.datetime64('2000-01-01'),
               timedelta(365),
               np.timedelta64(1, 's'),
               # Decimal(1.1),
               'object'
               ]
DT_TD_NaT_inds = [2] + list(range(14, 19))


def data_gen(perms, exclude_all_is_time=False):
    data = np.array([0]*N, dtype='object')
    for perm in perms:
        if exclude_all_is_time and all(ind in DT_TD_NaT_inds for ind in perm):
            continue
        for i in range(data.shape[0]):
            data[i] = check_types[perm[i]]
        yield data.copy()


N = 3
exclude_all_is_time = False
perms = itertools.product(list(range(len(check_types))), repeat=N)
datas = data_gen(perms, exclude_all_is_time)


class Tests:
    @pytest.mark.parametrize("data", datas)
    @pytest.mark.parametrize("safe", [1])
    # @pytest.mark.parametrize("conv_DT", [1])
    # @pytest.mark.parametrize("conv_TD", [1])
    # @pytest.mark.parametrize("safe", [0, 1])
    @pytest.mark.parametrize("conv_DT", [0, 1])
    @pytest.mark.parametrize("conv_TD", [0, 1])
    def test_maybe_convert_objects(self, data, safe, conv_DT, conv_TD):
        old = lib.maybe_convert_objects_OLD(data, 1, safe, conv_DT, conv_TD)
        new = lib.maybe_convert_objects(data, 1, safe, conv_DT, conv_TD)
        print('data:', data, data.dtype,
              '\nold:', old, old.dtype,
              '\nnew:', new, new.dtype)
        if isinstance(old, np.ndarray):
            tm.assert_numpy_array_equal(old, new)
        else:
            tm.assert_index_equal(old, new)

NOTE: have not parametrized by safe flag cause it concerns only numeric.

While running this tests with exclude_all_is_time, i.e. excluding tests with generated array fully consisting of NaT or datetime or timedelta- there is no errors all tests passed.

While exclude_all_is_time disabled (full test suite) got

Results (82.92s):
    31910 passed
      90 failed

Visually inspecting all of failed all of them concerns only BUG that is being fixed.
Failed tests. In Captured stdout call section of each of the tests there are input array - data, output of old/unfixed realization of algorithm - old and output of fixed realization - new.

BeforeFlight · 2019-07-17T19:23:31Z

Also have a question that is in general non relevant but don't understand where to ask. While adding Seen.nat_ I've changed following functions as well:

pandas/pandas/_libs/lib.pyx

Lines 1045 to 1051 in 2ae5152

    
           @property 
        
           def is_bool(self): 
        
               return not (self.datetime_ or self.numeric_ or self.timedelta_) 
        
           @property 
        
           def is_float_or_complex(self): 
        
               return not (self.bool_ or self.datetime_ or self.timedelta_)

My question is why in these functions there is no .null_ nor datetimetz_ nor .object_?

jbrockmendel · 2019-07-17T21:52:05Z

can you add an xref to the top post pointing back to the discussion (even though this doesnt close that issue)

jbrockmendel · 2019-07-17T22:00:48Z

needs tests. even if they're not passing yet please push them

My question is why in these functions there is no .null_ nor datetimetz_ nor .object_?

I don't know off the top of my head. Probably check the blame to see where this was implemented?

BeforeFlight · 2019-07-17T22:04:29Z

can you add an xref to the top post pointing back to the discussion (even though this doesnt close that issue)

Check please.

needs tests. even if they're not passing yet please push them

But in order to test like I did I need both realization old and new in the lib.pyx. On my laptop I did tests -> remove old maybe_convert_objects -> push and make PR. How it should be tested then with old function removed?

In fact this is a problem I suppose. Because even my current test suite checking only against existent realization outputs not predefined hand-made outputs. And only reviewing comparisons make it feasible to catch errors. Filling correct answers for each test (even for N=2 its already at least 3200 tests) is an overkill.

BeforeFlight · 2019-07-17T22:35:10Z

One of the solution (strange) for tests I see here is to dump tests output into file -> make function changes -> use same tests in order to comparing outputs. Then if changes will be approved this file with tests output should be updated to new outputs as well.

Another one is to have 2 copy of function in the code - old one and new one in order to check new against the old. Pretty same as I did on my laptop.
Disadvantage here is that old function must be used for test purposes only (it maybe moved in other file - in test section for example). I think this is sane decision.

gfyoung · 2019-07-17T23:55:24Z

My question is why in these functions there is no .null_ nor datetimetz_ nor .object_?

@BeforeFlight : These were implemented initially to combat large number overflow issues. Good call though to use them for purposes beyond that.

jbrockmendel · 2019-07-19T01:44:31Z

@BeforeFlight possibly helpful: coverage results for lib.pyx showing which lines are not hit in the tests:

pandas/_libs/lib.pyx 1084 70 94% 307, 370, 651, 655, 658, 699-705, 707-710, 718-732, 828, 978, 983, 988, 1245-1246, 1268, 1445, 1519-1521, 1641, 1644-1646, 1696, 1718, 1889-1890, 1903, 1935, 1944, 1954-1956, 2006-2007, 2009-2010, 2022-2024, 2128-2131, 2135-2136, 2138-2139, 2143, 2145-2146, 2148-2153, 2191

grain of salt, since cython coverage results have been buggy in the past

BeforeFlight · 2019-07-20T15:55:29Z

@jbrockmendel so how should I add tests? Copy old function in tests section?

jreback

this needs tests even before a change. its very hard to tell if what you did actually changed anything.

jreback · 2019-07-20T16:45:39Z

pandas/_libs/lib.pyx

@@ -2046,11 +2049,18 @@ def maybe_convert_objects(ndarray[object] objects, bint try_float=0,
            else:
                if not seen.bool_:
                    if seen.datetime_:
-                        if not seen.numeric_:
-                            return datetimes
+                        if not seen.timedelta_:


can you make this 1 level higher elif's, its very hard to follow these branches currently

Will do. As for me - this nested "if"s are ugly even before I've changed anything. Besides it is one of the good places to make a mistake (as it happens in current case).

jreback · 2019-07-20T16:45:48Z

pandas/_libs/lib.pyx

@@ -2077,11 +2087,18 @@ def maybe_convert_objects(ndarray[object] objects, bint try_float=0,
            else:
                if not seen.bool_:
                    if seen.datetime_:
-                        if not seen.numeric_:
-                            return datetimes
+                        if not seen.timedelta_:


same as above

@jreback I've proposed simple test in corresponding issue that returns undesired output - this PR is aimed to resolve exactly such situations. Should I add this single test only? If not - maybe_convert_objects need full test suite (for all possible types combinations) but as I tried to explain earlier there will be thousands of it which is not doable 'by hands'.

we already have quite a few tests
i am ok with systematic testing but don’t go overboard

so happy to merge tests first

so happy to merge tests first

@jreback sorry but from your answer I still can't understand - is single test which is fixed by this PR enough or not?

yes this needs at least a single test

if u want to do more comprehensive tests that xfail some (that re not passing now) is ok too (can be before or after this PR)

Will add.

The problem with comprehensive tests: it cannot be done by hands -> it needs some "trusted" function to tested against i.e. needs copying old function for tests purposes only (as one of the options) - and here I don't understand may I do so or not.

jbrockmendel · 2019-07-20T17:46:26Z

so how should I add tests? Copy old function in tests section?

There should be some managably-sized set of cases that you were trying to fix here, i.e. that were previous wrong but are now right. Write those tests specifically. No need to try have a zillion tests.

BeforeFlight · 2019-07-20T17:56:35Z

No need to try have a zillion tests

I get it. Agree that for this PR bunch of tests should be enough.

But at the same time I should mark that inconsistencies of maybe_convert_objects that I've found - all was result of thoroughly testing.
And another point to mention - this is core function (well I guess). That is why such complete test suite is needed in my opinion (not zillion tests - but complete tests; count shouldn't matter). Besides this test suite may be ran only when this concrete function is amended.

jbrockmendel · 2019-07-20T17:58:41Z

Improving the thoroughness of the test suite is great, but for a bugfix PR, the tests should be specific to the bug being fixed.

jreback · 2019-07-20T20:44:36Z

pandas/tests/dtypes/test_inference.py

+    def test_maybe_convert_objects_datetime(self):
+        # GH27438
+        arr = np.array([0, 0], dtype=object)
+        arr[0] = np.datetime64('2000-01-01')


this construction seems odd, you can just construct directly no? e.g.
arr = np.array([np.datetime...., np.timedetla.lll], dtype='object'])

don't you need a NaT somewhere? what about throwing in a np.nan as well (maybe another case)

Construction is odd, yes - will rewrite (it's leftover of previous tests sorry). Will add tests with NaT as well.

As for np.nan - not quite understand. Add something with datetimes and np.nan?

yes we likely already have a test for this, but pls check

@jreback all is green please check.

Now there are 10 tests for maybe_convert_objects + 13 tests through Dataframe constructor (there are also tests for maybe_convert_numeric but I did not count these).

By the way - no one confirm my changes with array full of NaT with both convert_DT and convert_TD checks as True (elif seen.nat_: case) - now it returns datetime without any warning of ambiguity with timedelta. Old behaviour was to return object (without considering flags state IIRC). Check it pls too.

BeforeFlight · 2019-07-21T10:54:29Z

Should I just remove this NaTs ambiguity for now and return object? Cause this concrete PR is somewhat not relative to this NaTs ambiguity problem.

jbrockmendel · 2019-07-21T15:38:11Z

Should I just remove this NaTs ambiguity for now and return object? Cause this concrete PR is somewhat not relative to this NaTs ambiguity problem.

Between here and the issue you're bringing up a lot of questions and I'm having trouble keeping track of them. To the extent that you can focus on one unambiguous bug at a time, that will make things easier for reviewers.

jreback · 2019-07-23T22:08:59Z

@BeforeFlight lgtm. does this have any user facing effects at all? meaning is there an incorrect Series/DataFrame construction currently?

BeforeFlight · 2019-07-23T22:25:27Z

Have to check (I'm not aware of all calling to maybe_convert_object) but if it used by some constructor directly - it might produce undesired results. The example that leads to this bug explains what I mean:

import numpy as np
import pandas as pd
import pandas._libs.lib as lib

data = np.array([np.datetime64('2000-01-01'), np.timedelta64(1, 's')], dtype='object')
out = lib.maybe_convert_objects(data,1,1,1,1)
print(out, out.dtype)

Output:

['2000-01-01T00:00:00.000000000' '1970-01-01T00:00:00.021389104'] datetime64[ns]

As I already mention at issue branch - datetime array is actually not initialized 2nd value at all (left it as it is after np.empty) -> might be any other in theory.

Will try to reproduce it on some constructor.

EDIT
Straightforward DataFrame...MultiIngex runs smoothly - somehow overcome such input and returns object. Need some time to investigate it why. Will do.

BeforeFlight · 2019-07-24T00:48:35Z

No, seems like there is no any effects for user at all in current code state. The reason here is that in order to produce such bug both flags convert_datetime and convert_timedelta must be True. But I've search and visually inspect all lib.maybe_convert_objects calls in pandas folder - there is no such case. Only separate flags convert_datetime or convert_timedelta but not both at same time.

jreback · 2019-07-24T12:05:52Z

thanks @BeforeFlight

…#27438)

BUG: maybe_convert_objects mixed datetimes and timedeltas in some inp…

9f0c986

…ut cases.

gfyoung added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations labels Jul 17, 2019

jreback requested changes Jul 20, 2019

View reviewed changes

BeforeFlight added 2 commits July 20, 2019 23:02

Rewrite nested ifs. Test added.

f005531

Some fixes.

e109dbf

jreback requested changes Jul 20, 2019

View reviewed changes

BeforeFlight added 2 commits July 21, 2019 00:23

Tests with np.nan, pd.NaT added.

034f729

Linter fixes.

d416771

Array full of NaT ambiguity TODO.

f2f265e

jreback added this to the 1.0 milestone Jul 23, 2019

jreback approved these changes Jul 23, 2019

View reviewed changes

jreback merged commit 4d9016e into pandas-dev:master Jul 24, 2019

BeforeFlight deleted the maybe_convert_objects_BUG branch July 25, 2019 19:56

quintusdias pushed a commit to quintusdias/pandas_dev that referenced this pull request Aug 16, 2019

BUG: maybe_convert_objects mixed datetimes and timedeltas (pandas-dev…

4ac34d6

…#27438)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: maybe_convert_objects mixed datetimes and timedeltas #27438

BUG: maybe_convert_objects mixed datetimes and timedeltas #27438

BeforeFlight commented Jul 17, 2019 •

edited

Loading

BeforeFlight commented Jul 17, 2019 •

edited

Loading

BeforeFlight commented Jul 17, 2019

jbrockmendel commented Jul 17, 2019

jbrockmendel commented Jul 17, 2019

BeforeFlight commented Jul 17, 2019 •

edited

Loading

BeforeFlight commented Jul 17, 2019 •

edited

Loading

gfyoung commented Jul 17, 2019 •

edited

Loading

jbrockmendel commented Jul 19, 2019

BeforeFlight commented Jul 20, 2019

jreback left a comment

jreback Jul 20, 2019

BeforeFlight Jul 20, 2019

jreback Jul 20, 2019

BeforeFlight Jul 20, 2019 •

edited

Loading

jreback Jul 20, 2019

BeforeFlight Jul 20, 2019

jreback Jul 20, 2019

BeforeFlight Jul 20, 2019

jbrockmendel commented Jul 20, 2019

BeforeFlight commented Jul 20, 2019

jbrockmendel commented Jul 20, 2019

jreback Jul 20, 2019

jreback Jul 20, 2019

BeforeFlight Jul 20, 2019

jreback Jul 20, 2019

BeforeFlight Jul 20, 2019 •

edited

Loading

BeforeFlight Jul 20, 2019 •

edited

Loading

BeforeFlight commented Jul 21, 2019 •

edited

Loading

jbrockmendel commented Jul 21, 2019

jreback commented Jul 23, 2019

BeforeFlight commented Jul 23, 2019 •

edited

Loading

BeforeFlight commented Jul 24, 2019

jreback commented Jul 24, 2019

BUG: maybe_convert_objects mixed datetimes and timedeltas #27438

BUG: maybe_convert_objects mixed datetimes and timedeltas #27438

Conversation

BeforeFlight commented Jul 17, 2019 • edited Loading

BeforeFlight commented Jul 17, 2019 • edited Loading

BeforeFlight commented Jul 17, 2019

jbrockmendel commented Jul 17, 2019

jbrockmendel commented Jul 17, 2019

BeforeFlight commented Jul 17, 2019 • edited Loading

BeforeFlight commented Jul 17, 2019 • edited Loading

gfyoung commented Jul 17, 2019 • edited Loading

jbrockmendel commented Jul 19, 2019

BeforeFlight commented Jul 20, 2019

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BeforeFlight Jul 20, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Jul 20, 2019

BeforeFlight commented Jul 20, 2019

jbrockmendel commented Jul 20, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BeforeFlight Jul 20, 2019 • edited Loading

Choose a reason for hiding this comment

BeforeFlight Jul 20, 2019 • edited Loading

Choose a reason for hiding this comment

BeforeFlight commented Jul 21, 2019 • edited Loading

jbrockmendel commented Jul 21, 2019

jreback commented Jul 23, 2019

BeforeFlight commented Jul 23, 2019 • edited Loading

BeforeFlight commented Jul 24, 2019

jreback commented Jul 24, 2019

BeforeFlight commented Jul 17, 2019 •

edited

Loading

BeforeFlight commented Jul 17, 2019 •

edited

Loading

BeforeFlight commented Jul 17, 2019 •

edited

Loading

BeforeFlight commented Jul 17, 2019 •

edited

Loading

gfyoung commented Jul 17, 2019 •

edited

Loading

BeforeFlight Jul 20, 2019 •

edited

Loading

BeforeFlight Jul 20, 2019 •

edited

Loading

BeforeFlight Jul 20, 2019 •

edited

Loading

BeforeFlight commented Jul 21, 2019 •

edited

Loading

BeforeFlight commented Jul 23, 2019 •

edited

Loading