Skip to content

BUG: maybe_convert_objects mixed datetimes and timedeltas #27438

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 24, 2019
Merged

BUG: maybe_convert_objects mixed datetimes and timedeltas #27438

merged 6 commits into from
Jul 24, 2019

Conversation

BeforeFlight
Copy link
Contributor

@BeforeFlight BeforeFlight commented Jul 17, 2019

  • corresponding discussion 27417
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

In order to stick to maybe_convert_objects realization it was needed to introduce Seen().nat_ param. Since we want to distinguish following situations: were seen NaT or datetime or timedelta

NOTE: in current realization array full of NaT will be converted to np.datetime64.

@BeforeFlight
Copy link
Contributor Author

BeforeFlight commented Jul 17, 2019

Have also ran following tests which compares new with old realization:

check_types = [None, np.nan, pd.NaT,
               True,
               np.iinfo(np.int64).min-1, -1, 1, np.iinfo(np.int64).max+1,
               np.iinfo(np.uint64).max, np.iinfo(np.uint64).max+1,
               1.1, np.float_(1.1),
               1j, np.complex_(1j),
               datetime(2000, 1, 1),
               datetime(2000, 1, 1, 0, 0, 0, 0, timezone.utc),
               np.datetime64('2000-01-01'),
               timedelta(365),
               np.timedelta64(1, 's'),
               # Decimal(1.1),
               'object'
               ]
DT_TD_NaT_inds = [2] + list(range(14, 19))


def data_gen(perms, exclude_all_is_time=False):
    data = np.array([0]*N, dtype='object')
    for perm in perms:
        if exclude_all_is_time and all(ind in DT_TD_NaT_inds for ind in perm):
            continue
        for i in range(data.shape[0]):
            data[i] = check_types[perm[i]]
        yield data.copy()


N = 3
exclude_all_is_time = False
perms = itertools.product(list(range(len(check_types))), repeat=N)
datas = data_gen(perms, exclude_all_is_time)


class Tests:
    @pytest.mark.parametrize("data", datas)
    @pytest.mark.parametrize("safe", [1])
    # @pytest.mark.parametrize("conv_DT", [1])
    # @pytest.mark.parametrize("conv_TD", [1])
    # @pytest.mark.parametrize("safe", [0, 1])
    @pytest.mark.parametrize("conv_DT", [0, 1])
    @pytest.mark.parametrize("conv_TD", [0, 1])
    def test_maybe_convert_objects(self, data, safe, conv_DT, conv_TD):
        old = lib.maybe_convert_objects_OLD(data, 1, safe, conv_DT, conv_TD)
        new = lib.maybe_convert_objects(data, 1, safe, conv_DT, conv_TD)
        print('data:', data, data.dtype,
              '\nold:', old, old.dtype,
              '\nnew:', new, new.dtype)
        if isinstance(old, np.ndarray):
            tm.assert_numpy_array_equal(old, new)
        else:
            tm.assert_index_equal(old, new)

NOTE: have not parametrized by safe flag cause it concerns only numeric.

While running this tests with exclude_all_is_time, i.e. excluding tests with generated array fully consisting of NaT or datetime or timedelta- there is no errors all tests passed.

While exclude_all_is_time disabled (full test suite) got

Results (82.92s):
    31910 passed
      90 failed

Visually inspecting all of failed all of them concerns only BUG that is being fixed.
Failed tests. In Captured stdout call section of each of the tests there are input array - data, output of old/unfixed realization of algorithm - old and output of fixed realization - new.

@BeforeFlight
Copy link
Contributor Author

Also have a question that is in general non relevant but don't understand where to ask. While adding Seen.nat_ I've changed following functions as well:

pandas/pandas/_libs/lib.pyx

Lines 1045 to 1051 in 2ae5152

@property
def is_bool(self):
return not (self.datetime_ or self.numeric_ or self.timedelta_)
@property
def is_float_or_complex(self):
return not (self.bool_ or self.datetime_ or self.timedelta_)

My question is why in these functions there is no .null_ nor datetimetz_ nor .object_?

@jbrockmendel
Copy link
Member

can you add an xref to the top post pointing back to the discussion (even though this doesnt close that issue)

@jbrockmendel
Copy link
Member

needs tests. even if they're not passing yet please push them

My question is why in these functions there is no .null_ nor datetimetz_ nor .object_?

I don't know off the top of my head. Probably check the blame to see where this was implemented?

@BeforeFlight
Copy link
Contributor Author

BeforeFlight commented Jul 17, 2019

can you add an xref to the top post pointing back to the discussion (even though this doesnt close that issue)

Check please.

needs tests. even if they're not passing yet please push them

But in order to test like I did I need both realization old and new in the lib.pyx. On my laptop I did tests -> remove old maybe_convert_objects -> push and make PR. How it should be tested then with old function removed?

In fact this is a problem I suppose. Because even my current test suite checking only against existent realization outputs not predefined hand-made outputs. And only reviewing comparisons make it feasible to catch errors. Filling correct answers for each test (even for N=2 its already at least 3200 tests) is an overkill.

@BeforeFlight
Copy link
Contributor Author

BeforeFlight commented Jul 17, 2019

One of the solution (strange) for tests I see here is to dump tests output into file -> make function changes -> use same tests in order to comparing outputs. Then if changes will be approved this file with tests output should be updated to new outputs as well.

Another one is to have 2 copy of function in the code - old one and new one in order to check new against the old. Pretty same as I did on my laptop.
Disadvantage here is that old function must be used for test purposes only (it maybe moved in other file - in test section for example). I think this is sane decision.

@gfyoung gfyoung added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations labels Jul 17, 2019
@gfyoung
Copy link
Member

gfyoung commented Jul 17, 2019

My question is why in these functions there is no .null_ nor datetimetz_ nor .object_?

@BeforeFlight : These were implemented initially to combat large number overflow issues. Good call though to use them for purposes beyond that.

@jbrockmendel
Copy link
Member

@BeforeFlight possibly helpful: coverage results for lib.pyx showing which lines are not hit in the tests:

pandas/_libs/lib.pyx 1084 70 94% 307, 370, 651, 655, 658, 699-705, 707-710, 718-732, 828, 978, 983, 988, 1245-1246, 1268, 1445, 1519-1521, 1641, 1644-1646, 1696, 1718, 1889-1890, 1903, 1935, 1944, 1954-1956, 2006-2007, 2009-2010, 2022-2024, 2128-2131, 2135-2136, 2138-2139, 2143, 2145-2146, 2148-2153, 2191

grain of salt, since cython coverage results have been buggy in the past

@BeforeFlight
Copy link
Contributor Author

@jbrockmendel so how should I add tests? Copy old function in tests section?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs tests even before a change. its very hard to tell if what you did actually changed anything.

@@ -2046,11 +2049,18 @@ def maybe_convert_objects(ndarray[object] objects, bint try_float=0,
else:
if not seen.bool_:
if seen.datetime_:
if not seen.numeric_:
return datetimes
if not seen.timedelta_:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this 1 level higher elif's, its very hard to follow these branches currently

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. As for me - this nested "if"s are ugly even before I've changed anything. Besides it is one of the good places to make a mistake (as it happens in current case).

@@ -2077,11 +2087,18 @@ def maybe_convert_objects(ndarray[object] objects, bint try_float=0,
else:
if not seen.bool_:
if seen.datetime_:
if not seen.numeric_:
return datetimes
if not seen.timedelta_:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Contributor Author

@BeforeFlight BeforeFlight Jul 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I've proposed simple test in corresponding issue that returns undesired output - this PR is aimed to resolve exactly such situations. Should I add this single test only? If not - maybe_convert_objects need full test suite (for all possible types combinations) but as I tried to explain earlier there will be thousands of it which is not doable 'by hands'.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have quite a few tests
i am ok with systematic testing but don’t go overboard

so happy to merge tests first

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so happy to merge tests first

@jreback sorry but from your answer I still can't understand - is single test which is fixed by this PR enough or not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this needs at least a single test

if u want to do more comprehensive tests that xfail some (that re not passing now) is ok too (can be before or after this PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add.

The problem with comprehensive tests: it cannot be done by hands -> it needs some "trusted" function to tested against i.e. needs copying old function for tests purposes only (as one of the options) - and here I don't understand may I do so or not.

@jbrockmendel
Copy link
Member

so how should I add tests? Copy old function in tests section?

There should be some managably-sized set of cases that you were trying to fix here, i.e. that were previous wrong but are now right. Write those tests specifically. No need to try have a zillion tests.

@BeforeFlight
Copy link
Contributor Author

No need to try have a zillion tests

I get it. Agree that for this PR bunch of tests should be enough.

But at the same time I should mark that inconsistencies of maybe_convert_objects that I've found - all was result of thoroughly testing.
And another point to mention - this is core function (well I guess). That is why such complete test suite is needed in my opinion (not zillion tests - but complete tests; count shouldn't matter). Besides this test suite may be ran only when this concrete function is amended.

@jbrockmendel
Copy link
Member

Improving the thoroughness of the test suite is great, but for a bugfix PR, the tests should be specific to the bug being fixed.

def test_maybe_convert_objects_datetime(self):
# GH27438
arr = np.array([0, 0], dtype=object)
arr[0] = np.datetime64('2000-01-01')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this construction seems odd, you can just construct directly no? e.g.
arr = np.array([np.datetime...., np.timedetla.lll], dtype='object'])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you need a NaT somewhere? what about throwing in a np.nan as well (maybe another case)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Construction is odd, yes - will rewrite (it's leftover of previous tests sorry). Will add tests with NaT as well.

As for np.nan - not quite understand. Add something with datetimes and np.nan?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we likely already have a test for this, but pls check

Copy link
Contributor Author

@BeforeFlight BeforeFlight Jul 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback all is green please check.

Now there are 10 tests for maybe_convert_objects + 13 tests through Dataframe constructor (there are also tests for maybe_convert_numeric but I did not count these).

Copy link
Contributor Author

@BeforeFlight BeforeFlight Jul 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way - no one confirm my changes with array full of NaT with both convert_DT and convert_TD checks as True (elif seen.nat_: case) - now it returns datetime without any warning of ambiguity with timedelta. Old behaviour was to return object (without considering flags state IIRC). Check it pls too.

@BeforeFlight
Copy link
Contributor Author

BeforeFlight commented Jul 21, 2019

Should I just remove this NaTs ambiguity for now and return object? Cause this concrete PR is somewhat not relative to this NaTs ambiguity problem.

@jbrockmendel
Copy link
Member

Should I just remove this NaTs ambiguity for now and return object? Cause this concrete PR is somewhat not relative to this NaTs ambiguity problem.

Between here and the issue you're bringing up a lot of questions and I'm having trouble keeping track of them. To the extent that you can focus on one unambiguous bug at a time, that will make things easier for reviewers.

@jreback jreback added this to the 1.0 milestone Jul 23, 2019
@jreback
Copy link
Contributor

jreback commented Jul 23, 2019

@BeforeFlight lgtm. does this have any user facing effects at all? meaning is there an incorrect Series/DataFrame construction currently?

@BeforeFlight
Copy link
Contributor Author

BeforeFlight commented Jul 23, 2019

Have to check (I'm not aware of all calling to maybe_convert_object) but if it used by some constructor directly - it might produce undesired results. The example that leads to this bug explains what I mean:

import numpy as np
import pandas as pd
import pandas._libs.lib as lib

data = np.array([np.datetime64('2000-01-01'), np.timedelta64(1, 's')], dtype='object')
out = lib.maybe_convert_objects(data,1,1,1,1)
print(out, out.dtype)

Output:

['2000-01-01T00:00:00.000000000' '1970-01-01T00:00:00.021389104'] datetime64[ns]

As I already mention at issue branch - datetime array is actually not initialized 2nd value at all (left it as it is after np.empty) -> might be any other in theory.

Will try to reproduce it on some constructor.

EDIT
Straightforward DataFrame...MultiIngex runs smoothly - somehow overcome such input and returns object. Need some time to investigate it why. Will do.

@BeforeFlight
Copy link
Contributor Author

No, seems like there is no any effects for user at all in current code state. The reason here is that in order to produce such bug both flags convert_datetime and convert_timedelta must be True. But I've search and visually inspect all lib.maybe_convert_objects calls in pandas folder - there is no such case. Only separate flags convert_datetime or convert_timedelta but not both at same time.

@jreback jreback merged commit 4d9016e into pandas-dev:master Jul 24, 2019
@jreback
Copy link
Contributor

jreback commented Jul 24, 2019

thanks @BeforeFlight

@BeforeFlight BeforeFlight deleted the maybe_convert_objects_BUG branch July 25, 2019 19:56
quintusdias pushed a commit to quintusdias/pandas_dev that referenced this pull request Aug 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants