-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: maybe_convert_objects mixed datetimes and timedeltas #27438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: maybe_convert_objects mixed datetimes and timedeltas #27438
Conversation
Have also ran following tests which compares new with old realization: check_types = [None, np.nan, pd.NaT,
True,
np.iinfo(np.int64).min-1, -1, 1, np.iinfo(np.int64).max+1,
np.iinfo(np.uint64).max, np.iinfo(np.uint64).max+1,
1.1, np.float_(1.1),
1j, np.complex_(1j),
datetime(2000, 1, 1),
datetime(2000, 1, 1, 0, 0, 0, 0, timezone.utc),
np.datetime64('2000-01-01'),
timedelta(365),
np.timedelta64(1, 's'),
# Decimal(1.1),
'object'
]
DT_TD_NaT_inds = [2] + list(range(14, 19))
def data_gen(perms, exclude_all_is_time=False):
data = np.array([0]*N, dtype='object')
for perm in perms:
if exclude_all_is_time and all(ind in DT_TD_NaT_inds for ind in perm):
continue
for i in range(data.shape[0]):
data[i] = check_types[perm[i]]
yield data.copy()
N = 3
exclude_all_is_time = False
perms = itertools.product(list(range(len(check_types))), repeat=N)
datas = data_gen(perms, exclude_all_is_time)
class Tests:
@pytest.mark.parametrize("data", datas)
@pytest.mark.parametrize("safe", [1])
# @pytest.mark.parametrize("conv_DT", [1])
# @pytest.mark.parametrize("conv_TD", [1])
# @pytest.mark.parametrize("safe", [0, 1])
@pytest.mark.parametrize("conv_DT", [0, 1])
@pytest.mark.parametrize("conv_TD", [0, 1])
def test_maybe_convert_objects(self, data, safe, conv_DT, conv_TD):
old = lib.maybe_convert_objects_OLD(data, 1, safe, conv_DT, conv_TD)
new = lib.maybe_convert_objects(data, 1, safe, conv_DT, conv_TD)
print('data:', data, data.dtype,
'\nold:', old, old.dtype,
'\nnew:', new, new.dtype)
if isinstance(old, np.ndarray):
tm.assert_numpy_array_equal(old, new)
else:
tm.assert_index_equal(old, new) NOTE: have not parametrized by While running this tests with While
Visually inspecting all of failed all of them concerns only BUG that is being fixed. |
Also have a question that is in general non relevant but don't understand where to ask. While adding Lines 1045 to 1051 in 2ae5152
My question is why in these functions there is no |
can you add an xref to the top post pointing back to the discussion (even though this doesnt close that issue) |
needs tests. even if they're not passing yet please push them
I don't know off the top of my head. Probably check the blame to see where this was implemented? |
Check please.
But in order to test like I did I need both realization old and new in the In fact this is a problem I suppose. Because even my current test suite checking only against existent realization outputs not predefined hand-made outputs. And only reviewing comparisons make it feasible to catch errors. Filling correct answers for each test (even for |
One of the solution (strange) for tests I see here is to dump tests output into file -> make function changes -> use same tests in order to comparing outputs. Then if changes will be approved this file with tests output should be updated to new outputs as well. Another one is to have 2 copy of function in the code - |
@BeforeFlight : These were implemented initially to combat large number overflow issues. Good call though to use them for purposes beyond that. |
@BeforeFlight possibly helpful: coverage results for lib.pyx showing which lines are not hit in the tests: pandas/_libs/lib.pyx 1084 70 94% 307, 370, 651, 655, 658, 699-705, 707-710, 718-732, 828, 978, 983, 988, 1245-1246, 1268, 1445, 1519-1521, 1641, 1644-1646, 1696, 1718, 1889-1890, 1903, 1935, 1944, 1954-1956, 2006-2007, 2009-2010, 2022-2024, 2128-2131, 2135-2136, 2138-2139, 2143, 2145-2146, 2148-2153, 2191 grain of salt, since cython coverage results have been buggy in the past |
@jbrockmendel so how should I add tests? Copy old function in tests section? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs tests even before a change. its very hard to tell if what you did actually changed anything.
pandas/_libs/lib.pyx
Outdated
@@ -2046,11 +2049,18 @@ def maybe_convert_objects(ndarray[object] objects, bint try_float=0, | |||
else: | |||
if not seen.bool_: | |||
if seen.datetime_: | |||
if not seen.numeric_: | |||
return datetimes | |||
if not seen.timedelta_: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this 1 level higher elif's, its very hard to follow these branches currently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. As for me - this nested "if"s are ugly even before I've changed anything. Besides it is one of the good places to make a mistake (as it happens in current case).
pandas/_libs/lib.pyx
Outdated
@@ -2077,11 +2087,18 @@ def maybe_convert_objects(ndarray[object] objects, bint try_float=0, | |||
else: | |||
if not seen.bool_: | |||
if seen.datetime_: | |||
if not seen.numeric_: | |||
return datetimes | |||
if not seen.timedelta_: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I've proposed simple test in corresponding issue that returns undesired output - this PR is aimed to resolve exactly such situations. Should I add this single test only? If not - maybe_convert_objects
need full test suite (for all possible types combinations) but as I tried to explain earlier there will be thousands of it which is not doable 'by hands'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already have quite a few tests
i am ok with systematic testing but don’t go overboard
so happy to merge tests first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so happy to merge tests first
@jreback sorry but from your answer I still can't understand - is single test which is fixed by this PR enough or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this needs at least a single test
if u want to do more comprehensive tests that xfail some (that re not passing now) is ok too (can be before or after this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add.
The problem with comprehensive tests: it cannot be done by hands -> it needs some "trusted" function to tested against i.e. needs copying old function for tests purposes only (as one of the options) - and here I don't understand may I do so or not.
There should be some managably-sized set of cases that you were trying to fix here, i.e. that were previous wrong but are now right. Write those tests specifically. No need to try have a zillion tests. |
I get it. Agree that for this PR bunch of tests should be enough. But at the same time I should mark that inconsistencies of |
Improving the thoroughness of the test suite is great, but for a bugfix PR, the tests should be specific to the bug being fixed. |
def test_maybe_convert_objects_datetime(self): | ||
# GH27438 | ||
arr = np.array([0, 0], dtype=object) | ||
arr[0] = np.datetime64('2000-01-01') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this construction seems odd, you can just construct directly no? e.g.
arr = np.array([np.datetime...., np.timedetla.lll], dtype='object'])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't you need a NaT somewhere? what about throwing in a np.nan as well (maybe another case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Construction is odd, yes - will rewrite (it's leftover of previous tests sorry). Will add tests with NaT as well.
As for np.nan - not quite understand. Add something with datetimes and np.nan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we likely already have a test for this, but pls check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback all is green please check.
Now there are 10 tests for maybe_convert_objects
+ 13 tests through Dataframe
constructor (there are also tests for maybe_convert_numeric
but I did not count these).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way - no one confirm my changes with array full of NaT
with both convert_DT
and convert_TD
checks as True
(elif seen.nat_:
case) - now it returns datetime
without any warning of ambiguity with timedelta
. Old behaviour was to return object
(without considering flags state IIRC). Check it pls too.
Should I just remove this |
Between here and the issue you're bringing up a lot of questions and I'm having trouble keeping track of them. To the extent that you can focus on one unambiguous bug at a time, that will make things easier for reviewers. |
@BeforeFlight lgtm. does this have any user facing effects at all? meaning is there an incorrect Series/DataFrame construction currently? |
Have to check (I'm not aware of all calling to import numpy as np
import pandas as pd
import pandas._libs.lib as lib
data = np.array([np.datetime64('2000-01-01'), np.timedelta64(1, 's')], dtype='object')
out = lib.maybe_convert_objects(data,1,1,1,1)
print(out, out.dtype) Output: ['2000-01-01T00:00:00.000000000' '1970-01-01T00:00:00.021389104'] datetime64[ns] As I already mention at issue branch - datetime array is actually not initialized 2nd value at all (left it as it is after Will try to reproduce it on some constructor. EDIT |
No, seems like there is no any effects for user at all in current code state. The reason here is that in order to produce such bug both flags |
thanks @BeforeFlight |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
In order to stick to
maybe_convert_objects
realization it was needed to introduceSeen().nat_
param. Since we want to distinguish following situations: were seenNaT
ordatetime
ortimedelta
NOTE: in current realization array full of
NaT
will be converted tonp.datetime64
.