-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: maybe_convert_objects rethink. #27417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This has just recently been done. So if there is a comment that is no longer accurate, definitely worth changing.
These directives usually improve performance. Are you saying this is not the case here?
Can you give a ballpark of by how much performance improves? Are we talking 1% or 10% or [...]?
After running
Yes. The level of interest will depend on the size of the performance improvement. |
Some questions I left for myself - just as marks. While development is in progress I've disable these flags in order to not receive unexpected seagfault or something (same is for
Yes, of course. I'm asking how to paste it here properly (even pastebin don't accept it due to size > 512KiB). Just copy-paste here 4500 changes comparings? |
It's fine if you want to post every 10th entry or something. Have you checked that the results are stable between multiple runs? asv often lets a lot of noise in for cython code. |
No ran it only once (~2h) - there are other bugs wich should be tackle first (i think) - first logic must be purelly correct, than performance tests. In fact performance tests are 'trimed' right now - if inserting more other dtypes in array (now it's only 2 - in the beginning of array and in the middle) performance will differ more (but even 3 inserts will take about 1-1.5 days to test in my laptop so will run it only when code logic will be consistent). Current testing stateAs at 27.07.2019asv testspytest
|
Failed pytests comparisons resolving.Here I need someone to choose the correct result from the following pytest outputs:
data: [None -9223372036854775809] object
old: [ nan -9.22337204e+18] float64
new: [None -9223372036854775809] object The more I think about this problem the more I don't like conversion idea (allow Now Line 1998 in 5b1a870
if we still want to use safe flag here should be float64 and complex added (i.e. any integer will be converting to float without checking range ; here might be valid range for float64 added but it's an overkill I think). Should add here float and complex into disable integer range checking?
|
QuestionsFormat here is Question with followed proposed answer or reasoning about. pytest
None
Time
Decimal
Design
Complete test suite
Cleaning
Unrelated minor questions:
Answers:None
Time
Decimal
|
To the extent that bugfixes, performance improvements, and design discussions can be kept independent, that will make things easier for reviewers |
Can you write the example there as something copy/pasteable? I tried to reproduce it locally based on my best guess of what you meant and got different results. |
import numpy as np
import pandas as pd
import pandas._libs.lib as lib
data = np.array([0]*2, dtype='object')
data[0] = np.datetime64('2000-01-01')
data[1] = np.timedelta64(1, 's')
old = lib.maybe_convert_objects(data,1,1,1,1)
print(old, old.dtype) Output: ['2000-01-01T00:00:00.000000000' '1970-01-01T00:00:00.021389104'] datetime64[ns] Sorry should have mentioned that my questions with comparing outputs are all from pytest suite I've shown earlier - you may skim it in order to fully understand what I was doing. Also how should I separate all my questions? Should I open another issue or comment here correctly? Or maybe collect all questions in single post (1st one for example) and split it into 'performance', 'design' etc. parts? |
For now its fine to have them in one place. Once you move on to making a PR then its more important to keep a narrow focus. |
As for these concrete example with mixing Lines 2079 to 2081 in 26bd34d
We just test for seen datetimes and not seen numeric but havn't take into account timedelta hence this weird output (in fact second value in the output 1970-01-01T00:00:00.021389104' was not filled at all - it is left as it was initialized by np.empty . So, in theory, here might be any other date).
|
Yah that looks like a bug. PR to fix would be welcome |
Here it is not so simple. If code something like this: if seen.datetime_:
if not seen.timedelta_:
if not seen.numeric_:
return datetimes Then another problem of initial code arisen - So there is a need to introduce something like Seen().NaT attribute or something else.. Have to think about it more. Am I correct that first exist code must be fully bug-fixed (which seems correct but will take time)? |
type inference with NaT comes up every so often. The rule of thumb is for it to be datetime-like by default, and be timedelta-like only if that is needed to make an array all-timedelta |
For now I've managed to resolve all conflicts with current version of |
There should be an
I don't know. What do we do now? Returning object-dtype seems reasonable for all-Decimal arrays.
Right now we say datetime right? There should probably be something like "ambiguous" and then we require the user to specify the dtype |
In fact I agree here. Especially if pandas now cannot work with
I've updated questions section |
The case I have in mind is |
@jbrockmendel it's been 5 days already. Please check 3rd pytest failure when you have some time (it shouldn't take long). I added minimal question with info there. |
which is the 3rd pytest failure? |
I've collected all my questions above under histogram with asvs. 3rd is unresolved.
|
In the integer-> float case the conversion is lossy, right? i.e. it doesn't round-trip correctly? This might be a good case for IntegerNA cc @jreback |
Yes conversion to |
test = np.array([np.iinfo(np.uint64).max+1, 1.1], dtype=np.object)
lib.maybe_convert_objects(test) array([18446744073709551616, 1.1], dtype=object) As well as test = np.array([-1, np.iinfo(np.uint64).max, 1.1], dtype=np.object)
lib.maybe_convert_objects(test) array([-1, 18446744073709551615, 1.1], dtype=object) |
would float128 be a valid workaround? IIRC there are some platforms/builds on which that is not available, but that may have been fixed in more recent numpys |
You mean in case |
in the case where keeping float64 would be lossy, yah |
we don’t support float128 |
I've got an idea (maybe weird I don't know): if pandas already got Extensions platfrom - how about creating following dtype In integerNA there are 2 arrays: with data and mask. Why not do the same? But mask here will be whether value is For example: |
@BeforeFlight is this a bug that is causing problems in the wild? If not, we'd probably be better off putting it on the backburner. Either the int-float64 idea or the IntegerNA approach would be pretty invasive, which we're reticent about for hopefully-obvious reasons. None of this is to discourage you from making contributions, but I'm assuming that there is lower-hanging fruit you'd be interested in tackling before this. |
@jbrockmendel if I get it correctly (sorry not english-native and your post contains some metaphors which I dont' get)
Sorry this I can't get at all. You mean again "don't do int-float and intNA" ? |
Thanks for clarifying. I will try to express this more plainly. Is the nan+int64 bug high-priority? Fixing it (either approach discussed so far) will be a big change, which we would rather delay unless it is a high priority. |
If we do not want make some "heavy" decision about vals = [np.iinfo(np.uint64).max - 10**i for i in range(5)]+[1.1]
test = np.array(vals, dtype=np.object)
conv = lib.maybe_convert_objects(test)
for i, el in enumerate(conv):
print((vals[i], format(el, '.0f'))) (18446744073709551614, '18446744073709551616')
(18446744073709551605, '18446744073709551616')
(18446744073709551515, '18446744073709551616')
(18446744073709550615, '18446744073709551616')
(18446744073709541615, '18446744073709541376')
(1.1, '1')
|
So we want to stick current realization. Then need to add some consistency here: # FLOAT
test = np.array([None, np.iinfo(np.uint64).max+1], dtype=np.object) # test 1
lib.maybe_convert_objects(test)
# OBJECT
test = np.array([np.nan, np.iinfo(np.uint64).max+1], dtype=np.object) # test 2
lib.maybe_convert_objects(test)
test = np.array([1.1, np.iinfo(np.uint64).max+1], dtype=np.object) # test 3
lib.maybe_convert_objects(test) array([ nan, 1.84467441e+19])
array([nan, 18446744073709551616], dtype=object)
array([1.1, 18446744073709551616], dtype=object) If Possible solutions for: Line 1998 in 5b1a870
|
Assuming this Will |
All So in case of dropping |
Pytest: check_types = [None, np.nan, pd.NaT,
True,
np.iinfo(np.int64).min-1, np.iinfo(np.int64).min,
1,
np.iinfo(np.uint64).max, np.iinfo(np.uint64).max+1,
1.1, np.float_(1.1),
1j, np.complex_(1j),
datetime(2000, 1, 1),
datetime(2000, 1, 1, 0, 0, 0, 0, timezone.utc),
datetime(2000, 1, 1, 0, 0, 0, 0, timezone(timedelta(hours=1))),
np.datetime64('2000-01-01'),
timedelta(365),
np.timedelta64(1, 's'),
'object'
]
DT_TD_NaT_inds = [2] + list(range(14, 20))
def data_gen(N, perms, exclude_all_is_time=False):
data = np.array([0]*N, dtype='object')
for perm in perms:
if exclude_all_is_time and all(ind in DT_TD_NaT_inds for ind in perm):
continue
for i in range(data.shape[0]):
data[i] = check_types[perm[i]]
yield data.copy()
N = 2
exclude_all_is_time = False
perms = itertools.product(list(range(len(check_types))), repeat=N)
datas = data_gen(N, perms, exclude_all_is_time)
data_inds = list(range(len(check_types)**N)) if not exclude_all_is_time \
else list(range((len(check_types)-len(DT_TD_NaT_inds))**N))
@pytest.fixture(scope='module', params=data_inds)
def get_data():
return next(datas)
class Tests:
@pytest.mark.parametrize("safe", [0, 1])
@pytest.mark.parametrize("conv_DT", [0, 1])
@pytest.mark.parametrize("conv_TD", [0, 1])
def test_maybe_convert_objects(self, get_data, safe, conv_DT, conv_TD):
data = get_data
old = lib.maybe_convert_objects_OLD(data, try_float=1, safe=safe,
convert_datetime=conv_DT,
convert_timedelta=conv_TD)
new = lib.maybe_convert_objects(data, try_float=1, safe=safe,
convert_datetime=conv_DT,
convert_timedelta=conv_TD)
if isinstance(old, np.ndarray):
tm.assert_numpy_array_equal(old, new)
else:
tm.assert_index_equal(old, new) Asv: class MaybeConvertObjects:
dtypes = {"bool": True, "sint": -1, "uint": np.iinfo(np.uint64).max,
"float": 1.1, "complex": 1j,
"datetime": datetime(2000, 1, 1), "timedelta": timedelta(365),
"datetime_tz_utc": datetime(2000, 1, 1, 0, 0, 0, 0, timezone.utc),
"datetime_tz_utc1": datetime(2000, 1, 1, 0, 0, 0, 0,
timezone(timedelta(hours=1))),
"object": "object"}
dtypes_to_insert = list(dtypes.values()) + [None, pd.NaT]
datas = {k: np.array([v] * 10**3, dtype='object')
for k, v in dtypes.items()}
insert_C = 2 # number of insert into homogeneous array
params = [[0, 1], [0, 1], [0, 1],
list(dtypes.keys()),
list(itertools.product(list(range(len(dtypes_to_insert))),
repeat=insert_C))]
param_names = ["safe", "conv_DT", "conv_TD", "dtype", "insrt_inds"]
def setup(self, safe, conv_DT, conv_TD, dtype, insrt_inds):
data = self.datas[dtype]
# inserting some values in data
for i in range(self.insert_C):
data[(i + 1) * len(data) // (self.insert_C + 1)] = \
self.dtypes_to_insert[insrt_inds[i]]
self.data = data
def time_maybe_convert_objects(self, safe, conv_DT, conv_TD,
dtype, insrt_inds):
lib.maybe_convert_objects(self.data, try_float=0, safe=safe,
convert_datetime=conv_DT,
convert_timedelta=conv_TD) Code: cdef inline bint is_datetime(object o):
# is_datetime also hits NaT
return PyDateTime_Check(o) or util.is_datetime64_object(o)
@cython.boundscheck(False)
@cython.wraparound(False)
cdef str check_complex(bint safe,
object[:] objects,
complex128_t[:] complexes,
int n = 0, Seen seen = None):
if seen is None:
seen = Seen()
cdef:
float64_t fnan = np.nan
Py_ssize_t i
for i in range(n, objects.shape[0]):
val = objects[i]
if util.is_complex_object(val) or util.is_float_object(val):
complexes[i] = val
elif val is None:
complexes[i] = fnan
elif not safe and util.is_integer_object(val):
if util.is_timedelta64_object(val):
return 'object'
val = int(val)
seen.saw_int(val)
if ((seen.uint_ and seen.sint_) or
not (oINT64_MIN <= val and val <= oUINT64_MAX)):
return 'object'
complexes[i] = <complex128_t>val
else:
return 'object'
return 'complex'
@cython.boundscheck(False)
@cython.wraparound(False)
cdef str check_float(bint safe,
object[:] objects,
float64_t[:] floats,
complex128_t[:] complexes,
int n = 0, Seen seen = None):
if seen is None:
seen = Seen()
cdef:
float64_t fnan = np.nan
Py_ssize_t i
for i in range(n, objects.shape[0]):
val = objects[i]
if util.is_float_object(val):
floats[i] = complexes[i] = val
elif val is None:
floats[i] = complexes[i] = fnan
elif not safe and util.is_integer_object(val):
if util.is_timedelta64_object(val):
return 'object'
val = int(val)
seen.saw_int(val)
if ((seen.uint_ and seen.sint_) or
not (oINT64_MIN <= val and val <= oUINT64_MAX)):
return 'object'
floats[i] = <float64_t>val
complexes[i] = <complex128_t>val
elif util.is_complex_object(val):
return check_complex(safe, objects, complexes, i, seen)
else:
return 'object'
return 'float'
@cython.boundscheck(False)
@cython.wraparound(False)
cdef str check_integer(bint safe,
object[:] objects,
int64_t[:] ints,
uint64_t[:] uints,
float64_t[:] floats,
complex128_t[:] complexes,
int n = 0):
cdef:
Seen seen = Seen()
Py_ssize_t i
for i in range(n, objects.shape[0]):
val = objects[i]
if util.is_integer_object(val):
# is_integer hits timedelta64
if util.is_timedelta64_object(val):
return 'object'
val = int(val)
seen.saw_int(val)
if ((seen.uint_ and seen.sint_) or
not (oINT64_MIN <= val and val <= oUINT64_MAX)):
return 'object'
if seen.uint_:
uints[i] = val
elif seen.sint_:
ints[i] = val
else:
uints[i] = val
ints[i] = val
if not safe:
floats[i] = <float64_t>val
complexes[i] = <complex128_t>val
elif safe:
return 'object'
elif util.is_float_object(val):
return check_float(safe, objects, floats, complexes, i, seen)
elif val is None:
return check_float(safe, objects, floats, complexes, i, seen)
elif util.is_complex_object(val):
return check_complex(safe, objects, complexes, i, seen)
else:
return 'object'
if seen.uint_:
return 'uint'
else:
return 'int'
@cython.boundscheck(False)
@cython.wraparound(False)
cdef str check_bool(object[:] objects,
uint8_t[:] bools,
int n = 0):
cdef Py_ssize_t i
for i in range(n, objects.shape[0]):
val = objects[i]
if util.is_bool_object(val):
bools[i] = val
else:
return 'object'
return 'bool'
@cython.boundscheck(False)
@cython.wraparound(False)
cdef str check_datetime(object[:] objects,
int64_t[:] idatetimes,
int n = 0):
cdef:
Py_ssize_t i
bint all_is_NaT = 1 # flag for `fast` object in case of different TZ
for i in range(n, objects.shape[0]):
val = objects[i]
if is_datetime(val):
# is_datetime also hits NaT
if val is NaT:
idatetimes[i] = NPY_NAT
else:
if getattr(val, 'tzinfo', None) is not None:
# if i > 0 and not all_is_NaT:
if i > 0 and not all_is_NaT:
return 'object'
else:
return 'datetime_tz'
all_is_NaT = 0
idatetimes[i] = convert_to_tsobject(val, None, None, 0, 0).value
else:
return 'object'
return 'datetime'
@cython.boundscheck(False)
@cython.wraparound(False)
cdef str check_timedelta(object[:] objects,
int64_t[:] itimedeltas,
int n = 0):
cdef Py_ssize_t i
for i in range(n, objects.shape[0]):
val = objects[i]
if is_timedelta(val):
itimedeltas[i] = convert_to_timedelta64(val, 'ns')
elif val is NaT:
itimedeltas[i] = NPY_NAT
else:
return 'object'
return 'timedelta'
@cython.boundscheck(False)
@cython.wraparound(False)
cdef str check_NaT(bint convert_datetime,
bint convert_timedelta,
object[:] objects,
int64_t[:] idatetimes,
int64_t[:] itimedeltas,
int n = 0):
cdef Py_ssize_t i
for i in range(n, objects.shape[0]):
val = objects[i]
if val is NaT:
idatetimes[i] = itimedeltas[i] = NPY_NAT
elif convert_datetime and is_datetime(val):
return check_datetime(objects, idatetimes, i)
elif convert_timedelta and is_timedelta(val):
return check_timedelta(objects, itimedeltas, i)
else:
return 'object'
# full NaT output
if convert_datetime and convert_timedelta:
# TODO: array full of NaT ambiguity resolve here needed
return 'object'
elif convert_datetime:
return 'datetime'
elif convert_timedelta:
return 'timedelta'
else:
return 'object'
@cython.boundscheck(False)
@cython.wraparound(False)
cdef str check_None(bint safe,
object[:] objects,
float64_t[:] floats,
complex128_t[:] complexes,
int n = 0):
cdef:
float64_t fnan = np.nan
Py_ssize_t i
for i in range(n, objects.shape[0]):
val = objects[i]
if val is None:
floats[i] = complexes[i] = fnan
elif util.is_float_object(val):
return check_float(safe, objects, floats, complexes, i)
elif util.is_complex_object(val):
return check_complex(safe, objects, complexes, i)
elif not safe and util.is_integer_object(val):
return check_float(safe, objects, floats, complexes, i)
else:
return 'object'
return 'object'
@cython.boundscheck(False)
@cython.wraparound(False)
def maybe_convert_objects(ndarray[object] objects, bint try_float=0,
bint safe=0, bint convert_datetime=0,
bint convert_timedelta=0):
"""
Type inference function-- convert object array to proper dtype
"""
cdef int n = objects.shape[0]
if n == 0:
return objects
cdef:
ndarray[uint8_t] bools = np.empty(n, dtype = np.uint8)
ndarray[int64_t] ints = np.empty(n, dtype = np.int64)
ndarray[uint64_t] uints = np.empty(n, dtype = np.uint64)
ndarray[float64_t] floats = np.empty(n, dtype = np.float64)
ndarray[complex128_t] complexes = np.empty(n, dtype = np.complex128)
if convert_datetime:
datetimes = np.empty(n, dtype='M8[ns]')
idatetimes = datetimes.view(np.int64)
if convert_timedelta:
timedeltas = np.empty(n, dtype='m8[ns]')
itimedeltas = timedeltas.view(np.int64)
# Type probing - determine type of first element
# and call corresponding check
cdef:
str convert_dtype = 'object'
object val = objects[0]
if val is None:
convert_dtype = check_None(safe, objects, floats, complexes)
elif val is NaT:
if convert_datetime and convert_timedelta:
convert_dtype = check_NaT(convert_datetime, convert_timedelta,
objects, idatetimes, itimedeltas)
elif convert_datetime:
convert_dtype = check_datetime(objects, idatetimes)
elif convert_timedelta:
convert_dtype = check_timedelta(objects, itimedeltas)
elif is_timedelta(val):
if convert_timedelta:
convert_dtype = check_timedelta(objects, itimedeltas)
elif is_datetime(val):
if convert_datetime:
convert_dtype = check_datetime(objects, idatetimes)
elif util.is_bool_object(val):
convert_dtype = check_bool(objects, bools)
elif util.is_integer_object(val):
convert_dtype = check_integer(safe, objects,
ints, uints, floats, complexes)
elif util.is_float_object(val):
convert_dtype = check_float(safe, objects, floats, complexes)
elif util.is_complex_object(val):
convert_dtype = check_complex(safe, objects, complexes)
# OUTPUT
if convert_dtype == 'bool':
return bools.view(np.bool_)
elif convert_dtype == 'int':
return ints
elif convert_dtype == 'uint':
return uints
elif convert_dtype == 'float':
return floats
elif convert_dtype == 'complex':
return complexes
elif convert_dtype == 'datetime':
return datetimes
elif convert_dtype == 'timedelta':
return timedeltas
elif convert_dtype == 'datetime_tz':
if is_datetime_with_singletz_array(objects):
from pandas import DatetimeIndex
return DatetimeIndex(objects)
return objects |
While exploring the
maybe_convert_objects
I've got quite a few proposals/questions but in the end I've come to a conclusion that the logic of function itself is far from optimal.Motivation
datetime
andfloat
we will still process the other part of array. And only after have seen all the elements we are going to checking of theSeen()
object state in order to determine result dtype. But in given example one may already conclude that the result dtype must be left asobject
.Other words there is a need to do result type checking on the fly if we want to preserve the main course of the written function.
datetime
for all seen values we are still will try to check forfloat
,bool
etc. But all this checks may be left in singleif is_datetime(val)... else return object
.I.e. for each seen dtype there should be it's own checking algorithm for other array elements.
These are only main points. So instead of trying to bloat current realisation with (maybe redundant) checks I am suggesting to rewrite
maybe_convert_objects
.Realization
Main idea of following code is perform
type probing
on the first element of array and then pass processing to the corresponding function, during execution which processing may be passed to another (more 'narrow' with checks) function if new dtype has been discovered.NOTE this is working copy so I left questions arisen in code comments (marked with
# QUEST:
)Code logic tests
I've done initial testing with pytest in order to compare outputs of the old and new functions with this test suite:
770/800 tests passed
For now it's simple creates array with 2 values for all the combinations from
check_dtypes
. I'm in progress of solving left 30 and then will increaseN
.Performance tests
Asv benchmarks:
For 4100/4400 of tests with changed performance the performance increased (what is the correct way to paste here the output of these? or is it redundant?). I'm also in progress of resolving left 300.
In order to continue working on this I need to know if this will be even considered or not.
The text was updated successfully, but these errors were encountered: