Skip to content

PERF: maybe_convert_objects rethink. #27417

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
BeforeFlight opened this issue Jul 16, 2019 · 37 comments
Closed

PERF: maybe_convert_objects rethink. #27417

BeforeFlight opened this issue Jul 16, 2019 · 37 comments
Labels
Performance Memory or execution speed performance

Comments

@BeforeFlight
Copy link
Contributor

While exploring the maybe_convert_objects I've got quite a few proposals/questions but in the end I've come to a conclusion that the logic of function itself is far from optimal.

Motivation

  1. Main argument here is we are not checking for inconsistency of parts of already seen data. E.g. if we have seen datetime and float we will still process the other part of array. And only after have seen all the elements we are going to checking of the Seen() object state in order to determine result dtype. But in given example one may already conclude that the result dtype must be left as object.
    Other words there is a need to do result type checking on the fly if we want to preserve the main course of the written function.
  2. Even if data is consistent we are still using redundant bunch of checks. E.g. if the dtype is datetime for all seen values we are still will try to check for float, bool etc. But all this checks may be left in single if is_datetime(val)... else return object.
    I.e. for each seen dtype there should be it's own checking algorithm for other array elements.

These are only main points. So instead of trying to bloat current realisation with (maybe redundant) checks I am suggesting to rewrite maybe_convert_objects.

Realization

Main idea of following code is perform type probing on the first element of array and then pass processing to the corresponding function, during execution which processing may be passed to another (more 'narrow' with checks) function if new dtype has been discovered.

NOTE this is working copy so I left questions arisen in code comments (marked with # QUEST:)

# First declare checking functions for corresponding dtype got in probing
# QUEST: What is the correct way to wrap these functions? Maybe another .pxd?
# QUEST: Can here and in other cdefs 'nogil' param be used?
cdef str check_complex(bint safe,
                       object[:] objects,
                       complex128_t[:] complexes,
                       int n = 0):
    cdef:
        float64_t fnan = np.nan
        Py_ssize_t i
    for i in range(n, objects.shape[0]):
        val = objects[i]
        if util.is_complex_object(val) or util.is_float_object(val):
            complexes[i] = val
        elif val is None:
            complexes[i] = fnan
        elif util.is_integer_object(val):
            if is_timedelta(val):
                return 'object'
            if not safe and oINT64_MIN <= val and val <= oUINT64_MAX:
                complexes[i] = <complex128_t>val
            else:
                return 'object'
        else:
            return 'object'
    return 'complex'

cdef str check_float(bint safe,
                     object[:] objects,
                     float64_t[:] floats,
                     complex128_t[:] complexes,
                     int n = 0):
    cdef:
        float64_t fnan = np.nan
        Py_ssize_t i
    for i in range(n, objects.shape[0]):
        val = objects[i]
        if util.is_float_object(val):
            floats[i] = complexes[i] = val
        elif val is None:
            floats[i] = complexes[i] = fnan
        elif util.is_integer_object(val):
            if util.is_timedelta64_object(val):
                return 'object'
            if not safe and oINT64_MIN <= val and val <= oUINT64_MAX:
                floats[i] = <float64_t>val
                complexes[i] = <complex128_t>val
            else:
                return 'object'
        elif util.is_complex_object(val):
            return check_complex(safe, objects, complexes, i)
        else:
            return 'object'
    return 'float'

cdef str check_integer(bint safe,
                       object[:] objects,
                       int64_t[:] ints,
                       uint64_t[:] uints,
                       float64_t[:] floats,
                       complex128_t[:] complexes,
                       int n = 0):
    cdef:
        Seen seen = Seen()
        Py_ssize_t i
    for i in range(n, objects.shape[0]):
        val = objects[i]
        if util.is_integer_object(val):
            # is_integer hits timedelta64
            # QUEST: maybe check timedelta64 in 'is_integer_object' directly?
            #        if so - remove from here and from check_float/complex
            if util.is_timedelta64_object(val):
                return 'object'
            if not safe:
                floats[i] = <float64_t>val
                complexes[i] = <complex128_t>val

            val = int(val)
            seen.saw_int(val)

            # QUEST: do val not in [in64.min, uint64.max] range should be
            #        dropped to object immediatelly?
            if ((seen.uint_ and seen.sint_) or
                    val > oUINT64_MAX or val < oINT64_MIN):
                return 'object'

            if seen.uint_:
                uints[i] = val
            elif seen.sint_:
                ints[i] = val
            else:
                uints[i] = val
                ints[i] = val
        elif safe:
            return 'object'
        elif util.is_float_object(val):
            return check_float(safe, objects, floats, complexes, i)
        elif val is None:
            return check_float(safe, objects, floats, complexes, i)
        elif util.is_complex_object(val):
            return check_complex(safe, objects, complexes, i)
        else:
            return 'object'

    if seen.uint_ and seen.sint_:
        # Default for intersected sint uint data
        # QUEST: should be 'sint'?
        return 'sint'
    elif seen.uint_:
        return 'uint'
    else:
        return 'sint'

cdef str check_bool(object[:] objects,
                    uint8_t[:] bools,
                    int n = 0):
    cdef Py_ssize_t i
    for i in range(n, objects.shape[0]):
        val = objects[i]
        if util.is_bool_object(val):
            bools[i] = val
        else:
            return 'object'
    return 'bool'

cdef str check_datetime(object[:] objects,
                        int64_t[:] idatetimes,
                        int n = 0):
    cdef Py_ssize_t i
    for i in range(n, objects.shape[0]):
        val = objects[i]
        if is_datetime(val):
            # QUEST: If one val has attr 'tzinfo' - must others have one?
            # Or if other have not - 'object' should be returned direcrtly?
            if getattr(val, 'tzinfo', None) is not None:
                return 'datetime_tz'
            idatetimes[i] = convert_to_tsobject(
                val, None, None, 0, 0).value
        elif val is NaT:
            idatetimes[i] = NPY_NAT
        else:
            return 'object'
    return 'datetime'

cdef str check_timedelta(object[:] objects,
                         int64_t[:] itimedeltas,
                         int n = 0):
    cdef Py_ssize_t i
    for i in range(n, objects.shape[0]):
        val = objects[i]
        if is_timedelta(val):
            itimedeltas[i] = convert_to_timedelta64(val, 'ns')
        elif val is NaT:
            itimedeltas[i] = NPY_NAT
        else:
            return 'object'
    return 'timedelta'

cdef str check_NaT(object[:] objects,
                   int64_t[:] idatetimes,
                   int64_t[:] itimedeltas,
                   int n = 0):
    cdef Py_ssize_t i
    for i in range(n, objects.shape[0]):
        val = objects[i]
        if val is NaT:
            idatetimes[i] = itimedeltas[i] = NPY_NAT
        elif is_datetime(val):
            return check_datetime(objects, idatetimes, i)
        elif is_timedelta(val):
            return check_timedelta(objects, itimedeltas, i)
        else:
            return 'object'

    # Default for arrays of NaT will be datetime64?
    return 'datetime'

cdef str check_None(bint safe,
                    object[:] objects,
                    float64_t[:] floats,
                    complex128_t[:] complexes,
                    int n = 0):
    cdef:
        float64_t fnan = np.nan
        Py_ssize_t i
    for i in range(n, objects.shape[0]):
        val = objects[i]
        if val is None:
            floats[i] = complexes[i] = fnan
        elif util.is_float_object(val):
            return check_float(safe, objects, floats, complexes, i)
        elif util.is_complex_object(val):
            return check_complex(safe, objects, complexes, i)
        else:
            return 'object'
    # QUEST: array full of None will be left as object - correct?
    # QUEST: at the same time [None, np.nan] converted to float.
    #        How about [None, NaT] then? or [None, bool]. Need consistency here.
    #        Same - [None, 1] is *float* now..
    return 'object'

# QUEST: Disable directives for now caues unstable. Enable?
# @cython.boundscheck(False)
# @cython.wraparound(False)
def maybe_convert_objects(ndarray[object] objects, bint try_float=0,
                          bint safe=0, bint convert_datetime=0,
                          bint convert_timedelta=0):
    """
    Type inference function-- convert object array to proper dtype
    """


    # Declaring output templates
    # QUEST: doesn't it use excessing memory for unnecessary types?
    #        so it should be initialized after dtype probing?
    # QUEST: do this arrs really needed? Why refilling even one of them -
    #        maybe use objects.view or something as result after checkings?
    cdef:
        int n = objects.shape[0]
        ndarray[uint8_t]      bools     = np.empty(n, dtype = np.uint8)
        ndarray[int64_t]      ints      = np.empty(n, dtype = np.int64)
        ndarray[uint64_t]     uints     = np.empty(n, dtype = np.uint64)
        ndarray[float64_t]    floats    = np.empty(n, dtype = np.float64)
        ndarray[complex128_t] complexes = np.empty(n, dtype = np.complex128)

    if convert_datetime:
        datetimes = np.empty(n, dtype='M8[ns]')
        idatetimes = datetimes.view(np.int64)

    if convert_timedelta:
        timedeltas = np.empty(n, dtype='m8[ns]')
        itimedeltas = timedeltas.view(np.int64)

    # Ensure we are able do probing at all
    if n == 0:
        return objects

    # Type probing - determine type of first element
    # and call corresponding check
    cdef:
         str convert_dtype = 'object'
         object val = objects[0]

    if val is None:
        # QUEST: Or should here 'object' be returned immediatelly?
        #        cause floats should use np.nan instead
        convert_dtype = check_None(safe, objects, floats, complexes)
    elif val is NaT:
        if convert_datetime and convert_timedelta:
            convert_dtype = check_NaT(objects, idatetimes, itimedeltas)
        elif convert_datetime:
            convert_dtype = check_datetime(objects, idatetimes)
        elif convert_timedelta:
            convert_dtype = check_timedelta(objects, itimedeltas)
    elif is_timedelta(val):
        if convert_timedelta:
            convert_dtype = check_timedelta(objects, itimedeltas)
    elif is_datetime(val):
        if convert_datetime:
            convert_dtype = check_datetime(objects, idatetimes)
    elif util.is_bool_object(val):
        convert_dtype = check_bool(objects, bools)
    elif util.is_integer_object(val):
        convert_dtype = check_integer(safe, objects,
                                      ints, uints, floats, complexes)
    elif util.is_float_object(val):
        convert_dtype = check_float(safe, objects, floats, complexes)
    elif util.is_complex_object(val):
        convert_dtype = check_complex(safe, objects, complexes)
    # QUEST: Why 'try_float' - not 'try_decimal' as e.g. try_timedelta?
    elif try_float and not isinstance(val, str):
        pass
        # TODO: not done yet. It is not "safe" way to just try float(val)
        # # this will convert Decimal objects
        # try:
        #     convert_dtype = check_float(safe, objects, floats, complexes)
        # except Exception:
        #     convert_dtype = 'object'

    # OUTPUT
    if convert_dtype == 'bool':
        # QUEST: is view here needed? need to recheck this
        return bools.view(np.bool_)
    elif convert_dtype == 'sint':
        return ints
    elif convert_dtype == 'uint':
        return uints
    elif convert_dtype == 'float':
        return floats
    elif convert_dtype == 'complex':
        return complexes
    elif convert_dtype == 'datetime':
        return datetimes
    elif convert_dtype == 'timedelta':
        return timedeltas
    elif convert_dtype == 'datetime_tz':
        # we try to coerce datetime w/tz but must all have the same tz
        # QUEST: havn't check speed of this - can it be done faster?
        # QUEST: DateTimeIndex is returned instead np.ndarray.
        if is_datetime_with_singletz_array(objects):
            from pandas import DatetimeIndex
            return DatetimeIndex(objects)
    return objects

Code logic tests

I've done initial testing with pytest in order to compare outputs of the old and new functions with this test suite:

import pytest
import itertools

import numpy as np
import pandas as pd

import pandas._libs.lib as lib
import pandas.util.testing as tm

from decimal import Decimal
from datetime import datetime, timezone, timedelta


def data_gen(types):
    N = 2
    data = np.array([0]*N, dtype='object')
    for perm in itertools.product(list(range(len(types))), repeat=N):
        for i in range(data.shape[0]):
            data[i] = types[perm[i]]
        yield data.copy()


check_types = [None, np.nan, pd.NaT,
               True,
               np.iinfo(np.int64).min-1, -1, 1, np.iinfo(np.int64).max+1,
               np.iinfo(np.uint64).max, np.iinfo(np.uint64).max+1,
               1.1, np.float_(1.1),
               1j, np.complex_(1j),
               datetime(2000, 1, 1),
               datetime(2000, 1, 1, 0, 0, 0, 0, timezone.utc),
               np.datetime64('2000-01-01'),
               timedelta(365),
               np.timedelta64(1, 's'),
               # Decimal(1.1),
               'object'
               ]
datas = list(data_gen(check_types))


class Tests:
    @pytest.mark.parametrize("data", datas)
    @pytest.mark.parametrize("safe", [0, 1])
    def test_maybe_convert_objects(self, data, safe):
        old = lib.maybe_convert_objects_OLD(data, 1, safe, 1, 1)
        new = lib.maybe_convert_objects(data, 1, safe, 1, 1)
        print('data:', data, data.dtype,
              '\nold:', old, old.dtype,
              '\nnew:', new, new.dtype)
        if isinstance(old, np.ndarray):
            tm.assert_numpy_array_equal(old, new)
        else:
            tm.assert_index_equal(old, new)

770/800 tests passed
For now it's simple creates array with 2 values for all the combinations from check_dtypes. I'm in progress of solving left 30 and then will increase N.

Performance tests

Asv benchmarks:

class MaybeConvertObjects:
    dtypes = {"bool": True, "sint": -1, "uint": np.iinfo(np.uint64).max,
              "float": 1.1, "complex": 1j,
              "datetime": datetime(2000, 1, 1), "timedelta": timedelta(365),
              "datetime_tz": datetime(2000, 1, 1, 0, 0, 0, 0, timezone.utc),
              # "decimal": Decimal(1.1),
              "object": "object"}
    dtypes_to_insert = list(dtypes.values()) + [None, pd.NaT]

    datas = {k: np.array([v] * 10**3, dtype='object')
             for k, v in dtypes.items()}

    params = [[0, 1], [0, 1], [0, 1],
              list(dtypes.keys()),
              list(itertools.product(list(range(len(dtypes_to_insert))),
                                     repeat=2))]
    param_names = ["safe", "convert_datetime", "convert_timedelta",
                   "dtype", "insrt_inds"]

    def setup(self, safe, convert_datetime, convert_timedelta,
              dtype, insrt_inds):
        data = self.datas[dtype]
        data[0] = self.dtypes_to_insert[insrt_inds[0]]
        data[len(data) // 2] = self.dtypes_to_insert[insrt_inds[1]]
        # data[2 * len(data) // 3] = self.dtypes_to_insert[insrt_inds[2]]
        self.data = data

    def time_maybe_convert_objects(self, safe, convert_datetime,
                                   convert_timedelta, dtype, insrt_inds):
        lib.maybe_convert_objects(self.data, try_float=0, safe=safe,
                                  convert_datetime=convert_datetime,
                                  convert_timedelta=convert_timedelta)

For 4100/4400 of tests with changed performance the performance increased (what is the correct way to paste here the output of these? or is it redundant?). I'm also in progress of resolving left 300.

In order to continue working on this I need to know if this will be even considered or not.

@jbrockmendel
Copy link
Member

# QUEST: maybe check timedelta64 in 'is_integer_object' directly?

This has just recently been done. So if there is a comment that is no longer accurate, definitely worth changing.

# QUEST: Disable directives for now caues unstable. Enable?
# @cython.boundscheck(False)
# @cython.wraparound(False)

These directives usually improve performance. Are you saying this is not the case here?

For 4100/4400 of tests with changed performance the performance increased (what is the correct way to paste here the output of these? or is it redundant?). I'm also in progress of resolving left 300.

Can you give a ballpark of by how much performance improves? Are we talking 1% or 10% or [...]?

what is the correct way to paste here the output of these? or is it redundant?

After running asv continuous you should see something like BENCHMARKS CHANGED SIGNIFICANTLY. Everything from that line down can be copy/pasted, within triple-ticks to format as code.

In order to continue working on this I need to know if this will be even considered or not.

Yes. The level of interest will depend on the size of the performance improvement.

@BeforeFlight
Copy link
Contributor Author

BeforeFlight commented Jul 16, 2019

These directives usually improve performance. Are you saying this is not the case here?

Some questions I left for myself - just as marks. While development is in progress I've disable these flags in order to not receive unexpected seagfault or something (same is for nogil). By the way perf tests now is also without these potential speed-ups for new code but persists in old one.

After running asv continuous you should see something like BENCHMARKS CHANGED SIGNIFICANTLY. Everything from that line down can be copy/pasted, within triple-ticks to format as code.

Yes, of course. I'm asking how to paste it here properly (even pastebin don't accept it due to size > 512KiB). Just copy-paste here 4500 changes comparings?

@jbrockmendel
Copy link
Member

(even pastebin don't accept it due to size > 512KiB). Just copy-paste here 4500 changes comparings?

It's fine if you want to post every 10th entry or something. Have you checked that the results are stable between multiple runs? asv often lets a lot of noise in for cython code.

@BeforeFlight
Copy link
Contributor Author

BeforeFlight commented Jul 16, 2019

Have you checked that the results are stable between multiple runs? asv often lets a lot of noise in for cython code.

No ran it only once (~2h) - there are other bugs wich should be tackle first (i think) - first logic must be purelly correct, than performance tests.

In fact performance tests are 'trimed' right now - if inserting more other dtypes in array (now it's only 2 - in the beginning of array and in the middle) performance will differ more (but even 3 inserts will take about 1-1.5 days to test in my laptop so will run it only when code logic will be consistent).

Current testing state

As at 27.07.2019

asv tests

  • failed 0
    f = 1.1
    insert_C = 2
    asv

pytest

@BeforeFlight
Copy link
Contributor Author

BeforeFlight commented Jul 16, 2019

Failed pytests comparisons resolving.

Here I need someone to choose the correct result from the following pytest outputs:

  • Old version mixes datetimes and timedelta
data: [datetime.datetime(2000, 1, 1, 0, 0) datetime.timedelta(days=365)] object
old: ['2000-01-01T00:00:00.000000000' '2116-01-05T01:59:38.762754458'] datetime64[ns]
new: [datetime.datetime(2000, 1, 1, 0, 0) datetime.timedelta(days=365)] object
  • Almost the same NaT is considered as datetime here (but timedelta in new)
data: [numpy.timedelta64(1,'s') NaT] object
old: ['1970-01-01T00:00:00.000000000'     'NaT'] datetime64[ns]
new: [1000000000    'NaT'] timedelta64[ns]
  • Int out of [np.int64.min, np.uint64.max]range sometimes float (in new - object is returned)
data: [None -9223372036854775809] object
old: [            nan -9.22337204e+18] float64
new: [None -9223372036854775809] object

The more I think about this problem the more I don't like conversion idea (allow int -> float) at all. It is not symmetric for other dtypes (we do not try to converse say timedelta or bool to int64 for example). And second reason is we got collisions for int with large absolute value in float64 representation. But if we want to stick to previous realization:

Now int valid range checking is disable only for seen.null_:

if not seen.null_:

if we still want to use safe flag here should be float64 and complex added (i.e. any integer will be converting to float without checking range ; here might be valid range for float64 added but it's an overkill I think). Should add here float and complex into disable integer range checking?

@shoyer shoyer changed the title PERF: maybe_convet_objects rethink. PERF: maybe_convert_objects rethink. Jul 16, 2019
@BeforeFlight
Copy link
Contributor Author

BeforeFlight commented Jul 16, 2019

Questions

Format here is Question with followed proposed answer or reasoning about.

pytest

  • Resolve all pytest failures (if any would be added in post above).

None

  • None. By user it might be used in order to replace any value. In old realisation it's just converted to np.nan. Left as np.nan?

    • Left as np.nan (i.e float or complex type) seems not symmetric to NaT (i.e. datetime or timedelta type). Other dtypes don't have None representation. So I see 3 option here:
      • try cast to np.nan (default)
      • try cast to NaN as well as to NaT
      • simply return object
  • Array full of None - how should be treated?

    • Depends on previous clause. In case 1 -> here float, 2 or 3 -> here object.

Time

  • Ambiguity for array full of NaT if convert_datetime and convert_timedelta.

    • Resolve to datetime but raise a warning about ambiguity.
  • In case of datetime with single TZ returned object is DatetimeIndex not np.ndarray - correct?

    • Correct since np.ndarray seems like cannot handle TZ.

Decimal

  • What to do with decimal. I don't think that converting it with float(val) is good idea at all. (At least) for following reasons:

    • Is there any guarantee that it will be reconvertible without loosing precision (float(val) doesn't consider its precision at all; it doesn't consider anything - only "convertible to float and not other 'major type'")?
    • If reconvert is possible how one will determine the decimal values, if everything is now float.
    • We are not converting, say bool in float but it is also convertible. But we (of course) still want to preserve its dtype - for decimal we want not?
    • If we still decide to convert it - have to use isinstance(val, decimal.Decimal()) not bare try float(). Since we have no guarantee that converted was indeed decimal.
    • pandas cannot working with decimal dtype directly .astype() and Decimal #26731

Design

  • How to wrap functions called inside of maybe_convert_objects (i.e. check_float, check_bool etc.)?
    • .pxd file or separate class

Complete test suite

  • Cannot be handmade. How to implement?
    • As option preserve old variant of maybe_convert_objects into tests section in order to test new maybe_convert_object against.

Cleaning

  • Decimals is disabled -> try_float flag should be dropped? If so - how about existent maybe_convert_objects calls? Should clean this calls as well?

Unrelated minor questions:

  • How to get lib.pyx coverage? It is not present here

Answers:

None

  • Stick to previous realization - convert to np.nan.
  • Array full of None is float.

Time

  • Ambiguity: add # TODO:
  • In case of single TZ DatetimeIndex is returned - correct.

Decimal

  • If decimal is found - return object.

@jbrockmendel
Copy link
Member

Another stand-alone question is what to do with decimal. I don't think that converting it with float(val) is good idea at all. (At least) for following reasons:

To the extent that bugfixes, performance improvements, and design discussions can be kept independent, that will make things easier for reviewers

@jbrockmendel
Copy link
Member

Old version mixes datetimes and timedelta:

Can you write the example there as something copy/pasteable? I tried to reproduce it locally based on my best guess of what you meant and got different results.

@BeforeFlight
Copy link
Contributor Author

BeforeFlight commented Jul 16, 2019

Can you write the example there as something copy/pasteable? I tried to reproduce it locally based on my best guess of what you meant and got different results.

import numpy as np
import pandas as pd
import pandas._libs.lib as lib

data = np.array([0]*2, dtype='object')
data[0] = np.datetime64('2000-01-01')
data[1] = np.timedelta64(1, 's')
old = lib.maybe_convert_objects(data,1,1,1,1)
print(old, old.dtype)

Output:

['2000-01-01T00:00:00.000000000' '1970-01-01T00:00:00.021389104'] datetime64[ns]

Sorry should have mentioned that my questions with comparing outputs are all from pytest suite I've shown earlier - you may skim it in order to fully understand what I was doing.

Also how should I separate all my questions? Should I open another issue or comment here correctly? Or maybe collect all questions in single post (1st one for example) and split it into 'performance', 'design' etc. parts?

@jbrockmendel
Copy link
Member

Also how should I separate all my questions?

For now its fine to have them in one place. Once you move on to making a PR then its more important to keep a narrow focus.

@BeforeFlight
Copy link
Contributor Author

BeforeFlight commented Jul 16, 2019

As for these concrete example with mixing DT with TD I believe problem is here (it is one of the questions to initial code):

pandas/pandas/_libs/lib.pyx

Lines 2079 to 2081 in 26bd34d

if seen.datetime_:
if not seen.numeric_:
return datetimes

We just test for seen datetimes and not seen numeric but havn't take into account timedelta hence this weird output (in fact second value in the output 1970-01-01T00:00:00.021389104' was not filled at all - it is left as it was initialized by np.empty. So, in theory, here might be any other date).

@jbrockmendel
Copy link
Member

Yah that looks like a bug. PR to fix would be welcome

@BeforeFlight
Copy link
Contributor Author

BeforeFlight commented Jul 16, 2019

Here it is not so simple. If code something like this:

if seen.datetime_:
    if not seen.timedelta_:
        if not seen.numeric_:
            return datetimes

Then another problem of initial code arisen - NaTs are not properly processed. Every NaT is counted as DT as well as TD. So with proposed fix the array with [numpy.datetime64, pd.NaT] won't be converted to datetime (cause now we have NaT => both DT and TD but we forbid this in our hypothetical fix).

So there is a need to introduce something like Seen().NaT attribute or something else.. Have to think about it more.

Am I correct that first exist code must be fully bug-fixed (which seems correct but will take time)?

@jbrockmendel
Copy link
Member

type inference with NaT comes up every so often. The rule of thumb is for it to be datetime-like by default, and be timedelta-like only if that is needed to make an array all-timedelta

@BeforeFlight
Copy link
Contributor Author

For now I've managed to resolve all conflicts with current version of maybe_convert_objects except ints out of int64.min: uint64.max range (my 3rd clause in "Failed pytests comparisons resolving" above). So it would be good if got correct version here.
Also "another questions" above needs any approval/suggestions too.

@jbrockmendel
Copy link
Member

Also "another questions" above needs any approval/suggestions too.

isinstance(val, decimal.Decimal())

There should be an is_decimal function in _libs.lib, use that.

What to do with decimal

I don't know. What do we do now? Returning object-dtype seems reasonable for all-Decimal arrays.

Array full of NaT - how should be treated?

Right now we say datetime right? There should probably be something like "ambiguous" and then we require the user to specify the dtype

@BeforeFlight
Copy link
Contributor Author

BeforeFlight commented Jul 20, 2019

Returning object-dtype seems reasonable for all-Decimal arrays.

In fact I agree here. Especially if pandas now cannot work with decimal directly #26731. So decision is - object, correct?

Right now we say datetime right? There should probably be something like "ambiguous" and then we require the user to specify the dtype

I've updated questions section

@jbrockmendel
Copy link
Member

But I don't think here should be any flag for ambiguity - this is service function not user function

The case I have in mind is pd.Series([pd.NaT, pd.NaT]) where we infer datetime64[ns] dtype and instead we should probably raise and tell the user to be specific.

@jbrockmendel jbrockmendel added the Performance Memory or execution speed performance label Jul 21, 2019
@BeforeFlight
Copy link
Contributor Author

@jbrockmendel it's been 5 days already. Please check 3rd pytest failure when you have some time (it shouldn't take long). I added minimal question with info there.

@jbrockmendel
Copy link
Member

which is the 3rd pytest failure?

@BeforeFlight
Copy link
Contributor Author

BeforeFlight commented Jul 25, 2019

I've collected all my questions above under histogram with asvs. 3rd is unresolved.
There are also other questions (not hard I suppose) but they are not blocking right now.

Should add here float and complex into disable integer range checking?

@jbrockmendel
Copy link
Member

In the integer-> float case the conversion is lossy, right? i.e. it doesn't round-trip correctly? This might be a good case for IntegerNA cc @jreback

@BeforeFlight
Copy link
Contributor Author

BeforeFlight commented Jul 25, 2019

Yes conversion to float64 will lead to collisions for int started from log10(2^53) ~= 16 digits.

@BeforeFlight
Copy link
Contributor Author

BeforeFlight commented Jul 26, 2019

  1. So for int+ nan return integerNA array, correct?

  2. int + None - ?

  3. What for int - float then? Same question about conversion to float - should int range checking be here? Now range checking is applied:

test = np.array([np.iinfo(np.uint64).max+1, 1.1], dtype=np.object)
lib.maybe_convert_objects(test)
array([18446744073709551616, 1.1], dtype=object)

As well as not (sint and uint) filter:

test = np.array([-1, np.iinfo(np.uint64).max, 1.1], dtype=np.object)
lib.maybe_convert_objects(test)
array([-1, 18446744073709551615, 1.1], dtype=object)

@jbrockmendel
Copy link
Member

would float128 be a valid workaround? IIRC there are some platforms/builds on which that is not available, but that may have been fixed in more recent numpys

@BeforeFlight
Copy link
Contributor Author

You mean in case int64 + float64 - return float128?

@jbrockmendel
Copy link
Member

in the case where keeping float64 would be lossy, yah

@jreback
Copy link
Contributor

jreback commented Jul 26, 2019

we don’t support float128

@BeforeFlight
Copy link
Contributor Author

BeforeFlight commented Jul 26, 2019

I've got an idea (maybe weird I don't know): if pandas already got Extensions platfrom - how about creating following dtype int-float:

In integerNA there are 2 arrays: with data and mask. Why not do the same? But mask here will be whether value is int or is it float? Moreover - I believe integerNA might be simply extent to this state (nan would be included into float part).

For example: int-float64 (fint64) will consist of 64 bit array with data and 1 bit array for indicating whether it float64 or int64 value.

@jbrockmendel
Copy link
Member

@BeforeFlight is this a bug that is causing problems in the wild? If not, we'd probably be better off putting it on the backburner. Either the int-float64 idea or the IntegerNA approach would be pretty invasive, which we're reticent about for hopefully-obvious reasons.

None of this is to discourage you from making contributions, but I'm assuming that there is lower-hanging fruit you'd be interested in tackling before this.

@BeforeFlight
Copy link
Contributor Author

BeforeFlight commented Jul 26, 2019

@jbrockmendel if I get it correctly (sorry not english-native and your post contains some metaphors which I dont' get) int-float and intNA is too invasive so need not do this. Then we go one step back - what to do with int+float arrays, .. my recent 3 questions.

there is lower-hanging fruit you'd be interested in tackling before this.

Sorry this I can't get at all. You mean again "don't do int-float and intNA" ?

@jbrockmendel
Copy link
Member

(sorry not english-native and your post contains some metaphors which I dont' get)

Thanks for clarifying. I will try to express this more plainly.

Is the nan+int64 bug high-priority? Fixing it (either approach discussed so far) will be a big change, which we would rather delay unless it is a high priority.

@BeforeFlight
Copy link
Contributor Author

BeforeFlight commented Jul 26, 2019

If we do not want make some "heavy" decision about int -> float conversion we might just left it as is (well almost) and some "# TODO: "s. Now conversion is also lossy but it doesn't make any problem yet:

vals = [np.iinfo(np.uint64).max - 10**i for i in range(5)]+[1.1]
test = np.array(vals, dtype=np.object)
conv = lib.maybe_convert_objects(test)
for i, el in enumerate(conv):
    print((vals[i], format(el, '.0f')))
(18446744073709551614, '18446744073709551616')  
(18446744073709551605, '18446744073709551616')  
(18446744073709551515, '18446744073709551616')  
(18446744073709550615, '18446744073709551616')  
(18446744073709541615, '18446744073709541376')  
(1.1, '1')

float-int is 80% of blocking my PR on this I suppose. Other questions should be simpler.

@BeforeFlight
Copy link
Contributor Author

BeforeFlight commented Jul 26, 2019

So we want to stick current realization. Then need to add some consistency here:

# FLOAT
test = np.array([None, np.iinfo(np.uint64).max+1], dtype=np.object)  # test 1
lib.maybe_convert_objects(test)
# OBJECT
test = np.array([np.nan, np.iinfo(np.uint64).max+1], dtype=np.object)  # test 2
lib.maybe_convert_objects(test)
test = np.array([1.1, np.iinfo(np.uint64).max+1], dtype=np.object)  # test 3
lib.maybe_convert_objects(test)
array([           nan, 1.84467441e+19])
array([nan, 18446744073709551616], dtype=object)
array([1.1, 18446744073709551616], dtype=object)

If seen.null_ range checking is disabled but is not if seen float or complex. I've partially stated it in 3) today and in pytest problem number 3.

Possible solutions for:

if not seen.null_:

  1. (Suggested) test1 -> object. I.e. remove this if not seen.null_:. Then [None, np.iinfo(np.uint64).max+1] -> object. But will receive performance small drop for the cases like [None, 1,2,3, many ints or floats] case we will check int bounds.

    • besides this if not seen.null_ is not symmetric to the nan. ThoughNone and nan both treated as nan.
  2. test 2,3 -> float. I.e. if not seen.null_: -> if not ( seen.null_ and seen.float_ and seen.complex_)

    • Then we will allow not only [None, uint64.max+1] be converted to float but any [float, HUGE_int] which will be even more lossy.

@BeforeFlight
Copy link
Contributor Author

Assuming this if not seen.null_ removed I've managed to pass all 74088 pytests for N=3.

Will pytest it for N=4 (will take quite some time) and will left asvs for the night - update histogram tomorrow.

@BeforeFlight
Copy link
Contributor Author

BeforeFlight commented Jul 27, 2019

All pytests for N=4 passed.
All changed asvs also improved.

So in case of dropping if not seen.null_ I have no blocking questions here (all resolved questions with answers are under pytest questions). Non-blocking questions might be asked in PR.

@BeforeFlight
Copy link
Contributor Author

Pytest:

check_types = [None, np.nan, pd.NaT,
               True,
               np.iinfo(np.int64).min-1, np.iinfo(np.int64).min,
               1,
               np.iinfo(np.uint64).max, np.iinfo(np.uint64).max+1,
               1.1, np.float_(1.1),
               1j, np.complex_(1j),
               datetime(2000, 1, 1),
               datetime(2000, 1, 1, 0, 0, 0, 0, timezone.utc),
               datetime(2000, 1, 1, 0, 0, 0, 0, timezone(timedelta(hours=1))),
               np.datetime64('2000-01-01'),
               timedelta(365),
               np.timedelta64(1, 's'),
               'object'
               ]
DT_TD_NaT_inds = [2] + list(range(14, 20))


def data_gen(N, perms, exclude_all_is_time=False):
    data = np.array([0]*N, dtype='object')
    for perm in perms:
        if exclude_all_is_time and all(ind in DT_TD_NaT_inds for ind in perm):
            continue
        for i in range(data.shape[0]):
            data[i] = check_types[perm[i]]
        yield data.copy()


N = 2
exclude_all_is_time = False
perms = itertools.product(list(range(len(check_types))), repeat=N)
datas = data_gen(N, perms, exclude_all_is_time)
data_inds = list(range(len(check_types)**N)) if not exclude_all_is_time \
    else list(range((len(check_types)-len(DT_TD_NaT_inds))**N))


@pytest.fixture(scope='module', params=data_inds)
def get_data():
    return next(datas)


class Tests:
    @pytest.mark.parametrize("safe", [0, 1])
    @pytest.mark.parametrize("conv_DT", [0, 1])
    @pytest.mark.parametrize("conv_TD", [0, 1])
    def test_maybe_convert_objects(self, get_data, safe, conv_DT, conv_TD):
        data = get_data
        old = lib.maybe_convert_objects_OLD(data, try_float=1, safe=safe,
                                            convert_datetime=conv_DT,
                                            convert_timedelta=conv_TD)
        new = lib.maybe_convert_objects(data, try_float=1, safe=safe,
                                        convert_datetime=conv_DT,
                                        convert_timedelta=conv_TD)
        if isinstance(old, np.ndarray):
            tm.assert_numpy_array_equal(old, new)
        else:
            tm.assert_index_equal(old, new)

Asv:

class MaybeConvertObjects:
    dtypes = {"bool": True, "sint": -1, "uint": np.iinfo(np.uint64).max,
              "float": 1.1, "complex": 1j,
              "datetime": datetime(2000, 1, 1), "timedelta": timedelta(365),
              "datetime_tz_utc": datetime(2000, 1, 1, 0, 0, 0, 0, timezone.utc),
              "datetime_tz_utc1": datetime(2000, 1, 1, 0, 0, 0, 0,
                                           timezone(timedelta(hours=1))),
              "object": "object"}
    dtypes_to_insert = list(dtypes.values()) + [None, pd.NaT]

    datas = {k: np.array([v] * 10**3, dtype='object')
             for k, v in dtypes.items()}

    insert_C = 2  # number of insert into homogeneous array
    params = [[0, 1], [0, 1], [0, 1],
              list(dtypes.keys()),
              list(itertools.product(list(range(len(dtypes_to_insert))),
                                     repeat=insert_C))]
    param_names = ["safe", "conv_DT", "conv_TD", "dtype", "insrt_inds"]

    def setup(self, safe, conv_DT, conv_TD, dtype, insrt_inds):
        data = self.datas[dtype]
        # inserting some values in data
        for i in range(self.insert_C):
            data[(i + 1) * len(data) // (self.insert_C + 1)] = \
                self.dtypes_to_insert[insrt_inds[i]]
        self.data = data

    def time_maybe_convert_objects(self, safe, conv_DT, conv_TD,
                                   dtype, insrt_inds):
        lib.maybe_convert_objects(self.data, try_float=0, safe=safe,
                                  convert_datetime=conv_DT,
                                  convert_timedelta=conv_TD)

Code:

cdef inline bint is_datetime(object o):
    # is_datetime also hits NaT
    return PyDateTime_Check(o) or util.is_datetime64_object(o)

@cython.boundscheck(False)
@cython.wraparound(False)
cdef str check_complex(bint safe,
                       object[:] objects,
                       complex128_t[:] complexes,
                       int n = 0, Seen seen = None):
    if seen is None:
        seen = Seen()

    cdef:
        float64_t fnan = np.nan
        Py_ssize_t i
    for i in range(n, objects.shape[0]):
        val = objects[i]
        if util.is_complex_object(val) or util.is_float_object(val):
            complexes[i] = val
        elif val is None:
            complexes[i] = fnan
        elif not safe and util.is_integer_object(val):
            if util.is_timedelta64_object(val):
                return 'object'

            val = int(val)
            seen.saw_int(val)
            if ((seen.uint_ and seen.sint_) or
                not (oINT64_MIN <= val and val <= oUINT64_MAX)):
                return 'object'

            complexes[i] = <complex128_t>val
        else:
            return 'object'
    return 'complex'

@cython.boundscheck(False)
@cython.wraparound(False)
cdef str check_float(bint safe,
                     object[:] objects,
                     float64_t[:] floats,
                     complex128_t[:] complexes,
                     int n = 0, Seen seen = None):
    if seen is None:
        seen = Seen()

    cdef:
        float64_t fnan = np.nan
        Py_ssize_t i
    for i in range(n, objects.shape[0]):
        val = objects[i]
        if util.is_float_object(val):
            floats[i] = complexes[i] = val
        elif val is None:
            floats[i] = complexes[i] = fnan
        elif not safe and util.is_integer_object(val):
            if util.is_timedelta64_object(val):
                return 'object'

            val = int(val)
            seen.saw_int(val)
            if ((seen.uint_ and seen.sint_) or
                not (oINT64_MIN <= val and val <= oUINT64_MAX)):
                return 'object'

            floats[i] = <float64_t>val
            complexes[i] = <complex128_t>val
        elif util.is_complex_object(val):
            return check_complex(safe, objects, complexes, i, seen)
        else:
            return 'object'
    return 'float'

@cython.boundscheck(False)
@cython.wraparound(False)
cdef str check_integer(bint safe,
                       object[:] objects,
                       int64_t[:] ints,
                       uint64_t[:] uints,
                       float64_t[:] floats,
                       complex128_t[:] complexes,
                       int n = 0):
    cdef:
        Seen seen = Seen()
        Py_ssize_t i
    for i in range(n, objects.shape[0]):
        val = objects[i]
        if util.is_integer_object(val):
            # is_integer hits timedelta64
            if util.is_timedelta64_object(val):
                return 'object'

            val = int(val)
            seen.saw_int(val)

            if ((seen.uint_ and seen.sint_) or
                not (oINT64_MIN <= val and val <= oUINT64_MAX)):
                return 'object'

            if seen.uint_:
                uints[i] = val
            elif seen.sint_:
                ints[i] = val
            else:
                uints[i] = val
                ints[i] = val

            if not safe:
                floats[i] = <float64_t>val
                complexes[i] = <complex128_t>val
        elif safe:
            return 'object'
        elif util.is_float_object(val):
            return check_float(safe, objects, floats, complexes, i, seen)
        elif val is None:
            return check_float(safe, objects, floats, complexes, i, seen)
        elif util.is_complex_object(val):
            return check_complex(safe, objects, complexes, i, seen)
        else:
            return 'object'

    if seen.uint_:
        return 'uint'
    else:
        return 'int'

@cython.boundscheck(False)
@cython.wraparound(False)
cdef str check_bool(object[:] objects,
                    uint8_t[:] bools,
                    int n = 0):
    cdef Py_ssize_t i
    for i in range(n, objects.shape[0]):
        val = objects[i]
        if util.is_bool_object(val):
            bools[i] = val
        else:
            return 'object'
    return 'bool'

@cython.boundscheck(False)
@cython.wraparound(False)
cdef str check_datetime(object[:] objects,
                        int64_t[:] idatetimes,
                        int n = 0):
    cdef:
        Py_ssize_t i
        bint all_is_NaT = 1  # flag for `fast` object in case of different TZ
    for i in range(n, objects.shape[0]):
        val = objects[i]
        if is_datetime(val):
            # is_datetime also hits NaT
            if val is NaT:
                idatetimes[i] = NPY_NAT
            else:
                if getattr(val, 'tzinfo', None) is not None:
                    # if i > 0 and not all_is_NaT:
                    if i > 0 and not all_is_NaT:
                        return 'object'
                    else:
                        return 'datetime_tz'
                all_is_NaT = 0
                idatetimes[i] = convert_to_tsobject(val, None, None, 0, 0).value
        else:
            return 'object'
    return 'datetime'

@cython.boundscheck(False)
@cython.wraparound(False)
cdef str check_timedelta(object[:] objects,
                         int64_t[:] itimedeltas,
                         int n = 0):
    cdef Py_ssize_t i
    for i in range(n, objects.shape[0]):
        val = objects[i]
        if is_timedelta(val):
            itimedeltas[i] = convert_to_timedelta64(val, 'ns')
        elif val is NaT:
            itimedeltas[i] = NPY_NAT
        else:
            return 'object'
    return 'timedelta'

@cython.boundscheck(False)
@cython.wraparound(False)
cdef str check_NaT(bint convert_datetime,
                   bint convert_timedelta,
                   object[:] objects,
                   int64_t[:] idatetimes,
                   int64_t[:] itimedeltas,
                   int n = 0):
    cdef Py_ssize_t i
    for i in range(n, objects.shape[0]):
        val = objects[i]
        if val is NaT:
            idatetimes[i] = itimedeltas[i] = NPY_NAT
        elif convert_datetime and is_datetime(val):
            return check_datetime(objects, idatetimes, i)
        elif convert_timedelta and is_timedelta(val):
            return check_timedelta(objects, itimedeltas, i)
        else:
            return 'object'

    # full NaT output
    if convert_datetime and convert_timedelta:
        # TODO: array full of NaT ambiguity resolve here needed
        return 'object'
    elif convert_datetime:
        return 'datetime'
    elif convert_timedelta:
        return 'timedelta'
    else:
        return 'object'

@cython.boundscheck(False)
@cython.wraparound(False)
cdef str check_None(bint safe,
                    object[:] objects,
                    float64_t[:] floats,
                    complex128_t[:] complexes,
                    int n = 0):
    cdef:
        float64_t fnan = np.nan
        Py_ssize_t i
    for i in range(n, objects.shape[0]):
        val = objects[i]
        if val is None:
            floats[i] = complexes[i] = fnan
        elif util.is_float_object(val):
            return check_float(safe, objects, floats, complexes, i)
        elif util.is_complex_object(val):
            return check_complex(safe, objects, complexes, i)
        elif not safe and util.is_integer_object(val):
            return check_float(safe, objects, floats, complexes, i)
        else:
            return 'object'
    return 'object'

@cython.boundscheck(False)
@cython.wraparound(False)
def maybe_convert_objects(ndarray[object] objects, bint try_float=0,
                          bint safe=0, bint convert_datetime=0,
                          bint convert_timedelta=0):
    """
    Type inference function-- convert object array to proper dtype
    """
    
    cdef int n = objects.shape[0]
    if n == 0:
        return objects

    cdef:
        ndarray[uint8_t]      bools     = np.empty(n, dtype = np.uint8)
        ndarray[int64_t]      ints      = np.empty(n, dtype = np.int64)
        ndarray[uint64_t]     uints     = np.empty(n, dtype = np.uint64)
        ndarray[float64_t]    floats    = np.empty(n, dtype = np.float64)
        ndarray[complex128_t] complexes = np.empty(n, dtype = np.complex128)

    if convert_datetime:
        datetimes = np.empty(n, dtype='M8[ns]')
        idatetimes = datetimes.view(np.int64)

    if convert_timedelta:
        timedeltas = np.empty(n, dtype='m8[ns]')
        itimedeltas = timedeltas.view(np.int64)

    # Type probing - determine type of first element
    # and call corresponding check
    cdef:
         str convert_dtype = 'object'
         object val = objects[0]

    if val is None:
        convert_dtype = check_None(safe, objects, floats, complexes)
    elif val is NaT:
        if convert_datetime and convert_timedelta:
            convert_dtype = check_NaT(convert_datetime, convert_timedelta,
                                      objects, idatetimes, itimedeltas)
        elif convert_datetime:
            convert_dtype = check_datetime(objects, idatetimes)
        elif convert_timedelta:
            convert_dtype = check_timedelta(objects, itimedeltas)
    elif is_timedelta(val):
        if convert_timedelta:
            convert_dtype = check_timedelta(objects, itimedeltas)
    elif is_datetime(val):
        if convert_datetime:
            convert_dtype = check_datetime(objects, idatetimes)
    elif util.is_bool_object(val):
        convert_dtype = check_bool(objects, bools)
    elif util.is_integer_object(val):
        convert_dtype = check_integer(safe, objects,
                                      ints, uints, floats, complexes)
    elif util.is_float_object(val):
        convert_dtype = check_float(safe, objects, floats, complexes)
    elif util.is_complex_object(val):
        convert_dtype = check_complex(safe, objects, complexes)

    # OUTPUT
    if convert_dtype == 'bool':
        return bools.view(np.bool_)
    elif convert_dtype == 'int':
        return ints
    elif convert_dtype == 'uint':
        return uints
    elif convert_dtype == 'float':
        return floats
    elif convert_dtype == 'complex':
        return complexes
    elif convert_dtype == 'datetime':
        return datetimes
    elif convert_dtype == 'timedelta':
        return timedeltas
    elif convert_dtype == 'datetime_tz':
        if is_datetime_with_singletz_array(objects):
            from pandas import DatetimeIndex
            return DatetimeIndex(objects)
    return objects

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

3 participants