Skip to content

BUG: Make lib.maybe_convert_objects work with uint64 #4845

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jtratner
Copy link
Contributor

Fixes #4471.

Only when it's greater than uint64 max (and not negative, etc.)

@jreback
Copy link
Contributor

jreback commented Sep 15, 2013

unit64 should already be preserved in block manager
you don't need a separate type

@jtratner
Copy link
Contributor Author

it's not, snippet from form_block:

        elif issubclass(v.dtype.type, np.integer):
            if v.dtype == np.uint64:
                # HACK #2355 definite overflow
                if (v > 2 ** 63 - 1).any():
                    object_items.append((i, k, v))
                    continue
            int_items.append((i, k, v))

@jtratner
Copy link
Contributor Author

you know what, i think it's because of the pathway through, so I identified the wrong error here. Sorry, still getting used to the internals.

Series(np.array([5], dtype='uint64')) comes out as uint

@jtratner
Copy link
Contributor Author

here's the problem:

>>> import sys
>>> uint_arr = np.array([sys.maxint + 5], dtype='uint64')
>>> ser = Series(uint_arr)
>>> ser.dtype
dtype('uint64')
>>> df = DataFrame([uint_arr])
>>> df.dtypes
0    object
dtype: object

@jtratner
Copy link
Contributor Author

But it does work if not in a list:

In [16]: df = DataFrame(uint_arr)
In [18]: df.dtypes
Out[18]:
0    uint64
dtype: object

Time to go a-searching.

@jreback
Copy link
Contributor

jreback commented Sep 15, 2013

the problem is in interleave_dtypes it's a bit tricky

@jtratner
Copy link
Contributor Author

@jreback well, you can just take that set of lines out and it works...but then it overflows under certain ops

What's supposed to happen if you do this?

ser = Series(np.array([5], dtype='uint64'))
ser - 10

Should pandas handle that for you and try to promote? or should it just overflow?

@jtratner
Copy link
Contributor Author

Right now it ends up with (on current master)

0    18446744073709551611
Name: 0, dtype: uint64

@jtratner
Copy link
Contributor Author

The test suite implies that, at least with dataframe, it's not supposed to overflow (subtracting a uint column such that it goes negative) :

 #4414 ish in pandas/tests/test_frame.py
            # vs mix int
            if op in ['add','sub','mul']:
                result = getattr(self.mixed_int, op)(2 + self.mixed_int)
                exp = f(self.mixed_int, 2 + self.mixed_int)

                # overflow in the uint
                dtype = None
                if op in ['sub']:
                    dtype = dict(B = 'object', C = None)

@jreback
Copy link
Contributor

jreback commented Sep 15, 2013

uint64 are very funny
the problem is you just want overflow to bubble up if it happens (which may not be happening )

@jtratner
Copy link
Contributor Author

Well, right now overflow doesn't bubble up. And it doesn't bubble up in numpy either. I'd vote to not bubble up on overflow like numpy.

@jtratner
Copy link
Contributor Author

@jreback this should pass now. Behavior is to allow overflow, keep dtype as uint64. (which I believe is same behavior as numpy)

@wesm
Copy link
Member

wesm commented Sep 15, 2013

I'd like to review this before merging...I had to deal with this myself in a different context recently

@jtratner
Copy link
Contributor Author

Yeah, definitely, wasn't planning to merge until you looked at it.

@jtratner
Copy link
Contributor Author

added some more test cases just to make sure I covered all the branches that interact with uint64.

@wesm
Copy link
Member

wesm commented Sep 17, 2013

Okay. I don't think that returning uint64 for the case that you have all positive integers is a good idea. Here's why:

In [1]: np.array([1,2 ,3,4,5], dtype=np.uint64) - 5
Out[1]: 
array([18446744073709551612, 18446744073709551613, 18446744073709551614,
       18446744073709551615,                    0], dtype=uint64)

Instead, I recommend that PyLong objects only be bounds checked to see if they are over INT64_MAX, in other words:

In [2]: np.iinfo(np.int64).max
Out[2]: 9223372036854775807

In [3]: np.iinfo(np.int64).max + 1
Out[3]: 9223372036854775808L

not so bad right? Just have to be careful with doing a PyLong_Check(obj) and only checking for overflow in such cases. maybe you want to also write a unit test containing numpy.uint64 scalar values so you know you catch that one, since they will not typecheck as PyLong

In [5]: np.uint64.mro()
Out[5]: 
[numpy.uint64,
 numpy.unsignedinteger,
 numpy.integer,
 numpy.number,
 numpy.generic,
 object]

if something exceeds INT64_MAX, you should enter another function that attempts to put the integers into a uint64 array -- if it cannot the original object array is returned.

pls ping me again when this is ready to have a look at again!

@jtratner
Copy link
Contributor Author

thanks for the feedback -- I agree that uint64 is tricky with all the overflow issues, but I'm confused about the difference between what you're suggesting and how the changes work now (or maybe you're just suggesting different internal architecture). Right now, it only tries to make uint if it overflows.

>>> lib.maybe_convert_objects(np.array([5, 3, 2**63 + 15], dtype=object))
array([                  5,                   3, 9223372036854775823], dtype=uint64)

whereas if it's all integers, comes out as integer

>>> lib.maybe_convert_objects(np.array([1, 2, 3], dtype=object))
array([1, 2, 3])
>>> _.dtype
np.dtype('int64')

But what about the case where you end up with a long > INT64_MAX? Still 'uint64'? Or try for something like doubledouble or longdouble or something?

That said, it might just make sense to wrap the entire thing in a try/except clause, try to put longs in int64, and if it fails with an OverflowError, then try to make an array of uint64 instead. (which would simplify the looping code too by removing the try suite from every iteration.

When it's greater than uint64 max (and not negative, etc.)
remove extraneous typecheck against uint64

better dtype checks in test_frame

update tests to reflect actual use of uint64, etc
@jreback
Copy link
Contributor

jreback commented Sep 30, 2013

pushed the issue & PR to 0.14

@ghost
Copy link

ghost commented Jan 3, 2014

@jtratner, this is in limbo. are you waiting for input from wes or can this go forward?

@jtratner
Copy link
Contributor Author

jtratner commented Jan 3, 2014

I don't remember :P I'll take another look at what I wrote up and the
thread. Thanks for the ping :)

@jreback
Copy link
Contributor

jreback commented Feb 18, 2014

@jtratner closing for now...pls reopen if you are continuing work on this

@jreback jreback closed this Feb 18, 2014
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 19, 2016
Adds handling for uint64 objects during conversion.
When negative numbers and uint64 are detected, we
then convert the result to object.

Picks up where pandas-devgh-4845 left off. Closes pandas-devgh-4471.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 19, 2016
Adds handling for uint64 objects during conversion.
When negative numbers and uint64 are detected, we
then convert the result to object.

Picks up where pandas-devgh-4845 left off. Closes pandas-devgh-4471.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 19, 2016
Adds handling for uint64 objects during conversion.
When negative numbers and uint64 are detected, we
then convert the result to object.

Picks up where pandas-devgh-4845 left off. Closes pandas-devgh-4471.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 19, 2016
Adds handling for uint64 objects during conversion.
When negative numbers and uint64 are detected, we
then convert the result to object.

Picks up where pandas-devgh-4845 left off. Closes pandas-devgh-4471.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 20, 2016
Adds handling for uint64 objects during conversion.
When negative numbers and uint64 are detected, we
then convert the result to object.

Picks up where pandas-devgh-4845 left off. Closes pandas-devgh-4471.
jreback pushed a commit that referenced this pull request Dec 20, 2016
Adds handling for `uint64` objects during conversion.  When negative
numbers and `uint64` are detected, we then convert the result to
`object`.    Picks up where #4845 left off. Closes #4471.

Author: gfyoung <[email protected]>

Closes #14916 from gfyoung/convert-objects-uint64 and squashes the following commits:

ed325cd [gfyoung] BUG: Convert uint64 in maybe_convert_objects
ShaharBental pushed a commit to ShaharBental/pandas that referenced this pull request Dec 26, 2016
Adds handling for `uint64` objects during conversion.  When negative
numbers and `uint64` are detected, we then convert the result to
`object`.    Picks up where pandas-dev#4845 left off. Closes pandas-dev#4471.

Author: gfyoung <[email protected]>

Closes pandas-dev#14916 from gfyoung/convert-objects-uint64 and squashes the following commits:

ed325cd [gfyoung] BUG: Convert uint64 in maybe_convert_objects
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

lib.maybe_convert_objects will fail on uint64 values that exceed int64 max
3 participants