Skip to content

ENH: Create and propagate UInt64Index #14937

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Dec 21, 2016

  1. Introduces and propagates UInt64Index, an index specifically for uint. xref BUG: Patch rank() uint64 behavior #14935

  2. Patches bug from BUG: Convert uint64 in maybe_convert_objects #14916 that makes maybe_convert_objects robust against the known numpy bug that uint64 cannot be compared to int64. This bug was caught during testing of UInt64Index.

UPDATE: Patched in #14951

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Enhancement Indexing Related to indexing on series/frames, not to indexes themselves labels Dec 21, 2016
@jreback
Copy link
Contributor

jreback commented Dec 21, 2016

can you separate out

Patches bug from #14916 that makes maybe_convert_objects robust against the known numpy bug that uint64 cannot be compared to int64. This bug was caught during testing of UInt64Index.

@@ -99,6 +99,7 @@ Other enhancements
unsorted MultiIndex (:issue:`11897`). This allows differentiation between errors due to lack
of sorting or an incorrect key. See :ref:`here <advanced.unsorted>`

- New ``UInt64Index`` (subclass of ``NumericIndex``) for specifically indexing unsigned integers (:issue:`14935`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create a new sub-section for all uint64 issues (can be one later)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created.

@@ -426,6 +426,67 @@ cdef class Int64Engine(IndexEngine):

return result

cdef class UInt64Engine(IndexEngine):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are going to need to simplify this boilerplate, either expanding some methods in the superclass, maybe using some tempita. I hate copy-pasting code.

@codecov-io
Copy link

codecov-io commented Dec 21, 2016

Current coverage is 85.54% (diff: 95.00%)

Merging #14937 into master will increase coverage by <.01%

@@             master     #14937   diff @@
==========================================
  Files           145        145          
  Lines         51288      51350    +62   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43872      43928    +56   
- Misses         7416       7422     +6   
  Partials          0          0          

Powered by Codecov. Last update 0e219d7...8ab6fbd

raise KeyError(val)

cdef _maybe_get_bool_indexer(self, object val):
cdef:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can prob just use fused types for routines like this (for the scalar & array type)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am planning to template, so this shouldn't be problematic anymore.

except (OverflowError, TypeError, ValueError):
pass

try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -177,6 +177,91 @@ def _assert_safe_casting(cls, data, subarr):
Int64Index._add_logical_methods()


class UInt64Index(NumericIndex):
"""
Immutable ndarray implementing an ordered, sliceable set. The basic object
Copy link
Contributor

@jreback jreback Dec 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use a shared doc for these indexes __init__

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Done.

_default_dtype = np.uint64

@property
def inferred_type(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm? you don't want to differentiate

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not see a reason to do so at this point.

return self.values.view('u8')

@property
def is_all_dates(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is defined in the super class (or it should be)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is defined in Index. However, that implementation is very generic and less performant. We can just return the boolean immediately if possible, which is what I do and is also done in many subclasses (e.g Int64Index, DatetimeIndex).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a method on NumericIndex, which just returns False (then is overriden in the datetime ones).

@gfyoung
Copy link
Member Author

gfyoung commented Dec 22, 2016

@jreback : See #14951 for patching maybe_convert_objects

@gfyoung gfyoung force-pushed the create-uint64-index branch 2 times, most recently from dde6312 to 22402d6 Compare December 22, 2016 08:45
or purely non-negative, integers. Previously, handling these integers would
result in improper rounding or data-type casting, leading to incorrect results.
One notable place where this improved was in ``DataFrame`` creation (:issue:`14917`):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this example. mainly wanted to put the issues together (top section is good though).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.


- Bug in converting object elements of array-like objects to unsigned 64-bit integers (:issue:`4471`)
- Bug in ``Series.unique()`` in which unsigned 64-bit integers were causing overflow (:issue:`14721`)
- New ``UInt64Index`` (subclass of ``NumericIndex``) for specifically indexing unsigned integers (:issue:`14935`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so an example using the index might be good (e.g. using indexing)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Done.

@jreback
Copy link
Contributor

jreback commented Dec 22, 2016

so need testing / fixes for

  • assignment to DataFrame
  • set/reset index
  • indexing (getting and setting), add a new test class

@gfyoung gfyoung force-pushed the create-uint64-index branch from 22402d6 to 4f59a12 Compare December 22, 2016 15:48
@gfyoung
Copy link
Member Author

gfyoung commented Dec 22, 2016

Assignment to DataFrame: frame/test_alter_axes.py ?
Add a new test class : indexes/test_numeric.py ?

Set / reset index ? (not sure what you mean)
Getting / setting ? (not sure what you mean)

@gfyoung gfyoung force-pushed the create-uint64-index branch from 4f59a12 to 3a6508d Compare December 22, 2016 17:04
@jreback
Copy link
Contributor

jreback commented Dec 23, 2016

Set / reset index ? (not sure what you mean)
df.set_index('uint64_col').reset_index()

Getting / setting ? (not sure what you mean)
df2 = df.set_index('uint64_col')
df2.loc[df.index[0]] (and =)

basically need a pandas/tests/indexing/test_uint64.py

@gfyoung
Copy link
Member Author

gfyoung commented Dec 23, 2016

Would appending uint to the battery of tests in tests/indexing/test_indexing.py suffice then?

@jreback
Copy link
Contributor

jreback commented Dec 23, 2016

@gfyoung yes adding uint to lots of tests would definitly help.

@gfyoung gfyoung force-pushed the create-uint64-index branch 2 times, most recently from cd9c6eb to 236fcf8 Compare December 24, 2016 22:03
jreback pushed a commit that referenced this pull request Dec 24, 2016
Adds `uint64` ranking functions to `algos.pyx` to allow for proper
ranking with `uint64`.    Also introduces partial patch for
`factorize()` by adding `uint64` hashtables and vectors for  usage.
However, this patch is only partial because the larger bug of non-
support for `uint64`  in `Index` has not been fixed (**UPDATE**:
tackled in #14937):  ~~~python  >>> from pandas import Index, np  >>>
Index(np.array([2**63], dtype=np.uint64))
Int64Index([-9223372036854775808], dtype='int64')  ~~~    Also patches
a bug in `UInt64HashTable` from #14915 that had an erroneous null
condition that was caught during testing and was hence removed.

Author: gfyoung <[email protected]>

Closes #14935 from gfyoung/core-algorithms-uint64-two and squashes the following commits:

2598cea [gfyoung] BUG: Patch rank() uint64 behavior
@gfyoung
Copy link
Member Author

gfyoung commented Jan 12, 2017

@jorisvandenbossche , @jreback : See my comment above. This might be indicative of a setup.py issue since we have to do a git clean before installing. It isn't breaking Travis, but this is a little disconcerting admittedly.

@jreback
Copy link
Contributor

jreback commented Jan 13, 2017

need

'depends': _pxi_dep['index']

here `https://github.com/pandas-dev/pandas/blob/master/setup.py#L493

@gfyoung gfyoung force-pushed the create-uint64-index branch 2 times, most recently from 0aaca35 to 452f56d Compare January 14, 2017 10:04
@@ -490,7 +490,8 @@ def pxd(name):
index={'pyxfile': 'index',
'sources': ['pandas/src/datetime/np_datetime.c',
'pandas/src/datetime/np_datetime_strings.c'],
'pxdfiles': ['src/util']},
'pxdfiles': ['src/util'],
'depends': _pxi_dep['index'] + _pxi_dep['algos']},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well this is dependent on pandas.algos, but a change in algos.pyx doesn't actually require re-compilation of index.pyx I don't think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point. Fixed.

@gfyoung gfyoung force-pushed the create-uint64-index branch from 452f56d to 488dfd7 Compare January 14, 2017 19:49
@gfyoung
Copy link
Member Author

gfyoung commented Jan 14, 2017

@jreback , @jorisvandenbossche : Added the dependency in setup.py as requested. The installation should work. If it does, then this should be ready to merge then.

@jorisvandenbossche
Copy link
Member

Was playing a bit with it, and got into this:

In [15]: idx = pd.UInt64Index([1,2,3,4])

In [16]: s = pd.Series(range(4), index=idx)

In [17]: s.append(pd.Series([1,2], index=[-1,8])).index
Out[17]: Float64Index([1.0, 2.0, 3.0, 4.0, -1.0, 8.0], dtype='float64')

@shoyer
Copy link
Member

shoyer commented Jan 15, 2017

Was playing a bit with it, and got into this:

In [17]: s.append(pd.Series([1,2], index=[-1,8])).index
Out[17]: Float64Index([1.0, 2.0, 3.0, 4.0, -1.0, 8.0], dtype='float64')

This looks like NumPy's default type promotion rules for combining int64 and uint64:

In [33]: np.concatenate([np.array([1, 2], dtype=np.int64), np.array([3], dtype=np.uint64)])
Out[33]: array([ 1.,  2.,  3.])

@gfyoung
Copy link
Member Author

gfyoung commented Jan 15, 2017

@shoyer , @jorisvandenbossche : At first glance, I would agree with that diagnosis. Combining int64 and uint64 is "dangerous" from that perspective unfortunately (and out of our control). For now, I would ensure that uint64-only interactions result in uint64.

@jorisvandenbossche
Copy link
Member

Ah yes, of course, overlooked that. Naively thought for a moment this fitted in int64 and so should be that. But fully agree that we should not check for this and just follow the rules.

@jreback
Copy link
Contributor

jreback commented Jan 15, 2017

you whatsnew example is a little odd FYI, maybe expand a touch, and here's a bug

In [42]:    idx = pd.UInt64Index([1, 2, 3], name='foo')
    ...:    df = pd.DataFrame({'A' : ['a', 'b', 'c'], 'B' : range(3), 'C' : np.array([3,4, 5],dtype='uint64')}, index=idx)
    ...:    
    ...: 

In [43]: df2 = df.reset_index()

In [44]: df2
Out[44]: 
   foo  A  B  C
0    1  a  0  3
1    2  b  1  4
2    3  c  2  5

In [45]: df2.dtypes
Out[45]: 
foo    uint64
A      object
B       int64
C      uint64
dtype: object

In [46]: df2.iloc[1, 0] = np.nan

In [47]: df2
Out[47]: 
   foo  A  B  C
0  1.0  a  0  3
1  NaN  b  1  4
2  3.0  c  2  5

In [48]: df2.dtypes
Out[48]: 
foo    float64
A       object
B        int64
C       uint64
dtype: object

In [49]: df2.foo.fillna(-1)
Out[49]: 
0    1.0
1   -1.0
2    3.0
Name: foo, dtype: float64

In [50]: df2.foo.fillna(-1).astype('uint64')
Out[50]: 
0                       1
1    18446744073709551615
2                       3
Name: foo, dtype: uint64

[50] is wrong

@gfyoung
Copy link
Member Author

gfyoung commented Jan 15, 2017

@jreback :

  1. What is odd about my example, and how should I expand it if it is lacking?

  2. What is the expected behavior for [50] in your opinion, as I have a similar question in this example:

>>> i = Index([1.1, 2, 3])
>>> i.astype('int64')  # Just plain INT64
0    1
1    2
2    3
dtype: int64

@jreback
Copy link
Contributor

jreback commented Jan 15, 2017

What is odd about my example, and how should I expand it if it is lacking?

I find not naming columns as odd (in examples).

In [37]: pd.DataFrame({'A' : ['a', 'b', 'c']}, index=range(3))
Out[37]: 
   A
0  a
1  b
2  c

@jreback
Copy link
Contributor

jreback commented Jan 15, 2017

What is the expected behavior for [50] in your opinion, as I have a similar question in this example:

I would raise, that is an invalid conversion if negative values or nan's are present

In [39]: Series([np.nan, 2, 3]).astype('int64')
ValueError: Cannot convert non-finite values (NA or inf) to integer

In fact I make sure to test lots of astyping (similar to what we do with ints)

@gfyoung
Copy link
Member Author

gfyoung commented Jan 15, 2017

@jreback : The point of my example in my response was to illustrate that astype is bugged far beyond anything that I could have added with UInt64Index. I think it is better to save patching that for another time, as the bug extends beyond uint64. I will update the whatsnew though.

@gfyoung gfyoung force-pushed the create-uint64-index branch from 488dfd7 to 8ab6fbd Compare January 16, 2017 01:27
@gfyoung
Copy link
Member Author

gfyoung commented Jan 16, 2017

@jreback : Updated whatsnew, and everything is still green.

@jreback jreback closed this in 362e78d Jan 17, 2017
@jreback jreback added this to the 0.20.0 milestone Jan 17, 2017
@jreback
Copy link
Contributor

jreback commented Jan 17, 2017

thanks @gfyoung !

this is great.

if you can canvas any issues that are referenced and make sure that they are closed? (ping if not).

@jreback
Copy link
Contributor

jreback commented Jan 17, 2017

https://travis-ci.org/pandas-dev/pandas/jobs/192687204

wonder if this is a cython cache issue.

@jreback
Copy link
Contributor

jreback commented Jan 17, 2017

https://travis-ci.org/pandas-dev/pandas/jobs/192687206#L523 in particular is suspicous

@jreback
Copy link
Contributor

jreback commented Jan 17, 2017

ok I cleared the cache on master it is built now: https://travis-ci.org/pandas-dev/pandas/builds/192687200

weird, maybe there is persistence of the .pxi files somehow so if the .pyx are not changed (e.g. algos.pyx or index.pyx) as well, then the cython cache thinks they are not rebuild.

@gfyoung
Copy link
Member Author

gfyoung commented Jan 18, 2017

@jreback : Awesome that doc-building issue got fixed on Travis. I haven't found any related issues yet, but I'll pick them / resolve them eventually as I've done with the read_csv issues 😄

@gfyoung gfyoung deleted the create-uint64-index branch January 18, 2017 05:10
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
1) Introduces and propagates `UInt64Index`, an index specifically for
`uint`.  xref pandas-dev#14935    2) <strike> Patches bug from pandas-dev#14916 that makes
`maybe_convert_objects` robust against the known `numpy` bug that
`uint64` cannot be compared to `int64`.  This bug was caught during
testing of `UInt64Index`. </strike>    **UPDATE**: Patched in pandas-dev#14951

Author: gfyoung <[email protected]>

Closes pandas-dev#14937 from gfyoung/create-uint64-index and squashes the following commits:

8ab6fbd [gfyoung] ENH: Create and propagate UInt64Index
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants