Skip to content

ENH: indexing support for reversed is_monotonic #7860

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hughesadam87 opened this issue Jul 28, 2014 · 17 comments · Fixed by #8680
Closed

ENH: indexing support for reversed is_monotonic #7860

hughesadam87 opened this issue Jul 28, 2014 · 17 comments · Fixed by #8680
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves Usage Question
Milestone

Comments

@hughesadam87
Copy link

conceptually not hard (and you can look at a the slices to figure this out, e.g. if start >= end or start>last_endpoint, you can just do a reversed is_monotonic), to avoid a perf hit I think, then just reverse the searching operations for slices would need to do is_monotonic_decreasing here : https://github.com/pydata/pandas/blob/master/pandas/core/index.py#L1764

Hello,

I am working with spectral data, which for various spectral units such as
wavenumber, is often presented with decreasing spectral values along the
index. For example:

http://www.chemguide.co.uk/analysis/ir/irpropanone.GIF

In my dataframe, the index is stored in descending order (eg 500, 499,
498... 2, 1); however, when I try to slice using .ix[]; it becomes
impossible, giving me a long key error.

Likewise, df.plot() is sorting the xvalues from low to high, so I need to
reverse the plot axis after the fact. Not really a big deal, but wondered
if there's a better workaround.

Any suggestions?

Note: This behavior works fine for int64 index:

#Create dataframe and reverse index
x = DataFrame(np.random.randn(50,50))
x.index = x.index[::-1]

#Slice 30-10
x.ix[30:10, ::]

But fails for float index

x = DataFrame(np.random.randn(50,50), 
                  index=np.linspace(0,50))
x.index = x.index[::-1]

x.ix[30.0:10.0, ::]

With error:

 ---------------------------------------------------------------------------
 KeyError                                  Traceback (most recent call last)
 <ipython-input-68-1af3b9a79d3d> in <module>()
      11 x.index = x.index[::-1]
      12 
 ---> 13 x.ix[30.0:10.0, ::]
      14 

 /home/glue/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/indexing.pyc in __getitem__(self, key)
      67                 pass
      68 
 ---> 69             return self._getitem_tuple(key)
      70         else:
      71             return self._getitem_axis(key, axis=0)

 /home/glue/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/indexing.pyc in _getitem_tuple(self, tup)
     673                 continue
     674 
 --> 675             retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
     676 
     677         return retval

 /home/glue/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/indexing.pyc in _getitem_axis(self, key, axis, validate_iterable)
     859         labels = self.obj._get_axis(axis)
     860         if isinstance(key, slice):
 --> 861             return self._get_slice_axis(key, axis=axis)
     862         elif _is_list_like(key) and not (isinstance(key, tuple) and
     863                                          isinstance(labels, MultiIndex)):

 /home/glue/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/indexing.pyc in _get_slice_axis(self, slice_obj, axis)
    1106         if not _need_slice(slice_obj):
    1107             return obj
 -> 1108         indexer = self._convert_slice_indexer(slice_obj, axis)
    1109 
    1110         if isinstance(indexer, slice):

 /home/glue/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/indexing.pyc in _convert_slice_indexer(self, key, axis)
     161         # if we are accessing via lowered dim, use the last dim
     162         ax = self.obj._get_axis(min(axis, self.ndim - 1))
 --> 163         return ax._convert_slice_indexer(key, typ=self.name)
     164 
     165     def _has_valid_setitem_indexer(self, indexer):

 /home/glue/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/index.pyc in _convert_slice_indexer(self, key, typ)
    2027 
    2028         # translate to locations
 -> 2029         return self.slice_indexer(key.start, key.stop, key.step)
    2030 
    2031     def get_value(self, series, key):

 /home/glue/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/index.pyc in slice_indexer(self, start, end, step)
    1704         This function assumes that the data is sorted, so use at your own peril
    1705         """
 -> 1706         start_slice, end_slice = self.slice_locs(start, end)
    1707 
    1708         # return a slice

 /home/glue/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/index.pyc in slice_locs(self, start, end)
    1777 
    1778         start_slice = _get_slice(0, offset=0, search_side='left',
 -> 1779                                  slice_property='start', search_value=start)
    1780         end_slice = _get_slice(len(self), offset=1, search_side='right',
    1781                                slice_property='stop', search_value=end)

 /home/glue/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/index.pyc in _get_slice(starting_value, offset, search_side, slice_property, search_value)
    1746 
    1747             try:
 -> 1748                 slc = self.get_loc(search_value)
    1749 
    1750                 if not is_unique:

 /home/glue/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/index.pyc in get_loc(self, key)
    2091         except (TypeError, NotImplementedError):
    2092             pass
 -> 2093         return super(Float64Index, self).get_loc(key)
    2094 
    2095     @property

 /home/glue/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/index.pyc in get_loc(self, key)
    1179         loc : int if unique index, possibly slice or mask if not
    1180         """
 -> 1181         return self._engine.get_loc(_values_from_object(key))
    1182 
    1183     def get_value(self, series, key):

 /home/glue/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3354)()

 /home/glue/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3234)()

 /home/glue/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/hashtable.so in pandas.hashtable.Float64HashTable.get_item (pandas/hashtable.c:9018)()

 /home/glue/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/hashtable.so in pandas.hashtable.Float64HashTable.get_item (pandas/hashtable.c:8962)()

 KeyError: 30.0
@jreback
Copy link
Contributor

jreback commented Jul 28, 2014

always pd.show_versions()

you can use:

x[(x.index>10.0)&(x.index<30.0)]

its not clear what:

x.ix[30.0:10.0,:] actually would mean for a reversed index as neither point is in the index. I supposed it could mean the above, but would have to think about that.

For an integer index, its clear, because the end-points are included.

@cpcloud
@jorisvandenbossche

@hughesadam87
Copy link
Author

Sorry, here's show_verions():

In [4]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-62-generic
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.utf8

pandas: 0.14.1
nose: 1.3.0
Cython: None
numpy: 1.8.0
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 3.0.0-dev
sphinx: None
patsy: 0.2.1
scikits.timeseries: None
dateutil: 2.2
pytz: 2014.4
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.3.1
openpyxl: 2.0.3
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None

Thanks for the solution. I'll use it for sure, but the ix behavior should work right? This is such a common index type in spectral data, I'd hate to require a seperate slice call for this use case. Although, if it's not likely to be changed in the future, I could probably just add my own slice functions that bury this under the hood unbeknownst to users. What do you recommend?

@jreback
Copy link
Contributor

jreback commented Jul 28, 2014

pls review docs here as well: http://pandas.pydata.org/pandas-docs/stable/indexing.html#float64index

I think this was not implemented because its not 'cheap'. In the sense that it would work if you knew that the index was monotonic, but reversed (iow would need to have a is_monotonic_increasing and is_monotonic_decreasing and then could just reverse the searching operators.

@hughesadam87
Copy link
Author

Well, that makes sense, thanks. I'll either make my own slice wrapper, or
raise a warning to users if they try to slice reversed index data.

On Mon, Jul 28, 2014 at 3:33 PM, jreback [email protected] wrote:

pls review docs here as well:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#float64index

I think this was not implemented because its not 'cheap'. In the sense
that it would work if you knew that the index was monotonic, but
reversed (iow would need to have a is_monotonic_increasing and
is_monotonic_decreasing and then could just reverse the searching
operators.


Reply to this email directly or view it on GitHub
#7860 (comment).

Adam Hughes
Physics Ph.D Candidate
George Washington University

@jreback
Copy link
Contributor

jreback commented Jul 28, 2014

we'll put it on the enhancement list. if you are interested in implemented, step up!

@jreback jreback added this to the 0.15.1 milestone Jul 28, 2014
@jreback jreback changed the title Support for dataframe index of decreasing float values ENH: indexing support for reversed is_monotonic Jul 28, 2014
@hughesadam87
Copy link
Author

Alright, thanks. I would take a crack, but really feel like I don't know the pandas code base well enough to guarantee my solution will do more good than harm.

@shoyer
Copy link
Member

shoyer commented Oct 30, 2014

I'm considering taking a crack at this, but there's one edge case I would like to clarify first. In particular: how do we want to handle slices with mis-matched ordering, e.g., x.loc[10:30] for an descending index or x.loc[30:10] for an ascending index.

Keeping track of whether an index is descending or ascending is one of those details that's nice to keep track of for the user, so it would be nice if these "just work" by switching start/stop in these cases. It seems like this would be handy for cases where the index is generally monotonic but can go either direction, e.g., as is the case for a number of physical variables.

Can anyone think of unfortunate consequences to this sort of interchanging?

@jreback
Copy link
Contributor

jreback commented Oct 30, 2014

@shoyer you can add to the is_monotonic_float64 et. all in generated.pyx and just return say -1 if is negative monotonic, then it would 'keep track' internally (just as is_monotonic does now, but for increasing).

Then I think you could easily just swap the start stop in those caes.

@shoyer
Copy link
Member

shoyer commented Oct 30, 2014

@jreback Excellent, I'll take a look. I'd like this to work for Int64Index, too, for the sake of consistency, although the typical case is floating point data.

@jreback
Copy link
Contributor

jreback commented Oct 30, 2014

you can do for all types - just change the template

@hughesadam87
Copy link
Author

I just wanted to point out that I did use @jreback suggestion for the boolean experssion and just put that into my getitem() indexer calls somewhere, and haven't encountered any problems since. This is probably a hacky solution, but for my use case, works fine.

Can I ask how monotonicity is determined? Are all values inspected, or just the start and final? And does is_monotonic_float64 already exist, or is this what is being proposed to be put in? It would help me if I had access to this attribute as well for when we do plotting. In fact, that might be an issue to consider. Matplotlib will try to plot from low to high values I believe, and I had to actually reverse the xlimits on calls to df.plot(). Unless my memory is mixed up...

@shoyer
Copy link
Member

shoyer commented Oct 30, 2014

Here is where is_monotonic is defined: https://github.com/pydata/pandas/blob/c7bfb4e16411516ca9108af95013bc3400ba38ad/pandas/src/generate_code.py#L542

This should be an easy fix to extend to identify descending indexes. It does indeed check all values (when necessary).

The advantage to using slice syntax is it uses numpy.ndarray views instead of making copies, so it's much faster. Also, various scientific file formats (e.g., netCDF, HDF5, OpenDAP) support reading slices directly but are much slower or have more limited support for array indexing. The later will be handy for xray, and it will get that for free when I add this to pandas.

@immerrr
Copy link
Contributor

immerrr commented Oct 30, 2014

how do we want to handle slices with mis-matched ordering, e.g., x.loc[10:30] for an descending index or x.loc[30:10] for an ascending index.

I'd expect the first example to work basically as

x.iloc[x.index.searchsorted(10): x.index.searchsorted(30, side='right') + 1]

with an obvious optimization potential of doing x.index.get_loc(N) if N in x.index. I'm not a fan of a slice operation counting down with a positive step value.

@shoyer
Copy link
Member

shoyer commented Oct 30, 2014

I'm not a fan of a slice operation counting down with a positive step value.

The logic would go like this: if the integer indexers start > stop and slice is not negative (i.e.., implying that the indexed object would have size 0), then swap the the label indexers start and stop.

(This would probably end up in Index.slice_indexer, since it needs to know the step.)

I doubt there many cases where users are relying on slicing returning a size 0 object due to enforcement of this ordering, but I could certainly be wrong.

@shoyer
Copy link
Member

shoyer commented Oct 30, 2014

Hmm... this could get pretty complex/unpredictable depending on whether step is positive or negative. Maybe better to avoid this for now.

@immerrr
Copy link
Contributor

immerrr commented Oct 30, 2014

I doubt there many cases where users are relying on slicing returning a size 0 object due to enforcement of this ordering, but I could certainly be wrong.

It may be me, but I see slicing as selecting values by position between lbound and ubound, with pandas being so kind to enable me writing bounds as labels rather than actual positions.

If OTOH you need all values between lbound and ubound value-wise you should either write the condition (x >= lbound) & (x <= ubound) or convince the team to add (or do it youself) a Index.between(lower, upper) method to solve the unwieldiness of double comparison (I always wondered how come there's a indexer_between_time for time-of-day comparisons, but no such thing for the rest of value types). As a bonus this method would work regardless of monotonicity.

@shoyer
Copy link
Member

shoyer commented Oct 30, 2014

@immerrr OK, I think I am convinced. +1 for the idea of Index.between.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves Usage Question
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants