Skip to content

Slicing with reversed datetime index #14316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
boncelet opened this issue Sep 28, 2016 · 9 comments
Closed

Slicing with reversed datetime index #14316

boncelet opened this issue Sep 28, 2016 · 9 comments
Labels
Bug Datetime Datetime data dtype
Milestone

Comments

@boncelet
Copy link

I'm working with a time series data file from EIA.gov. It's a CSV file with the dates in reverse order (most recent first). Slicing on the dates wasn't working right. I believe these snippets illustrate the problem:

Create dataframe and date reversed dataframe.

>>> import pandas as pd
>>> dates = pd.date_range('1/1/2016',periods = 5)
>>> df = pd.DataFrame([1,2,3,4,5], index=dates, columns=['Col'])
>>> rdf = df.sort_index(axis=0, ascending=False)

First the normal order: Slice on dates, getting what we expect:

>>> print(df['2016-01-01':'2016-01-03'])
            Col
2016-01-01    1
2016-01-02    2
2016-01-03    3

Now the reverse order: I expect to get exactly the same answer (same slice on the same dataset), perhaps with rows reversed, but don't:

>>> print(rdf['2016-01-01':'2016-01-03'])
Empty DataFrame
Columns: [Col]
Index: []

Perhaps if I reverse the slice. Note only two rows are returned:

>>> print(rdf['2016-01-03':'2016-01-01'])
            Col
2016-01-03    3
2016-01-02    2

This makes sense, but it's still confusing:

>>> print(df[1:3]), print(rdf[1:3])
            Col
2016-01-02    2
2016-01-03    3
            Col
2016-01-04    4
2016-01-03    3

I could sort (or reverse) my index, but I had no reason to believe I needed to. Furthermore, other operations, such as plotting, work as expected (the time axis is correct).

Output of pd.show_versions()

## INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 23 Stepping 6, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.1
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.3.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

@boncelet
Copy link
Author

Sorry to comment on my own issue, but here are two more examples. I get that the rows are different orders, but shouldn't both return the same number of rows?

print(rdf['2016-01-01':'2016-01-03':-1])
            Col
2016-01-02    2
2016-01-03    3
print(df['2016-01-03':'2016-01-01':-1])
            Col
2016-01-03    3
2016-01-02    2
2016-01-01    1

@shoyer
Copy link
Member

shoyer commented Sep 28, 2016

Now the reverse order: I expect to get exactly the same answer (same slice on the same dataset), perhaps with rows reversed, but don't:

>> print(rdf['2016-01-01':'2016-01-03'])
Empty DataFrame
Columns: [Col]
Index: []

This part at least is working as intended. Slicing a descending order index requires a descending slice. This mirrors how slicing an ascending order index requires an ascending slice.

Datetime indexing with strings involves some implicit rounding to match all appropriate times. It's possible that the logic is not quite right for reversed indexes.

@boncelet
Copy link
Author

The reason I said " I expect to get exactly the same answer (same slice on the same dataset)" is that I expected a slice to behave like a query on a database: give me all rows whose dates are in the range (inclusive). Clearly, the implementation will be much faster if the index is ordered (internally), but the query (slice) should still work as "expected" even if not.

@shoyer
Copy link
Member

shoyer commented Sep 28, 2016

I expected a slice to behave like a query on a database: give me all rows whose dates are in the range (inclusive).

This is a reasonable intuition, but it's the wrong mental model for slicing in pandas. A better model is: give me the continuous slice of the data enclosed by these bounds. Slicing in pandas doesn't work on indexes that are not either monotonic increasing or monotonic decreasing.

@chris-b1
Copy link
Contributor

Should label slicing with a step maybe just raise? I might be missing a valid usecase, but I'm not even sure if these are bugs, because I don't really know what to expect.

In [8]: rdf['2016-01-01':'2016-01-03':-1]
Out[8]: 
            Col
2016-01-02    2
2016-01-03    3

In [9]: df['2016-01-03':'2016-01-01':-1]
Out[9]: 
            Col
2016-01-03    3
2016-01-02    2
2016-01-01    1

@chris-b1
Copy link
Contributor

I guess there are tests here covering this behavior, so it is well defined, if a little strange.

https://github.com/pydata/pandas/blob/master/pandas/tests/indexing/test_indexing.py#L5140

@chris-b1
Copy link
Contributor

@boncelet - of these cases, the only one that looks buggy to me is this one - it should include the ending label. The others seem to just be symptoms of slicing semantics like @shoyer mentioned. Thanks for the report.

In [22]: rdf.loc['2016-01-03':'2016-01-01', :]
Out[22]: 
            Col
2016-01-03    3
2016-01-02    2

@shoyer shoyer added Bug Datetime Datetime data dtype labels Sep 28, 2016
@shoyer
Copy link
Member

shoyer commented Sep 28, 2016

I think the issue is in _maybe_cast_slice_bound. It needs to flip side for monotonic decreasing, like _searchsorted_monotonic

@boncelet
Copy link
Author

Thanks. Part of my concern is it wasn't obvious to me I needed to worry about the order of the data in the original CSV file. Same data, whether it's sorted first to last or last to first.

It turns out I was wrong about plots. At least some plots give the x-axis (time) reversed. Normally this is not what we want.

Unfortunately, this slicing (and plotting) behavior means user code has to check the order of the data and sort it if necessary. For modest datasets (like those I'm looking at) no problem. But I can imagine huge datasets where sorting will be a burden.

It might be nice to have a user settable flag to reverse the index without actually reversing the index, analogously to a numpy stride (which can be negative). That way we can assume the index is increasing even if internally it is decreasing. It'd be even nicer if the various dataframe creators set the flag automatically, thereby relieving the user of the necessity to check the order.

Sincerely, thanks to all you developers who've worked so hard to create useful tools like pandas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants