-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Slicing with reversed datetime index #14316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Sorry to comment on my own issue, but here are two more examples. I get that the rows are different orders, but shouldn't both return the same number of rows? print(rdf['2016-01-01':'2016-01-03':-1])
Col
2016-01-02 2
2016-01-03 3 print(df['2016-01-03':'2016-01-01':-1])
Col
2016-01-03 3
2016-01-02 2
2016-01-01 1 |
This part at least is working as intended. Slicing a descending order index requires a descending slice. This mirrors how slicing an ascending order index requires an ascending slice. Datetime indexing with strings involves some implicit rounding to match all appropriate times. It's possible that the logic is not quite right for reversed indexes. |
The reason I said " I expect to get exactly the same answer (same slice on the same dataset)" is that I expected a slice to behave like a query on a database: give me all rows whose dates are in the range (inclusive). Clearly, the implementation will be much faster if the index is ordered (internally), but the query (slice) should still work as "expected" even if not. |
This is a reasonable intuition, but it's the wrong mental model for slicing in pandas. A better model is: give me the continuous slice of the data enclosed by these bounds. Slicing in pandas doesn't work on indexes that are not either monotonic increasing or monotonic decreasing. |
Should label slicing with a step maybe just raise? I might be missing a valid usecase, but I'm not even sure if these are bugs, because I don't really know what to expect.
|
I guess there are tests here covering this behavior, so it is well defined, if a little strange. https://github.com/pydata/pandas/blob/master/pandas/tests/indexing/test_indexing.py#L5140 |
@boncelet - of these cases, the only one that looks buggy to me is this one - it should include the ending label. The others seem to just be symptoms of slicing semantics like @shoyer mentioned. Thanks for the report.
|
I think the issue is in |
Thanks. Part of my concern is it wasn't obvious to me I needed to worry about the order of the data in the original CSV file. Same data, whether it's sorted first to last or last to first. It turns out I was wrong about plots. At least some plots give the x-axis (time) reversed. Normally this is not what we want. Unfortunately, this slicing (and plotting) behavior means user code has to check the order of the data and sort it if necessary. For modest datasets (like those I'm looking at) no problem. But I can imagine huge datasets where sorting will be a burden. It might be nice to have a user settable flag to reverse the index without actually reversing the index, analogously to a numpy stride (which can be negative). That way we can assume the index is increasing even if internally it is decreasing. It'd be even nicer if the various dataframe creators set the flag automatically, thereby relieving the user of the necessity to check the order. Sincerely, thanks to all you developers who've worked so hard to create useful tools like pandas. |
I'm working with a time series data file from EIA.gov. It's a CSV file with the dates in reverse order (most recent first). Slicing on the dates wasn't working right. I believe these snippets illustrate the problem:
Create dataframe and date reversed dataframe.
First the normal order: Slice on dates, getting what we expect:
Now the reverse order: I expect to get exactly the same answer (same slice on the same dataset), perhaps with rows reversed, but don't:
Perhaps if I reverse the slice. Note only two rows are returned:
This makes sense, but it's still confusing:
I could sort (or reverse) my index, but I had no reason to believe I needed to. Furthermore, other operations, such as plotting, work as expected (the time axis is correct).
Output of
pd.show_versions()
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 23 Stepping 6, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.1
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.3.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None
The text was updated successfully, but these errors were encountered: