-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Inconsistent behavior of DatetimeIndex Partial String Indexing on Series and DataFrames #14826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The first "series vs dataframe" issue is as expected / follows from the second issue. But, you are correct there might be an inconsistency in determining whether the string is a single key or a slice between the regular and irregular datetimeindex. The problem is that it is very difficult for pandas to guess/determine this when the index has no frequency. |
For the first part, I completely agree. I forgot that For the second part, is it true that for irregular indexes any string is considered to be slice, and for regular ones only those strings that provide date-time specification with precision less then frequency is considered to be slice? Why it is not possible to consider string a key (not a slice) if it is casted to date-time that is exactly the same as one of keys in the index for irregular indexes? |
If you have ideas to rephrase this to make it clearer, very welcome!
I am not exactly sure how it is implemented in the code, but imagine the following case: you have a timeseries with index ["2016-01-01 00:00", "2016-01-01 12:00", "2016-01-01 23:00", "2016-01-02 05:00", "2016-01-02 18:00"] (some irregular hours over two days). |
Yes, I'm going to improve the docs according to our discussion after I understand all the details. Your explanation sounds reasonable, but I cannot get why all these arguments do not apply to the case of regular index? In fact, pandas can detect the resolution of a string-represented timestamp, this is done in function series = pd.Series([1, 2, 3], pd.DatetimeIndex(['2016-12-07 00:00:00',
'2016-12-07 01:00:00',
'2016-12-07 02:00:00']))
print(series["2016-12-07 00:00:00"])
# 1
print(series["2016-12-07 00:00"])
# 1
print(series["2016-12-07"])
# 2016-12-07 00:00:00 1
# 2016-12-07 01:00:00 2
# 2016-12-07 02:00:00 3
# dtype: int64 |
Finally, it seems that I got it. The code I'm interested in is the following: def _partial_date_slice(self, reso, parsed, use_lhs=True, use_rhs=True):
is_monotonic = self.is_monotonic
if ((reso in ['day', 'hour', 'minute'] and
not (self._resolution < Resolution.get_reso(reso) or
not is_monotonic)) or
(reso == 'second' and
not (self._resolution <= Resolution.RESO_SEC or
not is_monotonic))):
# These resolution/monotonicity validations came from GH3931,
# GH3452 and GH2369.
raise KeyError raising I'm not sure yet, why I'll try to improve the docs soon and prepare PR that will refer to this issue. |
Btw, could anybody tell, why |
I finally discovered that this was PR #3931 @jreback, could you please comment on this? Why do we introduce the inconsistence like this: series = pd.Series([1, 2, 3, 4], pd.DatetimeIndex(['2016-12-06 23:59:00',
'2016-12-07 01:00:00',
'2016-12-07 01:01:00',
'2016-12-07 01:02:01']))
print(type(series["2016-12-07 01:01:00"]))
# <class 'pandas.core.series.Series'>
series = pd.Series([1, 2, 3, 4], pd.DatetimeIndex(['2016-12-07',
'2016-12-08',
'2016-12-09',
'2016-12-10']))
print(type(series["2016-12-07"]))
# <class 'numpy.int64'> Why |
@ischurov Thanks for digging in! So indeed, it had in the end nothing to do with the irregular/regular index (only the resolution of the index is different due to the irregularity, and this impact how the slice is determined), but the different treatment of second resolution or higher resolutions. I am not sure why this was added differently as the other resolution, and this seems rather inconsistent to me. By the way, apart from clarification in the docs, some comprehensive tests looping over some combinations of different resolutions is also welcome |
if you look at the PR, DataFrames need to have a slice here (and not a single indexer). So I think this could fix the inconsistent case you enumerate above (e.g. seconds is an exact match in which case you raise So obviously this is not tested on series. But I think you'd have to introduce some logic to actually return a slice when selecting for a DataFrame, which in this case IS a slice. The other resolutions, day, hour, minute, cannot by definition ever have an exact match directly (because they always have a seconds component attached which you don't know). However seconds is special in that it could fully represented and actually be an exact match. |
Why is that?
So AFAIK the PR made possible to slice the above dataframe with that string index. However, I would argue that in this case this is no slice at all, but a single key (as both the indexer key as the index is of second resolution, so the result of such a string key will always be slice of length 1 ?) |
for a dataframe by definition is IS a slice always as it cannot be an exact match (wrong axis for exact matching); it can only ever be a slice while for a series both are possible |
As the issue is marked as bug, may I ask, what is the desired behavior? Actually, I believe this is not
One can argue that if the resolution is greater than |
@ischurov so can you show some short test cases that replicate the logic you have presented (and show what is changing from current). |
See pandas-dev#14826. Now the following logic applies: - If timestamp resolution is strictly less precise than index resolution, timetamp is a slice as it can (in theory) correspond to more than one elements in the index. For `Series`, `[]` should return `Series`, for `DataFrame` — `DataFrame`. - If timestamp resolution is equal to index resolution, then timestamp is considered as an attempt to get a kind of "exact match". For `Series`, `[]` should return scalar, for `DataFrame` — try to find column with this key (if any), and most probably raise `KeyError`. - If timestamp resolution is strictly more precise than index resolution and does not resolve to exact match, `KeyError` have to be raised in both cases. Testsuite is updated as well.
@jreback Here is a super short summary of what's changed: Let df = DataFrame({'a': [1, 2, 3]},
DatetimeIndex(['2011-12-31 23:59:59',
'2012-01-01 00:00:00',
'2012-01-01 00:00:01']),
dtype=np.int64) Then |
I reported the behaviour on StackOverflow (and @ischurov raised it here), so I believe I am not influenced by Pandas design decision/culture. As a new comer, I expected the indexing/selection to return a consistent datatype (using the example in the response above):
My suggestion differs from @ischurov's proposition on KeyError. IMHO, if KeyError is too be used then I would expect it to be raised when the index value is not the same resolution as the DataFrame's Index resolution. |
I don't really understand this comment. If the resolution of the indexer is lower than of the Index, you get a slice, if it is higher, you get a KeyError. So KeyError is already used in certain cases where the resolutions differ. |
As a follow-up: are there any reasons why We added it to the docs but actually I'm not sure what is the reason for this? series_monthly = pd.Series([1, 2, 3],
pd.DatetimeIndex(['2011-12',
'2012-01',
'2012-02']))
series_monthly.index.resolution # returns "day" |
…lution Closes pandas-dev#14826 Fix inconsistency in Partial String Index with 'second' resolution. See pandas-dev#14826. Now if the timestamp and the index both have resolution `second`, timestamp is considered as an exact match try and not a slice. Therefore, for `Series`, scalar will be returned, for `DataFrame` `KeyError` raised. Author: Ilya V. Schurov <[email protected]> Closes pandas-dev#14856 from ischurov/datetimeindex-slices and squashes the following commits: 2881a53 [Ilya V. Schurov] Merge branch 'datetimeindex-slices' of https://github.com/ischurov/pandas into datetimeindex-slices ac8758e [Ilya V. Schurov] resolved merge conflict in whatsnew/v0.20.0.txt 0e87874 [Ilya V. Schurov] resolved merge conflict in whatsnew/v0.20.0.txt 0814e5b [Ilya V. Schurov] - Addressing code review: added reference to new docs section in whatsnew. d215905 [Ilya V. Schurov] - Addressing code review: documentation clarification. c287845 [Ilya V. Schurov] conflict PR pandas-dev#14856 resolved 40eddc3 [Ilya V. Schurov] - Documentation fixes e17d210 [Ilya V. Schurov] - Whatsnew section added - Documentation section added 67e6bab [Ilya V. Schurov] Addressing code review: more comments added c901588 [Ilya V. Schurov] Addressing code review: testing different combinations with the loop instead of copy-pasting of the code 9b55117 [Ilya V. Schurov] Addressing code review b30039d [Ilya V. Schurov] Make flake8 happy. cc86bdd [Ilya V. Schurov] Fix inconsistency in Partial String Index with 'second' resolution ea51437 [Ilya V. Schurov] Made this code clearer.
This bugreport is related to this SO question and the discussion there.
Summary
I believe that current DatetimeIndex Partial String Indexing behavior is either inconsistent or underdocumented as the result depends nontrivially on whether we are working with
Series
orDataFrame
and whetherDateTimeIndex
is periodic or not.Series
vs.DataFrame
Here we see that the behaviour depends on what we are indexing:
Series
returns scalar whileDataFrame
raises an exception. This exception is consistent with the documentation notice:Why we do not get the same exception for
Series
object?Periodic vs. Non-periodic
In contrast with the previous example, we get an instance of
Series
here, so the same timestamp is considered as a slice, not index. Why it depends in such a way on periodicity of the index?No exceptions here, in contrast with periodic case.
Is it intended behavior? If yes, I believe that this should be clearly documented and rationale provided.
Output of
pd.show_versions()
pandas: 0.19.0+157.g2466ecb
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.1
numpy: 1.11.2
scipy: None
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: