Skip to content

Inconsistent behavior of DatetimeIndex Partial String Indexing on Series and DataFrames #14826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ischurov opened this issue Dec 8, 2016 · 18 comments
Labels
Bug Datetime Datetime data dtype Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@ischurov
Copy link
Contributor

ischurov commented Dec 8, 2016

This bugreport is related to this SO question and the discussion there.

Summary

I believe that current DatetimeIndex Partial String Indexing behavior is either inconsistent or underdocumented as the result depends nontrivially on whether we are working with Series or DataFrame and whether DateTimeIndex is periodic or not.

Series vs. DataFrame

series = pd.Series([1, 2, 3], pd.DatetimeIndex(['2016-12-07 09:00:00',
                                                '2016-12-08 09:00:00',
                                                '2016-12-09 09:00:00']))
print(type(series["2016-12-07 09:00:00"]))
# <class 'numpy.int64'>
df = pd.DataFrame(series)
df["2016-12-07 09:00:00"]

KeyError: '2016-12-07 09:00:00'

Here we see that the behaviour depends on what we are indexing: Series returns scalar while DataFrame raises an exception. This exception is consistent with the documentation notice:

Warning The following selection will raise a KeyError; otherwise this selection methodology would be inconsistent with other selection methods in pandas (as this is not a slice, nor does it resolve to one)

Why we do not get the same exception for Series object?

Periodic vs. Non-periodic

series = pd.Series([1, 2, 3], pd.DatetimeIndex(['2016-12-07 09:00:00',
                                                '2016-12-08 09:00:00',
                                                '2016-12-09 09:00:01']))
# now it is not periodic due to 1 second in the last timestamp

print(type(series["2016-12-07 09:00:00"]))
# <class 'pandas.core.series.Series'>

In contrast with the previous example, we get an instance of Series here, so the same timestamp is considered as a slice, not index. Why it depends in such a way on periodicity of the index?

df = pd.DataFrame(series)
print(type(df["2016-12-07 09:00:00"]))
# <class 'pandas.core.frame.DataFrame'>

No exceptions here, in contrast with periodic case.

Is it intended behavior? If yes, I believe that this should be clearly documented and rationale provided.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Darwin OS-release: 16.1.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.0+157.g2466ecb
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.1
numpy: 1.11.2
scipy: None
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

@ischurov ischurov changed the title Inconsistent behaviour of DatetimeIndex Partial String Indexing on Series and DataFrames Inconsistent behavior of DatetimeIndex Partial String Indexing on Series and DataFrames Dec 8, 2016
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 8, 2016

The first "series vs dataframe" issue is as expected / follows from the second issue. __getitem__/[] indexing does not work exactly the same for series vs dataframe (http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics). When provided a single key, Series will get a single value, while dataframe will try to get a column. And there is no column named "2016-12-07 09:00:00", hence the KeyError.
So this behaviour follows from the fact that in your first example, "2016-12-07 09:00:00" is not interpreted as a slice, but as a single key. Given that, the behaviour is as expected.

But, you are correct there might be an inconsistency in determining whether the string is a single key or a slice between the regular and irregular datetimeindex. The problem is that it is very difficult for pandas to guess/determine this when the index has no frequency.

@ischurov
Copy link
Contributor Author

ischurov commented Dec 8, 2016

For the first part, I completely agree. I forgot that [] for dataframe select columns if argument is not a slice (shame on me), this explains the behavior nicely. (And now I understand the rationale behind the Warning in the docs I quoted.)

For the second part, is it true that for irregular indexes any string is considered to be slice, and for regular ones only those strings that provide date-time specification with precision less then frequency is considered to be slice?

Why it is not possible to consider string a key (not a slice) if it is casted to date-time that is exactly the same as one of keys in the index for irregular indexes?

@jorisvandenbossche
Copy link
Member

And now I understand the rationale behind the Warning in the docs I quoted.

If you have ideas to rephrase this to make it clearer, very welcome!

Why it is not possible to consider string a key (not a slice) if it is casted to date-time that is exactly the same as one of keys in the index for irregular indexes?

I am not exactly sure how it is implemented in the code, but imagine the following case: you have a timeseries with index ["2016-01-01 00:00", "2016-01-01 12:00", "2016-01-01 23:00", "2016-01-02 05:00", "2016-01-02 18:00"] (some irregular hours over two days).
If you would index this with the key "2016-01-01", this can be interpreted as "give me all the data of the 1st of January", so in this case: a slice (which is what pandas does). But if you would also want to check if the provided key exactly matches with one of the labels of the index, pandas parses this key -> pd.Timestamp("2016-01-01"), which gives you "2016-01-01 00:00:00" (because pandas does not have different resolution in its datetime data type) and this would match with one of the elements (so: not a slice). So which one of the two would the user want?
(to put it in other words, pandas cannot make a distinction between s["2016-01-01"] (logically a slice) and s["2016-01-01 00:00"] (logically a single element))

@ischurov
Copy link
Contributor Author

ischurov commented Dec 8, 2016

Yes, I'm going to improve the docs according to our discussion after I understand all the details.

Your explanation sounds reasonable, but I cannot get why all these arguments do not apply to the case of regular index? In fact, pandas can detect the resolution of a string-represented timestamp, this is done in function pandas.tseries.tools.parse_time_string() and it allows pandas to distinct s["2016-01-01"] and s["2016-01-01 00:00"] in case s is indexed with regular index:

series = pd.Series([1, 2, 3], pd.DatetimeIndex(['2016-12-07 00:00:00',
                                                '2016-12-07 01:00:00',
                                                '2016-12-07 02:00:00']))
print(series["2016-12-07 00:00:00"])
# 1
print(series["2016-12-07 00:00"])
# 1
print(series["2016-12-07"])
# 2016-12-07 00:00:00    1
# 2016-12-07 01:00:00    2
# 2016-12-07 02:00:00    3
# dtype: int64

@ischurov
Copy link
Contributor Author

ischurov commented Dec 9, 2016

Finally, it seems that I got it. The code I'm interested in is the following:

    def _partial_date_slice(self, reso, parsed, use_lhs=True, use_rhs=True):
        is_monotonic = self.is_monotonic
        if ((reso in ['day', 'hour', 'minute'] and
             not (self._resolution < Resolution.get_reso(reso) or
                  not is_monotonic)) or
            (reso == 'second' and
             not (self._resolution <= Resolution.RESO_SEC or
                  not is_monotonic))):
            # These resolution/monotonicity validations came from GH3931,
            # GH3452 and GH2369.
            raise KeyError

raising KeyError here means that the timestamp cannot be coerced to a slice. The condition basically says that if the resolution of the timestamp (that we remember from the string) is less precise than the resolution of the index, it is slice, otherwise it is not. That sounds very reasonable.

I'm not sure yet, why second resolution uses different condition (non-strict inequality instead of strict one), but basically I understand what's going on here.

I'll try to improve the docs soon and prepare PR that will refer to this issue.

@ischurov
Copy link
Contributor Author

ischurov commented Dec 9, 2016

Btw, could anybody tell, why second resolution is treated in such a different way here? I tried to figure it out myself (looking at PRs mentioned near the code), but didn't succeed.

@ischurov
Copy link
Contributor Author

ischurov commented Dec 9, 2016

I finally discovered that this was PR #3931

@jreback, could you please comment on this? Why do we introduce the inconsistence like this:

series = pd.Series([1, 2, 3, 4], pd.DatetimeIndex(['2016-12-06 23:59:00',
                                                   '2016-12-07 01:00:00',
                                                   '2016-12-07 01:01:00',
                                                   '2016-12-07 01:02:01']))

print(type(series["2016-12-07 01:01:00"]))
# <class 'pandas.core.series.Series'>

series = pd.Series([1, 2, 3, 4], pd.DatetimeIndex(['2016-12-07',
                                                   '2016-12-08',
                                                   '2016-12-09',
                                                   '2016-12-10']))
print(type(series["2016-12-07"]))

# <class 'numpy.int64'>

Why second frequency is treated in different way?

@jorisvandenbossche
Copy link
Member

@ischurov Thanks for digging in! So indeed, it had in the end nothing to do with the irregular/regular index (only the resolution of the index is different due to the irregularity, and this impact how the slice is determined), but the different treatment of second resolution or higher resolutions.

I am not sure why this was added differently as the other resolution, and this seems rather inconsistent to me.

By the way, apart from clarification in the docs, some comprehensive tests looping over some combinations of different resolutions is also welcome

@jreback
Copy link
Contributor

jreback commented Dec 9, 2016

if you look at the PR, DataFrames need to have a slice here (and not a single indexer). So I think this could fix the inconsistent case you enumerate above (e.g. seconds is an exact match in which case you raise KeyError; this may be counter intuitive in how to do it, but it goes to another path that converts to a timestamp and just looks for an exact match, if its there it returns the value, otherwise raises).

So obviously this is not tested on series.

But I think you'd have to introduce some logic to actually return a slice when selecting for a DataFrame, which in this case IS a slice.

The other resolutions, day, hour, minute, cannot by definition ever have an exact match directly (because they always have a seconds component attached which you don't know). However seconds is special in that it could fully represented and actually be an exact match.

@jreback jreback added Bug Difficulty Advanced Indexing Related to indexing on series/frames, not to indexes themselves Datetime Datetime data dtype labels Dec 9, 2016
@jreback jreback added this to the Next Major Release milestone Dec 9, 2016
@jorisvandenbossche
Copy link
Member

if you look at the PR, DataFrames need to have a slice here (and not a single indexer).

Why is that?
The example you give at the top of that PR is:

In [11]: df = DataFrame(randn(5,5),columns=['open','high','low','close','volume'],index=date_range('2012-01-02 18:01:00',periods=5,tz='US/Central',freq='s'))

In [12]: df
Out[12]: 
                               open      high       low     close    volume
2012-01-02 18:01:00-06:00  0.131243  0.301542  0.128027  0.804162  1.296658
2012-01-02 18:01:01-06:00  0.341487  1.548695  0.703234  0.904201  1.422337
2012-01-02 18:01:02-06:00 -1.050453 -1.884035  1.537788 -0.821058  0.558631
2012-01-02 18:01:03-06:00  0.846885  1.045378 -0.722903 -0.613625 -0.476531
2012-01-02 18:01:04-06:00  1.186823 -0.018299 -0.513886 -1.103269 -0.311907

In [14]: df['2012-01-02 18:01:02']
Out[14]: 
                               open      high       low     close    volume
2012-01-02 18:01:02-06:00 -1.050453 -1.884035  1.537788 -0.821058  0.558631

So AFAIK the PR made possible to slice the above dataframe with that string index. However, I would argue that in this case this is no slice at all, but a single key (as both the indexer key as the index is of second resolution, so the result of such a string key will always be slice of length 1 ?)

@jreback
Copy link
Contributor

jreback commented Dec 9, 2016

for a dataframe by definition is IS a slice always as it cannot be an exact match (wrong axis for exact matching); it can only ever be a slice

while for a series both are possible

@ischurov
Copy link
Contributor Author

ischurov commented Dec 9, 2016

As the issue is marked as bug, may I ask, what is the desired behavior?

Actually, I believe this is not Series issue, it's DataFrame issue as well. I believe the following logic is consistent:

  • If timestamp resolution is strictly greater (less precise) than index resolution, timetamp is a slice as it can (in theory) correspond to more than one elements in the index. For Series, [] should return Series, for DataFrameDataFrame.
  • If timestamp resolution is equal to index resolution, then timestamp is considered as an attempt to get a kind of "exact match". For Series, [] should return scalar, for DataFrame — try to find column with this key (if any), and most probably raise KeyError.
  • If timestamp resolution is strictly less than index resolution, KeyError have to be raised in both cases.

One can argue that if the resolution is greater than second, no exact match possible. However, I believe that it's an implementation detail — how timestamps are presented internally — and from user's point of view if the resolution of index is e.g. hour, then "2016-01-01 01" is in fact exact match.

@jreback
Copy link
Contributor

jreback commented Dec 10, 2016

@ischurov so can you show some short test cases that replicate the logic you have presented (and show what is changing from current).

ischurov added a commit to ischurov/pandas that referenced this issue Dec 11, 2016
See pandas-dev#14826.

Now the following logic applies:

- If timestamp resolution is strictly less precise than index resolution, timetamp is a slice as it can (in theory) correspond to more than one elements in the index. For `Series`, `[]` should return `Series`, for `DataFrame` — `DataFrame`.
- If timestamp resolution is equal to index resolution, then timestamp is considered as an attempt to get a kind of "exact match". For `Series`, `[]` should return scalar, for `DataFrame` — try to find column with this key (if any), and most probably raise `KeyError`.
- If timestamp resolution is strictly more precise than index resolution and does not resolve to exact match, `KeyError` have to be raised in both cases.

Testsuite is updated as well.
@ischurov
Copy link
Contributor Author

@jreback see PR #14856. I added several tests to check partial string indexing with respect to the logic stated above. With current code, only test_partial_slice_second fails. The proposed solution is as simple as this.

@ischurov
Copy link
Contributor Author

@jreback Here is a super short summary of what's changed:

Let

df = DataFrame({'a': [1, 2, 3]},
                       DatetimeIndex(['2011-12-31 23:59:59',
                                      '2012-01-01 00:00:00',
                                      '2012-01-01 00:00:01']),
                       dtype=np.int64)

Then df['a']['2011-12-31 23:59:59'] should return np.int64 object 1 (now returns Series) and df['2011-12-31 23:59:59'] should raise KeyError (now return DataFrame).

@davidandreoletti
Copy link

davidandreoletti commented Dec 12, 2016

I reported the behaviour on StackOverflow (and @ischurov raised it here), so I believe I am not influenced by Pandas design decision/culture. As a new comer, I expected the indexing/selection to return a consistent datatype (using the example in the response above):

  • if the index value (eg: '2012-01-01 00:00:00') has the same resolution as the values in the DataFrame's index and the index value has an exact match then return a single scalar value.
  • Otherwise, return a Series containing all scalar values that can match the index value's resolution. This means that if the index value is has a higher resolution than the index value then a empty series is returned.

My suggestion differs from @ischurov's proposition on KeyError. IMHO, if KeyError is too be used then I would expect it to be raised when the index value is not the same resolution as the DataFrame's Index resolution.

@jorisvandenbossche
Copy link
Member

if KeyError is too be used then I would expect it to be raised when the index value is not the same resolution as the DataFrame's Index resolution.

I don't really understand this comment. If the resolution of the indexer is lower than of the Index, you get a slice, if it is higher, you get a KeyError. So KeyError is already used in certain cases where the resolutions differ.

@jreback jreback modified the milestones: 0.20.0, Next Major Release Dec 18, 2016
@ischurov
Copy link
Contributor Author

As a follow-up: are there any reasons why DatetimeIndex resolution cannot be less precise than day?

We added it to the docs but actually I'm not sure what is the reason for this?

    series_monthly = pd.Series([1, 2, 3],
                              pd.DatetimeIndex(['2011-12',
                                                '2012-01',
                                                '2012-02']))
    series_monthly.index.resolution # returns "day"

ShaharBental pushed a commit to ShaharBental/pandas that referenced this issue Dec 26, 2016
…lution

Closes pandas-dev#14826

Fix inconsistency in Partial String Index with 'second' resolution.
See pandas-dev#14826. Now if the timestamp and the index both have resolution
`second`, timestamp is considered as an exact match try and not a
slice. Therefore, for `Series`, scalar will be returned, for
`DataFrame` `KeyError` raised.

Author: Ilya V. Schurov <[email protected]>

Closes pandas-dev#14856 from ischurov/datetimeindex-slices and squashes the following commits:

2881a53 [Ilya V. Schurov] Merge branch 'datetimeindex-slices' of https://github.com/ischurov/pandas into datetimeindex-slices
ac8758e [Ilya V. Schurov] resolved merge conflict in whatsnew/v0.20.0.txt
0e87874 [Ilya V. Schurov] resolved merge conflict in whatsnew/v0.20.0.txt
0814e5b [Ilya V. Schurov] - Addressing code review: added reference to new docs section in whatsnew.
d215905 [Ilya V. Schurov] - Addressing code review: documentation clarification.
c287845 [Ilya V. Schurov] conflict PR pandas-dev#14856 resolved
40eddc3 [Ilya V. Schurov] - Documentation fixes
e17d210 [Ilya V. Schurov] - Whatsnew section added - Documentation section added
67e6bab [Ilya V. Schurov] Addressing code review: more comments added
c901588 [Ilya V. Schurov] Addressing code review: testing different combinations with the loop instead of copy-pasting of the code
9b55117 [Ilya V. Schurov] Addressing code review
b30039d [Ilya V. Schurov] Make flake8 happy.
cc86bdd [Ilya V. Schurov] Fix inconsistency in Partial String Index with 'second' resolution
ea51437 [Ilya V. Schurov] Made this code clearer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

4 participants