CLN remove duplicated entries in valid_resos in pandas/core/indexes/datetimes.py #38121

boringow · 2020-11-27T20:23:26Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Before running, it is easy! just check the lines 560-570 of the pandas/code/indexes/datetime.py and you will understand the human error some developer did :)

# Your code here



import pandas as pd
import datetime as dt

Inside Github.csv, we have:

'''
,RxID,msgtype,message,confirmed,SNR,strength,format,icao24,correctedBits,LowConf,freqOffst
2020-08-25 11:00:08.187503455,1,10,4000,,0.0,-80.5,,,,,
2020-08-25 11:00:08.189013753,1,10,4000,,0.0,-78.5,,,,,
2020-08-25 11:00:08.189746310,1,8,,,0.0,-79.0,,,,,
2020-08-25 11:00:08.189986916,1,10,2700,,0.0,-78.0,,,,,
2020-08-25 11:00:08.190476779,1,6,20A93040502C94,1,0.0,-79.5,4.0,A1F014,,,
2020-08-25 11:00:08.190482661,1,9,0000,,0.0,-79.0,,,,,
2020-08-25 11:00:08.190963454,1,6,58EB0000171D17,1,0.0,-76.5,11.0,,,,
2020-08-25 11:00:08.191085134,1,9,7F00,,0.0,-78.5,,,,,
2020-08-25 11:00:08.191087092,1,9,0000,,0.0,-79.5,,,,,
2020-08-25 11:00:08.191123383,1,9,1000,,0.0,-72.0,,,,,
2020-08-25 11:00:08.191139020,1,7,,,0.0,-77.0,,,,,
2020-08-25 11:00:08.191150695,1,7,,,0.0,-76.5,,,,,
2020-08-25 11:00:08.191590978,1,10,0000,,0.0,-70.5,,,,,
2020-08-25 11:00:08.193015479,1,10,0000,,0.0,-83.0,,,,,
2020-08-25 11:00:08.193509041,1,9,2800,,0.0,-78.0,,,,,
2020-08-25 11:00:08.193664650,1,8,,,0.0,-80.5,,,,,
2020-08-25 11:00:08.193992571,1,10,0000,,0.0,-81.5,,,,,
2020-08-25 11:00:08.194459459,1,10,7D00,,0.0,-76.5,,,,,
2020-08-25 11:00:08.194461492,1,10,0001,,0.0,-76.0,,,,,
2020-08-25 11:00:08.195045194,1,10,0000,,0.0,-80.0,,,,,
2020-08-25 11:00:08.195061385,1,8,,,0.0,-80.0,,,,,
2020-08-25 11:00:08.195102628,1,10,0000,,0.0,-75.0,,,,,

'''

dfPKT = pd.read_csv('Github.csv',dtype={'RxId': int, 'msgtype': int, 'message': str, 'confirmed': object,'SNR': float, 'strength': float, 'format': float, 'icao24': str, 'correctedBits': float,'LowConf': float, 'freqOffst': float},index_col=0)
i=pd.DatetimeIndex(dfPKT.index,dayfirst=True)
dfPKT.set_index(i,inplace=True)

#occupancy constant
occupancyMode3Ainterrogation = 0.00015675

#How many messages are inside valid Mode A interrogation.
validModeAint=[]

#We start with the first timestamp as a reference.
Next_timestamp = dfPKT.index[0]
#Then, we get all the minutes of our study/analysis.
allminutes = dfPKT.index.floor('60S').unique()
#and for every minute, we do a data analysis.
for minute in allminutes:
    #Here, since the dataset is very big, I get small datasets of 1 minute each. 
    dfAnalysis = dfPKT.loc[(dfPKT.index > minute) & (dfPKT.index < minute + dt.timedelta(seconds=60))]
    for message, index, format, code, msgtype in zip(dfAnalysis["message"], dfAnalysis.index, dfAnalysis["format"],dfAnalysis["icao24"], dfAnalysis["msgtype"]):
    #Depending on the type of the message, we process it in one way or another
        if msgtype == 10:
            bits = bin(int(message, 16))[2:].zfill(16) #this is the 16 bits inside the messagetype 10. 
            if bits[:2] == '00': #if first two bits are zero
                '''
                when these two specific bits on the message are 0, we have to check on the same dataset if there is a message linked to it, from 8 to 13 microseconds after the actual message. HERE is were we get the error. It is trying to slice the datetime strings, and internally, it does not have a "milisecond" resolution. Altough my resolution in here is microseconds. But internally, it tries to do something with miliseconds and It breaks the code.
                '''
                if (2 in dfAnalysis[str(index + dt.timedelta(seconds=0.000008)):str(index + dt.timedelta(seconds=0.000013))]["msgtype"].to_numpy()):
                    validModeAint.append(index)
                    Next_timestamp = index + dt.timedelta(seconds=occupancyMode3Ainterrogation)

Problem description

Traceback (most recent call last):
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\lib\site-packages\pandas\core\indexes\datetimes.py", line 718, in slice_indexer
    return Index.slice_indexer(self, start, end, step, kind=kind)
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\lib\site-packages\pandas\core\indexes\base.py", line 4966, in slice_indexer
    start_slice, end_slice = self.slice_locs(start, end, step=step, kind=kind)
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\lib\site-packages\pandas\core\indexes\base.py", line 5169, in slice_locs
    start_slice = self.get_slice_bound(start, "left", kind)
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\lib\site-packages\pandas\core\indexes\base.py", line 5079, in get_slice_bound
    label = self._maybe_cast_slice_bound(label, side, kind)
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\lib\site-packages\pandas\core\indexes\datetimes.py", line 665, in _maybe_cast_slice_bound
    lower, upper = self._parsed_string_to_bounds(reso, parsed)
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\lib\site-packages\pandas\core\indexes\datetimes.py", line 536, in _parsed_string_to_bounds
    raise KeyError
KeyError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\OccupancyScripts\OccupancyToKML\Github.py", line 31, in <module>
    if (2 in dfAnalysis[str(index + dt.timedelta(seconds=0.000008)):str(index + dt.timedelta(seconds=0.000013))]["msgtype"].to_numpy()):
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\lib\site-packages\pandas\core\frame.py", line 2881, in __getitem__
    indexer = convert_to_index_sliceable(self, key)
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\lib\site-packages\pandas\core\indexing.py", line 2132, in convert_to_index_sliceable
    return idx._convert_slice_indexer(key, kind="getitem")
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\lib\site-packages\pandas\core\indexes\base.py", line 3190, in _convert_slice_indexer
    indexer = self.slice_indexer(start, stop, step, kind=kind)
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\lib\site-packages\pandas\core\indexes\datetimes.py", line 728, in slice_indexer
    start_casted = self._maybe_cast_slice_bound(start, "left", kind)
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\lib\site-packages\pandas\core\indexes\datetimes.py", line 665, in _maybe_cast_slice_bound
    lower, upper = self._parsed_string_to_bounds(reso, parsed)
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\lib\site-packages\pandas\core\indexes\datetimes.py", line 536, in _parsed_string_to_bounds
    raise KeyError
KeyError

Process finished with exit code 1

The current behavior is that the indexes are partitioned, as they are strings, and the reso or Resolution class, which has nanosecond and minisecond resolution, then is passed to the datetime.py file on the pandas/core/indexes, which has the valid_resos (on line 520) wrong.

Expected Output

Process finished with exit code 0

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 67a3d42
python : 3.9.0.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United Kingdom.1252
pandas : 1.1.4
numpy : 1.19.4
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 50.3.1.post20201107
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

[paste the output of pd.show_versions() here leaving a blank line after the details tag]

The text was updated successfully, but these errors were encountered:

arw2019 · 2020-11-27T21:59:05Z

Can you take a look at this guide and provide a smaller code snippet? The idea is to localize the bug you want to fix

boringow · 2020-12-02T14:12:01Z

@jreback @arw2019 @rhshadrach

boringow · 2020-12-03T15:38:01Z

I tried to think in the developers and I trimmed my code all the maximum so you can understand it. Now it is easy for everyone to understand and I even the dataset to only 7 seconds

boringow · 2020-12-03T15:56:47Z

It is 16 lines of code, but the problem is super easy to solve:

The current behavior is that the indexes are partitioned, as they are strings, and the reso or Resolution class, which has nanosecond and minisecond resolution, then is passed to the datetime.py file on the pandas/core/indexes, which has the valid_resos (on line 520) wrong.

MarcoGorelli · 2020-12-18T15:24:36Z

Before running, it is easy! just check the lines 560-570 of the pandas/code/indexes/datetime.py and you will understand the human error some developer did :)

Thanks for the report @boringow , but this kind of comment isn't very helpful as the content of files changes all the time. Please post what you think the error is (looking at the current content of those lines:

        """
        Find the `freq` for self.insert(loc, item).
        """
        value = self._data._validate_scalar(item)
        item = self._data._box_func(value)

        freq = None
        if is_period_dtype(self.dtype):
            freq = self.freq
        elif self.freq is not None:
            # freq can be preserved on edge cases

it's not at all obvious to me)

MarcoGorelli · 2020-12-19T16:06:02Z

OK, I see what you're referring to now: the repetition in

        valid_resos = {
            "year",
            "month",
            "quarter",
            "day",
            "hour",
            "minute",
            "second",
            "minute",
            "second",
            "microsecond",
        }

(from your PR #38118)

boringow · 2021-01-23T20:20:52Z

exactly! And if I changed locally, it seems to work. It was obviously a human error.
Did you correct it on the master?

MarcoGorelli · 2021-01-24T11:32:39Z

Hi @boringow - no, this issue is still unfixed. If you'd like to submit another PR, then please include a test

boringow · 2021-01-30T15:39:12Z

@MarcoGorelli !? I do not understand, right here on issue #38118 , I reported the bug and I did a simple code to test it. I will change it now to be even simpler, but you already saw the error:
OK, I see what you're referring to now: the repetition in

    valid_resos = {
        "year",
        "month",
        "quarter",
        "day",
        "hour",
        "minute",
        "second",
        "minute",
        "second",
        "microsecond",
    }

MarcoGorelli · 2021-01-30T15:52:38Z

@boringow the bug report is fine :) What I meant was that if you make a pull request, there needs to be a test inside the pull request

boringow · 2021-01-30T16:13:15Z

@MarcoGorelli Okay! I see the BUG is fixed! sorry for disturbing, is my first time reporting a bug.
I have a simpler code but it seems to work now.
For the next time, I have to put the code on #38121, right? Confirm me if it is finally solved please
btw, the simpler code is the following:

import pandas as pd
import datetime as dt


data={'numbers':[1,2,3]}
df=pd.DataFrame(data)
i=pd.DatetimeIndex(['2020-08-25 11:00:08.187503455','2020-08-25 11:00:08.187513753','2020-08-25 11:00:08.189746310'],dayfirst=True)
df.set_index(i,inplace=True)

for index,number in zip(df.index,df['numbers']):
    if (2 in df[str(index+dt.timedelta(seconds=0.000008)):str(index+dt.timedelta(seconds=0.000013))]['numbers'].to_numpy()):
        print('The line above this raised an error.')

MarcoGorelli · 2021-01-30T18:21:19Z

I see the BUG is fixed!

I don't think so, what makes you think that?

the simpler code is the following:

I tried running the code, but didn't get any error - could you please show the output you get, and the output you were expecting to get?

I think you may be misunderstanding what we mean when we write "please add a test" - see here for an example of a pull request with a test. Look at the changes they made to the file pandas/tests/io/test_common.py . That's what a test looks like.

If you want to fix this issue, then please include a test in your pull request. If not, someone else will pick this up sooner or later

MarcoGorelli · 2021-01-30T18:33:04Z

I also don't get any error from trying to run the snippet you have in #38121 (comment)

I'll look into this more deeply later and check if it was throwing an error in the past

boringow · 2021-01-31T15:20:59Z

It does not give en error because someone solved the issue, but still, some developer did not perform his job correctly.
@MarcoGorelli :
MarcoGorelli commented on 19 Dec 2020 •

OK, I see what you're referring to now: the repetition in

    valid_resos = {
        "year",
        "month",
        "quarter",
        "day",
        "hour",
        "minute",
        "second",
        "minute",
        "second",
        "microsecond",
    }

Now, I checked the MASTER of pandas, and it appear to be like this:

    valid_resos = {
        "year",
        "month",
        "quarter",
        "day",
        "hour",
        "minute",
        "second",
        "minute",
        "second",
        "millisecond",
        "microsecond",
    }

Reference here:

pandas/pandas/core/indexes/datetimes.py

Line 562 in 0b16fb3

"millisecond",

So, as you can see, some developer add the millisecond! cool :) BUT HE DID NOT REMOVE THE DOUBLED "minute","second".

By the way, I want to become a pandas mantainer, although I have 1 year of experience I learn quite fast and this kind of errors won't happen to me :) How could I help?

MarcoGorelli · 2021-01-31T16:52:54Z

Awesome, so indeed it's fixed now

So, as you can see, some developer add the millisecond! cool :)

Yes, looks like it was fixed intentionally in #39378:

$ git log -S millisecond -p -- pandas/core/indexes/datetimes.py
commit 56476d1cd4e5b956e2597ca7361f0d6567dd1da9
Author: patrick <[email protected]>
Date:   Mon Jan 25 17:33:12 2021 +0100

    BUG: Slicing DatetimeIndex with strings containing microseconds raising KeyError (#39378)

diff --git a/pandas/core/indexes/datetimes.py b/pandas/core/indexes/datetimes.py
index 0df954e054..abe9a9c26c 100644
--- a/pandas/core/indexes/datetimes.py
+++ b/pandas/core/indexes/datetimes.py
@@ -565,6 +565,7 @@ class DatetimeIndex(DatetimeTimedeltaMixin):
             "second",
             "minute",
             "second",
+            "millisecond",
             "microsecond",
         }
         if reso.attrname not in valid_resos:

BUT HE DID NOT REMOVE THE DOUBLED "minute","second".

True - do you want to submit a new pull request (the old one would need rebasing anyway) to remove them then? As it would just be a clean-up, no tests would be required.

As an aside, gender-neutral language such as "BUT THEY DID NOT" would be appreciated

By the way, I want to become a pandas mantainer

Awesome! Perhaps start by becoming a contributor - if you're interested in maintenance, check the governance page: "The Project's Core Team will consist of Project Contributors who have produced contributions that are substantial in quality and quantity, and sustained over at least one year."

boringow · 2021-01-31T17:16:26Z

I was using the individual term for developer, as it is one persone who changed it. But yeah! surely, I want to start helping and form part of this community :). I will create a new pull request referencing this issue, so we can close it.
Thanks for your help! @MarcoGorelli

rhshadrach · 2021-01-31T17:21:08Z

@boringow The issue should remain open until the PR that closes it is merged.

boringow · 2021-01-31T17:32:34Z

Okay! sorry about that I didn't know it.

MarcoGorelli · 2021-01-31T17:42:46Z

They've opened #39503, which is clearer than this issue, so closing in favour of that one

boringow added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 27, 2020

rhshadrach added Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 29, 2020

arw2019 mentioned this issue Dec 2, 2020

BUG: parsed datetime string resolution incorrect #38118

Closed

6 tasks

MarcoGorelli added Clean and removed Bug Needs Info Clarification about behavior needed to assess issue labels Jan 31, 2021

boringow closed this as completed Jan 31, 2021

rhshadrach reopened this Jan 31, 2021

boringow mentioned this issue Jan 31, 2021

BUG: Repeated key words on datetime indexing methods (_parsed_string_to_bounds()) #39503

Closed

3 tasks

MarcoGorelli changed the title ~~BUG: Pull request Update datetimes.py, issue 38118~~ CLN remove duplicated entries in valid_resos in pandas/core/indexes/datetimes.py Jan 31, 2021

MarcoGorelli closed this as completed Jan 31, 2021

MarcoGorelli added this to the No action milestone Jan 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLN remove duplicated entries in valid_resos in pandas/core/indexes/datetimes.py #38121

CLN remove duplicated entries in valid_resos in pandas/core/indexes/datetimes.py #38121

boringow commented Nov 27, 2020 •

edited by MarcoGorelli

Loading

arw2019 commented Nov 27, 2020

boringow commented Dec 2, 2020

boringow commented Dec 3, 2020

boringow commented Dec 3, 2020

MarcoGorelli commented Dec 18, 2020

MarcoGorelli commented Dec 19, 2020 •

edited

Loading

boringow commented Jan 23, 2021

MarcoGorelli commented Jan 24, 2021

boringow commented Jan 30, 2021

MarcoGorelli commented Jan 30, 2021

boringow commented Jan 30, 2021 •

edited by MarcoGorelli

Loading

MarcoGorelli commented Jan 30, 2021

MarcoGorelli commented Jan 30, 2021

boringow commented Jan 31, 2021 •

edited

Loading

MarcoGorelli commented Jan 31, 2021 •

edited

Loading

boringow commented Jan 31, 2021

rhshadrach commented Jan 31, 2021

boringow commented Jan 31, 2021

MarcoGorelli commented Jan 31, 2021

CLN remove duplicated entries in valid_resos in pandas/core/indexes/datetimes.py #38121

CLN remove duplicated entries in valid_resos in pandas/core/indexes/datetimes.py #38121

Comments

boringow commented Nov 27, 2020 • edited by MarcoGorelli Loading

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

arw2019 commented Nov 27, 2020

boringow commented Dec 2, 2020

boringow commented Dec 3, 2020

boringow commented Dec 3, 2020

MarcoGorelli commented Dec 18, 2020

MarcoGorelli commented Dec 19, 2020 • edited Loading

boringow commented Jan 23, 2021

MarcoGorelli commented Jan 24, 2021

boringow commented Jan 30, 2021

MarcoGorelli commented Jan 30, 2021

boringow commented Jan 30, 2021 • edited by MarcoGorelli Loading

MarcoGorelli commented Jan 30, 2021

MarcoGorelli commented Jan 30, 2021

boringow commented Jan 31, 2021 • edited Loading

MarcoGorelli commented Jan 31, 2021 • edited Loading

boringow commented Jan 31, 2021

rhshadrach commented Jan 31, 2021

boringow commented Jan 31, 2021

MarcoGorelli commented Jan 31, 2021

boringow commented Nov 27, 2020 •

edited by MarcoGorelli

Loading

Output of `pd.show_versions()`

MarcoGorelli commented Dec 19, 2020 •

edited

Loading

boringow commented Jan 30, 2021 •

edited by MarcoGorelli

Loading

boringow commented Jan 31, 2021 •

edited

Loading

MarcoGorelli commented Jan 31, 2021 •

edited

Loading