Skip to content

CLN remove duplicated entries in valid_resos in pandas/core/indexes/datetimes.py #38121

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
boringow opened this issue Nov 27, 2020 · 19 comments
Closed
2 of 3 tasks
Labels

Comments

@boringow
Copy link

boringow commented Nov 27, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Before running, it is easy! just check the lines 560-570 of the pandas/code/indexes/datetime.py and you will understand the human error some developer did :)

# Your code here



import pandas as pd
import datetime as dt

Inside Github.csv, we have:

'''
,RxID,msgtype,message,confirmed,SNR,strength,format,icao24,correctedBits,LowConf,freqOffst
2020-08-25 11:00:08.187503455,1,10,4000,,0.0,-80.5,,,,,
2020-08-25 11:00:08.189013753,1,10,4000,,0.0,-78.5,,,,,
2020-08-25 11:00:08.189746310,1,8,,,0.0,-79.0,,,,,
2020-08-25 11:00:08.189986916,1,10,2700,,0.0,-78.0,,,,,
2020-08-25 11:00:08.190476779,1,6,20A93040502C94,1,0.0,-79.5,4.0,A1F014,,,
2020-08-25 11:00:08.190482661,1,9,0000,,0.0,-79.0,,,,,
2020-08-25 11:00:08.190963454,1,6,58EB0000171D17,1,0.0,-76.5,11.0,,,,
2020-08-25 11:00:08.191085134,1,9,7F00,,0.0,-78.5,,,,,
2020-08-25 11:00:08.191087092,1,9,0000,,0.0,-79.5,,,,,
2020-08-25 11:00:08.191123383,1,9,1000,,0.0,-72.0,,,,,
2020-08-25 11:00:08.191139020,1,7,,,0.0,-77.0,,,,,
2020-08-25 11:00:08.191150695,1,7,,,0.0,-76.5,,,,,
2020-08-25 11:00:08.191590978,1,10,0000,,0.0,-70.5,,,,,
2020-08-25 11:00:08.193015479,1,10,0000,,0.0,-83.0,,,,,
2020-08-25 11:00:08.193509041,1,9,2800,,0.0,-78.0,,,,,
2020-08-25 11:00:08.193664650,1,8,,,0.0,-80.5,,,,,
2020-08-25 11:00:08.193992571,1,10,0000,,0.0,-81.5,,,,,
2020-08-25 11:00:08.194459459,1,10,7D00,,0.0,-76.5,,,,,
2020-08-25 11:00:08.194461492,1,10,0001,,0.0,-76.0,,,,,
2020-08-25 11:00:08.195045194,1,10,0000,,0.0,-80.0,,,,,
2020-08-25 11:00:08.195061385,1,8,,,0.0,-80.0,,,,,
2020-08-25 11:00:08.195102628,1,10,0000,,0.0,-75.0,,,,,

'''
dfPKT = pd.read_csv('Github.csv',dtype={'RxId': int, 'msgtype': int, 'message': str, 'confirmed': object,'SNR': float, 'strength': float, 'format': float, 'icao24': str, 'correctedBits': float,'LowConf': float, 'freqOffst': float},index_col=0)
i=pd.DatetimeIndex(dfPKT.index,dayfirst=True)
dfPKT.set_index(i,inplace=True)

#occupancy constant
occupancyMode3Ainterrogation = 0.00015675

#How many messages are inside valid Mode A interrogation.
validModeAint=[]

#We start with the first timestamp as a reference.
Next_timestamp = dfPKT.index[0]
#Then, we get all the minutes of our study/analysis.
allminutes = dfPKT.index.floor('60S').unique()
#and for every minute, we do a data analysis.
for minute in allminutes:
    #Here, since the dataset is very big, I get small datasets of 1 minute each. 
    dfAnalysis = dfPKT.loc[(dfPKT.index > minute) & (dfPKT.index < minute + dt.timedelta(seconds=60))]
    for message, index, format, code, msgtype in zip(dfAnalysis["message"], dfAnalysis.index, dfAnalysis["format"],dfAnalysis["icao24"], dfAnalysis["msgtype"]):
    #Depending on the type of the message, we process it in one way or another
        if msgtype == 10:
            bits = bin(int(message, 16))[2:].zfill(16) #this is the 16 bits inside the messagetype 10. 
            if bits[:2] == '00': #if first two bits are zero
                '''
                when these two specific bits on the message are 0, we have to check on the same dataset if there is a message linked to it, from 8 to 13 microseconds after the actual message. HERE is were we get the error. It is trying to slice the datetime strings, and internally, it does not have a "milisecond" resolution. Altough my resolution in here is microseconds. But internally, it tries to do something with miliseconds and It breaks the code.
                '''
                if (2 in dfAnalysis[str(index + dt.timedelta(seconds=0.000008)):str(index + dt.timedelta(seconds=0.000013))]["msgtype"].to_numpy()):
                    validModeAint.append(index)
                    Next_timestamp = index + dt.timedelta(seconds=occupancyMode3Ainterrogation)

Problem description

Traceback (most recent call last):
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\lib\site-packages\pandas\core\indexes\datetimes.py", line 718, in slice_indexer
    return Index.slice_indexer(self, start, end, step, kind=kind)
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\lib\site-packages\pandas\core\indexes\base.py", line 4966, in slice_indexer
    start_slice, end_slice = self.slice_locs(start, end, step=step, kind=kind)
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\lib\site-packages\pandas\core\indexes\base.py", line 5169, in slice_locs
    start_slice = self.get_slice_bound(start, "left", kind)
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\lib\site-packages\pandas\core\indexes\base.py", line 5079, in get_slice_bound
    label = self._maybe_cast_slice_bound(label, side, kind)
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\lib\site-packages\pandas\core\indexes\datetimes.py", line 665, in _maybe_cast_slice_bound
    lower, upper = self._parsed_string_to_bounds(reso, parsed)
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\lib\site-packages\pandas\core\indexes\datetimes.py", line 536, in _parsed_string_to_bounds
    raise KeyError
KeyError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\OccupancyScripts\OccupancyToKML\Github.py", line 31, in <module>
    if (2 in dfAnalysis[str(index + dt.timedelta(seconds=0.000008)):str(index + dt.timedelta(seconds=0.000013))]["msgtype"].to_numpy()):
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\lib\site-packages\pandas\core\frame.py", line 2881, in __getitem__
    indexer = convert_to_index_sliceable(self, key)
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\lib\site-packages\pandas\core\indexing.py", line 2132, in convert_to_index_sliceable
    return idx._convert_slice_indexer(key, kind="getitem")
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\lib\site-packages\pandas\core\indexes\base.py", line 3190, in _convert_slice_indexer
    indexer = self.slice_indexer(start, stop, step, kind=kind)
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\lib\site-packages\pandas\core\indexes\datetimes.py", line 728, in slice_indexer
    start_casted = self._maybe_cast_slice_bound(start, "left", kind)
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\lib\site-packages\pandas\core\indexes\datetimes.py", line 665, in _maybe_cast_slice_bound
    lower, upper = self._parsed_string_to_bounds(reso, parsed)
  File "C:\Users\bkaguila\dev\anaconda\envs\Occupancies\lib\site-packages\pandas\core\indexes\datetimes.py", line 536, in _parsed_string_to_bounds
    raise KeyError
KeyError

Process finished with exit code 1

The current behavior is that the indexes are partitioned, as they are strings, and the reso or Resolution class, which has nanosecond and minisecond resolution, then is passed to the datetime.py file on the pandas/core/indexes, which has the valid_resos (on line 520) wrong.

Expected Output

Process finished with exit code 0

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 67a3d42
python : 3.9.0.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United Kingdom.1252
pandas : 1.1.4
numpy : 1.19.4
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 50.3.1.post20201107
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

[paste the output of pd.show_versions() here leaving a blank line after the details tag]

@boringow boringow added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 27, 2020
@arw2019
Copy link
Member

arw2019 commented Nov 27, 2020

Can you take a look at this guide and provide a smaller code snippet? The idea is to localize the bug you want to fix

@rhshadrach rhshadrach added Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 29, 2020
@boringow
Copy link
Author

boringow commented Dec 2, 2020

@jreback @arw2019 @rhshadrach

@boringow
Copy link
Author

boringow commented Dec 3, 2020

I tried to think in the developers and I trimmed my code all the maximum so you can understand it. Now it is easy for everyone to understand and I even the dataset to only 7 seconds

@boringow
Copy link
Author

boringow commented Dec 3, 2020

It is 16 lines of code, but the problem is super easy to solve:

The current behavior is that the indexes are partitioned, as they are strings, and the reso or Resolution class, which has nanosecond and minisecond resolution, then is passed to the datetime.py file on the pandas/core/indexes, which has the valid_resos (on line 520) wrong.

@MarcoGorelli
Copy link
Member

Before running, it is easy! just check the lines 560-570 of the pandas/code/indexes/datetime.py and you will understand the human error some developer did :)

Thanks for the report @boringow , but this kind of comment isn't very helpful as the content of files changes all the time. Please post what you think the error is (looking at the current content of those lines:

        """
        Find the `freq` for self.insert(loc, item).
        """
        value = self._data._validate_scalar(item)
        item = self._data._box_func(value)

        freq = None
        if is_period_dtype(self.dtype):
            freq = self.freq
        elif self.freq is not None:
            # freq can be preserved on edge cases

it's not at all obvious to me)

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Dec 19, 2020

OK, I see what you're referring to now: the repetition in

        valid_resos = {
            "year",
            "month",
            "quarter",
            "day",
            "hour",
            "minute",
            "second",
            "minute",
            "second",
            "microsecond",
        }

(from your PR #38118)

@boringow
Copy link
Author

exactly! And if I changed locally, it seems to work. It was obviously a human error.
Did you correct it on the master?

@MarcoGorelli
Copy link
Member

Hi @boringow - no, this issue is still unfixed. If you'd like to submit another PR, then please include a test

@boringow
Copy link
Author

@MarcoGorelli !? I do not understand, right here on issue #38118 , I reported the bug and I did a simple code to test it. I will change it now to be even simpler, but you already saw the error:
OK, I see what you're referring to now: the repetition in

    valid_resos = {
        "year",
        "month",
        "quarter",
        "day",
        "hour",
        "minute",
        "second",
        "minute",
        "second",
        "microsecond",
    }

@MarcoGorelli
Copy link
Member

@boringow the bug report is fine :) What I meant was that if you make a pull request, there needs to be a test inside the pull request

@boringow
Copy link
Author

boringow commented Jan 30, 2021

@MarcoGorelli Okay! I see the BUG is fixed! sorry for disturbing, is my first time reporting a bug.
I have a simpler code but it seems to work now.
For the next time, I have to put the code on #38121, right? Confirm me if it is finally solved please
btw, the simpler code is the following:

import pandas as pd
import datetime as dt


data={'numbers':[1,2,3]}
df=pd.DataFrame(data)
i=pd.DatetimeIndex(['2020-08-25 11:00:08.187503455','2020-08-25 11:00:08.187513753','2020-08-25 11:00:08.189746310'],dayfirst=True)
df.set_index(i,inplace=True)

for index,number in zip(df.index,df['numbers']):
    if (2 in df[str(index+dt.timedelta(seconds=0.000008)):str(index+dt.timedelta(seconds=0.000013))]['numbers'].to_numpy()):
        print('The line above this raised an error.')

@MarcoGorelli
Copy link
Member

I see the BUG is fixed!

I don't think so, what makes you think that?

the simpler code is the following:

I tried running the code, but didn't get any error - could you please show the output you get, and the output you were expecting to get?


I think you may be misunderstanding what we mean when we write "please add a test" - see here for an example of a pull request with a test. Look at the changes they made to the file pandas/tests/io/test_common.py . That's what a test looks like.

If you want to fix this issue, then please include a test in your pull request. If not, someone else will pick this up sooner or later

@MarcoGorelli
Copy link
Member

I also don't get any error from trying to run the snippet you have in #38121 (comment)

I'll look into this more deeply later and check if it was throwing an error in the past

@boringow
Copy link
Author

boringow commented Jan 31, 2021

It does not give en error because someone solved the issue, but still, some developer did not perform his job correctly.
@MarcoGorelli :
MarcoGorelli commented on 19 Dec 2020 •

OK, I see what you're referring to now: the repetition in

    valid_resos = {
        "year",
        "month",
        "quarter",
        "day",
        "hour",
        "minute",
        "second",
        "minute",
        "second",
        "microsecond",
    }

Now, I checked the MASTER of pandas, and it appear to be like this:

    valid_resos = {
        "year",
        "month",
        "quarter",
        "day",
        "hour",
        "minute",
        "second",
        "minute",
        "second",
        "millisecond",
        "microsecond",
    }

Reference here:

So, as you can see, some developer add the millisecond! cool :) BUT HE DID NOT REMOVE THE DOUBLED "minute","second".

By the way, I want to become a pandas mantainer, although I have 1 year of experience I learn quite fast and this kind of errors won't happen to me :) How could I help?

@MarcoGorelli MarcoGorelli added Clean and removed Bug Needs Info Clarification about behavior needed to assess issue labels Jan 31, 2021
@MarcoGorelli
Copy link
Member

MarcoGorelli commented Jan 31, 2021

Awesome, so indeed it's fixed now

So, as you can see, some developer add the millisecond! cool :)

Yes, looks like it was fixed intentionally in #39378:

$ git log -S millisecond -p -- pandas/core/indexes/datetimes.py
commit 56476d1cd4e5b956e2597ca7361f0d6567dd1da9
Author: patrick <[email protected]>
Date:   Mon Jan 25 17:33:12 2021 +0100

    BUG: Slicing DatetimeIndex with strings containing microseconds raising KeyError (#39378)

diff --git a/pandas/core/indexes/datetimes.py b/pandas/core/indexes/datetimes.py
index 0df954e054..abe9a9c26c 100644
--- a/pandas/core/indexes/datetimes.py
+++ b/pandas/core/indexes/datetimes.py
@@ -565,6 +565,7 @@ class DatetimeIndex(DatetimeTimedeltaMixin):
             "second",
             "minute",
             "second",
+            "millisecond",
             "microsecond",
         }
         if reso.attrname not in valid_resos:

BUT HE DID NOT REMOVE THE DOUBLED "minute","second".

True - do you want to submit a new pull request (the old one would need rebasing anyway) to remove them then? As it would just be a clean-up, no tests would be required.

As an aside, gender-neutral language such as "BUT THEY DID NOT" would be appreciated

By the way, I want to become a pandas mantainer

Awesome! Perhaps start by becoming a contributor - if you're interested in maintenance, check the governance page: "The Project's Core Team will consist of Project Contributors who have produced contributions that are substantial in quality and quantity, and sustained over at least one year."

@boringow
Copy link
Author

I was using the individual term for developer, as it is one persone who changed it. But yeah! surely, I want to start helping and form part of this community :). I will create a new pull request referencing this issue, so we can close it.
Thanks for your help! @MarcoGorelli

@rhshadrach
Copy link
Member

@boringow The issue should remain open until the PR that closes it is merged.

@boringow
Copy link
Author

Okay! sorry about that I didn't know it.

@MarcoGorelli MarcoGorelli changed the title BUG: Pull request Update datetimes.py, issue 38118 CLN remove duplicated entries in valid_resos in pandas/core/indexes/datetimes.py Jan 31, 2021
@MarcoGorelli
Copy link
Member

They've opened #39503, which is clearer than this issue, so closing in favour of that one

@MarcoGorelli MarcoGorelli added this to the No action milestone Jan 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants