-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Parse IntervalArray and IntervalIndex from strings #41451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse IntervalArray and IntervalIndex from strings #41451
Conversation
…o parse string representations of Intervals
MyPy misunderstood the type of a list because I mixed callable types and functions. I explicitly annotated a list before looping over it this time, so MyPy should be happier now! Also, I fixed the doctest. |
@simonjayhawkins or others, could we trigger the workflows to see that everything works well now? |
Thank you for the feedback, @jreback! I implemented all comments that I understood. There was just one that I did not catch your meaning of:
Please let me know if this is now solved or where I can read more about what you mean. Also, I had some issues with the CI, so I pulled from upstream. We'll see what happens this time.. |
Hey! All checks passed a week ago except one that timed out after 60 minutes, so I tried merging from upstream again. Can we try the CI again now, @jreback @simonjayhawkins or others? |
pandas/core/arrays/interval.py
Outdated
for string in data: | ||
|
||
# Try to match "(left, right]" where 'left' and 'right' are breaks. | ||
breaks_match = re.match(r"\(.*,.*]", string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compile this regex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can likely make this support all of the closed types i think (see above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this regex actually strict enough? e.g. you should require this to match exactly (but whitespace is prob ok)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complied regex in 162045a.
All string representations of the closed types are the same (maybe room for future improvement here?), so from 162045a, the user can specify the closed type as an argument.
Regarding the strictness, could you give an example for when this may pass incorrectly? In the tests, I'm trying different incorrect cases and can't find any undefined behaviour.
pandas/core/arrays/interval.py
Outdated
) | ||
|
||
conversions: list[Callable] = [int, float, to_datetime, to_timedelta] | ||
# Try to parse the breaks first as floats, then datetime, then timedelta. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm i am not a big fan of this, can we make the dtype strict e.g. interval[float]
would completely solve this issue (forcing the user to do this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem here is that there IS ambiguity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case that this string parsing functionality is relevant (and the only one I see for now) is reading from text files. The other parts of pandas infer types dynamically, so wouldn't it be best to do this by default here as well?
In 162045a, I added an optional dtype
argument. If specified (not None
), it will parse string representations first and then rely on IntervalArray.from_arrays()
to (potentially) perform the more exact conversion. The IntervalArray.from_arrays()
method unfortunately doesn't parse string representations of numeric or datetime-like values, so the IntervalArray.from_strings()
method has to implement it in some way:
In [1]: import pandas as pd
In [2]: pd.arrays.IntervalArray.from_arrays(["0", "1", "2"], ["0", "1", "2"])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-2-677073bd0961> in <module>
----> 1 pd.arrays.IntervalArray.from_arrays(["0", "1", "2"], ["0", "1", "2"])
~/Projects/pandas/pandas/core/arrays/interval.py in from_arrays(cls, left, right, closed, copy, dtype)
479 right = _maybe_convert_platform_interval(right)
480
--> 481 return cls._simple_new(
482 left, right, closed, copy=copy, dtype=dtype, verify_integrity=True
483 )
~/Projects/pandas/pandas/core/arrays/interval.py in _simple_new(cls, left, right, closed, copy, dtype, verify_integrity)
293 "for IntervalArray"
294 )
--> 295 raise TypeError(msg)
296 elif isinstance(left, ABCPeriodIndex):
297 msg = "Period dtypes are not supported, use a PeriodIndex instead"
TypeError: category, object, and string subtypes are not supported for IntervalArray
* Added argument for different types of closed intervals * Added dtype argument to allow strict typing. * Changed tests to use pytest.mark.parametrize * Improved documentation.
Thanks for the feedback, @jreback! I think I solved all of your requests this time. |
I have implemented all fixes a few weeks ago and it passes the checks! What's the next step, @jreback @simonjayhawkins ? |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
This pull request is still relevant and I still request a new review. |
I wish for this to stay open. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the type conversion code needs some updating & a number of tests. also pls be sure that you are testing each time things are raising.
looked ok the last time. can you address open items and merge master;; ping on green |
Hi @jreback. Thanks for all the good comments! Field season ends this week so I'll try to get it fixed next week. |
…/pandas into intervalindex_from_string
Hi again @jreback and others! Sorry for the delay with the fixes. The comments I got led to significant improvements, specifically with the added support of all the closed types. It seems CI may need to be rerun ( I am available quicker from now on regarding potential other fixes. Thanks again for the help! |
), | ||
} | ||
) | ||
def from_strings( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you adding public api here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By making it public, do you mean with the @Appender
decorator or by not preceding the name with a lowercase?
If the latter, how would one otherwise use this function?
pandas/core/arrays/interval.py
Outdated
), | ||
} | ||
) | ||
def from_strings( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this public? shouldn't this be _from_sequence_of_strings
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 40af75d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uuuh, changing that name went out of my depth a little... Apparently there are more meanings to that name? Now tests are failing in pandas/tests/extension/test_interval.py
with testing EA types (I don't know what that means). What's the best way forward here??
can you merge master |
@mntss can you address comments and merge master (i don't think we had much left) |
@jreback perhaps there are others who can help? There's no rush for me anymore, since I just implemented these functionalities separately for the project I'm working on, but I think it would be a great feature for more people than I! To recapitulate, everything worked quite fine, but the renaming of some methods (e.g. |
Thanks for the PR, but it appears to have gone stale. If you or anyone else would like to pick this PR back up we can reopen, closing for now. |
Currently, when saving a DataFrame with an IntervalIndex as a CSV, there is no easy way to parse it again. With this PR, class methods are introduced to handle this:
As can be seen in the tests, the conversion supports each valid dtype of an Interval and raises descriptive exceptions if it fails.