Parse IntervalArray and IntervalIndex from strings #41451

erikmannerfelt · 2021-05-13T10:10:01Z

Currently, when saving a DataFrame with an IntervalIndex as a CSV, there is no easy way to parse it again. With this PR, class methods are introduced to handle this:

import tempfile

import numpy as np
import pandas as pd

# Create a DataFrame with an IntervalIndex
df = pd.DataFrame(data=[1, 2, 3], index=pd.IntervalIndex.from_breaks([2., 3., 4., 5.]))

# Create a temporary directory to save the csv in
temp_dir = tempfile.TemporaryDirectory()
df.to_csv(temp_dir.name + 'df.csv')

print(df)

# Read the DataFrame containing the IntervalIndex column
df2 = pd.read_csv(temp_dir.name + 'df.csv')

# Convert the column to an IntervalIndex
df2.index = pd.IntervalIndex.from_strings(df2.iloc[:, 0])
df2.drop(columns=df2.columns[0], inplace=True)

print(df2)

# Validate that the original and parsed indices are the same
assert np.array_equal(df.index, df2.index)

            0
(2.0, 3.0]  1
(3.0, 4.0]  2
(4.0, 5.0]  3
            0
(2.0, 3.0]  1
(3.0, 4.0]  2
(4.0, 5.0]  3

As can be seen in the tests, the conversion supports each valid dtype of an Interval and raises descriptive exceptions if it fails.

closes ENH: Method to recover IntervalIndex when reloaing plain-text files #23595
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

…o parse string representations of Intervals

…dex_from_string

erikmannerfelt · 2021-05-22T08:32:46Z

MyPy misunderstood the type of a list because I mixed callable types and functions. I explicitly annotated a list before looping over it this time, so MyPy should be happier now!

Also, I fixed the doctest.

erikmannerfelt · 2021-06-01T08:21:31Z

@simonjayhawkins or others, could we trigger the workflows to see that everything works well now?

pandas/tests/indexes/interval/test_interval.py

pandas/core/arrays/interval.py

…dex_from_string

erikmannerfelt · 2021-06-01T12:52:00Z

Thank you for the feedback, @jreback! I implemented all comments that I understood. There was just one that I did not catch your meaning of:

this is very fixed to a single closed type - is there a reason?

Please let me know if this is now solved or where I can read more about what you mean.

Also, I had some issues with the CI, so I pulled from upstream. We'll see what happens this time..

…dex_from_string

erikmannerfelt · 2021-06-07T12:44:23Z

Hey! All checks passed a week ago except one that timed out after 60 minutes, so I tried merging from upstream again. Can we try the CI again now, @jreback @simonjayhawkins or others?

…dex_from_string

pandas/core/arrays/interval.py

jreback · 2021-06-08T22:08:01Z

pandas/core/arrays/interval.py

+        for string in data:
+
+            # Try to match "(left, right]" where 'left' and 'right' are breaks.
+            breaks_match = re.match(r"\(.*,.*]", string)


compile this regex

you can likely make this support all of the closed types i think (see above)

is this regex actually strict enough? e.g. you should require this to match exactly (but whitespace is prob ok)

Complied regex in 162045a.

All string representations of the closed types are the same (maybe room for future improvement here?), so from 162045a, the user can specify the closed type as an argument.

Regarding the strictness, could you give an example for when this may pass incorrectly? In the tests, I'm trying different incorrect cases and can't find any undefined behaviour.

pandas/core/arrays/interval.py

jreback · 2021-06-08T22:10:45Z

pandas/core/arrays/interval.py

+                )
+
+            conversions: list[Callable] = [int, float, to_datetime, to_timedelta]
+            # Try to parse the breaks first as floats, then datetime, then timedelta.


hmm i am not a big fan of this, can we make the dtype strict e.g. interval[float] would completely solve this issue (forcing the user to do this)

the problem here is that there IS ambiguity.

The case that this string parsing functionality is relevant (and the only one I see for now) is reading from text files. The other parts of pandas infer types dynamically, so wouldn't it be best to do this by default here as well?

In 162045a, I added an optional dtype argument. If specified (not None), it will parse string representations first and then rely on IntervalArray.from_arrays() to (potentially) perform the more exact conversion. The IntervalArray.from_arrays() method unfortunately doesn't parse string representations of numeric or datetime-like values, so the IntervalArray.from_strings() method has to implement it in some way:

In [1]: import pandas as pd In [2]: pd.arrays.IntervalArray.from_arrays(["0", "1", "2"], ["0", "1", "2"]) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-2-677073bd0961> in <module> ----> 1 pd.arrays.IntervalArray.from_arrays(["0", "1", "2"], ["0", "1", "2"]) ~/Projects/pandas/pandas/core/arrays/interval.py in from_arrays(cls, left, right, closed, copy, dtype) 479 right = _maybe_convert_platform_interval(right) 480 --> 481 return cls._simple_new( 482 left, right, closed, copy=copy, dtype=dtype, verify_integrity=True 483 ) ~/Projects/pandas/pandas/core/arrays/interval.py in _simple_new(cls, left, right, closed, copy, dtype, verify_integrity) 293 "for IntervalArray" 294 ) --> 295 raise TypeError(msg) 296 elif isinstance(left, ABCPeriodIndex): 297 msg = "Period dtypes are not supported, use a PeriodIndex instead" TypeError: category, object, and string subtypes are not supported for IntervalArray

pandas/tests/indexes/interval/test_interval.py

…dex_from_string

* Added argument for different types of closed intervals * Added dtype argument to allow strict typing. * Changed tests to use pytest.mark.parametrize * Improved documentation.

erikmannerfelt · 2021-06-09T10:12:06Z

Thanks for the feedback, @jreback! I think I solved all of your requests this time.

erikmannerfelt · 2021-06-29T12:06:18Z

I have implemented all fixes a few weeks ago and it passes the checks!

What's the next step, @jreback @simonjayhawkins ?

github-actions · 2021-08-18T00:02:37Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

erikmannerfelt · 2021-08-18T13:36:25Z

This pull request is still relevant and I still request a new review.

erikmannerfelt · 2021-08-18T13:37:11Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

I wish for this to stay open.

jreback

the type conversion code needs some updating & a number of tests. also pls be sure that you are testing each time things are raising.

pandas/core/arrays/interval.py

pandas/tests/indexes/interval/test_interval.py

…dex_from_string

jreback · 2021-08-31T23:45:12Z

looked ok the last time. can you address open items and merge master;; ping on green

erikmannerfelt · 2021-09-01T11:21:24Z

looked ok the last time. can you address open items and merge master;; ping on green

Hi @jreback. Thanks for all the good comments! Field season ends this week so I'll try to get it fixed next week.

…dex_from_string

…/pandas into intervalindex_from_string

erikmannerfelt · 2021-09-16T09:14:50Z

Hi again @jreback and others! Sorry for the delay with the fixes. The comments I got led to significant improvements, specifically with the added support of all the closed types.

It seems CI may need to be rerun (Posix / pytest (actions-39-slow.yaml, slow) (pull_request) fails saying Could not find conda environment: pandas-dev).

I am available quicker from now on regarding potential other fixes.

Thanks again for the help!

pandas/core/arrays/interval.py

jreback · 2021-09-16T17:24:11Z

pandas/core/indexes/interval.py

+            ),
+        }
+    )
+    def from_strings(


why are you adding public api here?

By making it public, do you mean with the @Appender decorator or by not preceding the name with a lowercase?

If the latter, how would one otherwise use this function?

jreback · 2021-09-16T17:25:23Z

pandas/core/arrays/interval.py

+            ),
+        }
+    )
+    def from_strings(


why is this public? shouldn't this be _from_sequence_of_strings?

Fixed in 40af75d.

Uuuh, changing that name went out of my depth a little... Apparently there are more meanings to that name? Now tests are failing in pandas/tests/extension/test_interval.py with testing EA types (I don't know what that means). What's the best way forward here??

jreback · 2021-10-04T00:32:26Z

can you merge master

…dex_from_string

jreback · 2021-12-23T22:51:43Z

@mntss can you address comments and merge master (i don't think we had much left)

erikmannerfelt · 2022-01-14T14:23:46Z

@jreback perhaps there are others who can help? There's no rush for me anymore, since I just implemented these functionalities separately for the project I'm working on, but I think it would be a great feature for more people than I!

To recapitulate, everything worked quite fine, but the renaming of some methods (e.g. _from_sequence_of_strings) led to errors that I simply don't understand. Either we revert these changes, or someone more knowledgeable than I chimes in!

mroeschke · 2022-03-06T01:11:19Z

Thanks for the PR, but it appears to have gone stale. If you or anyone else would like to pick this PR back up we can reopen, closing for now.

erikmannerfelt added 2 commits May 13, 2021 11:45

Added IntervalArray.from_strings() and IntervalIndex.from_strings() t…

00c7e52

…o parse string representations of Intervals

Ran pre-commit program on modified files.

d2625a5

simonjayhawkins added Enhancement Interval Interval data type IO CSV read_csv, to_csv labels May 21, 2021

erikmannerfelt added 2 commits May 22, 2021 10:19

Merge branch 'master' of github.com:pandas-dev/pandas into intervalin…

88f4f69

…dex_from_string

Documented exception to mypy linting and fixed doctest.

27ae2bf

jreback requested changes Jun 1, 2021

View reviewed changes

erikmannerfelt added 4 commits June 1, 2021 13:02

Replaced str.find() calls with more robust regex matching

e6e072a

Added parsing for int64 and improved tests.

6c1b871

Merge branch 'master' of github.com:pandas-dev/pandas into intervalin…

5fd1526

…dex_from_string

Moved error test to separate test function.

d646802

erikmannerfelt requested a review from jreback June 1, 2021 12:51

Merge branch 'master' of github.com:pandas-dev/pandas into intervalin…

1355a17

…dex_from_string

Merge branch 'master' of github.com:pandas-dev/pandas into intervalin…

ec8a4cb

…dex_from_string

jreback requested changes Jun 8, 2021

View reviewed changes

erikmannerfelt added 2 commits June 9, 2021 10:14

Merge branch 'master' of github.com:pandas-dev/pandas into intervalin…

ab76e68

…dex_from_string

* Simplified pandas imports

162045a

* Added argument for different types of closed intervals * Added dtype argument to allow strict typing. * Changed tests to use pytest.mark.parametrize * Improved documentation.

erikmannerfelt requested a review from jreback June 9, 2021 10:12

Merge branch 'master' into intervalindex_from_string

e702840

github-actions bot added the Stale label Aug 18, 2021

lithomas1 added Needs Review and removed Stale labels Aug 18, 2021

jreback requested changes Aug 19, 2021

View reviewed changes

Merge branch 'master' of github.com:pandas-dev/pandas into intervalin…

f8ab67c

…dex_from_string

lithomas1 removed the Needs Review label Sep 8, 2021

jreback added this to the 1.4 milestone Sep 9, 2021

erikmannerfelt added 3 commits September 15, 2021 11:58

Merge branch 'master' of github.com:pandas-dev/pandas into intervalin…

0137f3e

…dex_from_string

Added support for all closed types. Improved tests.

a8cc3b2

Merge branch 'intervalindex_from_string' of github.com:erikmannerfelt…

cb2ec12

…/pandas into intervalindex_from_string

rerun CI

27a5a2e

jreback requested changes Sep 16, 2021

View reviewed changes

Renamed IntervalArray method and removed nested try/except clauses.

40af75d

Merge branch 'master' of github.com:pandas-dev/pandas into intervalin…

cffc164

…dex_from_string

jreback removed this from the 1.4 milestone Dec 29, 2021

Merge branch 'pandas-dev:main' into intervalindex_from_string

8b29251

mroeschke closed this Mar 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse IntervalArray and IntervalIndex from strings #41451

Parse IntervalArray and IntervalIndex from strings #41451

erikmannerfelt commented May 13, 2021

erikmannerfelt commented May 22, 2021

erikmannerfelt commented Jun 1, 2021

erikmannerfelt commented Jun 1, 2021

erikmannerfelt commented Jun 7, 2021

jreback Jun 8, 2021

jreback Jun 8, 2021

jreback Jun 8, 2021

erikmannerfelt Jun 9, 2021

jreback Jun 8, 2021

jreback Jun 8, 2021

erikmannerfelt Jun 9, 2021

erikmannerfelt commented Jun 9, 2021

erikmannerfelt commented Jun 29, 2021

github-actions bot commented Aug 18, 2021

erikmannerfelt commented Aug 18, 2021

erikmannerfelt commented Aug 18, 2021

jreback left a comment

jreback commented Aug 31, 2021

erikmannerfelt commented Sep 1, 2021

erikmannerfelt commented Sep 16, 2021

jreback Sep 16, 2021

erikmannerfelt Sep 17, 2021

jreback Sep 16, 2021

erikmannerfelt Sep 17, 2021

erikmannerfelt Sep 17, 2021

jreback commented Oct 4, 2021

jreback commented Dec 23, 2021

erikmannerfelt commented Jan 14, 2022

mroeschke commented Mar 6, 2022

Parse IntervalArray and IntervalIndex from strings #41451

Parse IntervalArray and IntervalIndex from strings #41451

Conversation

erikmannerfelt commented May 13, 2021

erikmannerfelt commented May 22, 2021

erikmannerfelt commented Jun 1, 2021

erikmannerfelt commented Jun 1, 2021

erikmannerfelt commented Jun 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikmannerfelt commented Jun 9, 2021

erikmannerfelt commented Jun 29, 2021

github-actions bot commented Aug 18, 2021

erikmannerfelt commented Aug 18, 2021

erikmannerfelt commented Aug 18, 2021

jreback left a comment

Choose a reason for hiding this comment

jreback commented Aug 31, 2021

erikmannerfelt commented Sep 1, 2021

erikmannerfelt commented Sep 16, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Oct 4, 2021

jreback commented Dec 23, 2021

erikmannerfelt commented Jan 14, 2022

mroeschke commented Mar 6, 2022