Skip to content

Parse IntervalArray and IntervalIndex from strings #41451

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

erikmannerfelt
Copy link

Currently, when saving a DataFrame with an IntervalIndex as a CSV, there is no easy way to parse it again. With this PR, class methods are introduced to handle this:

import tempfile

import numpy as np
import pandas as pd

# Create a DataFrame with an IntervalIndex
df = pd.DataFrame(data=[1, 2, 3], index=pd.IntervalIndex.from_breaks([2., 3., 4., 5.]))

# Create a temporary directory to save the csv in
temp_dir = tempfile.TemporaryDirectory()
df.to_csv(temp_dir.name + 'df.csv')

print(df)

# Read the DataFrame containing the IntervalIndex column
df2 = pd.read_csv(temp_dir.name + 'df.csv')

# Convert the column to an IntervalIndex
df2.index = pd.IntervalIndex.from_strings(df2.iloc[:, 0])
df2.drop(columns=df2.columns[0], inplace=True)

print(df2)

# Validate that the original and parsed indices are the same
assert np.array_equal(df.index, df2.index)
            0
(2.0, 3.0]  1
(3.0, 4.0]  2
(4.0, 5.0]  3
            0
(2.0, 3.0]  1
(3.0, 4.0]  2
(4.0, 5.0]  3

As can be seen in the tests, the conversion supports each valid dtype of an Interval and raises descriptive exceptions if it fails.

@simonjayhawkins simonjayhawkins added Enhancement Interval Interval data type IO CSV read_csv, to_csv labels May 21, 2021
@erikmannerfelt
Copy link
Author

MyPy misunderstood the type of a list because I mixed callable types and functions. I explicitly annotated a list before looping over it this time, so MyPy should be happier now!

Also, I fixed the doctest.

@erikmannerfelt
Copy link
Author

@simonjayhawkins or others, could we trigger the workflows to see that everything works well now?

@erikmannerfelt erikmannerfelt requested a review from jreback June 1, 2021 12:51
@erikmannerfelt
Copy link
Author

Thank you for the feedback, @jreback! I implemented all comments that I understood. There was just one that I did not catch your meaning of:

this is very fixed to a single closed type - is there a reason?

Please let me know if this is now solved or where I can read more about what you mean.

Also, I had some issues with the CI, so I pulled from upstream. We'll see what happens this time..

@erikmannerfelt
Copy link
Author

Hey! All checks passed a week ago except one that timed out after 60 minutes, so I tried merging from upstream again. Can we try the CI again now, @jreback @simonjayhawkins or others?

for string in data:

# Try to match "(left, right]" where 'left' and 'right' are breaks.
breaks_match = re.match(r"\(.*,.*]", string)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compile this regex

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can likely make this support all of the closed types i think (see above)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this regex actually strict enough? e.g. you should require this to match exactly (but whitespace is prob ok)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complied regex in 162045a.

All string representations of the closed types are the same (maybe room for future improvement here?), so from 162045a, the user can specify the closed type as an argument.

Regarding the strictness, could you give an example for when this may pass incorrectly? In the tests, I'm trying different incorrect cases and can't find any undefined behaviour.

)

conversions: list[Callable] = [int, float, to_datetime, to_timedelta]
# Try to parse the breaks first as floats, then datetime, then timedelta.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm i am not a big fan of this, can we make the dtype strict e.g. interval[float] would completely solve this issue (forcing the user to do this)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem here is that there IS ambiguity.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case that this string parsing functionality is relevant (and the only one I see for now) is reading from text files. The other parts of pandas infer types dynamically, so wouldn't it be best to do this by default here as well?

In 162045a, I added an optional dtype argument. If specified (not None), it will parse string representations first and then rely on IntervalArray.from_arrays() to (potentially) perform the more exact conversion. The IntervalArray.from_arrays() method unfortunately doesn't parse string representations of numeric or datetime-like values, so the IntervalArray.from_strings() method has to implement it in some way:

In [1]: import pandas as pd

In [2]: pd.arrays.IntervalArray.from_arrays(["0", "1", "2"], ["0", "1", "2"])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-677073bd0961> in <module>
----> 1 pd.arrays.IntervalArray.from_arrays(["0", "1", "2"], ["0", "1", "2"])

~/Projects/pandas/pandas/core/arrays/interval.py in from_arrays(cls, left, right, closed, copy, dtype)
    479         right = _maybe_convert_platform_interval(right)
    480
--> 481         return cls._simple_new(
    482             left, right, closed, copy=copy, dtype=dtype, verify_integrity=True
    483         )

~/Projects/pandas/pandas/core/arrays/interval.py in _simple_new(cls, left, right, closed, copy, dtype, verify_integrity)
    293                 "for IntervalArray"
    294             )
--> 295             raise TypeError(msg)
    296         elif isinstance(left, ABCPeriodIndex):
    297             msg = "Period dtypes are not supported, use a PeriodIndex instead"

TypeError: category, object, and string subtypes are not supported for IntervalArray

* Added argument for different types of closed intervals
* Added dtype argument to allow strict typing.
* Changed tests to use pytest.mark.parametrize
* Improved documentation.
@erikmannerfelt
Copy link
Author

Thanks for the feedback, @jreback! I think I solved all of your requests this time.

@erikmannerfelt erikmannerfelt requested a review from jreback June 9, 2021 10:12
@erikmannerfelt
Copy link
Author

I have implemented all fixes a few weeks ago and it passes the checks!

What's the next step, @jreback @simonjayhawkins ?

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Aug 18, 2021
@erikmannerfelt
Copy link
Author

This pull request is still relevant and I still request a new review.

@erikmannerfelt
Copy link
Author

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

I wish for this to stay open.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the type conversion code needs some updating & a number of tests. also pls be sure that you are testing each time things are raising.

@jreback
Copy link
Contributor

jreback commented Aug 31, 2021

looked ok the last time. can you address open items and merge master;; ping on green

@erikmannerfelt
Copy link
Author

looked ok the last time. can you address open items and merge master;; ping on green

Hi @jreback. Thanks for all the good comments! Field season ends this week so I'll try to get it fixed next week.

@jreback jreback added this to the 1.4 milestone Sep 9, 2021
@erikmannerfelt
Copy link
Author

Hi again @jreback and others! Sorry for the delay with the fixes. The comments I got led to significant improvements, specifically with the added support of all the closed types.

It seems CI may need to be rerun (Posix / pytest (actions-39-slow.yaml, slow) (pull_request) fails saying Could not find conda environment: pandas-dev).

I am available quicker from now on regarding potential other fixes.

Thanks again for the help!

),
}
)
def from_strings(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you adding public api here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By making it public, do you mean with the @Appender decorator or by not preceding the name with a lowercase?

If the latter, how would one otherwise use this function?

),
}
)
def from_strings(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this public? shouldn't this be _from_sequence_of_strings?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 40af75d.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uuuh, changing that name went out of my depth a little... Apparently there are more meanings to that name? Now tests are failing in pandas/tests/extension/test_interval.py with testing EA types (I don't know what that means). What's the best way forward here??

@jreback
Copy link
Contributor

jreback commented Oct 4, 2021

can you merge master

@jreback
Copy link
Contributor

jreback commented Dec 23, 2021

@mntss can you address comments and merge master (i don't think we had much left)

@jreback jreback removed this from the 1.4 milestone Dec 29, 2021
@erikmannerfelt
Copy link
Author

@jreback perhaps there are others who can help? There's no rush for me anymore, since I just implemented these functionalities separately for the project I'm working on, but I think it would be a great feature for more people than I!

To recapitulate, everything worked quite fine, but the renaming of some methods (e.g. _from_sequence_of_strings) led to errors that I simply don't understand. Either we revert these changes, or someone more knowledgeable than I chimes in!

@mroeschke
Copy link
Member

Thanks for the PR, but it appears to have gone stale. If you or anyone else would like to pick this PR back up we can reopen, closing for now.

@mroeschke mroeschke closed this Mar 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Interval Interval data type IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Method to recover IntervalIndex when reloaing plain-text files
5 participants