Skip to content

ENH: improve Period parsing #48000

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 3 tasks
joooeey opened this issue Aug 7, 2022 · 1 comment
Open
1 of 3 tasks

ENH: improve Period parsing #48000

joooeey opened this issue Aug 7, 2022 · 1 comment
Labels
Enhancement Period Period data type

Comments

@joooeey
Copy link
Contributor

joooeey commented Aug 7, 2022

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I did a long analysis of what kind of Period objects can and cannot produced from a string alone. I.e. pd.Period("2022Q2") gives a valid Period representing April-June 2022. I found lots of blank spots though. Here's a description in words. In the Additional Context section is a long list of examples.

Types of Period with lots of obvious ways to create them from just a string

Year, quarter, month, day, minute, second, millisecond, microsecond, nanosecond

Types of Period that can't be created from a string alone at all

There are other ways to create these frequencies using the full suite of arguments to the Period class but they can't be created with a string alone.

  1. Business years not going from January to December. I don't know how to specify those unambiguously though.
  2. Quarters starting with a different month than January/April/July/October. Again I'm not aware of standardized strings to specify these.
  3. Weeks. There would be two obvious ways to implement this:
    a. ISO 8601 week strings such as "2022-W31". This only covers weeks starting on a Monday. This is supported by dateutil.parser.isoparse, so would be easy to implement. However, isoparse just returns a datetime in this case and doesn't indicate that it refers to an entire week.
    b. Using the string produced by printing a week period: str(pd.Period("2017-01-25", freq="W")) => "2017-01-23/2017-01-29". This often fails silently and creates a Period with freq="min" or raises a DateParseError (see 6. & 7. for details).
    c. Using strings giving the start and end day of the week but in a slightly different form than given in b such as "20170123-20170129".
  4. Business days. No idea what their use case is and if this is a problem.

Types of Period that can't be created from their string representation alone

  1. Hours are interpreted as minutes because their string representation is "2010-11-12 13:00". However hours can be created in different ways from a string alone e.g. from "2010-11-12 13h"

In addition the frequencies listed in 1-4 can also not be recovered from their string representation. I.e. given a original = Period(...) with one of these frequencies, Period(str(original)) != original. However, adding the frequency information usually works fine: Period(str(original), freq=original.freq) == original. There is one exception though:

Some weeks raise a DateParseError when you try to re-create them from their string representation:

  1. Weeks in the 60s, 70s, 80s, or 90s of any century
  2. Weeks from the 24th century onwards

This is because in "2481-12-29/2482-01-04" the year 2482 is interpreted as 24 hours and 82 minutes. One can argue that we don't need to be able to re-convert the string representation of a Period back to a Period. However, there are tests in the codebase that rely on this functionality. These happen to be written with dates in the early 21st century so they work. Nevertheless, I propose to remove some of them in #47990.

Other specific standardized string representations that are not understood:

  1. ISO 8601 ordinal dates such as "1981-095" for "1981-04-05". Though probably not widely used, this is also supported by dateutil.parser.isoparse.

Feature Description

Process:

  1. Decide which cases given in the Problem Description we even wish to cover.
  2. Define and document acceptable inputs.
  3. Write tests.
  4. Adjust pandas/_libs/tslibs/parsing.pyx.

Alternative Solutions

  • We could expect users to do their own conversions tailored to their use cases. We can't cover all possible strings that might show up in databases anyways. After all the Period class and related classes provide plenty of other arguments besides value to control the result.
  • Pandas developers should be careful to not parse string representations of Period objects.

Additional Context

Full list of tested strings and other Period constructors.

code to create `DataFrame`
import pandas as pd
from pandas.io import clipboard

period_strings = [
    # ISO 8601 examples from Wikipedia
    "2022-08-07",
    "2022-08-07T06:41:54+00:00",
    "2022-08-07T06:41:54Z",
    "20220807T064154Z",
    "2022-W31",
    "2022-W31-7",
    "2022-219",
    # examples from the Pandas docs
    "2002Q3",
    "2000",
    "1/1/2011",
    "2011-01",
    "2012-1-1 19:00",
    # couldn't find docs but these work
    "2012-1-1 12:34:56.1",
    "2012-1-1 12:34:56.123",
    "2012-1-1 12:34:56.123456",
    "2012-1-1 12:34:56.123456789",
    "2010-11-12 13h",
]

def eval_period(s):
    try:
        return eval(s, None, {"dt": "2012-1-1 12:34:56.123456789"})
    except Exception as e:
        return e.__class__.__name__

def recreate_yes_no(old, use):
    assert "str" in use

    if not isinstance(old, pd.Period):
        return "N/A"

    freq = old.freq if "freq" in use else None

    s = str(old)
    try:
        new = pd.Period(s, freq=freq)
    except Exception as e:
        return e.__class__.__name__
    if new == old:
        return "== p"
    return f"failed: freq={new.freq}"


period_strings = pd.Series(period_strings)
str_constructors = period_strings.apply(lambda s: f"pd.Period( '{s}' )")

freq_strings = pd.Series(
    pd._libs.tslibs.dtypes._reverse_period_code_map.values()
)
freq_constructors = freq_strings.apply(
    lambda s: f"pd.Period( dt, freq='{s}' )"
)

years = pd.Series(range(2002, 2500, 30))
week_constructors = years.apply(
    lambda a: f"pd.Period( freq='W', year={a} )"
)

df = pd.DataFrame()
df["constructors"] = pd.concat(
    [str_constructors, freq_constructors, week_constructors], ignore_index=True
)

df["str(p)"] = df["constructors"].apply(eval_period)
df["p.freq"] = df["str(p)"].apply(lambda p: getattr(p, "freq", "N/A"))
df["pd.Period(str(p))"] = df["str(p)"].apply(recreate_yes_no, use=["str"])
df["pd.Period(str(p), freq=p.freq)"] = df["str(p)"].apply(
    recreate_yes_no, use=["str", "freq"]
)

clipboard.copy(df.to_markdown())

Note: dt = "2012-1-1 12:34:56.123456789"

constructors str(p) p.freq pd.Period(str(p)) pd.Period(str(p), freq=p.freq)
0 pd.Period( '2022-08-07' ) 2022-08-07 Day == p == p
1 pd.Period( '2022-08-07T06:41:54+00:00' ) 2022-08-07 06:41:54 Second == p == p
2 pd.Period( '2022-08-07T06:41:54Z' ) 2022-08-07 06:41:54 Second == p == p
3 pd.Period( '20220807T064154Z' ) 2022-08-07 06:41:54 Second == p == p
4 pd.Period( '2022-W31' ) DateParseError N/A N/A N/A
5 pd.Period( '2022-W31-7' ) DateParseError N/A N/A N/A
6 pd.Period( '2022-219' ) DateParseError N/A N/A N/A
7 pd.Period( '2002Q3' ) 2002Q3 QuarterEnd: startingMonth=12 == p == p
8 pd.Period( '2000' ) 2000 YearEnd: month=12 == p == p
9 pd.Period( '1/1/2011' ) 2011-01-01 Day == p == p
10 pd.Period( '2011-01' ) 2011-01 MonthEnd == p == p
11 pd.Period( '2012-1-1 19:00' ) 2012-01-01 19:00 Minute == p == p
12 pd.Period( '2012-1-1 12:34:56.1' ) 2012-01-01 12:34:56.100 Milli == p == p
13 pd.Period( '2012-1-1 12:34:56.123' ) 2012-01-01 12:34:56.123 Milli == p == p
14 pd.Period( '2012-1-1 12:34:56.123456' ) 2012-01-01 12:34:56.123456 Micro == p == p
15 pd.Period( '2012-1-1 12:34:56.123456789' ) 2012-01-01 12:34:56.123456789 Nano == p == p
16 pd.Period( '2010-11-12 13h' ) 2010-11-12 13:00 Hour failed: freq=Minute == p
17 pd.Period( dt, freq='A-DEC' ) 2012 YearEnd: month=12 == p == p
18 pd.Period( dt, freq='A-JAN' ) 2012 YearEnd: month=1 failed: freq=YearEnd: month=12 == p
19 pd.Period( dt, freq='A-FEB' ) 2012 YearEnd: month=2 failed: freq=YearEnd: month=12 == p
20 pd.Period( dt, freq='A-MAR' ) 2012 YearEnd: month=3 failed: freq=YearEnd: month=12 == p
21 pd.Period( dt, freq='A-APR' ) 2012 YearEnd: month=4 failed: freq=YearEnd: month=12 == p
22 pd.Period( dt, freq='A-MAY' ) 2012 YearEnd: month=5 failed: freq=YearEnd: month=12 == p
23 pd.Period( dt, freq='A-JUN' ) 2012 YearEnd: month=6 failed: freq=YearEnd: month=12 == p
24 pd.Period( dt, freq='A-JUL' ) 2012 YearEnd: month=7 failed: freq=YearEnd: month=12 == p
25 pd.Period( dt, freq='A-AUG' ) 2012 YearEnd: month=8 failed: freq=YearEnd: month=12 == p
26 pd.Period( dt, freq='A-SEP' ) 2012 YearEnd: month=9 failed: freq=YearEnd: month=12 == p
27 pd.Period( dt, freq='A-OCT' ) 2012 YearEnd: month=10 failed: freq=YearEnd: month=12 == p
28 pd.Period( dt, freq='A-NOV' ) 2012 YearEnd: month=11 failed: freq=YearEnd: month=12 == p
29 pd.Period( dt, freq='Q-DEC' ) 2012Q1 QuarterEnd: startingMonth=12 == p == p
30 pd.Period( dt, freq='Q-JAN' ) 2012Q4 QuarterEnd: startingMonth=1 failed: freq=QuarterEnd: startingMonth=12 == p
31 pd.Period( dt, freq='Q-FEB' ) 2012Q4 QuarterEnd: startingMonth=2 failed: freq=QuarterEnd: startingMonth=12 == p
32 pd.Period( dt, freq='Q-MAR' ) 2012Q4 QuarterEnd: startingMonth=3 failed: freq=QuarterEnd: startingMonth=12 == p
33 pd.Period( dt, freq='Q-APR' ) 2012Q3 QuarterEnd: startingMonth=4 failed: freq=QuarterEnd: startingMonth=12 == p
34 pd.Period( dt, freq='Q-MAY' ) 2012Q3 QuarterEnd: startingMonth=5 failed: freq=QuarterEnd: startingMonth=12 == p
35 pd.Period( dt, freq='Q-JUN' ) 2012Q3 QuarterEnd: startingMonth=6 failed: freq=QuarterEnd: startingMonth=12 == p
36 pd.Period( dt, freq='Q-JUL' ) 2012Q2 QuarterEnd: startingMonth=7 failed: freq=QuarterEnd: startingMonth=12 == p
37 pd.Period( dt, freq='Q-AUG' ) 2012Q2 QuarterEnd: startingMonth=8 failed: freq=QuarterEnd: startingMonth=12 == p
38 pd.Period( dt, freq='Q-SEP' ) 2012Q2 QuarterEnd: startingMonth=9 failed: freq=QuarterEnd: startingMonth=12 == p
39 pd.Period( dt, freq='Q-OCT' ) 2012Q1 QuarterEnd: startingMonth=10 failed: freq=QuarterEnd: startingMonth=12 == p
40 pd.Period( dt, freq='Q-NOV' ) 2012Q1 QuarterEnd: startingMonth=11 failed: freq=QuarterEnd: startingMonth=12 == p
41 pd.Period( dt, freq='M' ) 2012-01 MonthEnd == p == p
42 pd.Period( dt, freq='W-SUN' ) 2011-12-26/2012-01-01 Week: weekday=6 failed: freq=Minute == p
43 pd.Period( dt, freq='W-MON' ) 2011-12-27/2012-01-02 Week: weekday=0 failed: freq=Minute == p
44 pd.Period( dt, freq='W-TUE' ) 2011-12-28/2012-01-03 Week: weekday=1 failed: freq=Minute == p
45 pd.Period( dt, freq='W-WED' ) 2011-12-29/2012-01-04 Week: weekday=2 failed: freq=Minute == p
46 pd.Period( dt, freq='W-THU' ) 2011-12-30/2012-01-05 Week: weekday=3 failed: freq=Minute == p
47 pd.Period( dt, freq='W-FRI' ) 2011-12-31/2012-01-06 Week: weekday=4 failed: freq=Minute == p
48 pd.Period( dt, freq='W-SAT' ) 2012-01-01/2012-01-07 Week: weekday=5 failed: freq=Minute == p
49 pd.Period( dt, freq='B' ) 2012-01-02 BusinessDay failed: freq=Day == p
50 pd.Period( dt, freq='D' ) 2012-01-01 Day == p == p
51 pd.Period( dt, freq='H' ) 2012-01-01 12:00 Hour failed: freq=Minute == p
52 pd.Period( dt, freq='T' ) 2012-01-01 12:34 Minute == p == p
53 pd.Period( dt, freq='S' ) 2012-01-01 12:34:56 Second == p == p
54 pd.Period( dt, freq='L' ) 2012-01-01 12:34:56.123 Milli == p == p
55 pd.Period( dt, freq='U' ) 2012-01-01 12:34:56.123456 Micro == p == p
56 pd.Period( dt, freq='N' ) 2012-01-01 12:34:56.123456789 Nano == p == p
57 pd.Period( freq='W', year=2002 ) 2001-12-31/2002-01-06 Week: weekday=6 failed: freq=Minute == p
58 pd.Period( freq='W', year=2032 ) 2031-12-29/2032-01-04 Week: weekday=6 failed: freq=Minute == p
59 pd.Period( freq='W', year=2062 ) 2061-12-26/2062-01-01 Week: weekday=6 DateParseError DateParseError
60 pd.Period( freq='W', year=2092 ) 2091-12-31/2092-01-06 Week: weekday=6 DateParseError DateParseError
61 pd.Period( freq='W', year=2122 ) 2121-12-29/2122-01-04 Week: weekday=6 failed: freq=Minute == p
62 pd.Period( freq='W', year=2152 ) 2151-12-27/2152-01-02 Week: weekday=6 failed: freq=Minute == p
63 pd.Period( freq='W', year=2182 ) 2181-12-31/2182-01-06 Week: weekday=6 DateParseError DateParseError
64 pd.Period( freq='W', year=2212 ) 2211-12-30/2212-01-05 Week: weekday=6 failed: freq=Minute == p
65 pd.Period( freq='W', year=2242 ) 2241-12-27/2242-01-02 Week: weekday=6 failed: freq=Minute == p
66 pd.Period( freq='W', year=2272 ) 2272-01-01/2272-01-07 Week: weekday=6 DateParseError DateParseError
67 pd.Period( freq='W', year=2302 ) 2301-12-30/2302-01-05 Week: weekday=6 failed: freq=Minute == p
68 pd.Period( freq='W', year=2332 ) 2331-12-28/2332-01-03 Week: weekday=6 failed: freq=Minute == p
69 pd.Period( freq='W', year=2362 ) 2362-01-01/2362-01-07 Week: weekday=6 DateParseError DateParseError
70 pd.Period( freq='W', year=2392 ) 2391-12-30/2392-01-05 Week: weekday=6 DateParseError DateParseError
71 pd.Period( freq='W', year=2422 ) 2421-12-27/2422-01-02 Week: weekday=6 DateParseError DateParseError
72 pd.Period( freq='W', year=2452 ) 2452-01-01/2452-01-07 Week: weekday=6 DateParseError DateParseError
73 pd.Period( freq='W', year=2482 ) 2481-12-29/2482-01-04 Week: weekday=6 DateParseError DateParseError
@joooeey joooeey added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 7, 2022
@jbrockmendel jbrockmendel added the Period Period data type label Jan 23, 2023
@jbrockmendel
Copy link
Member

xref #13931

@mroeschke mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Period Period data type
Projects
None yet
Development

No branches or pull requests

3 participants