Skip to content

Make *_range functions consistent #17482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Sep 14, 2017

Conversation

jschendel
Copy link
Member

Comments:

  • I couldn't find any existing tests of period_range, so I created a new file test_period_range.py in a similar fashion to the existing files test_date_range.py and test_timedelta_range.py.
  • Cleaned up a few existing tests relevant to this issue (used with context where appropriate, etc.).

@gfyoung gfyoung added API Design Error Reporting Incorrect or improved errors from pandas Interval Interval data type labels Sep 9, 2017
@@ -310,6 +310,59 @@ Previously, :func:`to_datetime` did not localize datetime ``Series`` data when `

Additionally, DataFrames with datetime columns that were parsed by :func:`read_sql_table` and :func:`read_sql_query` will also be localized to UTC only if the original SQL columns were timezone aware datetime columns.

.. _whatsnew_0200.api.consistency_of_range_functions:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean "0210"

@@ -2030,7 +2031,7 @@ def date_range(start=None, end=None, periods=None, freq='D', tz=None,

Notes
-----
2 of start, end, or periods must be specified
Exactly two of start, end, or periods must be specified
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know what you're trying to say here, but it's a little confusing the final parameter is plural. Say something like this (and propagate to places where relevant):

"Of the three parameters, start, end, and periods, exactly two must be specified."

For the doc-strings, I would even consider ticking them for additional clarity.

@@ -146,14 +146,26 @@ def test_date_range_businesshour(self):

def test_range_misspecified(self):
# GH #1095
with pytest.raises(ValueError):
Copy link
Member

@gfyoung gfyoung Sep 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know what the error message is here (you just added it). Let's make this test (and all other relevant ones) stronger by checking the error message too using tm.assert_raises_regexp.

@codecov
Copy link

codecov bot commented Sep 9, 2017

Codecov Report

Merging #17482 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17482      +/-   ##
==========================================
- Coverage   91.15%   91.14%   -0.02%     
==========================================
  Files         163      163              
  Lines       49534    49531       -3     
==========================================
- Hits        45153    45144       -9     
- Misses       4381     4387       +6
Flag Coverage Δ
#multiple 88.92% <100%> (ø) ⬆️
#single 40.22% <11.11%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/interval.py 93.11% <100%> (+0.5%) ⬆️
pandas/core/indexes/period.py 92.74% <100%> (ø) ⬆️
pandas/core/indexes/datetimes.py 95.53% <100%> (ø) ⬆️
pandas/core/indexes/timedeltas.py 90.6% <100%> (+0.01%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.72% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d6df8ea...523dcea. Read the comment docs.

@codecov
Copy link

codecov bot commented Sep 9, 2017

Codecov Report

Merging #17482 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17482      +/-   ##
==========================================
+ Coverage   91.18%   91.19%   +<.01%     
==========================================
  Files         163      163              
  Lines       49545    49582      +37     
==========================================
+ Hits        45179    45214      +35     
- Misses       4366     4368       +2
Flag Coverage Δ
#multiple 88.97% <100%> (+0.02%) ⬆️
#single 40.2% <14.28%> (-0.08%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/interval.py 93.57% <100%> (+0.95%) ⬆️
pandas/core/indexes/period.py 92.77% <100%> (+0.03%) ⬆️
pandas/core/indexes/datetimes.py 95.53% <100%> (ø) ⬆️
pandas/core/indexes/timedeltas.py 91.19% <100%> (+0.6%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.77% <0%> (-0.1%) ⬇️
pandas/core/common.py 91.98% <0%> (+0.32%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fa557f7...f6cc860. Read the comment docs.

@jschendel jschendel force-pushed the consistency-range-functions branch from 523dcea to 8713abb Compare September 10, 2017 06:00
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments. looks really good. a couple of in-line questions. ping when pushed.

---------------------------------------------------------------------------
ValueError: Of the three parameters, start, end, and periods, exactly two must be specified

Additionally, the endpoint parameter ``end`` was not included in the intervals produced by ``interval_range``. However, all other range functions include ``end`` in their output. To promote consistency among the range functions, ``interval_range`` will now include ``end`` as the right endpoint of the final interval, except if ``freq`` is specified in a way which skips ``end``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this last sentence true for period as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think periods can cause a skip. Since we're forcing exactly two of start/ end/periods to be passed, if periods is passed then only one of start/end can be passed. The _range functions then compute the non-passed using periods/ freq/passed, so by construction nothing can get skipped. I think the only case a skip is possible is end being skipped when start/end/freq are all passed.

@@ -412,7 +412,8 @@ def __new__(cls, data=None,
def _generate(cls, start, end, periods, name, offset,
tz=None, normalize=False, ambiguous='raise', closed=None):
if com._count_not_none(start, end, periods) != 2:
raise ValueError('Must specify two of start, end, or periods')
raise ValueError('Of the three parameters, start, end, and '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of the three parameters: start, end, and periods, comma -> :

@@ -2030,7 +2031,8 @@ def date_range(start=None, end=None, periods=None, freq='D', tz=None,

Notes
-----
2 of start, end, or periods must be specified
Of the three parameters, ``start``, ``end``, and ``periods``, exactly two
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same (and all other occurrences)

@@ -1039,43 +1039,37 @@ def interval_range(start=None, end=None, freq=None, periods=None,
Left bound for generating data
end : string or datetime-like, default None
Right bound for generating data
freq : interger, string or DateOffset, default 1
periods : interger, default None
freq : integer, string or DateOffset, default 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somewhere mention (if not already) that freq must be consistent with the type of the start/end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also I think there are tests but for this, but pls confirm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed that there are tests for mixed units in TestIntervalRange.test_errors. Adding one additional test to be safe.

period_range(periods=5)

with tm.assert_raises_regex(ValueError, msg):
period_range()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add inconsistent freq w.r.t. start/end

Copy link
Member Author

@jschendel jschendel Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current behavior on 0.20.3 (and this PR as-is) is not to raise, but to convert to freq:

In [2]: p1 = pd.Period('2017-01-01', freq='W')
   ...: p2 = pd.Period('2017-12-31', freq='W')
   ...: pd.period_range(start=p1, end=p2, freq='Q')
   ...:
Out[2]: PeriodIndex(['2017Q1', '2017Q2', '2017Q3', '2017Q4'], dtype='period[Q-DEC]', freq='Q-DEC')

When freq isn't specified, it converts to daily (default is freq='D'):

In [5]: pd.period_range(start=p1, end=p2)
Out[5]:
PeriodIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
             '2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
             '2017-01-09', '2017-01-10',
             ...
             '2017-12-22', '2017-12-23', '2017-12-24', '2017-12-25',
             '2017-12-26', '2017-12-27', '2017-12-28', '2017-12-29',
             '2017-12-30', '2017-12-31'],
            dtype='period[D]', length=365, freq='D')

Same behavior with different frequencies for all of start/end/freq:

In [4]: p3 = pd.Period('2017-12-31', freq='M')
   ...: pd.period_range(start=p1, end=p3, freq='Q')
   ...:
Out[4]: PeriodIndex(['2017Q1', '2017Q2', '2017Q3', '2017Q4'], dtype='period[Q-DEC]', freq='Q-DEC')

Should this raise if any of start/end/freq are passed with different frequencies? If start/end are passed with the same frequency, and freq isn't specified, should freq be set to match start/end instead of using the daily default? (same thing if one of start/end is passed with a frequency and periods is passed)

Couldn't find a direct comparison with the other range functions to check against for consistency. interval_range will raise if freq isn't compatible with they type of start/end, but that seems a bit different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is meant to be the 'new' freq, while the start/end serve as the anchor endpoints. this seems fine then. (maybe adding the above example to the docs would be helpful though).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, will add tests for this behavior.

@jschendel jschendel force-pushed the consistency-range-functions branch from 8713abb to 09498fd Compare September 11, 2017 06:30
@jschendel
Copy link
Member Author

ping @jreback

Made the review changes, as well as a few additional things I noticed:

  • Found and fixed a corner case bug in my endpoint fix for interval_range, which resulted in shuffling around some of that code
    • Also added some additional checks around periods (similar to what's done in the other functions)
    • Also added some additonal tests
  • Regarding period_range when start or end are Period objects (discussed in review comments):
    • Added tests to for this scenario
    • Added an Example section to the period_range docstring covering this scenario
    • Added an example of this using the PeriodIndex constructor in timeseries.rst
  • Cleaned up the docstrings of each function and made them as consistent with each other as possible
  • Cleaned up a few error messages to make the capitalization consistent with param names

I also noticed string/datetime-like start/end doesn't work with interval_range, despite the docstring saying it does (don't think it ever did). Should that be fixed here, or is a separate issue/PR more appropriate?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of minor comments.


In [2]: pd.interval_range(start=0, end=4, periods=6)
---------------------------------------------------------------------------
ValueError: Of the three parameters, start, end, and periods, exactly two must be specified
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use : after parameters here

periods : integer or None, default None
If None, must specify start and end
periods : integer, default None
Number of dates to generate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

number of periods

freq : string or DateOffset, default 'B' (business daily)
Frequency strings can have multiples, e.g. '5H'
Frequency strings can have multiples, e.g. '5, default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the example here looks wrong

'exactly two must be specified')

# must all be same units or None
arr = np.array(list(com._not_none(start, end, freq)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could just do this above and compare len(arr) != 2

The length of each interval. Must be consistent with the
type of start and end
name : string, default None
Name of the resulting IntervalIndex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a couple of examples?

closed : string or None, default None
name : string, default None
Name of the resulting TimedeltaIndex
closed : string, default None
Make the interval closed with respect to the given frequency to
the 'left', 'right', or both sides (None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some examples

@jreback
Copy link
Contributor

jreback commented Sep 11, 2017

these are all fine: #17482 (comment) thanks!

I also noticed string/datetime-like start/end doesn't work with interval_range, despite the docstring saying it does (don't think it ever did). Should that be fixed here, or is a separate issue/PR more appropriate

Datetimelike should work with interval_range, as it works with the regular constructors. I don't think strings should work though, its tricky to infer what is mean (IOW it could be a datetime, or a timedelta, or it could be an error). I would raise on strings (but accept datetimelike)

@jreback jreback added this to the 0.21.0 milestone Sep 11, 2017
@jreback
Copy link
Contributor

jreback commented Sep 11, 2017

ping on push / green.

@jschendel jschendel force-pushed the consistency-range-functions branch 2 times, most recently from d34b82c to 925c72c Compare September 12, 2017 07:41
@jschendel
Copy link
Member Author

ping @jreback

Made the review changes, and additionally:

  • Implemented support for datetime-like in interval_range
    • Added a bunch of associated tests
    • Related to previous review comment: note that both not null checks are actually necessary, as one is checking with periods, the other freq
  • Changed a couple ValueError to TypeError, as that seems like the more appropriate exception

raise ValueError("must specify 2 of start, end, periods")
if is_number(start):
is_datetime_interval = False
elif start is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This not correct for the start is None (and end). these must be number, datetime, or timedelta. For datetime, timedelta, then you can wrap the Timestamp/Timedelta around.

@jschendel jschendel force-pushed the consistency-range-functions branch from 925c72c to 549c402 Compare September 12, 2017 22:37
@jschendel
Copy link
Member Author

ping @jreback

Updated datetime-like support. For some reason I originally thought that datetime-like didn't include timedelta, so I shortcut to just parsing timestamps. Updated with support for timedelta and added timedelta related tests.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see if you can condense the constructor a bit. yeah I know a lot is going on,.

@@ -2091,7 +2094,7 @@ def bdate_range(start=None, end=None, periods=None, freq='B', tz=None,
def cdate_range(start=None, end=None, periods=None, freq='C', tz=None,
normalize=True, name=None, closed=None, **kwargs):
"""
**EXPERIMENTAL** Return a fixed frequency datetime index, with
**EXPERIMENTAL** Return a fixed frequency DatetimeIndex, with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side issue we can prob remove the experimental (but new PR for that)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this in #17554, as there were other doc related fixes that needed to be made.


iv_type = {'numeric': True, 'timestamp': True, 'timedelta': True}

start = com._maybe_box_datetimelike(start)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about a helper function to validate this instead of repeating code.
maybe

if is_number(start) or is_number(end):
    boxer = type(start) if start is not None else type(end)
elif isinstance(start, Timestamp) or isinstance(end, Timestamp):
   boxer = Timestamp
elif .....
    ....
else:
    raise

@jschendel
Copy link
Member Author

@jreback : cleaned up the interval_range constructor

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking really good. just a couple of minor doc changes and a question

Consistency of Range Functions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In previous versions, there were some inconsistencies between the various range functions: ``date_range``, ``bdate_range``, ``cdate_range``, ``interval_range``, ``period_range``, and ``timedelta_range``. (:issue:`17471`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add interval_range to the api.rst (might as well add cdate_range too), can you add these as :func:`date_range` as well.

msg = 'end must be numeric or datetime-like, got {end}'
raise ValueError(msg.format(end=end))

if is_float(periods):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much cleaner, nice!

raise ValueError("must specify 2 of start, end, periods")
if is_number(endpoint):
if periods is None:
periods = int((end - start) // freq)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this what we do elsewhere? this is effectively floored not that this is bad, just wondering if its consistent

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's floored elsewhere too, e.g.

if is_float(periods):
periods = int(periods)

if is_float(periods):
periods = int(periods)

Clean error message, check for error in tests, fix typo in whatsnew.
addressed jreback's review, cleaned up docstrings, added documention, update to interval_range endpoint fix, additional tests
Made review related changes, implemented support for datetime-like input in interval_range.
@jschendel jschendel force-pushed the consistency-range-functions branch from d5579cb to f6cc860 Compare September 14, 2017 05:33
@jschendel
Copy link
Member Author

@jreback : Made the doc changes. Didn't see an obvious location for interval_range in api.rst, so I created a new section. Additionally fixed one minor typo in whatsnew.

@jreback jreback merged commit 2cf2566 into pandas-dev:master Sep 14, 2017
@jreback
Copy link
Contributor

jreback commented Sep 14, 2017

thanks @jschendel nice PR, and very responsive! keep em' comin!

@jschendel jschendel deleted the consistency-range-functions branch September 14, 2017 17:17
@@ -218,10 +218,19 @@ Top-level dealing with datetimelike
to_timedelta
date_range
bdate_range
cdate_range
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is not in the public API, so adding it here does not work

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, see #17554

Consistency of Range Functions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In previous versions, there were some inconsistencies between the various range functions: func:`date_range`, func:`bdate_range`, func:`cdate_range`, func:`period_range`, func:`timedelta_range`, and func:`interval_range`. (:issue:`17471`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are missing :'s for all references: :func:.. instead of func:..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and also cdate_range here does not work)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, see #17554

jschendel added a commit to jschendel/pandas that referenced this pull request Sep 15, 2017
@jschendel jschendel mentioned this pull request Sep 15, 2017
2 tasks
jreback pushed a commit that referenced this pull request Sep 17, 2017
jorisvandenbossche added a commit that referenced this pull request Sep 17, 2017
alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017
alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Error Reporting Incorrect or improved errors from pandas Interval Interval data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent behavior of *_range functions
4 participants