-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Docstring shift #21039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docstring shift #21039
Conversation
Codecov Report
@@ Coverage Diff @@
## master #21039 +/- ##
=======================================
Coverage 91.99% 91.99%
=======================================
Files 167 167
Lines 50578 50578
=======================================
Hits 46530 46530
Misses 4048 4048
Continue to review full report at Codecov.
|
@TomAugspurger @jreback for a check if you have the chance |
I think this is missing the main purpose of that thread, namely that shift will not do any implicit inference of ordering or continuity for you. While the solution you provided "answers" that question, I think it would be more appropriate to document shift with / without reindexing prior to the fact to contrast the difference |
@WillAyd Thanks for the feedback. I am not sure how I could document the case without re-indexing without resulting in an error. I decided to print the intermediary result in an attempt to make this clear. Do you suggest I show the column with the shift applied instead of having it hide behind the lambda operation? |
I don't think whatever you show should have any lambda (UDFs should typically be a "last resort"). I would suggest doing a shift with the base frame, pointing out that values from 06-06 simply get moved down to 06-08 without any regard for the fact that there is a two day gap in between those. Then contrast that by reindexing then shifting to show how to move from 06-06 to 06-07 |
Okay - will change this to remove usage of lambda. Doing a shift with the base frame without any re-indexing data.shift(1, pd.Timedelta('1 days'))) moves values from 06 to 07 but I believe the re-indexing and left join is needed because of the grouping and the mismatch of indexing between the original and resulting dataframe as a result of the shift. |
pandas/core/generic.py
Outdated
Compute the difference between a column in a dataframe and | ||
its shifted version | ||
|
||
>>> data = pd.DataFrame({'mydate': [pd.to_datetime('2016-06-06'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use pd.Timestamp('2016-06-06')
or pd.to_datetime(['2016-06-06', ...])
to convey that to_datetime
should be used for arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move mydate
to index=
and remove the set_index
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is useful to have a named index as the label (mydate) is used later in the process (for the left join). Would you prefer if I use index=
and then set the label using data.index.name =
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. How about
index=pd.DatetimeIndex(['2016-06-08', ... '2016-06-13'], name='mydate')
pandas/core/generic.py
Outdated
Examples | ||
-------- | ||
Compute the difference between a column in a dataframe and | ||
its shifted version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
End with a .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noted
pandas/core/generic.py
Outdated
>>> data.set_index('mydate', inplace=True) | ||
>>> data | ||
myvalue group | ||
mydate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this indented too far? I thought the leftmost output should be directly below the leftmost >
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noted
pandas/core/generic.py
Outdated
Merge result as a column named *delta* to the original data | ||
|
||
>>> result.name = 'delta' | ||
>>> data.reset_index().merge( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reset_index
and set_index
can be avoided in 0.23+ (can join on mix of columns and index names).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have had a play and looks like the reset_index
and set_index
are necessary. reset_index
changes the delta (shifted values) from a series to a DataFrame and it exposes mydate
as column on the original data which is needed for the left join. And the set_index
just sets up mydate
as an index like in the original dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#21220 for that.
I think pd.merge(data, result.to_frame(), on=['mydate', 'group'])
may work
In [45]: pd.merge(data, result.to_frame(), on=['group', 'mydate'])
Out[45]:
myvalue_x group myvalue_y
mydate
2016-06-06 1 A NaN
2016-06-08 2 A NaN
2016-06-09 3 A 1.0
2016-06-10 4 B NaN
2016-06-12 5 B NaN
2016-06-13 6 B 1.0
Hello @PyJay! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on July 22, 2018 at 22:04 Hours UTC |
@TomAugspurger I have updated the docs as per the feedback. |
I still think this is missing the point. Can you show:
Please remove the use of the anonymous function |
@WillAyd I have updated to address to your first two points. I am still unsure about removing the lambda since the only alternative I can think of is having a one line function customized to the use case to do the same. Maybe I'm missing a better way here? |
pandas/core/generic.py
Outdated
@@ -7810,6 +7810,86 @@ def mask(self, cond, other=np.nan, inplace=False, axis=None, level=None, | |||
is not realigned. That is, use freq if you would like to extend the | |||
index when shifting and preserve the original data. | |||
|
|||
Examples | |||
-------- | |||
Compute the difference between a column in a dataframe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whenever you reference it be sure to capitalize appropriately the word DataFrame
pandas/core/generic.py
Outdated
@@ -7810,6 +7810,86 @@ def mask(self, cond, other=np.nan, inplace=False, axis=None, level=None, | |||
is not realigned. That is, use freq if you would like to extend the | |||
index when shifting and preserve the original data. | |||
|
|||
Examples | |||
-------- | |||
Compute the difference between a column in a dataframe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This first line isn't necessary - can simply delete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which first line? This one? "Compute the difference between a column in a dataframe
with grouped data, and its shifted version." Or the blank line before "Examples"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe @WillAyd was referring to the "Compute the difference..." line.
pandas/core/generic.py
Outdated
Compute the difference between a column in a dataframe | ||
with grouped data, and its shifted version. | ||
|
||
>>> data = pd.DataFrame({'myvalue': [1, 2, 3, 4, 5, 6], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the variable name df
instead of data
for consistency
pandas/core/generic.py
Outdated
If the dataframe is shifted without passing a freq argument than the | ||
values simply move down | ||
|
||
>>> data[data.group=='A'].myvalue.shift(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of filtering to group A
you'd be better served to work with the entire frame
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. The group stuff seems to be a distraction from shift
at this point.
pandas/core/generic.py
Outdated
For the groups compute the difference between current `myvalue` and | ||
`myvalue` shifted forward by 1 day. | ||
|
||
If the dataframe is shifted without passing a freq argument than the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simple typo - "then" instead of "than" here
pandas/core/generic.py
Outdated
|
||
>>> data[data.group=='A'].myvalue.shift(1) | ||
mydate | ||
2016-06-06 NaN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Others might have a differing opinion but I find these examples rather confusing as you really need to think through how the data is indexed.
I mentioned it before but I think the cleanest approach would be the reindex
/ fill
set of operations I posted in the original PR - is there any reason why we can't use that here instead of the UDF? I think it more clearly explains the situation and it will certainly scale better on larger datasets, hence why I'd rather we suggest that type of usage in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure - I can use your suggestion instead. You suggested
dt_rng = pd.date_range(data.index.min(), data.index.max())
data = data.reindex(dt_rng)
data['group'] = data['group'].ffill()
data.groupby('group')['myvalue'].transform(lambda x: x-x.shift())
I think I can avoid the UDF by doing
df['myvalue'] - df.groupby('group')['myvalue'].shift(1)
I get the same answer, just wanted to confirm that it's the correct thing to do here?
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s correct
@WillAyd @TomAugspurger Changes made as per feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there!
2016-06-13 1.0 | ||
Freq: D, Name: myvalue, dtype: float64 | ||
|
||
Concatenate result as a column named `delta` to the original data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the thoroughness you are aiming for here, but let's get rid of this section to the example. concat
will have a separate docstring with examples to teach that piece to users should they want to know
Examples | ||
-------- | ||
|
||
>>> df = pd.DataFrame({'group': ['A', 'A', 'A', 'B', 'B', 'B'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably isn't enforceable and maybe it's just a personal preference but I think using key
and val
would be preferable to group
and myvalue
. I've seen the former used in quite a few more places so for consistency would be better to use those names.
@datapythonista maybe something to think about
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I think it makes things clearer if the example is with real world data. In this case, if we just show basic examples of shift
(and we avoid using ffill
, reindexing
, groupby
and concat
), I don't think we need the column group
.
I think the examples are quite good for the cookbook, or the time series documentation, but for the docstring, I think we should show the basic stuff. In this case what I'd do:
- Create a simple dataframe (one column and 4 rows for example)
- Show
.shift()
with default arguments - Show how
period
can change the shifted periods - Show how
freq
can be used
And I don't know exactly what axis
does in this method, it could be shown too.
For all the rest of the stuff, feel free to contribute it to the other documents.
2016-06-12 B 5 | ||
2016-06-13 B 6 | ||
|
||
For the groups compute the difference between current `myvalue` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing the point but I don't think you need this sentence
For the groups compute the difference between current `myvalue` and | ||
`myvalue` shifted forward by 1 day. | ||
|
||
If `myvalue` is shifted then the values will simply move down. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start but I think the point to stress here is that "move down" makes no consideration of the dates being in order. So maybe to emphasize better can append something like "...move down one row, regardless of the chronology of the dates"
2016-06-13 5.0 | ||
Name: myvalue, dtype: float64 | ||
|
||
We only want to shift `myvalue` forward by one day before computing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for clearer wording I'd say "If instead you wanted to shift values forward by a day you can do this by reindexing first and filling key
"
After considering the grouping we can calculate the difference | ||
as follows | ||
|
||
>>> result = df['myvalue'] - df.groupby('group')['myvalue'].shift(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you are aiming for with the subtraction but I think it's clearer if you just do df.groupby('key').shift()
here and show the result
Ahh thanks for the feedback @WillAyd I realise this has dragged on a bit! Will hopefully be able to finalise this weekend. @datapythonista were you thinking of taking this work over? |
2016-06-11 B NaN NaN | ||
2016-06-12 B 5.0 NaN | ||
2016-06-13 B 6.0 1.0 | ||
|
||
Returns | ||
------- | ||
shifted : %(klass)s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move Return
before the examples?
@@ -8012,12 +8012,12 @@ def mask(self, cond, other=np.nan, inplace=False, axis=None, level=None, | |||
errors=errors) | |||
|
|||
_shared_docs['shift'] = (""" | |||
Shift index by desired number of periods with an optional time freq | |||
Shift index by desired number of periods with an optional time freq. | |||
|
|||
Parameters | |||
---------- | |||
periods : int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the default 1
Examples | ||
-------- | ||
|
||
>>> df = pd.DataFrame({'group': ['A', 'A', 'A', 'B', 'B', 'B'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I think it makes things clearer if the example is with real world data. In this case, if we just show basic examples of shift
(and we avoid using ffill
, reindexing
, groupby
and concat
), I don't think we need the column group
.
I think the examples are quite good for the cookbook, or the time series documentation, but for the docstring, I think we should show the basic stuff. In this case what I'd do:
- Create a simple dataframe (one column and 4 rows for example)
- Show
.shift()
with default arguments - Show how
period
can change the shifted periods - Show how
freq
can be used
And I don't know exactly what axis
does in this method, it could be shown too.
For all the rest of the stuff, feel free to contribute it to the other documents.
Duplicate of #20472 |
git diff upstream/master -u -- "*.py" | flake8 --diff