Skip to content

Docstring shift #21039

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 24 commits into from
Closed

Docstring shift #21039

wants to merge 24 commits into from

Conversation

PyJay
Copy link

@PyJay PyJay commented May 14, 2018

@codecov
Copy link

codecov bot commented May 15, 2018

Codecov Report

Merging #21039 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21039   +/-   ##
=======================================
  Coverage   91.99%   91.99%           
=======================================
  Files         167      167           
  Lines       50578    50578           
=======================================
  Hits        46530    46530           
  Misses       4048     4048
Flag Coverage Δ
#multiple 90.4% <ø> (ø) ⬆️
#single 42.17% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/generic.py 96.47% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0828c25...888b97b. Read the comment docs.

@PyJay
Copy link
Author

PyJay commented May 15, 2018

@TomAugspurger @jreback for a check if you have the chance

@WillAyd
Copy link
Member

WillAyd commented May 15, 2018

I think this is missing the main purpose of that thread, namely that shift will not do any implicit inference of ordering or continuity for you.

While the solution you provided "answers" that question, I think it would be more appropriate to document shift with / without reindexing prior to the fact to contrast the difference

@PyJay
Copy link
Author

PyJay commented May 15, 2018

@WillAyd Thanks for the feedback.

I am not sure how I could document the case without re-indexing without resulting in an error. I decided to print the intermediary result in an attempt to make this clear. Do you suggest I show the column with the shift applied instead of having it hide behind the lambda operation?

@WillAyd
Copy link
Member

WillAyd commented May 15, 2018

I don't think whatever you show should have any lambda (UDFs should typically be a "last resort").

I would suggest doing a shift with the base frame, pointing out that values from 06-06 simply get moved down to 06-08 without any regard for the fact that there is a two day gap in between those. Then contrast that by reindexing then shifting to show how to move from 06-06 to 06-07

@PyJay
Copy link
Author

PyJay commented May 15, 2018

Okay - will change this to remove usage of lambda.

Doing a shift with the base frame without any re-indexing data.shift(1, pd.Timedelta('1 days'))) moves values from 06 to 07 but I believe the re-indexing and left join is needed because of the grouping and the mismatch of indexing between the original and resulting dataframe as a result of the shift.

Compute the difference between a column in a dataframe and
its shifted version

>>> data = pd.DataFrame({'mydate': [pd.to_datetime('2016-06-06'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use pd.Timestamp('2016-06-06') or pd.to_datetime(['2016-06-06', ...]) to convey that to_datetime should be used for arrays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move mydate to index= and remove the set_index below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is useful to have a named index as the label (mydate) is used later in the process (for the left join). Would you prefer if I use index= and then set the label using data.index.name = ?

Copy link
Contributor

@TomAugspurger TomAugspurger May 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. How about

index=pd.DatetimeIndex(['2016-06-08', ... '2016-06-13'], name='mydate')

Examples
--------
Compute the difference between a column in a dataframe and
its shifted version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

End with a .

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted

>>> data.set_index('mydate', inplace=True)
>>> data
myvalue group
mydate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this indented too far? I thought the leftmost output should be directly below the leftmost >.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted

Merge result as a column named *delta* to the original data

>>> result.name = 'delta'
>>> data.reset_index().merge(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reset_index and set_index can be avoided in 0.23+ (can join on mix of columns and index names).

Copy link
Author

@PyJay PyJay May 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have had a play and looks like the reset_index and set_index are necessary. reset_index changes the delta (shifted values) from a series to a DataFrame and it exposes mydate as column on the original data which is needed for the left join. And the set_index just sets up mydate as an index like in the original dataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#21220 for that.

I think pd.merge(data, result.to_frame(), on=['mydate', 'group']) may work

In [45]: pd.merge(data, result.to_frame(), on=['group', 'mydate'])
Out[45]:
            myvalue_x group  myvalue_y
mydate
2016-06-06          1     A        NaN
2016-06-08          2     A        NaN
2016-06-09          3     A        1.0
2016-06-10          4     B        NaN
2016-06-12          5     B        NaN
2016-06-13          6     B        1.0

@gfyoung gfyoung added the Docs label May 21, 2018
@pep8speaks
Copy link

pep8speaks commented Jun 2, 2018

Hello @PyJay! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 22, 2018 at 22:04 Hours UTC

@PyJay
Copy link
Author

PyJay commented Jun 2, 2018

@TomAugspurger I have updated the docs as per the feedback.

@WillAyd
Copy link
Member

WillAyd commented Jun 2, 2018

I still think this is missing the point. Can you show:

  • The "as is shift", calling out that the values are simply moved down (i.e. 06-06 value goes to 06-08)
  • A reindexed version where the value from 06-06 subsequently moves to 06-07

Please remove the use of the anonymous function

@PyJay
Copy link
Author

PyJay commented Jul 1, 2018

@WillAyd I have updated to address to your first two points. I am still unsure about removing the lambda since the only alternative I can think of is having a one line function customized to the use case to do the same. Maybe I'm missing a better way here?

@@ -7810,6 +7810,86 @@ def mask(self, cond, other=np.nan, inplace=False, axis=None, level=None,
is not realigned. That is, use freq if you would like to extend the
index when shifting and preserve the original data.

Examples
--------
Compute the difference between a column in a dataframe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever you reference it be sure to capitalize appropriately the word DataFrame

@@ -7810,6 +7810,86 @@ def mask(self, cond, other=np.nan, inplace=False, axis=None, level=None,
is not realigned. That is, use freq if you would like to extend the
index when shifting and preserve the original data.

Examples
--------
Compute the difference between a column in a dataframe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This first line isn't necessary - can simply delete

Copy link
Author

@PyJay PyJay Jul 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which first line? This one? "Compute the difference between a column in a dataframe
with grouped data, and its shifted version." Or the blank line before "Examples"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe @WillAyd was referring to the "Compute the difference..." line.

Compute the difference between a column in a dataframe
with grouped data, and its shifted version.

>>> data = pd.DataFrame({'myvalue': [1, 2, 3, 4, 5, 6],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the variable name df instead of data for consistency

If the dataframe is shifted without passing a freq argument than the
values simply move down

>>> data[data.group=='A'].myvalue.shift(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of filtering to group A you'd be better served to work with the entire frame

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The group stuff seems to be a distraction from shift at this point.

For the groups compute the difference between current `myvalue` and
`myvalue` shifted forward by 1 day.

If the dataframe is shifted without passing a freq argument than the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple typo - "then" instead of "than" here


>>> data[data.group=='A'].myvalue.shift(1)
mydate
2016-06-06 NaN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Others might have a differing opinion but I find these examples rather confusing as you really need to think through how the data is indexed.

I mentioned it before but I think the cleanest approach would be the reindex / fill set of operations I posted in the original PR - is there any reason why we can't use that here instead of the UDF? I think it more clearly explains the situation and it will certainly scale better on larger datasets, hence why I'd rather we suggest that type of usage in the documentation.

Copy link
Author

@PyJay PyJay Jul 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - I can use your suggestion instead. You suggested

dt_rng = pd.date_range(data.index.min(), data.index.max())
data = data.reindex(dt_rng)
data['group'] = data['group'].ffill()
data.groupby('group')['myvalue'].transform(lambda x: x-x.shift())

I think I can avoid the UDF by doing

df['myvalue'] - df.groupby('group')['myvalue'].shift(1)

I get the same answer, just wanted to confirm that it's the correct thing to do here?
Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s correct

@datapythonista datapythonista self-assigned this Jul 22, 2018
@PyJay
Copy link
Author

PyJay commented Jul 22, 2018

@WillAyd @TomAugspurger Changes made as per feedback.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there!

2016-06-13 1.0
Freq: D, Name: myvalue, dtype: float64

Concatenate result as a column named `delta` to the original data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the thoroughness you are aiming for here, but let's get rid of this section to the example. concat will have a separate docstring with examples to teach that piece to users should they want to know

Examples
--------

>>> df = pd.DataFrame({'group': ['A', 'A', 'A', 'B', 'B', 'B'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably isn't enforceable and maybe it's just a personal preference but I think using key and val would be preferable to group and myvalue. I've seen the former used in quite a few more places so for consistency would be better to use those names.

@datapythonista maybe something to think about

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think it makes things clearer if the example is with real world data. In this case, if we just show basic examples of shift (and we avoid using ffill, reindexing, groupby and concat), I don't think we need the column group.

I think the examples are quite good for the cookbook, or the time series documentation, but for the docstring, I think we should show the basic stuff. In this case what I'd do:

  • Create a simple dataframe (one column and 4 rows for example)
  • Show .shift() with default arguments
  • Show how period can change the shifted periods
  • Show how freq can be used

And I don't know exactly what axis does in this method, it could be shown too.

For all the rest of the stuff, feel free to contribute it to the other documents.

2016-06-12 B 5
2016-06-13 B 6

For the groups compute the difference between current `myvalue` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing the point but I don't think you need this sentence

For the groups compute the difference between current `myvalue` and
`myvalue` shifted forward by 1 day.

If `myvalue` is shifted then the values will simply move down.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start but I think the point to stress here is that "move down" makes no consideration of the dates being in order. So maybe to emphasize better can append something like "...move down one row, regardless of the chronology of the dates"

2016-06-13 5.0
Name: myvalue, dtype: float64

We only want to shift `myvalue` forward by one day before computing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for clearer wording I'd say "If instead you wanted to shift values forward by a day you can do this by reindexing first and filling key"

After considering the grouping we can calculate the difference
as follows

>>> result = df['myvalue'] - df.groupby('group')['myvalue'].shift(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you are aiming for with the subtraction but I think it's clearer if you just do df.groupby('key').shift() here and show the result

@PyJay
Copy link
Author

PyJay commented Jul 23, 2018

Ahh thanks for the feedback @WillAyd I realise this has dragged on a bit! Will hopefully be able to finalise this weekend. @datapythonista were you thinking of taking this work over?

2016-06-11 B NaN NaN
2016-06-12 B 5.0 NaN
2016-06-13 B 6.0 1.0

Returns
-------
shifted : %(klass)s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move Return before the examples?

@@ -8012,12 +8012,12 @@ def mask(self, cond, other=np.nan, inplace=False, axis=None, level=None,
errors=errors)

_shared_docs['shift'] = ("""
Shift index by desired number of periods with an optional time freq
Shift index by desired number of periods with an optional time freq.

Parameters
----------
periods : int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the default 1

Examples
--------

>>> df = pd.DataFrame({'group': ['A', 'A', 'A', 'B', 'B', 'B'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think it makes things clearer if the example is with real world data. In this case, if we just show basic examples of shift (and we avoid using ffill, reindexing, groupby and concat), I don't think we need the column group.

I think the examples are quite good for the cookbook, or the time series documentation, but for the docstring, I think we should show the basic stuff. In this case what I'd do:

  • Create a simple dataframe (one column and 4 rows for example)
  • Show .shift() with default arguments
  • Show how period can change the shifted periods
  • Show how freq can be used

And I don't know exactly what axis does in this method, it could be shown too.

For all the rest of the stuff, feel free to contribute it to the other documents.

@datapythonista
Copy link
Member

Duplicate of #20472

@datapythonista datapythonista marked this as a duplicate of #20472 Nov 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

issue when shifting with Timedelta in a groupby
6 participants