Skip to content

Reindex docs question / clarification #21429

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bzier opened this issue Jun 11, 2018 · 4 comments
Open

Reindex docs question / clarification #21429

bzier opened this issue Jun 11, 2018 · 4 comments
Labels

Comments

@bzier
Copy link

bzier commented Jun 11, 2018

From the bottom of the reindex docs here; relevant docs source here:

The index entries that did not have a value in the original data frame (for example, ‘2009-12-29’) are by default filled with NaN. If desired, we can fill in the missing values using one of several options.

For example, to backpropagate the last valid value to fill the NaN values, pass bfill as an argument to the method keyword.

>>> df2.reindex(date_index2, method='bfill')
           prices
2009-12-29     100
2009-12-30     100
2009-12-31     100
2010-01-01     100
2010-01-02     101
2010-01-03     NaN
2010-01-04     100
2010-01-05      89
2010-01-06      88
2010-01-07     NaN

Please note that the NaN value present in the original dataframe
(at index value 2010-01-03) will not be filled by any of the
value propagation schemes. This is because filling while reindexing
does not look at dataframe values, but only compares the original and
desired indexes. If you do want to fill in the NaN values present
in the original dataframe, use the fillna() method.

Problem description

Couldn't find any duplicates during search (but hard to say it isn't out there somewhere).

This is a question as much as anything. It may be my ignorance, or perhaps an oversight in the docs. The last value in the output shows 2010-01-07 NaN. It was not part of the original dataframe, so based on the note, it seems that it too should be auto-filled like the first 3 values were. I understand why 2010-01-03 NaN was not populated, but it doesn't seem right for the last value. Unless there is something I'm missing.

https://pandas-docs.github.io/pandas-docs-travis/
^^
FYI, this link from the issue template is giving a 404

@gfyoung
Copy link
Member

gfyoung commented Jun 11, 2018

FYI, this link from the issue template is giving a 404

@jreback @jorisvandenbossche : I thought we were still pushing builds of the docs on Travis?

@TomAugspurger
Copy link
Contributor

I thought we were still pushing builds of the docs on Travis?

Failing with https://travis-ci.org/pandas-dev/pandas/jobs/390828331#L2234 till #21397 is merged.

@TomAugspurger
Copy link
Contributor

w.r.t. the original issue, 2010-01-07 is not filled since it's beyond the last original valid. bfill backfills valid value, and there isn't a valid value past 2010-01-07, so there's nothing to backfill. @bzier is there anything in the docstring that could better explain that? bfill and ffill are concepts most pandas users will see first via fillna: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.fillna.html

@bzier
Copy link
Author

bzier commented Jun 11, 2018

@TomAugspurger Thanks for the clarification, that makes sense.

I am brand new to pandas, so hadn't been exposed to bfill or ffill yet. I wound up on the reindex docs from the very bottom of this pandas intro notebook from the Google Machine Learning Crash Course. The rest of that intro made sense, but they piqued my interest with the point about string indexes, so I followed the link straight to that reindexing page.


The reindexing docs all made sense and the examples made things clear, up to that point. I think a couple things threw me off. The first section introduces it as

we can fill in the missing values

and

to fill the NaN values

This doesn't indicate that any NaN values wouldn't be filled. The note underneath then goes on to explain why the one original value was not filled (2010-01-03), but says nothing about the last value at the end. It says

This is because filling while reindexing does not look at dataframe values, but only compares the original and desired indexes

which almost implies that (or at least I read it as) the original values will be left alone and all the new indexes will be filled.

I think simply adding those two sentences from your response would make it clear. For those who are familiar with the fill concepts, it will seem obvious, but I think it would provide clarity for those who aren't.

2010-01-07 is not filled since it's beyond the last original valid. bfill backfills valid value, and there isn't a valid value past 2010-01-07, so there's nothing to backfill.

Alternatively, perhaps just referencing the fill strategies in the earlier statement would be sufficient. Along those lines, one more clarification... the docs say

If desired, we can fill in the missing values using one of several options.

Does that mean then that if we were to specify ffill as the method rather than bfill, the results would have left the first three values as NaN and populated the 2010-01-07 result with the previous valid value of 88?


Thanks again for the help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants