Skip to content

DOC: Improve reindex examples and docstring #10996

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 26, 2015

Conversation

Winterflower
Copy link
Contributor

Fixes #10995

Done/waiting for feedback.

@jreback jreback added the Docs label Sep 4, 2015
@jreback
Copy link
Contributor

jreback commented Sep 9, 2015

lmk when you are finsihed

@jreback
Copy link
Contributor

jreback commented Oct 12, 2015

how's this coming?

@Winterflower
Copy link
Contributor Author

Working to finish this right now. ETA in 2 hours (depending on how fast the coffee kicks in...)

@Winterflower
Copy link
Contributor Author

hi @jreback , just trying to flesh out some doc for the filing methods in reindex and running into this situation

Pandas version 0.17.0

# create original dataframe
index2 = pd.date_range('1/1/2010', periods=5, freq='D')
sample_df2 = pd.DataFrame({"prices":[100, 101, np.nan, np.nan, 89]}, index=index2)
print sample_df2

>>>
            prices
2010-01-01     100
2010-01-02     101
2010-01-03     NaN
2010-01-04     NaN
2010-01-05      89

#create new index, reindex and use backfill
index_new = pd.date_range('12/29/2009', periods=8, freq='D')
sample_df2.reindex(index_new, method='bfill')

>>>
            prices
2009-12-29     100
2009-12-30     100
2009-12-31     100
2010-01-01     100
2010-01-02     101
2010-01-03     NaN
2010-01-04     NaN
2010-01-05      89

Question: Why isn't '89' being propagated backwards to fill the NaN in 2010-01-03 and 2010-01-04?
It seems that we only fill NaN's if they were introduced by the reindexing and not if they were present in the original dataframe before reindexing.
Just want to confirm that this is desired behavior and I'll make a note of it in the examples.

@jreback
Copy link
Contributor

jreback commented Oct 12, 2015

In [5]: sample_df2.reindex(index_new, method='bfill')
Out[5]: 
            prices
2009-12-29     100
2009-12-30     100
2009-12-31     100
2010-01-01     100
2010-01-02     101
2010-01-03     NaN
2010-01-04     NaN
2010-01-05      89

In [6]: sample_df2.reindex(index_new).bfill()        
Out[6]: 
            prices
2009-12-29     100
2009-12-30     100
2009-12-31     100
2010-01-01     100
2010-01-02     101
2010-01-03      89
2010-01-04      89
2010-01-05      89

see the issue here: #5669

I still don't buy that these should be different. Filling WHILE reindexing is IMHO just confusing. These should be separate operations.

@shoyer
Copy link
Member

shoyer commented Oct 12, 2015

Filling while reindexing does not look at the data values when deciding which should be propagated. That's a nice property, for a few reasons:

  1. it preserves NaNs in the original dataset
  2. If means you unambiguously preserve records when you reindex a dataframe -- what would you do otherwise if a row had a missing value in one column but not another?
  3. it means you can do the operation more efficiently, in a single take from the original array

@jreback
Copy link
Contributor

jreback commented Oct 12, 2015

@shoyer well, I disagree. Reindexing is completely different that setting the index, which is the point the other issue trying to make. Yes under certainly limited circumstances this makes sense, though I completely discount point (3) (it doesn't matter mucho IMHO).

Much more important is obvious semantics. This differs from df.reindex(....).ffill() and causes some confusion.

@shoyer
Copy link
Member

shoyer commented Oct 12, 2015

It certainly is a source of confusion, but it is following a sensible rule: the values reindexing looks at are only in the index. It's certainly worth clarifying when feasible in the docs.

I agree that efficiency does not matter here for pandas when the data is already all in memory, but it does make a difference for downstream libraries that copy the pandas API for out-of-core data. Namely, it's useful in xray when we define operations on dask arrays and also for dask.dataframe -- we never need to look at the array values to determine what the reindexing result looks like.

@jreback
Copy link
Contributor

jreback commented Oct 12, 2015

@shoyer exactly my point. Reindexing is hard enough for users. The rule is sensible, but not to the average user. This should NOT be advertised as how to fill thing at all. Rather .fillna and cousins.

@Winterflower
Copy link
Contributor Author

Thanks @jreback and @shoyer for the clarification.
I'll do my best to clarify/make note of it in the examples.

@jreback jreback added this to the 0.17.1 milestone Oct 13, 2015
@jreback
Copy link
Contributor

jreback commented Oct 13, 2015

since this is long (yeh!) do we need to paginate somehow? @jorisvandenbossche @shoyer


>>> index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror']
>>> df = pd.DataFrame({
... 'http_status':[200,200,404,404,301],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pep8: add a space after :

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, thanks! Caught a few other ones like this as well.

@Winterflower Winterflower force-pushed the pandas-reindex-doc branch 2 times, most recently from eec6a10 to 9a511fc Compare October 13, 2015 21:23
the keyword ``fill_value``. Please note: due to the fact
that the index is not monotonically increasing or
decreasing, we cannot use arguments to the keyword
``method`` to fill the ``NaN`` values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this note should rather go in the explanation of the argument itself

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche agreed. Will fix this now, but I think I should leave this note in the examples as well, in case someone is just browsing the examples without studying the keyword descriptions in the API itself (I know I'm sometimes guilty of doing this one).

@jreback
Copy link
Contributor

jreback commented Oct 25, 2015

@Winterflower can you update according to comments

DOC: add more reindex examples

DOC: fixing some PEP8 issues

DOC: added shoyer suggestions

DOC: Fixes to reindex based on comments
@Winterflower
Copy link
Contributor Author

@jreback apologies for the delay!
added the fixes.

jreback added a commit that referenced this pull request Oct 26, 2015
DOC: Improve reindex examples and docstring
@jreback jreback merged commit 88e8d6e into pandas-dev:master Oct 26, 2015
@jreback
Copy link
Contributor

jreback commented Oct 26, 2015

thanks @Winterflower

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants