Skip to content

Doc: Adds example of exploding lists into columns instead of storing in dataframe cells #23041

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

mgautam98
Copy link

@mgautam98 mgautam98 commented Oct 8, 2018

@codecov
Copy link

codecov bot commented Oct 8, 2018

Codecov Report

Merging #23041 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #23041   +/-   ##
=======================================
  Coverage   92.19%   92.19%           
=======================================
  Files         169      169           
  Lines       50873    50873           
=======================================
  Hits        46904    46904           
  Misses       3969     3969
Flag Coverage Δ
#multiple 90.61% <ø> (ø) ⬆️
#single 42.32% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ce1f81f...4952597. Read the comment docs.

@@ -336,3 +336,94 @@ constructors using something similar to the following:
See `the NumPy documentation on byte order
<https://docs.scipy.org/doc/numpy/user/basics.byteswapping.html>`__ for more
details.


Alternative to storing lists in DataFrame Cells
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't an "alternative" to lists as much as a way of just reshaping values; this is better phrased as just "Exploding List Items" or something to the effect


Alternative to storing lists in DataFrame Cells
-----------------------------------------------
Storing nested lists/arrays inside a pandas object should be avoided for performance and memory use reasons. Instead they should be "exploded" into a flat ``DataFrame`` structure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to comment above this isn't an "alternative" to using lists within a DataFrame


.. ipython:: python

df = pd.DataFrame({'name': ['A.J. Price'] * 3,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is copied from the following SO article:

https://stackoverflow.com/questions/32468402/how-to-explode-a-list-inside-a-dataframe-cell-into-separate-rows

Need to be careful copy / pasting items from SO into the code base. Would have to get express permission from author to use

Copy link
Contributor

@TomAugspurger TomAugspurger Oct 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SO code snippets are CC BY-SA so as long as we link back to the source (which we should be doing anything) then we're good.

https://stackoverflow.com/help/licensing

df

Stack the new columns as rows; this creates a new index level we'll want to drop in the next step.
Note that at this point we have a Series, not a Dataframe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be in the cookbook instead. Please cut this down to a much simpler set of examples.

@datapythonista
Copy link
Member

What I use to expand a Series with lists in the values to the corresponding DataFrame is:

import pandas

genres = pandas.Series([['drama', 'romance'], ['romance'], ['comedy', 'action']])
genres.str.join(',').str.get_dummies(',')

So, I think the content of this PR shouldn't be in the documentation (I know you just used the existing PR @mgautam98, sorry I didn't see that earlier).

May be a short entry with that to the cookbook could be useful.

Closing, if anybody disagrees, please reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC: section on caveats of storing lists inside DataFrame/Series
5 participants