Skip to content

Doc: Adds example of exploding lists into columns instead of storing in dataframe cells #23041

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
91 changes: 91 additions & 0 deletions doc/source/gotchas.rst
Original file line number Diff line number Diff line change
Expand Up @@ -336,3 +336,94 @@ constructors using something similar to the following:
See `the NumPy documentation on byte order
<https://docs.scipy.org/doc/numpy/user/basics.byteswapping.html>`__ for more
details.


Alternative to storing lists in DataFrame Cells
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't an "alternative" to lists as much as a way of just reshaping values; this is better phrased as just "Exploding List Items" or something to the effect

-----------------------------------------------
Storing nested lists/arrays inside a pandas object should be avoided for performance and memory use reasons. Instead they should be "exploded" into a flat ``DataFrame`` structure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to comment above this isn't an "alternative" to using lists within a DataFrame


Example of exploding nested lists into a DataFrame:

.. ipython:: python

df = pd.DataFrame({'name': ['A.J. Price'] * 3,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is copied from the following SO article:

https://stackoverflow.com/questions/32468402/how-to-explode-a-list-inside-a-dataframe-cell-into-separate-rows

Need to be careful copy / pasting items from SO into the code base. Would have to get express permission from author to use

Copy link
Contributor

@TomAugspurger TomAugspurger Oct 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SO code snippets are CC BY-SA so as long as we link back to the source (which we should be doing anything) then we're good.

https://stackoverflow.com/help/licensing

'opponent': ['76ers', 'blazers', 'bobcats']},
columns=['name','opponent'])
df

nearest_neighbors = [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']]*3
nearest_neighbors

Create an index with the "parent" columns to be included in the final Dataframe

.. ipython:: python

df = pd.concat([df[['name','opponent']], pd.DataFrame(nearest_neighbors)], axis=1)
df

Transform the column with lists into series, which become columns in a new Dataframe.
Note that only the index from the original df is retained - Any other columns in the original df are not part of the new df

.. ipython:: python

df = df.set_index(['name', 'opponent'])
df

Stack the new columns as rows; this creates a new index level we'll want to drop in the next step.
Note that at this point we have a Series, not a Dataframe

.. ipython:: python

ser = df.stack()
ser

#. Drop the extraneous index level created by the stack
ser.reset_index(level=2, drop=True, inplace=True)
ser

#. Create a Dataframe from the Series
df = ser.to_frame('nearest_neighbors')
df


Example of exploding a list embedded in a dataframe:

.. ipython:: python

df = pd.DataFrame({'name': ['A.J. Price'] * 3,
'opponent': ['76ers', 'blazers', 'bobcats'],
'nearest_neighbors': [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']] * 3},
columns=['name','opponent','nearest_neighbors'])
df

Create an index with the "parent" columns to be included in the final Dataframe

.. ipython:: python

df = df.set_index(['name', 'opponent'])
df

Transform the column with lists into series, which become columns in a new Dataframe.
Note that only the index from the original df is retained - any other columns in the original df are not part of the new df

.. ipython:: python

df = df.nearest_neighbors.apply(pd.Series)
df

Stack the new columns as rows; this creates a new index level we'll want to drop in the next step.
Note that at this point we have a Series, not a Dataframe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be in the cookbook instead. Please cut this down to a much simpler set of examples.


.. ipython:: python

ser = df.stack()
ser

#. Drop the extraneous index level created by the stack
ser.reset_index(level=2, drop=True, inplace=True)
ser

#. Create a Dataframe from the Series
df = ser.to_frame('nearest_neighbors')
df