Skip to content

Doc: Adds example of exploding lists into columns instead of storing in dataframe cells #19215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 88 additions & 0 deletions doc/source/gotchas.rst
Original file line number Diff line number Diff line change
Expand Up @@ -332,3 +332,91 @@ using something similar to the following:
See `the NumPy documentation on byte order
<https://docs.scipy.org/doc/numpy/user/basics.byteswapping.html>`__ for more
details.


Alternative to storing lists in DataFrame Cells
------------------------------------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be the same length as the title

Storing nested lists/arrays inside a pandas object should be avoided for performance and memory use reasons. Instead they should be "exploded" into a flat ``DataFrame`` structure.

Example of exploding nested lists into a DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you have 2 examples you can use another level of sub-section


.. ipython:: python

df = pd.DataFrame({'name': ['A.J. Price'] * 3,
'opponent': ['76ers', 'blazers', 'bobcats']},
columns=['name','opponent'])
df

nearest_neighbors = [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']]*3
nearest_neighbors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make things into separate ipython:: python blocks, rather than using comments (you can simply write text and not use the #)

#. Create an index with the "parent" columns to be included in the final Dataframe
df2 = pd.concat([df[['name','opponent']], pd.DataFrame(nearest_neighbors)], axis=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to keep naming the dataframes, just use

df = ..... or whatever

df2

#. Transform the column with lists into series, which become columns in a new Dataframe.
# Note that only the index from the original df is retained -
# any other columns in the original df are not part of the new df
df3 = df2.set_index(['name', 'opponent'])
df3

#. Stack the new columns as rows; this creates a new index level we'll want to drop in the next step.
# Note that at this point we have a Series, not a Dataframe
ser = df3.stack()
ser

#. Drop the extraneous index level created by the stack
ser.reset_index(level=2, drop=True, inplace=True)
ser

#. Create a Dataframe from the Series
df4 = ser.to_frame('nearest_neighbors')
df4

# All steps in one stack
df4 = (df2.set_index(['name', 'opponent'])
.stack()
.reset_index(level=2, drop=True)
.to_frame('nearest_neighbors'))
df4

Example of exploding a list embedded in a dataframe:

.. ipython:: python

df = pd.DataFrame({'name': ['A.J. Price'] * 3,
'opponent': ['76ers', 'blazers', 'bobcats'],
'nearest_neighbors': [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']] * 3},
columns=['name','opponent','nearest_neighbors'])
df

#. Create an index with the "parent" columns to be included in the final Dataframe
df2 = df.set_index(['name', 'opponent'])
df2

#. Transform the column with lists into series, which become columns in a new Dataframe.
# Note that only the index from the original df is retained -
# any other columns in the original df are not part of the new df
df3 = df2.nearest_neighbors.apply(pd.Series)
df3

#. Stack the new columns as rows; this creates a new index level we'll want to drop in the next step.
# Note that at this point we have a Series, not a Dataframe
ser = df3.stack()
ser

#. Drop the extraneous index level created by the stack
ser.reset_index(level=2, drop=True, inplace=True)
ser

#. Create a Dataframe from the Series
df4 = ser.to_frame('nearest_neighbors')
df4

# All steps in one stack
df4 = (df.set_index(['name', 'opponent'])
.nearest_neighbors.apply(pd.Series)
.stack()
.reset_index(level=2, drop=True)
.to_frame('nearest_neighbors'))
df4