Skip to content

DOC: Add scaling to large datasets section #28577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Oct 1, 2019

Conversation

TomAugspurger
Copy link
Contributor

Closes #28315

@TomAugspurger TomAugspurger added this to the 1.0 milestone Sep 23, 2019

import pandas as pd
import numpy as np
from pandas.util.testing import make_timeseries
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking through implications but is this something we really want to do? I feel like this is an unnecessary API exposure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to suggestions here. I've tried to make the API exposure as small as possible. But I need a way to

  1. Generate a semi-realistic dataset of arbitrary size.
  2. Without distracting from the overall message.

So I think the options are what I have here, or a private method like _make_timeseries() and the docs just describes the raw data contained in the file on disk (but we hide the generation of that file).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think private method and hiding import would be preferable; maybe just a comment preceding first usage saying # arbitrary large frame or something to the effect. The user shouldn't care about the import machinery

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, if I make it private, I'm not going to show it being imported.

Copy link
Contributor Author

@TomAugspurger TomAugspurger Sep 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I'll note in passing that a way to generate realistic, sample random datasets is nice :) But that's a larger discussion.

@WillAyd
Copy link
Member

WillAyd commented Sep 23, 2019

Thanks for addressing comments. Generally looks good and I think a welcome addition to docs

@TomAugspurger
Copy link
Contributor Author

And thank you for the review :)

FYI, this adds ~30s to the doc build on my machine (haven't checked the slowdown on CI, but presumably its longer). Is that unacceptable? I can make the examples smaller if needed.

Copy link
Member

@jschendel jschendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just some nitpicks on my part.

@TomAugspurger
Copy link
Contributor Author

FYI, rendered version:

FireShot Capture 007 - Scaling to large datasets — pandas 0 25 0+418 ga7fb97f44a dirty documen_ -

@TomAugspurger
Copy link
Contributor Author

Any other thoughts here?

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor quibbles but I'd be OK with it as is anyway

.. ipython:: python

%%time
files = list(pathlib.Path("data/timeseries/").glob("ts*.parquet"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think necessary to encapsulate this in list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a leftover from before, when I printed out files. But that's covered by the tree output earlier on now that we aren't showing make_timeseries.

df = pd.read_parquet(path)
# ... plus a small Series `counts`, which is updated.
counts = counts.add(df['name'].value_counts(), fill_value=0)
counts.astype(int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? Just seems like some cruft in here for dtype preservation. Ideally would like to keep code here at a minimum

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without it, you get a float:

In [16]: s = pd.Series(dtype=int)

In [17]: s.add(t, fill_value=0)
Out[17]:
0    1.0
1    2.0
dtype: float64

I think it'd be strange for a value_counts to return floating-point values in the counts.

environment.yml Outdated
@@ -35,6 +35,8 @@ dependencies:
- nbconvert>=5.4.1
- nbsphinx
- pandoc
- dask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be an optional dependency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the dev env. We've been including all the dependencies necessary to build the docs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must have been some misunderstanding in #27646 (comment). That was a dev-only dependency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the dev env. We've been including all the dependencies necessary to build the docs.

That said, would it be possible to rely on just dask-core (+ what is needed for dask.dataframe), as distributed brings in a lot more dependencies?
(the one code block that shows the client could be a code-block)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition to the docs!

Not for this PR, but if we have an additional level of navigation, we could put this together with the current enhancingperf.rst in a "performance" section?

*************************

Pandas provides data structures for in-memory analytics, which makes using pandas
to analyze larger than memory datasets somewhat tricky.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This document is not only for "larger than memory" data right? It becomes already tricky if your dataset is (some factor) smaller than your memory, right? (because we create copies, because reading can take more memory, ...)

At least the first sections in this document equally apply as performance considerations on smaller-than-memory datasets

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to clarify this a bit (in part by removing the "use efficient file formats" section.

%time _ = pd.read_parquet("timeseries.parquet")

Notice that parquet gives higher performance for reading (and writing), both
in terms of speed and lower peak memory usage. See :ref:`io` for more.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe link to the section in io.rst that compares the performance of different formats?


Some workloads can be achieved with chunking: splitting a large problem like "convert this
directory of CSVs to parquet" into a bunch of small problems ("convert this individual parquet
file into a CSV. Now repeat that for each file in this directory."). As long as each chunk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"convert this individual parquet file into a CSV" -> "convert this individual CSV file into a Parquet file" ?

(then in matches the example of the previous sentence)


Pandas is just one library offering a DataFrame API. Because of its popularity,
pandas' API has become something of a standard that other libraries implement.
The pandas documentation maintains a list of libraries implemetning a DataFrame API
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The pandas documentation maintains a list of libraries implemetning a DataFrame API
The pandas documentation maintains a list of libraries implementing a DataFrame API

@TomAugspurger
Copy link
Contributor Author

Not for this PR, but if we have an additional level of navigation, we could put this together with the current enhancingperf.rst in a "performance" section?

I slightly prefer smaller doc pages that cross reference eachother.

I've removed the section comparing the perf of read_csv and read_parquet. That muddies the purpose of this document, since it's talking about speed, while the rest of the document focuses primarily on memory usage.

I've updated the "subset of columns" example to use the %memit magic from memory_profiler.

@TomAugspurger
Copy link
Contributor Author

flake8-rst didn't like jupyter magics, so I've had to remove it.

@jorisvandenbossche
Copy link
Member

Not for this PR, but if we have an additional level of navigation, we could put this together with the current enhancingperf.rst in a "performance" section?

I slightly prefer smaller doc pages that cross reference eachother.

Sorry, I was not very clear :) I didn't mean to suggest to put them in a single file. Because I also prefer smaller doc pages, I want to split some of the existing ones, and create an extra level of hierarchy in the user guide. And in that light, this one could fit together with the existing enhancingperf (which could be splitted for the cython stuff and the querying) in a general "Performance" section (that has multiple sub pages). But that's for the future, not now, so nothing to care about for this PR.

I've removed the section comparing the perf of read_csv and read_parquet. That muddies the purpose of this document, since it's talking about speed, while the rest of the document focuses primarily on memory usage.

I actually think that section could still be useful. read_csv vs read_parquet is not only for speed, also memory wise I would think that read_parquet is quite a bit better? (I heard about read_csv needing quite some more memory than the original file or final dataframe) So maybe if using %memit instead of %timeit would illustrate that?

@@ -782,7 +782,8 @@ def test_categorical_no_compress():

def test_sort():

# http://stackoverflow.com/questions/23814368/sorting-pandas-categorical-labels-after-groupby # noqa: flake8
# http://stackoverflow.com/questions/23814368/
# sorting-pandas-categorical-labels-after-groupby # noqa: flake8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fixed on master (it's also what is causing the merge conflict)

- toolz>=0.7.3
- fsspec>=0.5.1
- partd>=0.3.10
- cloudpickle>=0.2.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here that those last 4 ones are just dependencies of dask.dataframe?

@jreback
Copy link
Contributor

jreback commented Sep 26, 2019

lgtm, needs a rebase though.

@TomAugspurger
Copy link
Contributor Author

All green here.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@TomAugspurger TomAugspurger merged commit c13c13b into pandas-dev:master Oct 1, 2019
@TomAugspurger
Copy link
Contributor Author

Thanks.

@TomAugspurger TomAugspurger deleted the scale branch October 1, 2019 11:59
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 1, 2019

Not that it needed to be in this PR (could be a follow-up), but an opinion on the usefulness of what I remarked in #28577 (comment) (the second paragraph on reading files)

josibake pushed a commit to josibake/pandas that referenced this pull request Oct 1, 2019
* DOC: Add scaling to large datasets section

Closes pandas-dev#28315
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
* DOC: Add scaling to large datasets section

Closes pandas-dev#28315
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
* DOC: Add scaling to large datasets section

Closes pandas-dev#28315
bongolegend pushed a commit to bongolegend/pandas that referenced this pull request Jan 1, 2020
* DOC: Add scaling to large datasets section

Closes pandas-dev#28315
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add documentation section on Scaling
6 participants