DOC: Add scaling to large datasets section #28577

TomAugspurger · 2019-09-23T11:15:58Z

pandas/util/testing.py

doc/.gitignore

doc/source/user_guide/scale.rst

WillAyd · 2019-09-23T19:32:41Z

doc/source/user_guide/scale.rst

+
+   import pandas as pd
+   import numpy as np
+   from pandas.util.testing import make_timeseries


Just thinking through implications but is this something we really want to do? I feel like this is an unnecessary API exposure

Open to suggestions here. I've tried to make the API exposure as small as possible. But I need a way to

Generate a semi-realistic dataset of arbitrary size.

Without distracting from the overall message.

So I think the options are what I have here, or a private method like _make_timeseries() and the docs just describes the raw data contained in the file on disk (but we hide the generation of that file).

I think private method and hiding import would be preferable; maybe just a comment preceding first usage saying # arbitrary large frame or something to the effect. The user shouldn't care about the import machinery

Just to be clear, if I make it private, I'm not going to show it being imported.

Though I'll note in passing that a way to generate realistic, sample random datasets is nice :) But that's a larger discussion.

doc/source/user_guide/scale.rst

WillAyd · 2019-09-23T19:55:44Z

Thanks for addressing comments. Generally looks good and I think a welcome addition to docs

TomAugspurger · 2019-09-23T20:36:52Z

And thank you for the review :)

FYI, this adds ~30s to the doc build on my machine (haven't checked the slowdown on CI, but presumably its longer). Is that unacceptable? I can make the examples smaller if needed.

jschendel

Looks good overall, just some nitpicks on my part.

doc/source/user_guide/scale.rst

TomAugspurger · 2019-09-23T21:06:15Z

FYI, rendered version:

TomAugspurger · 2019-09-24T15:28:42Z

Any other thoughts here?

WillAyd

Minor quibbles but I'd be OK with it as is anyway

WillAyd · 2019-09-24T16:00:31Z

doc/source/user_guide/scale.rst

+.. ipython:: python
+
+   %%time
+   files = list(pathlib.Path("data/timeseries/").glob("ts*.parquet"))


I don't think necessary to encapsulate this in list

Yeah, that's a leftover from before, when I printed out files. But that's covered by the tree output earlier on now that we aren't showing make_timeseries.

WillAyd · 2019-09-24T16:01:36Z

doc/source/user_guide/scale.rst

+       df = pd.read_parquet(path)
+       # ... plus a small Series `counts`, which is updated.
+       counts = counts.add(df['name'].value_counts(), fill_value=0)
+   counts.astype(int)


Is this necessary? Just seems like some cruft in here for dtype preservation. Ideally would like to keep code here at a minimum

Without it, you get a float:

In [16]: s = pd.Series(dtype=int) In [17]: s.add(t, fill_value=0) Out[17]: 0 1.0 1 2.0 dtype: float64

I think it'd be strange for a value_counts to return floating-point values in the counts.

simonjayhawkins · 2019-09-24T17:08:38Z

environment.yml

@@ -35,6 +35,8 @@ dependencies:
  - nbconvert>=5.4.1
  - nbsphinx
  - pandoc
+  - dask


could this be an optional dependency?

This is for the dev env. We've been including all the dependencies necessary to build the docs.

must have been some misunderstanding in #27646 (comment). That was a dev-only dependency.

This is for the dev env. We've been including all the dependencies necessary to build the docs.

That said, would it be possible to rely on just dask-core (+ what is needed for dask.dataframe), as distributed brings in a lot more dependencies?
(the one code block that shows the client could be a code-block)

jorisvandenbossche

Nice addition to the docs!

Not for this PR, but if we have an additional level of navigation, we could put this together with the current enhancingperf.rst in a "performance" section?

jorisvandenbossche · 2019-09-24T17:08:39Z

doc/source/user_guide/scale.rst

+*************************
+
+Pandas provides data structures for in-memory analytics, which makes using pandas
+to analyze larger than memory datasets somewhat tricky.


This document is not only for "larger than memory" data right? It becomes already tricky if your dataset is (some factor) smaller than your memory, right? (because we create copies, because reading can take more memory, ...)

At least the first sections in this document equally apply as performance considerations on smaller-than-memory datasets

Tried to clarify this a bit (in part by removing the "use efficient file formats" section.

jorisvandenbossche · 2019-09-24T17:11:24Z

doc/source/user_guide/scale.rst

+   %time _ = pd.read_parquet("timeseries.parquet")
+
+Notice that parquet gives higher performance for reading (and writing), both
+in terms of speed and lower peak memory usage. See :ref:`io` for more.


Maybe link to the section in io.rst that compares the performance of different formats?

jorisvandenbossche · 2019-09-24T17:14:20Z

doc/source/user_guide/scale.rst

+
+Some workloads can be achieved with chunking: splitting a large problem like "convert this
+directory of CSVs to parquet" into a bunch of small problems ("convert this individual parquet
+file into a CSV. Now repeat that for each file in this directory."). As long as each chunk


"convert this individual parquet file into a CSV" -> "convert this individual CSV file into a Parquet file" ?

(then in matches the example of the previous sentence)

jorisvandenbossche · 2019-09-24T17:21:50Z

doc/source/user_guide/scale.rst

+
+Pandas is just one library offering a DataFrame API. Because of its popularity,
+pandas' API has become something of a standard that other libraries implement.
+The pandas documentation maintains a list of libraries implemetning a DataFrame API


Suggested change

The pandas documentation maintains a list of libraries implemetning a DataFrame API

The pandas documentation maintains a list of libraries implementing a DataFrame API

doc/source/user_guide/scale.rst

TomAugspurger · 2019-09-24T18:31:32Z

Not for this PR, but if we have an additional level of navigation, we could put this together with the current enhancingperf.rst in a "performance" section?

I slightly prefer smaller doc pages that cross reference eachother.

I've removed the section comparing the perf of read_csv and read_parquet. That muddies the purpose of this document, since it's talking about speed, while the rest of the document focuses primarily on memory usage.

I've updated the "subset of columns" example to use the %memit magic from memory_profiler.

TomAugspurger · 2019-09-24T21:31:01Z

flake8-rst didn't like jupyter magics, so I've had to remove it.

jorisvandenbossche · 2019-09-26T11:49:52Z

Not for this PR, but if we have an additional level of navigation, we could put this together with the current enhancingperf.rst in a "performance" section?

I slightly prefer smaller doc pages that cross reference eachother.

Sorry, I was not very clear :) I didn't mean to suggest to put them in a single file. Because I also prefer smaller doc pages, I want to split some of the existing ones, and create an extra level of hierarchy in the user guide. And in that light, this one could fit together with the existing enhancingperf (which could be splitted for the cython stuff and the querying) in a general "Performance" section (that has multiple sub pages). But that's for the future, not now, so nothing to care about for this PR.

I've removed the section comparing the perf of read_csv and read_parquet. That muddies the purpose of this document, since it's talking about speed, while the rest of the document focuses primarily on memory usage.

I actually think that section could still be useful. read_csv vs read_parquet is not only for speed, also memory wise I would think that read_parquet is quite a bit better? (I heard about read_csv needing quite some more memory than the original file or final dataframe) So maybe if using %memit instead of %timeit would illustrate that?

jorisvandenbossche · 2019-09-26T11:50:39Z

pandas/tests/groupby/test_categorical.py

@@ -782,7 +782,8 @@ def test_categorical_no_compress():

 def test_sort():

-    # http://stackoverflow.com/questions/23814368/sorting-pandas-categorical-labels-after-groupby  # noqa: flake8
+    # http://stackoverflow.com/questions/23814368/
+    #   sorting-pandas-categorical-labels-after-groupby  # noqa: flake8


This is fixed on master (it's also what is causing the merge conflict)

jorisvandenbossche · 2019-09-26T11:53:41Z

environment.yml

+  - toolz>=0.7.3
+  - fsspec>=0.5.1
+  - partd>=0.3.10
+  - cloudpickle>=0.2.1


Can you add a comment here that those last 4 ones are just dependencies of dask.dataframe?

jreback · 2019-09-26T12:23:01Z

lgtm, needs a rebase though.

TomAugspurger · 2019-09-30T14:36:08Z

All green here.

WillAyd

lgtm

TomAugspurger · 2019-10-01T11:59:05Z

Thanks.

jorisvandenbossche · 2019-10-01T12:09:24Z

Not that it needed to be in this PR (could be a follow-up), but an opinion on the usefulness of what I remarked in #28577 (comment) (the second paragraph on reading files)

* DOC: Add scaling to large datasets section Closes pandas-dev#28315

DOC: Add scaling to large datasets section

7e7d786

Closes pandas-dev#28315

TomAugspurger added the Docs label Sep 23, 2019

TomAugspurger added this to the 1.0 milestone Sep 23, 2019

jreback reviewed Sep 23, 2019

View reviewed changes

pandas/util/testing.py Outdated Show resolved Hide resolved

doc/.gitignore Outdated Show resolved Hide resolved

doc/source/user_guide/scale.rst Outdated Show resolved Hide resolved

TomAugspurger added 6 commits September 23, 2019 07:42

emphasize memory

506edd1

fixups

35a4dde

code checks

efb3260

bump pyarrow

3201f42

update deps

eae9593

include in user_guide index

68ff6ee

WillAyd reviewed Sep 23, 2019

View reviewed changes

TomAugspurger added 2 commits September 23, 2019 15:10

Merge remote-tracking branch 'upstream/master' into scale

a7fb97f

update

a4baa41

jschendel reviewed Sep 23, 2019

View reviewed changes

doc/source/user_guide/scale.rst Outdated Show resolved Hide resolved

doc/source/user_guide/scale.rst Outdated Show resolved Hide resolved

doc/source/user_guide/scale.rst Outdated Show resolved Hide resolved

doc/source/user_guide/scale.rst Outdated Show resolved Hide resolved

fixups

78d22e6

WillAyd reviewed Sep 24, 2019

View reviewed changes

simonjayhawkins reviewed Sep 24, 2019

View reviewed changes

jorisvandenbossche reviewed Sep 24, 2019

View reviewed changes

updates

f7bc6dc

TomAugspurger added 6 commits September 24, 2019 14:27

fixup

5294cdb

fixup

c57f33a

Merge remote-tracking branch 'upstream/master' into scale

7f32b83

try updating check

55be2bb

remove memory profiling

a76453f

fixup

98c06fa

jorisvandenbossche reviewed Sep 26, 2019

View reviewed changes

jreback approved these changes Sep 26, 2019

View reviewed changes

TomAugspurger added 2 commits September 26, 2019 07:39

Merge remote-tracking branch 'upstream/master' into scale

9578248

whatsnew, fixups

78eb2f1

WillAyd approved these changes Sep 30, 2019

View reviewed changes

TomAugspurger merged commit c13c13b into pandas-dev:master Oct 1, 2019

TomAugspurger deleted the scale branch October 1, 2019 11:59

josibake pushed a commit to josibake/pandas that referenced this pull request Oct 1, 2019

DOC: Add scaling to large datasets section (pandas-dev#28577)

4661d77

* DOC: Add scaling to large datasets section Closes pandas-dev#28315

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

DOC: Add scaling to large datasets section (pandas-dev#28577)

da82364

* DOC: Add scaling to large datasets section Closes pandas-dev#28315

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

DOC: Add scaling to large datasets section (pandas-dev#28577)

82344e4

* DOC: Add scaling to large datasets section Closes pandas-dev#28315

bongolegend pushed a commit to bongolegend/pandas that referenced this pull request Jan 1, 2020

DOC: Add scaling to large datasets section (pandas-dev#28577)

9f57b78

* DOC: Add scaling to large datasets section Closes pandas-dev#28315

	The pandas documentation maintains a list of libraries implemetning a DataFrame API
	The pandas documentation maintains a list of libraries implementing a DataFrame API

DOC: Add scaling to large datasets section #28577

DOC: Add scaling to large datasets section #28577

Conversation

TomAugspurger commented Sep 23, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger Sep 23, 2019 • edited Loading

Choose a reason for hiding this comment

WillAyd commented Sep 23, 2019

TomAugspurger commented Sep 23, 2019

jschendel left a comment

Choose a reason for hiding this comment

TomAugspurger commented Sep 23, 2019

TomAugspurger commented Sep 24, 2019

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Sep 24, 2019

TomAugspurger commented Sep 24, 2019

jorisvandenbossche commented Sep 26, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Sep 26, 2019

TomAugspurger commented Sep 30, 2019

WillAyd left a comment

Choose a reason for hiding this comment

TomAugspurger commented Oct 1, 2019

jorisvandenbossche commented Oct 1, 2019 • edited Loading

TomAugspurger Sep 23, 2019 •

edited

Loading

jorisvandenbossche commented Oct 1, 2019 •

edited

Loading