-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: Add scaling to large datasets section #28577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
doc/source/user_guide/scale.rst
Outdated
|
||
import pandas as pd | ||
import numpy as np | ||
from pandas.util.testing import make_timeseries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thinking through implications but is this something we really want to do? I feel like this is an unnecessary API exposure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open to suggestions here. I've tried to make the API exposure as small as possible. But I need a way to
- Generate a semi-realistic dataset of arbitrary size.
- Without distracting from the overall message.
So I think the options are what I have here, or a private method like _make_timeseries()
and the docs just describes the raw data contained in the file on disk (but we hide the generation of that file).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think private method and hiding import would be preferable; maybe just a comment preceding first usage saying # arbitrary large frame
or something to the effect. The user shouldn't care about the import machinery
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, if I make it private, I'm not going to show it being imported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I'll note in passing that a way to generate realistic, sample random datasets is nice :) But that's a larger discussion.
Thanks for addressing comments. Generally looks good and I think a welcome addition to docs |
And thank you for the review :) FYI, this adds ~30s to the doc build on my machine (haven't checked the slowdown on CI, but presumably its longer). Is that unacceptable? I can make the examples smaller if needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, just some nitpicks on my part.
Any other thoughts here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor quibbles but I'd be OK with it as is anyway
doc/source/user_guide/scale.rst
Outdated
.. ipython:: python | ||
|
||
%%time | ||
files = list(pathlib.Path("data/timeseries/").glob("ts*.parquet")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think necessary to encapsulate this in list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a leftover from before, when I printed out files
. But that's covered by the tree
output earlier on now that we aren't showing make_timeseries
.
df = pd.read_parquet(path) | ||
# ... plus a small Series `counts`, which is updated. | ||
counts = counts.add(df['name'].value_counts(), fill_value=0) | ||
counts.astype(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? Just seems like some cruft in here for dtype preservation. Ideally would like to keep code here at a minimum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without it, you get a float:
In [16]: s = pd.Series(dtype=int)
In [17]: s.add(t, fill_value=0)
Out[17]:
0 1.0
1 2.0
dtype: float64
I think it'd be strange for a value_counts
to return floating-point values in the counts.
environment.yml
Outdated
@@ -35,6 +35,8 @@ dependencies: | |||
- nbconvert>=5.4.1 | |||
- nbsphinx | |||
- pandoc | |||
- dask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this be an optional dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for the dev env. We've been including all the dependencies necessary to build the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
must have been some misunderstanding in #27646 (comment). That was a dev-only dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for the dev env. We've been including all the dependencies necessary to build the docs.
That said, would it be possible to rely on just dask-core (+ what is needed for dask.dataframe), as distributed brings in a lot more dependencies?
(the one code block that shows the client could be a code-block)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice addition to the docs!
Not for this PR, but if we have an additional level of navigation, we could put this together with the current enhancingperf.rst in a "performance" section?
doc/source/user_guide/scale.rst
Outdated
************************* | ||
|
||
Pandas provides data structures for in-memory analytics, which makes using pandas | ||
to analyze larger than memory datasets somewhat tricky. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This document is not only for "larger than memory" data right? It becomes already tricky if your dataset is (some factor) smaller than your memory, right? (because we create copies, because reading can take more memory, ...)
At least the first sections in this document equally apply as performance considerations on smaller-than-memory datasets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to clarify this a bit (in part by removing the "use efficient file formats" section.
doc/source/user_guide/scale.rst
Outdated
%time _ = pd.read_parquet("timeseries.parquet") | ||
|
||
Notice that parquet gives higher performance for reading (and writing), both | ||
in terms of speed and lower peak memory usage. See :ref:`io` for more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe link to the section in io.rst that compares the performance of different formats?
doc/source/user_guide/scale.rst
Outdated
|
||
Some workloads can be achieved with chunking: splitting a large problem like "convert this | ||
directory of CSVs to parquet" into a bunch of small problems ("convert this individual parquet | ||
file into a CSV. Now repeat that for each file in this directory."). As long as each chunk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"convert this individual parquet file into a CSV" -> "convert this individual CSV file into a Parquet file" ?
(then in matches the example of the previous sentence)
doc/source/user_guide/scale.rst
Outdated
|
||
Pandas is just one library offering a DataFrame API. Because of its popularity, | ||
pandas' API has become something of a standard that other libraries implement. | ||
The pandas documentation maintains a list of libraries implemetning a DataFrame API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pandas documentation maintains a list of libraries implemetning a DataFrame API | |
The pandas documentation maintains a list of libraries implementing a DataFrame API |
I slightly prefer smaller doc pages that cross reference eachother. I've removed the section comparing the perf of read_csv and read_parquet. That muddies the purpose of this document, since it's talking about speed, while the rest of the document focuses primarily on memory usage. I've updated the "subset of columns" example to use the |
flake8-rst didn't like jupyter magics, so I've had to remove it. |
Sorry, I was not very clear :) I didn't mean to suggest to put them in a single file. Because I also prefer smaller doc pages, I want to split some of the existing ones, and create an extra level of hierarchy in the user guide. And in that light, this one could fit together with the existing enhancingperf (which could be splitted for the cython stuff and the querying) in a general "Performance" section (that has multiple sub pages). But that's for the future, not now, so nothing to care about for this PR.
I actually think that section could still be useful. read_csv vs read_parquet is not only for speed, also memory wise I would think that read_parquet is quite a bit better? (I heard about read_csv needing quite some more memory than the original file or final dataframe) So maybe if using |
@@ -782,7 +782,8 @@ def test_categorical_no_compress(): | |||
|
|||
def test_sort(): | |||
|
|||
# http://stackoverflow.com/questions/23814368/sorting-pandas-categorical-labels-after-groupby # noqa: flake8 | |||
# http://stackoverflow.com/questions/23814368/ | |||
# sorting-pandas-categorical-labels-after-groupby # noqa: flake8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fixed on master (it's also what is causing the merge conflict)
- toolz>=0.7.3 | ||
- fsspec>=0.5.1 | ||
- partd>=0.3.10 | ||
- cloudpickle>=0.2.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment here that those last 4 ones are just dependencies of dask.dataframe?
lgtm, needs a rebase though. |
All green here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks. |
Not that it needed to be in this PR (could be a follow-up), but an opinion on the usefulness of what I remarked in #28577 (comment) (the second paragraph on reading files) |
* DOC: Add scaling to large datasets section Closes pandas-dev#28315
* DOC: Add scaling to large datasets section Closes pandas-dev#28315
* DOC: Add scaling to large datasets section Closes pandas-dev#28315
* DOC: Add scaling to large datasets section Closes pandas-dev#28315
Closes #28315