Skip to content

DOC: update the pandas.DataFrame.to_sparse docstring #20193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 10, 2018

Conversation

gioiab
Copy link
Contributor

@gioiab gioiab commented Mar 10, 2018

Updates the docstring for the to_sparse function.

Here is the output of the validation script:


################################################################################
#################### Docstring (pandas.DataFrame.to_sparse) ####################
################################################################################

Convert to SparseDataFrame.

Implement the sparse version of the DataFrame meaning that any data matching
a specific value it's omitted in the representation. The sparse DataFrame takes
less memory on disk when pickled and in the Python interpreter.

Parameters
----------
fill_value : float, default NaN
    The specific value that should be omitted in the representation.
kind : {'block', 'integer'}
    The kind of the SparseIndex tracking where data has been omitted.
    The block kind is recommended since it’s more memory efficient:
    it tracks just the locations and sizes of the blocks of data that
    are not equal to the fill value while the integer kind keeps an
    array with all those locations.

Returns
-------
y : SparseDataFrame

See Also
--------
pandas.DataFrame.to_dense: converts the DataFrame back to the its dense form

Examples
--------

Compressing on the zero value.

>>> df = pd.DataFrame(np.random.randn(1000, 4))
>>> df.iloc[:995] = 0.
>>> sdf = df.to_sparse(fill_value=0.)
>>> sdf.density
0.005

################################################################################
################################## Validation ##################################
################################################################################

Docstring for "pandas.DataFrame.to_sparse" correct. :)

@pep8speaks
Copy link

pep8speaks commented Mar 10, 2018

Hello @gioiab! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 10, 2018 at 03:48 Hours UTC


Implement the sparse version of the DataFrame meaning that any data matching
a specific value it's omitted in the representation. The sparse DataFrame takes
less memory on disk when pickled and in the Python interpreter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just say its has more efficient storage.

kind : {'block', 'integer'}
The kind of the SparseIndex tracking where data has been omitted.
The block kind is recommended since it’s more memory efficient:
it tracks just the locations and sizes of the blocks of data that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you break these into 2 bullet points


See Also
--------
pandas.DataFrame.to_dense: converts the DataFrame back to the its dense form
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas.SparseDataFrame.to_dense

@gioiab
Copy link
Contributor Author

gioiab commented Mar 10, 2018

@jreback I've implemented the changes you requested, I'm just waiting the CI to finish. Can you get a final look?

@gioiab
Copy link
Contributor Author

gioiab commented Mar 10, 2018

@jreback the Appveyor build failed, together with at least 50 other ones, all with the same error message: Command executed with exception: Cannot index into a null array.

I don't have a retry button to queue another build. Could you please help me on this? Thanks!

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments. lgtm.

the fill value:

- 'block' tracks only the locations and sizes of blocks of data;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no blank lines between cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


See Also
--------
pandas.SparseDataFrame.to_dense :
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need pandas. here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to DataFrame.to_dense instead of pandas.SparseDataFrame.to_dense: I've built the entire html documentation and the proper hyperlink is generated correctly in this way.

@jreback jreback added this to the 0.23.0 milestone Mar 11, 2018
@gioiab
Copy link
Contributor Author

gioiab commented Mar 28, 2018

@jreback can I help you in some way in closing this? :)

@jreback
Copy link
Contributor

jreback commented Apr 14, 2018

@gioiab can you rebase

@datapythonista if you'd have a look

@jreback jreback removed this from the 0.23.0 milestone Apr 14, 2018
@codecov
Copy link

codecov bot commented Apr 14, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@44691ee). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #20193   +/-   ##
=========================================
  Coverage          ?   91.84%           
=========================================
  Files             ?      153           
  Lines             ?    49275           
  Branches          ?        0           
=========================================
  Hits              ?    45255           
  Misses            ?     4020           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.23% <ø> (?)
#single 41.91% <ø> (?)
Impacted Files Coverage Δ
pandas/core/frame.py 97.15% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 44691ee...dcdf0bb. Read the comment docs.

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review. Great changes, I added some comments with formatting things and some ideas.

See Also
--------
DataFrame.to_dense :
converts the DataFrame back to the its dense form
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you capitalize the first letter and add a period at the end?

I think the description should start in the same line as to_dense and continue to the next line indented when it doesn't fit.

- 'block' tracks only the locations and sizes of blocks of data;
- 'integer' keeps an array with all the locations of the data.

The kind 'block' is recommended since it's more memory efficient.

Returns
-------
y : SparseDataFrame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get rid of the y and just leave the type. Also, may be it's a bit redundant in this case, but for consistency I'd add the description of what is returned.

The kind of the SparseIndex tracking where data is not equal to
the fill value:

- 'block' tracks only the locations and sizes of blocks of data;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the semi-colon at the end is intentional. If you find another docstring with a list, I'd use the same convention.

>>> df.iloc[:995] = 0.
>>> sdf = df.to_sparse(fill_value=0.)
>>> sdf.density
0.005
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice example. It's just an opinion, feel free to leave it like this, if you think that being somehow more realistic is better, but I'd have something like: pd.DataFrame([np.nan, np.nan, 1., np.nan]). You can make it sparse with default arguments, then create another example with zeros instead of NaN and use fill_value. And if you find a nice way to illustrate the difference, I'd add an example with kind='integer'.


Parameters
----------
fill_value : float, default NaN
The specific value that should be omitted in the representation.
kind : {'block', 'integer'}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is missing.

- 'block' tracks only the locations and sizes of blocks of data;
- 'integer' keeps an array with all the locations of the data.

The kind 'block' is recommended since it's more memory efficient.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say the same, but in a way that doesn't sound like 'block' is always better. Something like "In most cases, the default 'block' is preferred for being more memory efficient.".

Not a big difference, but I'd prefer to avoid giving the idea that 'integer' is never the best option, which is not true.

@datapythonista datapythonista merged commit 1dd05cc into pandas-dev:master Jul 10, 2018
@datapythonista
Copy link
Member

Thanks @gioiab for the contribution. And sorry it took a while to merge it.

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
* Updates the documentation for pandas.DataFrame.to_sparse.

* Minor fixes and adding more real world examples
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants