Skip to content

DOC: Fix EX01 in DataFrame.drop_duplicates #33283

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Apr 10, 2020

Conversation

farhanreynaldo
Copy link
Contributor

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Related to #27977.

################################################################################
################################## Validation ##################################
################################################################################

Copy link
Member

@ShaharNaveh ShaharNaveh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@farhanreynaldo Thank you for working on this!


I have a few nits, otherwise LGTM


Examples
--------

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the empty line.

... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
... 'rating': [4, 4, 3.5, 15, 5]},
... index=['TH', 'TH', 'ID', 'ID', 'ID'])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the empty line, so it won't split into sections

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that, I'm gonna fix this issue.

Copy link
Member

@ShaharNaveh ShaharNaveh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

>>> df = pd.DataFrame({'brand': brand,
... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
... 'rating': [4, 4, 3.5, 15, 5]},
... index=['TH', 'TH', 'ID', 'ID', 'ID'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the index adds much value to the example. I'd remove it if it doesn't, so things are simpler and faster to read.

Also, did you run the validation script scripts/validate_docstrings.py pandas.DataFrame.drop_duplicates? I'm wondering if the indentation above is PEP-8 correct. The script should tell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I agree, the index doesn't have additional values, I might as well remove it.
I also have run the validate scripts, and it returns:

(pandas-dev) E:\pandas>python scripts/validate_docstrings.py pandas.DataFrame.drop_duplicates

################################################################################
################# Docstring (pandas.DataFrame.drop_duplicates) #################
################################################################################

Return DataFrame with duplicate rows removed.

Considering certain columns is optional. Indexes, including time indexes
are ignored.

Parameters
----------
subset : column label or sequence of labels, optional
    Only consider certain columns for identifying duplicates, by
    default use all of the columns.
keep : {'first', 'last', False}, default 'first'
    Determines which duplicates (if any) to keep.
    - ``first`` : Drop duplicates except for the first occurrence.
    - ``last`` : Drop duplicates except for the last occurrence.
    - False : Drop all duplicates.
inplace : bool, default False
    Whether to drop duplicates in place or to return a copy.
ignore_index : bool, default False
    If True, the resulting axis will be labeled 0, 1, …, n - 1.

    .. versionadded:: 1.0.0

Returns
-------
DataFrame
    DataFrame with duplicates removed or None if ``inplace=True``.

See Also
--------
DataFrame.value_counts: Count unique combinations of columns.

Examples
--------
Consider dataset containing ramen rating.

>>> brand = ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie']
>>> df = pd.DataFrame({'brand': brand,
... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
... 'rating': [4, 4, 3.5, 15, 5]})
>>> df
    brand style  rating
0  Yum Yum   cup     4.0
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

By default, it removes duplicate rows based on all columns

>>> df.drop_duplicates()
    brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

To remove duplicates on specific column(s), use ``subset``

>>> df.drop_duplicates(subset=['brand'])
    brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5

To remove duplicates and keep last occurences, use ``keep``

>>> df.drop_duplicates(subset=['brand', 'style'], keep='last')
    brand style  rating
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
4  Indomie  pack     5.0

################################################################################
################################## Validation ##################################
################################################################################

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hehe, I think the important part is the one that you didn't past, after the Validation header. :) That's where it says if any error has been found, or if everything is ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's no error showed after validation header ._.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC there should be a message saying there are no errors if that's the case. May be there is something broken.

But in any case, the CI is green, and this is a nice improvement. If there is any validation problem we can take care at it in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I am wondering, which part of the documentation I could improve regarding the PEP-8 indentation? I could change it and run the validation scripts once again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the indentation, see:

df = pd.DataFrame({'brand': brand,
'style': ['cup', 'cup', 'cup', 'pack', 'pack'],

df = pd.DataFrame({'brand': brand,
                   'style': ['cup', 'cup', 'cup', 'pack', 'pack'],

df = pd.DataFrame({
    'brand': brand,
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],

The first one is the one in your code, and doesn't seem correct to me. The other two seem correct.

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, this looks great now. Thanks @farhanreynaldo

@jreback jreback added this to the 1.1 milestone Apr 10, 2020
@jreback jreback merged commit 916d1f3 into pandas-dev:master Apr 10, 2020
@jreback
Copy link
Contributor

jreback commented Apr 10, 2020

thanks @farhanreynaldo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants