Skip to content

Have the possibility for Series.unique() to return a Series rather than an array #1923

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lbeltrame opened this issue Sep 17, 2012 · 13 comments · Fixed by #47274
Closed

Have the possibility for Series.unique() to return a Series rather than an array #1923

lbeltrame opened this issue Sep 17, 2012 · 13 comments · Fixed by #47274
Milestone

Comments

@lbeltrame
Copy link
Contributor

I admit I haven't looked at the code so there may be reasons for this, but I've found myself in the need of squeezing out duplicates from a Series but keeping the results as a Series.

Series.unique() however returns an array, so in my code I have to construct a Series twice:

series = pandas.Series([1,2,3,3])
series = pandas.Series(series.unique())

Is this by design? If so, feel free to close this bug.

@wesm
Copy link
Member

wesm commented Sep 17, 2012

Well it's a good question. I guess the main issue is what index you should assign (default 0 to N-1 would be the only reasonable one probably, otherwise the index values where the unique values occurred).

@lbeltrame
Copy link
Contributor Author

In data lunedì 17 settembre 2012 05:35:34, Wes McKinney ha scritto:

assign (default 0 to N-1 would be the only reasonable one probably,
otherwise the index values where the unique values occurred).

I don't have strong opinions on either, any would be a very good improvement
over the current behavior IMO.

Luca Beltrame - KDE Forums team
KDE Science supporter
GPG key ID: 6E1A4E79

@gerigk
Copy link

gerigk commented Sep 17, 2012

I think the second option (indices of the unique entries) would be helpful
since it is easy to simply reset the index to 0...n-1 but much more
expensive to get the indices where they occur in case I am interested.
but probably if you added this users would then ask for an option to
specify whether I get the index of the first, last or whatever occurence of
the unique values ;-)

On Mon, Sep 17, 2012 at 3:21 PM, Luca Beltrame [email protected]:

In data lunedì 17 settembre 2012 05:35:34, Wes McKinney ha scritto:

assign (default 0 to N-1 would be the only reasonable one probably,
otherwise the index values where the unique values occurred).

I don't have strong opinions on either, any would be a very good
improvement
over the current behavior IMO.

Luca Beltrame - KDE Forums team
KDE Science supporter
GPG key ID: 6E1A4E79


Reply to this email directly or view it on GitHubhttps://github.com//issues/1923#issuecomment-8613623.

@wesm
Copy link
Member

wesm commented Sep 17, 2012

I agree it would be helpful. But more expensive to compute. Have to think about it

@changhiskhan
Copy link
Contributor

I don't think unique should return a Series with a meaningless integer index, seems harmful/confusing if the original Series also had an integer index.
As for computing the indices of occurrences, how about an optional return_loc parameter where it's None by default (returns ndarray) and can be "all", "first", "last".

@wesm
Copy link
Member

wesm commented Sep 24, 2012

like how about this:

s.unique() -> no index
s.unique(index='first') -> Series
s.unique(index='last') -> Series

@changhiskhan
Copy link
Contributor

yeah, exactly what I was thinking

@lodagro
Copy link
Contributor

lodagro commented Sep 25, 2012

s.unique() --> keep method as it is, a faster alternative to np.unique() --- no index

Add drop_duplicates() to Series?:
s.drop_duplicates(take_last=...) --> Series, index behavior like for DataFrame.drop_duplicates()

@wesm
Copy link
Member

wesm commented Sep 25, 2012

That's not a bad idea either

@changhiskhan
Copy link
Contributor

Maybe drop_duplicates to get first or last and then a separate method to get a reverse mapping of all indices for each unique value?

@wesm
Copy link
Member

wesm commented Oct 5, 2012

See DataFrame.duplicated, which returns a boolean array

@wesm wesm closed this as completed in 537e6a6 Nov 28, 2012
@lazarillo
Copy link

At the risk of appearing to re-open something that is dead, I thought I'd just summarize for any newcomers to this page:

series.unique was left unchanged (returns a numpy ndarray)
drop_duplicates and duplicated were added to Series.

So, if you wanted to do something like the OP and wanted to keep the indices, you'd perform:

series = pandas.Series([1,2,3,3]).drop_duplicates(keep='first')

(keep='first' retains the first occurrence of any duplicates, keep='last' retains the last occurrence, and keep=False retains NONE of the duplicates. keep='first' is the default.)

@ddebernardy
Copy link

In case anyone feels like writing a PR, the pandas.Series.unique docs doesn't mention what @lazarillo says and could use an extra line that references pandas.Series.drop_duplicates (and possibly vice versa).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants