-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Add key to sorting functions #3942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Here's a specific use case that came up on StackOverflow recently. from pandas import DataFrame
df = DataFrame({'a':range(5)},index=['0hr', '128hr', '72hr', '48hr', '96hr'])
print(df.sort()) This returns
The user would have liked to sort the index "naturally" using natsort, i.e.: from natsort import natsort_keygen
print(df.sort(key=natsort_geygen())) which would have returned
|
Here's the way to do this (further you could use a
|
Agreed, in this specific example they could use I'm not really sure how to use |
see #9741 (big, long), but short answer is yes. |
If I understand properly, in order for Based on your suggestions, I am getting the impression that |
@SethMMorton I think we could add key to In categoricals, you can add things later if you want.(and reorder to change the sorting order). |
Ok. Thanks for your time. I'll take a good, deep look at categoricals. |
An issue with the key argument is that it encourages writing non-vectorized code. I would rather support supplying an array to sort by instead, e.g., something like |
True. Maybe it would be nice if the |
first, sort by key is a universally known operation, why drag categories into it?
adding pandas already usefully accepts lambdas in many many places, you can always optimize later if you need to. why decide for every user and every problem that this specific boilerplate/performance tradeoff is worth it? the array you'd like to require is just as likely to be computed with the lambda you'd like to ban only with added inconvenience for users. |
With |
you wouldn't tell users to use category ordering when they want to sort the index lexically, and keyed sorting is no different. wanting to sort by a key function doesn't imply that the index is or should be categorical. categories are irrelevant unless the index data is categorical in nature to begin with. |
Let me add a more concrete example of when having a Suppose that a user had data in text files, and one of the columns contains distances with associated units, i.e. "45.232m" or "0.59472km". Let's say there are ~500,000 rows, and each has a different distance. Now, suppose the user wanted to sort based the data in this column. Obviously, they will have to do some sort of transformation of this data to make it sortable, since a purely ordinal sort will not work. As far as I can tell, currently the two most obvious results are to a) make a new column of the transformation result and use that column for sorting, or b) make the column a category, and then sort the data in the list, and make the categories the sorted data. import re
from pandas import read_csv
def transform_distance(x):
"""Convert string of value and unit to a float.
Since this is just an example, skip error checking."""
m = re.match(r'(.*)([mkc]?m)', x)
units = {'m': 1, 'cm': 0.01, 'mm': 0.001, 'km': 1000}
return float(m.group(1)) * units[m.group(2)]
df = read_csv('myfile.data')
# Sort method 1: Make a new column and sort on that.
df['distances_sort'] = df.distances.map(transform_distance)
df.sort('distances_sort')
# Sort method 2: Use categoricals
df.distances = df.distances.astype('category')
df.distances.cat.reorder_categories(sorted(df.distances, key=transform_distance), inplace=True, ordered=True)
df.sort('distances') To me, neither seem entirely preferable because method 1 adds extra data to the DataFrame, which will take up space and require me to filter out later if I want to write out to file, and method 2 requires sorting all the data in my column before I can sort the data in my DataFrame, which unless I am mistaken is not incredibly efficient. Things would be made worse if I then wanted to read in a second file and append that data to the DataFrame I already had, or if I wanted to modify the existing data in the "distances" column. I would then need to re-update my "distances_sort" column, or re-perform the If a # Proposed sort method: Use a key argument
df.sort('distances', key=transform_distances) Now, no matter how I update or modify my distances column, I do not need to do any additional pre-processing before sorting. The # Supporting multi-column sorting with a key.
# In this case, only columns 'a' and 'c' would use the key for sorting,
# and 'b' and 'd' would sort in the standard way.
df.sort(['a', 'b', 'c', 'd'], key={'a': lambda x: x.lower(), 'c': transform_distances}) |
OK, I can see Instead of allow key to accept dicts, what about sorting multiple columns -> the key function gets a tuple? |
As long as it is well documented how to use the key on multiple columns, I don't much care. Just having the key option would be a huge step in the right direction. |
I found this while searching these kinds of usages. I really like @SethMMorton 's idea. Really wish this will happen. Under many circumstances, this makes more sense than catogeries. |
as indicated above, if someone wants to step up and add this functionality in a proper / tested way, then it is likely to be accepted. |
I did a quick dive into this since I needed it for a project, and I wanted to add some notes. First of all, I see three possible approaches to implementing this.
if key is not None:
key_func = np.vectorize(key)
k = key_func(k)
...
nargsort(k, ...) This would be very useful for some of the things that have been discussed like a
if key is not None:
indexer = non_nan_idx[sorted(range(len(non_nans), key=lambda x y: key(non_nans[x], non_nans[y])]
else:
indexer = non_nan_idx[non_nans.argsort(kind=kind)] with a note making it clear that comparison with a key will be performed within Python. I personally think the second solution is attractive and fits with the Python sorting conventions (since it imitates |
My fork at https://github.com/ja3067/pandas has a key implemented for sort_values() and sort_index(). The following code works as expected: >>> import pandas as pd
>>> df = pd.DataFrame(["Hello", "goodbye"])
>>> df.sort_values(0)
0
0 Hello
1 goodbye
>>> df.sort_values(0, key=str.lower)
0
1 goodbye
0 Hello
>>> df.sort_index(key=lambda x : -x)
0
1 goodbye
0 Hello
>>> ser = pd.Series([1, 2, 3])
>>> ser.sort_values(key=lambda x : -x)
2 3
1 2
0 1
>>> ser.sort_index(key=lambda x : -x)
2 3
1 2
0 1 I'm currently writing tests. |
Opened pull request #27237 to implement these features. |
Many python functions (sorting, max/min) accept a key argument, perhaps they could in pandas too.
.
The terrible motivating example was this awful hack from this question.... for which maybe one could do
This would still be an awful awful hack, but a slightly less awful one.
The text was updated successfully, but these errors were encountered: