Skip to content

Commit 64d2270

Browse files
committed
Make select by substring
1 parent 7e4d306 commit 64d2270

File tree

5 files changed

+342
-0
lines changed

5 files changed

+342
-0
lines changed

books.xml

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
<?xml version="1.0"?>
2+
<catalog>
3+
<book id="bk101">
4+
<author>Gambardella, Matthew</author>
5+
<title>XML Developer's Guide</title>
6+
<genre>Computer</genre>
7+
<price>44.95</price>
8+
<publish_date>2000-10-01</publish_date>
9+
<description>An in-depth look at creating applications
10+
with XML.</description>
11+
</book>
12+
<book id="bk102">
13+
<author>Ralls, Kim</author>
14+
<title>Midnight Rain</title>
15+
<genre>Fantasy</genre>
16+
<price>5.95</price>
17+
<publish_date>2000-12-16</publish_date>
18+
<description>A former architect battles corporate zombies,
19+
an evil sorceress, and her own childhood to become queen
20+
of the world.</description>
21+
</book>
22+
<book id="bk103">
23+
<author>Corets, Eva</author>
24+
<title>Maeve Ascendant</title>
25+
<genre>Fantasy</genre>
26+
<price>5.95</price>
27+
<publish_date>2000-11-17</publish_date>
28+
<description>After the collapse of a nanotechnology
29+
society in England, the young survivors lay the
30+
foundation for a new society.</description>
31+
</book>
32+
<book id="bk104">
33+
<author>Corets, Eva</author>
34+
<title>Oberon's Legacy</title>
35+
<genre>Fantasy</genre>
36+
<price>5.95</price>
37+
<publish_date>2001-03-10</publish_date>
38+
<description>In post-apocalypse England, the mysterious
39+
agent known only as Oberon helps to create a new life
40+
for the inhabitants of London. Sequel to Maeve
41+
Ascendant.</description>
42+
</book>
43+
<book id="bk105">
44+
<author>Corets, Eva</author>
45+
<title>The Sundered Grail</title>
46+
<genre>Fantasy</genre>
47+
<price>5.95</price>
48+
<publish_date>2001-09-10</publish_date>
49+
<description>The two daughters of Maeve, half-sisters,
50+
battle one another for control of England. Sequel to
51+
Oberon's Legacy.</description>
52+
</book>
53+
<book id="bk106">
54+
<author>Randall, Cynthia</author>
55+
<title>Lover Birds</title>
56+
<genre>Romance</genre>
57+
<price>4.95</price>
58+
<publish_date>2000-09-02</publish_date>
59+
<description>When Carla meets Paul at an ornithology
60+
conference, tempers fly as feathers get ruffled.</description>
61+
</book>
62+
<book id="bk107">
63+
<author>Thurman, Paula</author>
64+
<title>Splish Splash</title>
65+
<genre>Romance</genre>
66+
<price>4.95</price>
67+
<publish_date>2000-11-02</publish_date>
68+
<description>A deep sea diver finds true love twenty
69+
thousand leagues beneath the sea.</description>
70+
</book>
71+
<book id="bk108">
72+
<author>Knorr, Stefan</author>
73+
<title>Creepy Crawlies</title>
74+
<genre>Horror</genre>
75+
<price>4.95</price>
76+
<publish_date>2000-12-06</publish_date>
77+
<description>An anthology of horror stories about roaches,
78+
centipedes, scorpions and other insects.</description>
79+
</book>
80+
<book id="bk109">
81+
<author>Kress, Peter</author>
82+
<title>Paradox Lost</title>
83+
<genre>Science Fiction</genre>
84+
<price>6.95</price>
85+
<publish_date>2000-11-02</publish_date>
86+
<description>After an inadvertant trip through a Heisenberg
87+
Uncertainty Device, James Salway discovers the problems
88+
of being quantum.</description>
89+
</book>
90+
<book id="bk110">
91+
<author>O'Brien, Tim</author>
92+
<title>Microsoft .NET: The Programming Bible</title>
93+
<genre>Computer</genre>
94+
<price>36.95</price>
95+
<publish_date>2000-12-09</publish_date>
96+
<description>Microsoft's .NET initiative is explored in
97+
detail in this deep programmer's reference.</description>
98+
</book>
99+
<book id="bk111">
100+
<author>O'Brien, Tim</author>
101+
<title>MSXML3: A Comprehensive Guide</title>
102+
<genre>Computer</genre>
103+
<price>36.95</price>
104+
<publish_date>2000-12-01</publish_date>
105+
<description>The Microsoft MSXML3 parser is covered in
106+
detail, with attention to XML DOM interfaces, XSLT processing,
107+
SAX and more.</description>
108+
</book>
109+
<book id="bk112">
110+
<author>Galos, Mike</author>
111+
<title>Visual Studio 7: A Comprehensive Guide</title>
112+
<genre>Computer</genre>
113+
<price>49.95</price>
114+
<publish_date>2001-04-16</publish_date>
115+
<description>Microsoft Visual Studio 7 is explored in depth,
116+
looking at how Visual Basic, Visual C++, C#, and ASP+ are
117+
integrated into a comprehensive development
118+
environment.</description>
119+
</book>
120+
</catalog>

pandas/core/frame.py

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,8 +31,10 @@
3131
TYPE_CHECKING,
3232
Any,
3333
Literal,
34+
List,
3435
cast,
3536
overload,
37+
Union
3638
)
3739
import warnings
3840

@@ -7716,6 +7718,86 @@ def nsmallest(
77167718
Nauru 337000 182 NR
77177719
"""
77187720
return selectn.SelectNFrame(self, n=n, keep=keep, columns=columns).nsmallest()
7721+
7722+
def select_by_substr(
7723+
self,
7724+
substr: Union[str, List[str]],
7725+
ignore_case: bool = True,
7726+
) -> DataFrame | None:
7727+
"""
7728+
Return columns whose names contain the specified substring(s).
7729+
7730+
Select and return all columns from the DataFrame whose names contain
7731+
the given substring or any of a list of substrings. By default, the
7732+
search is case-insensitive.
7733+
7734+
Parameters
7735+
----------
7736+
substr : str or list of str
7737+
Substring or list of substrings to search for in column names.
7738+
ignore_case : bool, default True
7739+
Whether to ignore case when searching for substrings.
7740+
7741+
Returns
7742+
-------
7743+
DataFrame or None
7744+
DataFrame containing only the columns whose names match the
7745+
specified substring(s). Returns None if no columns match.
7746+
7747+
See Also
7748+
--------
7749+
DataFrame.filter : Subset the columns or rows of a DataFrame according to labels or a boolean array.
7750+
DataFrame.loc : Access a group of rows and columns by label(s) or a boolean array.
7751+
7752+
Notes
7753+
-----
7754+
All columns containing at least one of the provided substrings will be
7755+
returned. If no columns match, None is returned.
7756+
7757+
Examples
7758+
--------
7759+
>>> df = pd.DataFrame({
7760+
... "first_name": ["Alice", "Bob"],
7761+
... "last_name": ["Smith", "Jones"],
7762+
... "age": [25, 30],
7763+
... "city": ["NY", "LA"]
7764+
... })
7765+
>>> df.select_by_substr("name")
7766+
first_name last_name
7767+
0 Alice Smith
7768+
1 Bob Jones
7769+
7770+
>>> df.select_by_substr(["name", "city"])
7771+
first_name last_name city
7772+
0 Alice Smith NY
7773+
1 Bob Jones LA
7774+
7775+
>>> df.select_by_substr("AGE", ignore_case=False) # No match due to case
7776+
None
7777+
7778+
>>> df.select_by_substr("AGE", ignore_case=True)
7779+
age
7780+
0 25
7781+
1 30
7782+
"""
7783+
substr = [substr] if isinstance(substr, str) else substr
7784+
selected_cols = self.columns
7785+
7786+
if ignore_case:
7787+
selected_cols = [
7788+
col for col in self.columns
7789+
if any(sub.casefold() in col.casefold()
7790+
for sub in substr)
7791+
]
7792+
else:
7793+
selected_cols = [
7794+
col for col in self.columns
7795+
if any(sub in col
7796+
for sub in substr)
7797+
]
7798+
7799+
selected_cols = list(set(selected_cols))
7800+
return self[selected_cols]
77197801

77207802
def swaplevel(self, i: Axis = -2, j: Axis = -1, axis: Axis = 0) -> DataFrame:
77217803
"""

pandas/tests/frame/test_

Whitespace-only changes.

placeholder.txt

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
Return the first `n` rows ordered by `columns` in ascending order.
2+
3+
Return the first `n` rows with the smallest values in `columns`, in
4+
ascending order. The columns that are not specified are returned as
5+
well, but not used for ordering.
6+
7+
This method is equivalent to
8+
``df.sort_values(columns, ascending=True).head(n)``, but more
9+
performant.
10+
11+
Parameters
12+
----------
13+
n : int
14+
Number of items to retrieve.
15+
columns : list or str
16+
Column name or names to order by.
17+
keep : {'first', 'last', 'all'}, default 'first'
18+
Where there are duplicate values:
19+
20+
- ``first`` : take the first occurrence.
21+
- ``last`` : take the last occurrence.
22+
- ``all`` : keep all the ties of the largest item even if it means
23+
selecting more than ``n`` items.
24+
25+
Returns
26+
-------
27+
DataFrame
28+
DataFrame with the first `n` rows ordered by `columns` in ascending order.
29+
30+
See Also
31+
--------
32+
DataFrame.nlargest : Return the first `n` rows ordered by `columns` in
33+
descending order.
34+
DataFrame.sort_values : Sort DataFrame by the values.
35+
DataFrame.head : Return the first `n` rows without re-ordering.
36+
37+
Examples
38+
--------
39+
>>> df = pd.DataFrame(
40+
... {
41+
... "population": [
42+
... 59000000,
43+
... 65000000,
44+
... 434000,
45+
... 434000,
46+
... 434000,
47+
... 337000,
48+
... 337000,
49+
... 11300,
50+
... 11300,
51+
... ],
52+
... "GDP": [1937894, 2583560, 12011, 4520, 12128, 17036, 182, 38, 311],
53+
... "alpha-2": ["IT", "FR", "MT", "MV", "BN", "IS", "NR", "TV", "AI"],
54+
... },
55+
... index=[
56+
... "Italy",
57+
... "France",
58+
... "Malta",
59+
... "Maldives",
60+
... "Brunei",
61+
... "Iceland",
62+
... "Nauru",
63+
... "Tuvalu",
64+
... "Anguilla",
65+
... ],
66+
... )
67+
>>> df
68+
population GDP alpha-2
69+
Italy 59000000 1937894 IT
70+
France 65000000 2583560 FR
71+
Malta 434000 12011 MT
72+
Maldives 434000 4520 MV
73+
Brunei 434000 12128 BN
74+
Iceland 337000 17036 IS
75+
Nauru 337000 182 NR
76+
Tuvalu 11300 38 TV
77+
Anguilla 11300 311 AI
78+
79+
In the following example, we will use ``nsmallest`` to select the
80+
three rows having the smallest values in column "population".
81+
82+
>>> df.nsmallest(3, "population")
83+
population GDP alpha-2
84+
Tuvalu 11300 38 TV
85+
Anguilla 11300 311 AI
86+
Iceland 337000 17036 IS
87+
88+
When using ``keep='last'``, ties are resolved in reverse order:
89+
90+
>>> df.nsmallest(3, "population", keep="last")
91+
population GDP alpha-2
92+
Anguilla 11300 311 AI
93+
Tuvalu 11300 38 TV
94+
Nauru 337000 182 NR
95+
96+
When using ``keep='all'``, the number of element kept can go beyond ``n``
97+
if there are duplicate values for the largest element, all the
98+
ties are kept.
99+
100+
>>> df.nsmallest(3, "population", keep="all")
101+
population GDP alpha-2
102+
Tuvalu 11300 38 TV
103+
Anguilla 11300 311 AI
104+
Iceland 337000 17036 IS
105+
Nauru 337000 182 NR
106+
107+
However, ``nsmallest`` does not keep ``n`` distinct
108+
smallest elements:
109+
110+
>>> df.nsmallest(4, "population", keep="all")
111+
population GDP alpha-2
112+
Tuvalu 11300 38 TV
113+
Anguilla 11300 311 AI
114+
Iceland 337000 17036 IS
115+
Nauru 337000 182 NR
116+
117+
To order by the smallest values in column "population" and then "GDP", we can
118+
specify multiple columns like in the next example.
119+
120+
>>> df.nsmallest(3, ["population", "GDP"])
121+
population GDP alpha-2
122+
Tuvalu 11300 38 TV
123+
Anguilla 11300 311 AI
124+
Nauru 337000 182 NR
125+
"""

test.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# contributing guide: https://pandas.pydata.org/docs/dev/development/contributing.html#pushing-your-changes
2+
import pandas as pd
3+
4+
df = pd.DataFrame({
5+
"yes": [5000, 2, 3],
6+
"Byesyes": [4, 5, 6],
7+
"no": [7, 8, 9],
8+
"Byesno": [10, 11, 12],
9+
"YES": [13, 14, 15],
10+
"NO": [16, 17, 18],
11+
"YESYES": [19, 20, 21],
12+
})
13+
14+
# Test the DataFrame creation
15+
print(df.select_by_substr("skibidi"))

0 commit comments

Comments
 (0)