@@ -22,12 +22,12 @@ The axis labeling information in pandas objects serves many purposes:
22
22
- Enables automatic and explicit data alignment
23
23
- Allows intuitive getting and setting of subsets of the data set
24
24
25
- In this section / chapter, we will focus on the latter set of functionality,
26
- namely how to slice, dice, and generally get and set subsets of pandas
27
- objects. The primary focus will be on Series and DataFrame as they have
28
- received more development attention in this area. More work will be invested in
29
- Panel and future higher-dimensional data structures in the future, especially
30
- in label-based advanced indexing.
25
+ In this section / chapter, we will focus on the final point: namely, how to
26
+ slice, dice, and generally get and set subsets of pandas objects. The primary
27
+ focus will be on Series and DataFrame as they have received more development
28
+ attention in this area. Expect more work to be invested higher-dimensional data
29
+ structures (including Panel) in the future, especially in label-based advanced
30
+ indexing.
31
31
32
32
.. _indexing.basics :
33
33
@@ -115,19 +115,16 @@ label, respectively.
115
115
panel.major_xs(date)
116
116
panel.minor_xs(' A' )
117
117
118
- .. note ::
119
-
120
- See :ref: `advanced indexing <indexing.advanced >` below for an alternate and
121
- more concise way of doing the same thing.
122
-
123
118
Slicing ranges
124
119
~~~~~~~~~~~~~~
125
120
126
- :ref: `Advanced indexing <indexing.advanced >` detailed below is the most robust
127
- and consistent way of slicing ranges, e.g. ``obj[5:10] ``, across all of the data
128
- structures and their axes (except in the case of integer labels, more on that
129
- later). On Series, this syntax works exactly as expected as with an ndarray,
130
- returning a slice of the values and the corresponding labels:
121
+ The most robust and consistent way of slicing ranges along arbitrary axes is
122
+ described in the :ref: `Advanced indexing <indexing.advanced >` section detailing
123
+ the ``.ix `` method. For now, we explain the semantics of slicing using the
124
+ ``[] `` operator.
125
+
126
+ With Series, the syntax works exactly as with an ndarray, returning a slice of
127
+ the values and the corresponding labels:
131
128
132
129
.. ipython :: python
133
130
@@ -154,28 +151,37 @@ largely as a convenience since it is such a common operation.
154
151
Boolean indexing
155
152
~~~~~~~~~~~~~~~~
156
153
157
- Using a boolean vector to index a Series works exactly like an ndarray:
154
+ .. _indexing.boolean :
155
+
156
+ Using a boolean vector to index a Series works exactly as in a numpy ndarray:
158
157
159
158
.. ipython :: python
160
159
161
160
s[s > 0 ]
162
161
s[(s < 0 ) & (s > - 0.5 )]
163
162
164
- Again as a convenience, selecting rows from a DataFrame using a boolean vector
165
- the same length as the DataFrame's index (for example, something derived from
166
- one of the columns of the DataFrame) is supported :
163
+ You may select rows from a DataFrame using a boolean vector the same length as
164
+ the DataFrame's index (for example, something derived from one of the columns
165
+ of the DataFrame):
167
166
168
167
.. ipython :: python
169
168
170
169
df[df[' A' ] > 0 ]
171
170
172
- As we will see later on, the same operation could be accomplished by
173
- reindexing. However, the syntax would be more verbose; hence, the inclusion of
174
- this indexing method.
171
+ Consider the ``isin `` method of Series, which returns a boolean vector that is
172
+ true wherever the Series elements exist in the passed list. This allows you to
173
+ select out rows where one or more columns have values you want:
174
+
175
+ .. ipython :: python
176
+
177
+ df2 = DataFrame({' a' : [' one' , ' one' , ' two' , ' three' , ' two' , ' one' , ' six' ],
178
+ ' b' : [' x' , ' y' , ' y' , ' x' , ' y' , ' x' , ' x' ],
179
+ ' c' : np.random.randn(7 )})
180
+ df2[df2[' a' ].isin([' one' , ' two' ])]
175
181
176
- With the advanced indexing capabilities discussed later , you are able to do
177
- boolean indexing in any of axes or combine a boolean vector with an indexing
178
- expression on one of the other axes
182
+ Note, with the :ref: ` advanced indexing < indexing.advanced >` `` ix `` method , you
183
+ may select along more than one axis using boolean vectors combined with other
184
+ indexing expressions.
179
185
180
186
Indexing a DataFrame with a boolean DataFrame
181
187
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -202,19 +208,32 @@ Take Methods
202
208
203
209
TODO: Fill Me In
204
210
205
-
206
- Slicing ranges
211
+ Duplicate Data
207
212
~~~~~~~~~~~~~~
208
213
209
- Similar to Python lists and ndarrays, for convenience DataFrame
210
- supports slicing:
214
+ .. _indexing.duplicate :
211
215
212
- .. ipython :: python
216
+ If you want to indentify and remove duplicate rows in a DataFrame, there are
217
+ two methods that will help: ``duplicated `` and ``drop_duplicates ``. Each
218
+ takes as an argument the columns to use to identify duplicated rows.
219
+
220
+ ``duplicated `` returns a boolean vector whose length is the number of rows, and
221
+ which indicates whether a row is duplicated.
213
222
214
- df[:2 ]
215
- df[::- 1 ]
216
- df[- 3 :].T
223
+ ``drop_duplicates `` removes duplicate rows.
224
+
225
+ By default, the first observed row of a duplicate set is considered unique, but
226
+ each method has a ``take_last `` parameter that indicates the last observed row
227
+ should be taken instead.
228
+
229
+ .. ipython :: python
217
230
231
+ df2 = DataFrame({' a' : [' one' , ' one' , ' two' , ' three' , ' two' , ' one' , ' six' ],
232
+ ' b' : [' x' , ' y' , ' y' , ' x' , ' y' , ' x' , ' x' ],
233
+ ' c' : np.random.randn(7 )})
234
+ df2.duplicated([' a' ,' b' ])
235
+ df2.drop_duplicates([' a' ,' b' ])
236
+ df2.drop_duplicates([' a' ,' b' ], take_last = True )
218
237
219
238
.. _indexing.advanced :
220
239
0 commit comments