UBC-DSCI · joelostblom · Jul 29, 2023 · Jan 5, 2023 · Jan 5, 2023 · Jan 5, 2023
@@ -1013,26 +1013,16 @@ The query (criteria we are using to select values) is input as a string. The `qu
 is less often used than the earlier approaches we introduced, but it can come in handy
 to make long chains of filtering operations a bit easier to read.
 
-(loc-iloc)=
-## Using `loc[]` to filter rows and select columns.
-```{index} pandas.DataFrame; loc[]
-```
+## Using `[]` to select ranges of columns
 
-The `[]` operation is only used when you want to filter rows or select columns;
-it cannot be used to do both operations at the same time. This is where `loc[]`
-comes in. For the first example, recall `loc[]` from Chapter {ref}`intro`,
-which lets us create a subset of columns from a data frame.
 Suppose we wanted to select only the columns `language`, `region`,
 `most_at_home` and `most_at_work` from the `tidy_lang` data set. Using what we
 learned in the chapter on {ref}`intro`, we would pass all of these column names into the square brackets.
 
 ```{code-cell} ipython3
 :tags: ["output_scroll"]
-selected_columns = tidy_lang.loc[:, ["language", "region", "most_at_home", "most_at_work"]]
-selected_columns
+tidy_lang[["language", "region", "most_at_home", "most_at_work"]]
 ```
-We pass `:` before the comma indicating we want to retrieve all rows, and the list indicates
-the columns that we want.
 
 Note that we could obtain the same result by stating that we would like all of the columns
 from `language` through `most_at_work`. Instead of passing a list of all of the column
@@ -1041,20 +1031,18 @@ you can read as "The columns from `language` to `most_at_work`".
 
 ```{code-cell} ipython3
 :tags: ["output_scroll"]
-selected_columns = tidy_lang.loc[:, "language":"most_at_work"]
-selected_columns
+tidy_lang["language":"most_at_work"]
 ```
 
 Similarly, you can ask for all of the columns including and after `language` by doing the following
 
 ```{code-cell} ipython3
 :tags: ["output_scroll"]
-selected_columns = tidy_lang.loc[:, "language":]
-selected_columns
+tidy_lang["language":]
 ```
 
-By not putting anything after the `:`, python reads this as "from `language` until the last column".
-Although the notation for selecting a range using `:` is convienent because less code is required,
+By not putting anything after the `:`, Python reads this as "from `language` until the last column".
+Although the notation for selecting a range using `:` is convenient because less code is required,
 it must be used carefully. If you were to re-order columns or add a column to the data frame, the
 output would change. Using a list is more explicit and less prone to potential confusion.
 
@@ -1065,7 +1053,7 @@ us to select variables based on their names. In particular, we can use the `.str
 to choose only the columns that start with the word "most":
 
 ```{code-cell} ipython3
-tidy_lang.loc[:, tidy_lang.columns.str.startswith('most')]
+tidy_lang[tidy_lang.columns.str.startswith('most')]
 ```
 
 ```{index} pandas.Series; str.contains
@@ -1076,46 +1064,73 @@ We could also have chosen the columns containing an underscore `_` by using the
 the columns we want contain underscores and the others don't.
 
 ```{code-cell} ipython3
-tidy_lang.loc[:, tidy_lang.columns.str.contains('_')]
+tidy_lang[tidy_lang.columns.str.contains('_')]
+```
+
+(loc-iloc)=
+## Using `loc[]` to filter rows and select columns
+
+```{index} pandas.DataFrame; loc[]
 ```
 
-There are many different functions that help with selecting
-variables based on certain criteria.
-The additional resources section at the end of this chapter
-provides a comprehensive resource on these functions.
+The `[]` operation is only used when you want to either filter rows **or** select columns;
+it cannot be used to do both operations at the same time. This is where `loc[]`
+comes in. For the first example, recall `loc[]` from Chapter {ref}`intro`,
+which lets us create a subset of columns from a data frame.
 
 ```{code-cell} ipython3
-:tags: [remove-cell]
+:tags: ["output_scroll"]
+tidy_lang.loc[
+    tidy_lang['region'] == 'Toronto',
+    ["language", "region", "most_at_home", "most_at_work"]
+]
+```
+
+Just as `[]`, `loc` also works with ranges of columns:
 
-# There are many different `select` helpers that select
-# variables based on certain criteria.
-# The additional resources section at the end of this chapter
-# provides a comprehensive resource on `select` helpers.
+```{code-cell} ipython3
+:tags: ["output_scroll"]
+tidy_lang.loc[
+    tidy_lang['region'] == 'Toronto',
+    "language":"most_at_work"
+]
 ```
 
-## Using `iloc[]` to extract a range of columns
+## Using `iloc[]` to extract rows and columns by position
 ```{index} pandas.DataFrame; iloc[], column range
 ```
-Another approach for selecting columns is to use `iloc[]`,
-which provides the ability to index with integers rather than the names of the columns.
-For example, the column names of the `tidy_lang` data frame are
+Another approach for selecting rows and columns is to use `iloc[]`,
+which provides the ability to index with the position rather than the label of the columns.
+For example, the column labels of the `tidy_lang` data frame are
 `['category', 'language', 'region', 'most_at_home', 'most_at_work']`.
 Using `iloc[]`, you can ask for the `language` column by requesting the
 column at index `1` (remember that Python starts counting at `0`, so the second item `'language'`
 has index `1`!).
 
 ```{code-cell} ipython3
-column = tidy_lang.iloc[:, 1]
-column
+tidy_lang.iloc[:, 1]
 ```
 
-You can also ask for multiple columns, just like we did with `[]`. We pass `:` before
-the comma, indicating we want to retrieve all rows, and `1:` after the comma
+We pass `:` before the comma indicating we want to retrieve all rows.
+You can also ask for multiple columns,
+we pass `1:` after the comma
 indicating we want columns after and including index 1 (*i.e.* `language`).
 
 ```{code-cell} ipython3
-column_range = tidy_lang.iloc[:, 1:]
-column_range
+tidy_lang.iloc[:, 1:]
+```
+
+We can also use `iloc[]` to select ranges of rows, using a similar syntax.
+For example to select the ten first rows we could use the following:
+
+```{code-cell} ipython3
+tidy_lang.iloc[:10, :]
+```
+
+`pandas` also provides a shorthand for selecting ranges of rows by using `[]`:
+
+```{code-cell} ipython3
+tidy_lang[:10]
 ```
 
 The `iloc[]` method is less commonly used, and needs to be used with care.
@@ -1251,52 +1266,44 @@ summary statistics that you can compute with `pandas`.
 What if you want to calculate summary statistics on an entire data frame? Well,
 it turns out that the functions in {numref}`tab:basic-summary-statistics`
 can be applied to a whole data frame!
-For example, we can ask for the number of rows that each column has using `count`.
-```{code-cell} ipython3
-region_lang.count()
-```
-Not surprisingly, they are all the same. We could also ask for the `mean`, but
-some of the columns in `region_lang` contain string data with words like `"Vancouver"`
-and `"Halifax"`---for these columns there is no way for `pandas` to compute the mean.
-So we provide the keyword `numeric_only=True` so that it only computes the mean of columns with numeric values. This
-is also needed if you want the `sum` or `std`.
-```{code-cell} ipython3
-region_lang.mean(numeric_only=True)
-```
-If we ask for the `min` or the `max`, `pandas` will give you the smallest or largest number
-for columns with numeric values. For columns with text, it will return the
-least repeated value for `min` and the most repeated value for `max`. Again,
-if you only want the minimum and maximum value for
-numeric columns, you can provide `numeric_only=True`.
+For example, we can ask for the maximum value of each each column has using `max`.
+
 ```{code-cell} ipython3
 region_lang.max()
 ```
+
+We can see that for columns that contain string data
+with words like `"Vancouver"` and `"Halifax"`,
+the maximum value is determined by sorting the string alphabetically
+and returning the last value.
+If we only want the maximum value for
+numeric columns,
+we can provide `numeric_only=True`:
+
 ```{code-cell} ipython3
-region_lang.min()
+region_lang.max(numeric_only=True)
 ```
 
-Similarly, if there are only some columns for which you would like to get summary statistics,
-you can first use `loc[]` and then ask for the summary statistic. An example of this is illustrated in {numref}`fig:summarize-across`.
-Later, we will talk about how you can also use a more general function, `apply`, to accomplish this.
+We could also ask for the `mean` for each columns in the dataframe.
+It does not make sense to compute the mean of the string columns,
+so in this case we *must* provide the keyword `numeric_only=True`
+so that the mean is only computed on columns with numeric values.
 
-```{figure} img/summarize/summarize.003.jpeg
-:name: fig:summarize-across
-:figclass: figure
-
-`loc[]` or `apply` is useful for efficiently calculating summary statistics on
-many columns at once. The darker, top row of each table represents the column
-headers.
+```{code-cell} ipython3
+region_lang.mean(numeric_only=True)
 ```
 
+If there are only some columns for which you would like to get summary statistics,
+you can first use `[]` to select those columns
+and then ask for the summary statistic,
+as we did for a single column previously:
 Lets say that we want to know
 the mean and standard deviation of all of the columns between `"mother_tongue"` and `"lang_known"`.
-We use `loc[]` to specify the columns and then `agg` to ask for both the `mean` and `std`.
+We use `[]` to specify the columns and then `agg` to ask for both the `mean` and `std`.
 ```{code-cell} ipython3
-region_lang.loc[:, "mother_tongue":"lang_known"].agg(["mean", "std"])
+region_lang["mother_tongue":"lang_known"].agg(["mean", "std"])
 ```
 
-
-
 ## Performing operations on groups of rows using `groupby`
 
 +++