UBC-DSCI · trevorcampbell · Nov 17, 2023 · Nov 16, 2023 · Nov 16, 2023 · Nov 16, 2023
@@ -49,7 +49,7 @@ html:
   extra_navbar: Powered by <a href="https://jupyterbook.org">Jupyter Book</a> # Will be displayed underneath the left navbar.
   extra_footer: "" # Will be displayed underneath the footer.
   google_analytics_id: "G-7XBFF4RSN2" # A GA id that can be used to track book views.
-  home_page_in_navbar: true # Whether to include your home page in the left Navigation Bar
+  home_page_in_navbar: false # Whether to include your home page in the left Navigation Bar
   baseurl: "" # The base URL where your book will be hosted. Used for creating image previews and social links. e.g.: https://mypage.com/mybook/
   comments:
     hypothesis: false

@@ -144,7 +144,7 @@ In this case, the file containing the breast cancer data set is a `.csv`
 file with headers. We'll use the `read_csv` function with no additional
 arguments, and then inspect its contents:
 
-```{index} read function; read\_csv
+```{index} read function; read_csv
 ```
 
 ```{code-cell} ipython3
@@ -183,7 +183,7 @@ total set of variables per image in this data set is:
 
 +++
 
-```{index} info
+```{index} DataFrame; info
 ```
 
 Below we use the `info` method to preview the data frame. This method can
@@ -195,7 +195,7 @@ as well as their data types and the number of non-missing entries.
 cancer.info()
 ```
 
-```{index} unique
+```{index} Series; unique
 ```
 
 From the summary of the data above, we can see that `Class` is of type `object`.
@@ -213,7 +213,7 @@ method. The `replace` method takes one argument: a dictionary that maps
 previous values to desired new values.
 We will verify the result using the `unique` method.
 
-```{index} replace
+```{index} Series; replace
 ```
 
 ```{code-cell} ipython3
@@ -227,7 +227,7 @@ cancer["Class"].unique()
 
 ### Exploring the cancer data
 
-```{index} groupby, count
+```{index} DataFrame; groupby, Series;size
 ```
 
 ```{code-cell} ipython3
@@ -239,9 +239,9 @@ glue("malignant_pct", "{:0.0f}".format(100*cancer["Class"].value_counts(normaliz
 ```
 
 Before we start doing any modeling, let's explore our data set. Below we use
-the `groupby` and `count` methods to find the number and percentage
+the `groupby` and `size` methods to find the number and percentage
 of benign and malignant tumor observations in our data set. When paired with
-`groupby`, `count` counts the number of observations for each value of the `Class`
+`groupby`, `size` counts the number of observations for each value of the `Class`
 variable. Then we calculate the percentage in each group by dividing by the total
 number of observations and multiplying by 100.
 The total number of observations equals the number of rows in the data frame,
@@ -256,7 +256,7 @@ tumor observations.
 100 * cancer.groupby("Class").size() / cancer.shape[0]
 ```
 
-```{index} value_counts
+```{index} Series; value_counts
 ```
 
 The `pandas` package also has a more convenient specialized `value_counts` method for
@@ -621,8 +621,6 @@ glue("fig:05-multiknn-1", perim_concav_with_new_point3)
 Scatter plot of concavity versus perimeter with new observation represented as a red diamond.
 :::
 
-```{index} pandas.DataFrame; assign
-```
 
 ```{code-cell} ipython3
 new_obs_Perimeter = 0
@@ -952,7 +950,7 @@ knn = KNeighborsClassifier(n_neighbors=5)
 knn
 ```
 
-```{index} scikit-learn; X & y
+```{index} scikit-learn; fit, scikit-learn; predictors, scikit-learn; response
 ```
 
 In order to fit the model on the breast cancer data, we need to call `fit` on
@@ -1061,10 +1059,13 @@ predictors (colored by diagnosis) for both the unstandardized data we just
 loaded, and the standardized version of that same data. But first, we need to
 standardize the `unscaled_cancer` data set with `scikit-learn`.
 
-```{index} pipeline, scikit-learn; make_column_transformer
+```{index} see: Pipeline; scikit-learn
+```
+
+```{index} see: make_column_transformer; scikit-learn
 ```
 
-```{index} double: scikit-learn; pipeline
+```{index} scikit-learn;Pipeline, scikit-learn; make_column_transformer
 ```
 
 The `scikit-learn` framework provides a collection of *preprocessors* used to manipulate
@@ -1090,13 +1091,13 @@ preprocessor = make_column_transformer(
 preprocessor
 ```
 
-```{index} scikit-learn; ColumnTransformer, scikit-learn; StandardScaler, scikit-learn; fit_transform
+```{index} scikit-learn; make_column_transformer, scikit-learn; StandardScaler 
 ```
 
-```{index} ColumnTransformer; StandardScaler
+```{index} see: StandardScaler; scikit-learn
 ```
 
-```{index} scikit-learn; fit, scikit-learn; transform
+```{index} scikit-learn; fit, scikit-learn; make_column_selector, scikit-learn; StandardScaler
 ```
 
 You can see that the preprocessor includes a single standardization step
@@ -1119,7 +1120,10 @@ preprocessor = make_column_transformer(
 preprocessor
 ```
 
-```{index} see: fit, transform, fit_transform; scikit-learn
+```{index} see: fit ; scikit-learn
+```
+
+```{index} scikit-learn; transform
 ```
 
 We are now ready to standardize the numerical predictor columns in the `unscaled_cancer` data frame.
@@ -1409,6 +1413,9 @@ detection, there are many cases in which the "important" class to identify
 (presence of disease, malicious email) is much rarer than the "unimportant"
 class (no disease, normal email).
 
+```{index} concat
+```
+
 To better illustrate the problem, let's revisit the scaled breast cancer data,
 `cancer`; except now we will remove many of the observations of malignant tumors, simulating
 what the data would look like if the cancer was rare. We will do this by
@@ -1603,7 +1610,7 @@ Imbalanced data with background color indicating the decision of the classifier
 
 +++
 
-```{index} oversampling, scikit-learn; sample
+```{index} oversampling, DataFrame; sample
 ```
 
 Despite the simplicity of the problem, solving it in a statistically sound manner is actually
@@ -1747,6 +1754,9 @@ entries, one option is to simply remove those observations prior to building
 the K-nearest neighbors classifier. We can accomplish this by using the
 `dropna` method prior to working with the data.
 
+```{index} missing data; dropna
+```
+
 ```{code-cell} ipython3
 no_missing_cancer = missing_cancer.dropna()
 no_missing_cancer
@@ -1758,8 +1768,11 @@ possible approach is to *impute* the missing entries, i.e., fill in synthetic
 values based on the other observations in the data set. One reasonable choice
 is to perform *mean imputation*, where missing entries are filled in using the
 mean of the present entries in each variable. To perform mean imputation, we
-use a `SimpleImputer` transformer with the default arguments, and wrap it in a
-`ColumnTransformer` to indicate which columns need imputation.
+use a `SimpleImputer` transformer with the default arguments, and use
+`make_column_transformer` to indicate which columns need imputation.
+
+```{index} scikit-learn; SimpleImputer, missing data;mean imputation
+```
 
 ```{code-cell} ipython3
 from sklearn.impute import SimpleImputer
@@ -1792,7 +1805,7 @@ question you are answering.
 (08:puttingittogetherworkflow)=
 ## Putting it together in a `Pipeline`
 
-```{index} scikit-learn; pipeline
+```{index} scikit-learn; Pipeline
 ```
 
 The `scikit-learn` package collection also provides the [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html?highlight=pipeline#sklearn.pipeline.Pipeline),

@@ -121,6 +121,9 @@ $$\mathrm{accuracy} = \frac{\mathrm{number \; of  \; correct  \; predictions}}{\
 Process for splitting the data and finding the prediction accuracy.
 ```
 
+```{index} confusion matrix
+```
+
 Accuracy is a convenient, general-purpose way to summarize the performance of a classifier with
 a single number.  But prediction accuracy by itself does not tell the whole
 story.  In particular, accuracy alone only tells us how often the classifier
@@ -165,6 +168,9 @@ disastrous error, since it may lead to a patient who requires treatment not rece
 Since we are particularly interested in identifying malignant cases, this
 classifier would likely be unacceptable even with an accuracy of 89%.
 
+```{index} positive label, negative label, true positive, true negative, false positive, false negative
+```
+
 Focusing more on one label than the other is
 common in classification problems. In such cases, we typically refer to the label we are more
 interested in identifying as the *positive* label, and the other as the
@@ -178,6 +184,9 @@ classifier can make, corresponding to the four entries in the confusion matrix:
 - **True Negative:** A benign observation that was classified as benign (bottom right in {numref}`confusion-matrix-table`).
 - **False Negative:** A malignant observation that was classified as benign (top right in {numref}`confusion-matrix-table`).
 
+```{index} precision, recall
+```
+
 A perfect classifier would have zero false negatives and false positives (and
 therefore, 100% accuracy). However, classifiers in practice will almost always
 make some errors. So you should think about which kinds of error are most
@@ -358,6 +367,12 @@ in `np.random.seed` will lead to different patterns of randomness, but as long a
 value your analysis results will be the same. In the remainder of the textbook,
 we will set the seed once at the beginning of each chapter.
 
+```{index} RandomState
+```
+
+```{index} see: RandomState; seed
+```
+
 ````{note}
 When you use `np.random.seed`, you are really setting the seed for the `numpy`
 package's *default random number generator*. Using the global default random
@@ -516,7 +531,7 @@ glue("cancer_train_nrow", "{:d}".format(len(cancer_train)))
 glue("cancer_test_nrow", "{:d}".format(len(cancer_test)))
 ```
 
-```{index} info
+```{index} DataFrame; info
 ```
 
 We can see from the `info` method above that the training set contains {glue:text}`cancer_train_nrow` observations,
@@ -525,7 +540,7 @@ a train / test split of 75% / 25%, as desired. Recall from {numref}`Chapter %s <
 that we use the `info` method to preview the number of rows, the variable names, their data types, and
 missing entries of a data frame.
 
-```{index} groupby, count
+```{index} Series; value_counts
 ```
 
 We can use the `value_counts` method with the `normalize` argument set to `True`
@@ -557,7 +572,7 @@ training and test data sets.
 
 +++
 
-```{index} pipeline, pipeline; make_column_transformer, pipeline; StandardScaler
+```{index} scikit-learn; Pipeline, scikit-learn; make_column_transformer, scikit-learn; StandardScaler
 ```
 
 Fortunately, `scikit-learn` helps us handle this properly as long as we wrap our
@@ -603,7 +618,7 @@ knn_pipeline
 
 ### Predict the labels in the test set
 
-```{index} pandas.concat
+```{index} scikit-learn; predict
 ```
 
 Now that we have a K-nearest neighbors classifier object, we can use it to
@@ -622,7 +637,7 @@ cancer_test[["ID", "Class", "predicted"]]
 (eval-performance-clasfcn2)=
 ### Evaluate performance
 
-```{index} scikit-learn; score
+```{index} scikit-learn; score, scikit-learn; precision_score, scikit-learn; recall_score
 ```
 
 Finally, we can assess our classifier's performance. First, we will examine accuracy.
@@ -695,6 +710,9 @@ arguments: the actual labels first, then the predicted labels second. Note that
 `crosstab` orders its columns alphabetically, but the positive label is still `Malignant`,
 even if it is not in the top left corner as in the example confusion matrix earlier in this chapter.
 
+```{index} crosstab
+```
+
 ```{code-cell} ipython3
 pd.crosstab(
     cancer_test["Class"],
@@ -774,7 +792,7 @@ a recall of {glue:text}`cancer_rec_1`%.
 That sounds pretty good! Wait, *is* it good?
 Or do we need something higher?
 
-```{index} accuracy; assessment
+```{index} accuracy;assessment, precision;assessment, recall;assessment
 ```
 
 In general, a *good* value for accuracy (as well as precision and recall, if applicable)
@@ -1026,6 +1044,12 @@ cv_5_df = pd.DataFrame(
 cv_5_df
 ```
 
+```{index} see: sem;standard error
+```
+
+```{index} standard error, DataFrame;agg
+```
+
 The validation scores we are interested in are contained in the `test_score` column.
 We can then aggregate the *mean* and *standard error*
 of the classifier's validation accuracy across the folds.
@@ -1098,6 +1122,9 @@ cv_10_metrics["test_score"]["sem"] = cv_5_metrics["test_score"]["sem"] / np.sqrt
 cv_10_metrics
 ```
 
+```{index} cross-validation; folds
+```
+
 In this case, using 10-fold instead of 5-fold cross validation did
 reduce the standard error very slightly. In fact, due to the randomness in how the data are split, sometimes
 you might even end up with a *higher* standard error when increasing the number of folds!
@@ -1153,6 +1180,11 @@ functionality, named `GridSearchCV`, to automatically handle the details for us.
 Before we use `GridSearchCV`, we need to create a new pipeline
 with a `KNeighborsClassifier` that has the number of neighbors left unspecified.
 
+```{index} see: make_pipeline; scikit-learn
+```
+```{index} scikit-learn;make_pipeline
+```
+
 ```{code-cell} ipython3
 knn = KNeighborsClassifier()
 cancer_tune_pipe = make_pipeline(cancer_preprocessor, knn)
@@ -1534,6 +1566,9 @@ us automatically. To make predictions and assess the estimated accuracy of the b
 `score` and `predict` methods of the fit `GridSearchCV` object. We can then pass those predictions to
 the `precision`, `recall`, and `crosstab` functions to assess the estimated precision and recall, and print a confusion matrix.
 
+```{index} scikit-learn;predict, scikit-learn;score, scikit-learn;precision_score, scikit-learn;recall_score, crosstab
+```
+
 ```{code-cell} ipython3
 cancer_test["predicted"] = cancer_tune_grid.predict(
     cancer_test[["Smoothness", "Concavity"]]
@@ -1637,7 +1672,7 @@ Overview of K-NN classification.
 
 +++
 
-```{index} scikit-learn, pipeline, cross-validation, K-nearest neighbors; classification, classification
+```{index} scikit-learn;Pipeline, cross-validation, K-nearest neighbors; classification, classification
 ```
 
 The overall workflow for performing K-nearest neighbors classification using `scikit-learn` is as follows:
@@ -1755,19 +1790,7 @@ for i in range(len(ks)):
     cancer_tune_pipe = make_pipeline(cancer_preprocessor, KNeighborsClassifier())
     param_grid = {
         "kneighborsclassifier__n_neighbors": range(1, 21),
-    }  ## double check: in R textbook, it is tune_grid(..., grid=20), so I guess it matches RandomizedSearchCV
-       ## instead of GridSeachCV?
-    # param_grid_rand = {
-    #     "kneighborsclassifier__n_neighbors": range(1, 100),
-    # }
-    # cancer_tune_grid = RandomizedSearchCV(
-    #     estimator=cancer_tune_pipe,
-    #     param_distributions=param_grid_rand,
-    #     n_iter=20,
-    #     cv=5,
-    #     n_jobs=-1,
-    #     return_train_score=True,
-    # )
+    }  
     cancer_tune_grid = GridSearchCV(
         estimator=cancer_tune_pipe,
         param_grid=param_grid,
@@ -1980,7 +2003,10 @@ where to learn more about advanced predictor selection methods.
 
 +++
 
-### Forward selection in `scikit-learn`
+### Forward selection in Python
+
+```{index} variable selection; implementation
+```
 
 We now turn to implementing forward selection in Python.
 First we will extract a smaller set of predictors to work with in this illustrative example&mdash;`Smoothness`,