Skip to content

Index Update #316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Nov 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion source/_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ html:
extra_navbar: Powered by <a href="https://jupyterbook.org">Jupyter Book</a> # Will be displayed underneath the left navbar.
extra_footer: "" # Will be displayed underneath the footer.
google_analytics_id: "G-7XBFF4RSN2" # A GA id that can be used to track book views.
home_page_in_navbar: true # Whether to include your home page in the left Navigation Bar
home_page_in_navbar: false # Whether to include your home page in the left Navigation Bar
baseurl: "" # The base URL where your book will be hosted. Used for creating image previews and social links. e.g.: https://mypage.com/mybook/
comments:
hypothesis: false
Expand Down
55 changes: 34 additions & 21 deletions source/classification1.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@ In this case, the file containing the breast cancer data set is a `.csv`
file with headers. We'll use the `read_csv` function with no additional
arguments, and then inspect its contents:

```{index} read function; read\_csv
```{index} read function; read_csv
```

```{code-cell} ipython3
Expand Down Expand Up @@ -183,7 +183,7 @@ total set of variables per image in this data set is:

+++

```{index} info
```{index} DataFrame; info
```

Below we use the `info` method to preview the data frame. This method can
Expand All @@ -195,7 +195,7 @@ as well as their data types and the number of non-missing entries.
cancer.info()
```

```{index} unique
```{index} Series; unique
```

From the summary of the data above, we can see that `Class` is of type `object`.
Expand All @@ -213,7 +213,7 @@ method. The `replace` method takes one argument: a dictionary that maps
previous values to desired new values.
We will verify the result using the `unique` method.

```{index} replace
```{index} Series; replace
```

```{code-cell} ipython3
Expand All @@ -227,7 +227,7 @@ cancer["Class"].unique()

### Exploring the cancer data

```{index} groupby, count
```{index} DataFrame; groupby, Series;size
```

```{code-cell} ipython3
Expand All @@ -239,9 +239,9 @@ glue("malignant_pct", "{:0.0f}".format(100*cancer["Class"].value_counts(normaliz
```

Before we start doing any modeling, let's explore our data set. Below we use
the `groupby` and `count` methods to find the number and percentage
the `groupby` and `size` methods to find the number and percentage
of benign and malignant tumor observations in our data set. When paired with
`groupby`, `count` counts the number of observations for each value of the `Class`
`groupby`, `size` counts the number of observations for each value of the `Class`
variable. Then we calculate the percentage in each group by dividing by the total
number of observations and multiplying by 100.
The total number of observations equals the number of rows in the data frame,
Expand All @@ -256,7 +256,7 @@ tumor observations.
100 * cancer.groupby("Class").size() / cancer.shape[0]
```

```{index} value_counts
```{index} Series; value_counts
```

The `pandas` package also has a more convenient specialized `value_counts` method for
Expand Down Expand Up @@ -621,8 +621,6 @@ glue("fig:05-multiknn-1", perim_concav_with_new_point3)
Scatter plot of concavity versus perimeter with new observation represented as a red diamond.
:::

```{index} pandas.DataFrame; assign
```

```{code-cell} ipython3
new_obs_Perimeter = 0
Expand Down Expand Up @@ -952,7 +950,7 @@ knn = KNeighborsClassifier(n_neighbors=5)
knn
```

```{index} scikit-learn; X & y
```{index} scikit-learn; fit, scikit-learn; predictors, scikit-learn; response
```

In order to fit the model on the breast cancer data, we need to call `fit` on
Expand Down Expand Up @@ -1061,10 +1059,13 @@ predictors (colored by diagnosis) for both the unstandardized data we just
loaded, and the standardized version of that same data. But first, we need to
standardize the `unscaled_cancer` data set with `scikit-learn`.

```{index} pipeline, scikit-learn; make_column_transformer
```{index} see: Pipeline; scikit-learn
```

```{index} see: make_column_transformer; scikit-learn
```

```{index} double: scikit-learn; pipeline
```{index} scikit-learn;Pipeline, scikit-learn; make_column_transformer
```

The `scikit-learn` framework provides a collection of *preprocessors* used to manipulate
Expand All @@ -1090,13 +1091,13 @@ preprocessor = make_column_transformer(
preprocessor
```

```{index} scikit-learn; ColumnTransformer, scikit-learn; StandardScaler, scikit-learn; fit_transform
```{index} scikit-learn; make_column_transformer, scikit-learn; StandardScaler
```

```{index} ColumnTransformer; StandardScaler
```{index} see: StandardScaler; scikit-learn
```

```{index} scikit-learn; fit, scikit-learn; transform
```{index} scikit-learn; fit, scikit-learn; make_column_selector, scikit-learn; StandardScaler
```

You can see that the preprocessor includes a single standardization step
Expand All @@ -1119,7 +1120,10 @@ preprocessor = make_column_transformer(
preprocessor
```

```{index} see: fit, transform, fit_transform; scikit-learn
```{index} see: fit ; scikit-learn
```

```{index} scikit-learn; transform
```

We are now ready to standardize the numerical predictor columns in the `unscaled_cancer` data frame.
Expand Down Expand Up @@ -1409,6 +1413,9 @@ detection, there are many cases in which the "important" class to identify
(presence of disease, malicious email) is much rarer than the "unimportant"
class (no disease, normal email).

```{index} concat
```

To better illustrate the problem, let's revisit the scaled breast cancer data,
`cancer`; except now we will remove many of the observations of malignant tumors, simulating
what the data would look like if the cancer was rare. We will do this by
Expand Down Expand Up @@ -1603,7 +1610,7 @@ Imbalanced data with background color indicating the decision of the classifier

+++

```{index} oversampling, scikit-learn; sample
```{index} oversampling, DataFrame; sample
```

Despite the simplicity of the problem, solving it in a statistically sound manner is actually
Expand Down Expand Up @@ -1747,6 +1754,9 @@ entries, one option is to simply remove those observations prior to building
the K-nearest neighbors classifier. We can accomplish this by using the
`dropna` method prior to working with the data.

```{index} missing data; dropna
```

```{code-cell} ipython3
no_missing_cancer = missing_cancer.dropna()
no_missing_cancer
Expand All @@ -1758,8 +1768,11 @@ possible approach is to *impute* the missing entries, i.e., fill in synthetic
values based on the other observations in the data set. One reasonable choice
is to perform *mean imputation*, where missing entries are filled in using the
mean of the present entries in each variable. To perform mean imputation, we
use a `SimpleImputer` transformer with the default arguments, and wrap it in a
`ColumnTransformer` to indicate which columns need imputation.
use a `SimpleImputer` transformer with the default arguments, and use
`make_column_transformer` to indicate which columns need imputation.

```{index} scikit-learn; SimpleImputer, missing data;mean imputation
```

```{code-cell} ipython3
from sklearn.impute import SimpleImputer
Expand Down Expand Up @@ -1792,7 +1805,7 @@ question you are answering.
(08:puttingittogetherworkflow)=
## Putting it together in a `Pipeline`

```{index} scikit-learn; pipeline
```{index} scikit-learn; Pipeline
```

The `scikit-learn` package collection also provides the [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html?highlight=pipeline#sklearn.pipeline.Pipeline),
Expand Down
68 changes: 47 additions & 21 deletions source/classification2.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,9 @@ $$\mathrm{accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\
Process for splitting the data and finding the prediction accuracy.
```

```{index} confusion matrix
```

Accuracy is a convenient, general-purpose way to summarize the performance of a classifier with
a single number. But prediction accuracy by itself does not tell the whole
story. In particular, accuracy alone only tells us how often the classifier
Expand Down Expand Up @@ -165,6 +168,9 @@ disastrous error, since it may lead to a patient who requires treatment not rece
Since we are particularly interested in identifying malignant cases, this
classifier would likely be unacceptable even with an accuracy of 89%.

```{index} positive label, negative label, true positive, true negative, false positive, false negative
```

Focusing more on one label than the other is
common in classification problems. In such cases, we typically refer to the label we are more
interested in identifying as the *positive* label, and the other as the
Expand All @@ -178,6 +184,9 @@ classifier can make, corresponding to the four entries in the confusion matrix:
- **True Negative:** A benign observation that was classified as benign (bottom right in {numref}`confusion-matrix-table`).
- **False Negative:** A malignant observation that was classified as benign (top right in {numref}`confusion-matrix-table`).

```{index} precision, recall
```

A perfect classifier would have zero false negatives and false positives (and
therefore, 100% accuracy). However, classifiers in practice will almost always
make some errors. So you should think about which kinds of error are most
Expand Down Expand Up @@ -358,6 +367,12 @@ in `np.random.seed` will lead to different patterns of randomness, but as long a
value your analysis results will be the same. In the remainder of the textbook,
we will set the seed once at the beginning of each chapter.

```{index} RandomState
```

```{index} see: RandomState; seed
```

````{note}
When you use `np.random.seed`, you are really setting the seed for the `numpy`
package's *default random number generator*. Using the global default random
Expand Down Expand Up @@ -516,7 +531,7 @@ glue("cancer_train_nrow", "{:d}".format(len(cancer_train)))
glue("cancer_test_nrow", "{:d}".format(len(cancer_test)))
```

```{index} info
```{index} DataFrame; info
```

We can see from the `info` method above that the training set contains {glue:text}`cancer_train_nrow` observations,
Expand All @@ -525,7 +540,7 @@ a train / test split of 75% / 25%, as desired. Recall from {numref}`Chapter %s <
that we use the `info` method to preview the number of rows, the variable names, their data types, and
missing entries of a data frame.

```{index} groupby, count
```{index} Series; value_counts
```

We can use the `value_counts` method with the `normalize` argument set to `True`
Expand Down Expand Up @@ -557,7 +572,7 @@ training and test data sets.

+++

```{index} pipeline, pipeline; make_column_transformer, pipeline; StandardScaler
```{index} scikit-learn; Pipeline, scikit-learn; make_column_transformer, scikit-learn; StandardScaler
```

Fortunately, `scikit-learn` helps us handle this properly as long as we wrap our
Expand Down Expand Up @@ -603,7 +618,7 @@ knn_pipeline

### Predict the labels in the test set

```{index} pandas.concat
```{index} scikit-learn; predict
```

Now that we have a K-nearest neighbors classifier object, we can use it to
Expand All @@ -622,7 +637,7 @@ cancer_test[["ID", "Class", "predicted"]]
(eval-performance-clasfcn2)=
### Evaluate performance

```{index} scikit-learn; score
```{index} scikit-learn; score, scikit-learn; precision_score, scikit-learn; recall_score
```

Finally, we can assess our classifier's performance. First, we will examine accuracy.
Expand Down Expand Up @@ -695,6 +710,9 @@ arguments: the actual labels first, then the predicted labels second. Note that
`crosstab` orders its columns alphabetically, but the positive label is still `Malignant`,
even if it is not in the top left corner as in the example confusion matrix earlier in this chapter.

```{index} crosstab
```

```{code-cell} ipython3
pd.crosstab(
cancer_test["Class"],
Expand Down Expand Up @@ -774,7 +792,7 @@ a recall of {glue:text}`cancer_rec_1`%.
That sounds pretty good! Wait, *is* it good?
Or do we need something higher?

```{index} accuracy; assessment
```{index} accuracy;assessment, precision;assessment, recall;assessment
```

In general, a *good* value for accuracy (as well as precision and recall, if applicable)
Expand Down Expand Up @@ -1026,6 +1044,12 @@ cv_5_df = pd.DataFrame(
cv_5_df
```

```{index} see: sem;standard error
```

```{index} standard error, DataFrame;agg
```

The validation scores we are interested in are contained in the `test_score` column.
We can then aggregate the *mean* and *standard error*
of the classifier's validation accuracy across the folds.
Expand Down Expand Up @@ -1098,6 +1122,9 @@ cv_10_metrics["test_score"]["sem"] = cv_5_metrics["test_score"]["sem"] / np.sqrt
cv_10_metrics
```

```{index} cross-validation; folds
```

In this case, using 10-fold instead of 5-fold cross validation did
reduce the standard error very slightly. In fact, due to the randomness in how the data are split, sometimes
you might even end up with a *higher* standard error when increasing the number of folds!
Expand Down Expand Up @@ -1153,6 +1180,11 @@ functionality, named `GridSearchCV`, to automatically handle the details for us.
Before we use `GridSearchCV`, we need to create a new pipeline
with a `KNeighborsClassifier` that has the number of neighbors left unspecified.

```{index} see: make_pipeline; scikit-learn
```
```{index} scikit-learn;make_pipeline
```

```{code-cell} ipython3
knn = KNeighborsClassifier()
cancer_tune_pipe = make_pipeline(cancer_preprocessor, knn)
Expand Down Expand Up @@ -1534,6 +1566,9 @@ us automatically. To make predictions and assess the estimated accuracy of the b
`score` and `predict` methods of the fit `GridSearchCV` object. We can then pass those predictions to
the `precision`, `recall`, and `crosstab` functions to assess the estimated precision and recall, and print a confusion matrix.

```{index} scikit-learn;predict, scikit-learn;score, scikit-learn;precision_score, scikit-learn;recall_score, crosstab
```

```{code-cell} ipython3
cancer_test["predicted"] = cancer_tune_grid.predict(
cancer_test[["Smoothness", "Concavity"]]
Expand Down Expand Up @@ -1637,7 +1672,7 @@ Overview of K-NN classification.

+++

```{index} scikit-learn, pipeline, cross-validation, K-nearest neighbors; classification, classification
```{index} scikit-learn;Pipeline, cross-validation, K-nearest neighbors; classification, classification
```

The overall workflow for performing K-nearest neighbors classification using `scikit-learn` is as follows:
Expand Down Expand Up @@ -1755,19 +1790,7 @@ for i in range(len(ks)):
cancer_tune_pipe = make_pipeline(cancer_preprocessor, KNeighborsClassifier())
param_grid = {
"kneighborsclassifier__n_neighbors": range(1, 21),
} ## double check: in R textbook, it is tune_grid(..., grid=20), so I guess it matches RandomizedSearchCV
## instead of GridSeachCV?
# param_grid_rand = {
# "kneighborsclassifier__n_neighbors": range(1, 100),
# }
# cancer_tune_grid = RandomizedSearchCV(
# estimator=cancer_tune_pipe,
# param_distributions=param_grid_rand,
# n_iter=20,
# cv=5,
# n_jobs=-1,
# return_train_score=True,
# )
}
cancer_tune_grid = GridSearchCV(
estimator=cancer_tune_pipe,
param_grid=param_grid,
Expand Down Expand Up @@ -1980,7 +2003,10 @@ where to learn more about advanced predictor selection methods.

+++

### Forward selection in `scikit-learn`
### Forward selection in Python

```{index} variable selection; implementation
```

We now turn to implementing forward selection in Python.
First we will extract a smaller set of predictors to work with in this illustrative example&mdash;`Smoothness`,
Expand Down
Loading