diff --git a/source/_config.yml b/source/_config.yml index 5651a12e..6bf00afd 100755 --- a/source/_config.yml +++ b/source/_config.yml @@ -49,7 +49,7 @@ html: extra_navbar: Powered by Jupyter Book # Will be displayed underneath the left navbar. extra_footer: "" # Will be displayed underneath the footer. google_analytics_id: "G-7XBFF4RSN2" # A GA id that can be used to track book views. - home_page_in_navbar: true # Whether to include your home page in the left Navigation Bar + home_page_in_navbar: false # Whether to include your home page in the left Navigation Bar baseurl: "" # The base URL where your book will be hosted. Used for creating image previews and social links. e.g.: https://mypage.com/mybook/ comments: hypothesis: false diff --git a/source/classification1.md b/source/classification1.md index a393f295..30b2d90b 100755 --- a/source/classification1.md +++ b/source/classification1.md @@ -144,7 +144,7 @@ In this case, the file containing the breast cancer data set is a `.csv` file with headers. We'll use the `read_csv` function with no additional arguments, and then inspect its contents: -```{index} read function; read\_csv +```{index} read function; read_csv ``` ```{code-cell} ipython3 @@ -183,7 +183,7 @@ total set of variables per image in this data set is: +++ -```{index} info +```{index} DataFrame; info ``` Below we use the `info` method to preview the data frame. This method can @@ -195,7 +195,7 @@ as well as their data types and the number of non-missing entries. cancer.info() ``` -```{index} unique +```{index} Series; unique ``` From the summary of the data above, we can see that `Class` is of type `object`. @@ -213,7 +213,7 @@ method. The `replace` method takes one argument: a dictionary that maps previous values to desired new values. We will verify the result using the `unique` method. -```{index} replace +```{index} Series; replace ``` ```{code-cell} ipython3 @@ -227,7 +227,7 @@ cancer["Class"].unique() ### Exploring the cancer data -```{index} groupby, count +```{index} DataFrame; groupby, Series;size ``` ```{code-cell} ipython3 @@ -239,9 +239,9 @@ glue("malignant_pct", "{:0.0f}".format(100*cancer["Class"].value_counts(normaliz ``` Before we start doing any modeling, let's explore our data set. Below we use -the `groupby` and `count` methods to find the number and percentage +the `groupby` and `size` methods to find the number and percentage of benign and malignant tumor observations in our data set. When paired with -`groupby`, `count` counts the number of observations for each value of the `Class` +`groupby`, `size` counts the number of observations for each value of the `Class` variable. Then we calculate the percentage in each group by dividing by the total number of observations and multiplying by 100. The total number of observations equals the number of rows in the data frame, @@ -256,7 +256,7 @@ tumor observations. 100 * cancer.groupby("Class").size() / cancer.shape[0] ``` -```{index} value_counts +```{index} Series; value_counts ``` The `pandas` package also has a more convenient specialized `value_counts` method for @@ -621,8 +621,6 @@ glue("fig:05-multiknn-1", perim_concav_with_new_point3) Scatter plot of concavity versus perimeter with new observation represented as a red diamond. ::: -```{index} pandas.DataFrame; assign -``` ```{code-cell} ipython3 new_obs_Perimeter = 0 @@ -952,7 +950,7 @@ knn = KNeighborsClassifier(n_neighbors=5) knn ``` -```{index} scikit-learn; X & y +```{index} scikit-learn; fit, scikit-learn; predictors, scikit-learn; response ``` In order to fit the model on the breast cancer data, we need to call `fit` on @@ -1061,10 +1059,13 @@ predictors (colored by diagnosis) for both the unstandardized data we just loaded, and the standardized version of that same data. But first, we need to standardize the `unscaled_cancer` data set with `scikit-learn`. -```{index} pipeline, scikit-learn; make_column_transformer +```{index} see: Pipeline; scikit-learn +``` + +```{index} see: make_column_transformer; scikit-learn ``` -```{index} double: scikit-learn; pipeline +```{index} scikit-learn;Pipeline, scikit-learn; make_column_transformer ``` The `scikit-learn` framework provides a collection of *preprocessors* used to manipulate @@ -1090,13 +1091,13 @@ preprocessor = make_column_transformer( preprocessor ``` -```{index} scikit-learn; ColumnTransformer, scikit-learn; StandardScaler, scikit-learn; fit_transform +```{index} scikit-learn; make_column_transformer, scikit-learn; StandardScaler ``` -```{index} ColumnTransformer; StandardScaler +```{index} see: StandardScaler; scikit-learn ``` -```{index} scikit-learn; fit, scikit-learn; transform +```{index} scikit-learn; fit, scikit-learn; make_column_selector, scikit-learn; StandardScaler ``` You can see that the preprocessor includes a single standardization step @@ -1119,7 +1120,10 @@ preprocessor = make_column_transformer( preprocessor ``` -```{index} see: fit, transform, fit_transform; scikit-learn +```{index} see: fit ; scikit-learn +``` + +```{index} scikit-learn; transform ``` We are now ready to standardize the numerical predictor columns in the `unscaled_cancer` data frame. @@ -1409,6 +1413,9 @@ detection, there are many cases in which the "important" class to identify (presence of disease, malicious email) is much rarer than the "unimportant" class (no disease, normal email). +```{index} concat +``` + To better illustrate the problem, let's revisit the scaled breast cancer data, `cancer`; except now we will remove many of the observations of malignant tumors, simulating what the data would look like if the cancer was rare. We will do this by @@ -1603,7 +1610,7 @@ Imbalanced data with background color indicating the decision of the classifier +++ -```{index} oversampling, scikit-learn; sample +```{index} oversampling, DataFrame; sample ``` Despite the simplicity of the problem, solving it in a statistically sound manner is actually @@ -1747,6 +1754,9 @@ entries, one option is to simply remove those observations prior to building the K-nearest neighbors classifier. We can accomplish this by using the `dropna` method prior to working with the data. +```{index} missing data; dropna +``` + ```{code-cell} ipython3 no_missing_cancer = missing_cancer.dropna() no_missing_cancer @@ -1758,8 +1768,11 @@ possible approach is to *impute* the missing entries, i.e., fill in synthetic values based on the other observations in the data set. One reasonable choice is to perform *mean imputation*, where missing entries are filled in using the mean of the present entries in each variable. To perform mean imputation, we -use a `SimpleImputer` transformer with the default arguments, and wrap it in a -`ColumnTransformer` to indicate which columns need imputation. +use a `SimpleImputer` transformer with the default arguments, and use +`make_column_transformer` to indicate which columns need imputation. + +```{index} scikit-learn; SimpleImputer, missing data;mean imputation +``` ```{code-cell} ipython3 from sklearn.impute import SimpleImputer @@ -1792,7 +1805,7 @@ question you are answering. (08:puttingittogetherworkflow)= ## Putting it together in a `Pipeline` -```{index} scikit-learn; pipeline +```{index} scikit-learn; Pipeline ``` The `scikit-learn` package collection also provides the [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html?highlight=pipeline#sklearn.pipeline.Pipeline), diff --git a/source/classification2.md b/source/classification2.md index 649b5aa3..0b7cebec 100755 --- a/source/classification2.md +++ b/source/classification2.md @@ -121,6 +121,9 @@ $$\mathrm{accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\ Process for splitting the data and finding the prediction accuracy. ``` +```{index} confusion matrix +``` + Accuracy is a convenient, general-purpose way to summarize the performance of a classifier with a single number. But prediction accuracy by itself does not tell the whole story. In particular, accuracy alone only tells us how often the classifier @@ -165,6 +168,9 @@ disastrous error, since it may lead to a patient who requires treatment not rece Since we are particularly interested in identifying malignant cases, this classifier would likely be unacceptable even with an accuracy of 89%. +```{index} positive label, negative label, true positive, true negative, false positive, false negative +``` + Focusing more on one label than the other is common in classification problems. In such cases, we typically refer to the label we are more interested in identifying as the *positive* label, and the other as the @@ -178,6 +184,9 @@ classifier can make, corresponding to the four entries in the confusion matrix: - **True Negative:** A benign observation that was classified as benign (bottom right in {numref}`confusion-matrix-table`). - **False Negative:** A malignant observation that was classified as benign (top right in {numref}`confusion-matrix-table`). +```{index} precision, recall +``` + A perfect classifier would have zero false negatives and false positives (and therefore, 100% accuracy). However, classifiers in practice will almost always make some errors. So you should think about which kinds of error are most @@ -358,6 +367,12 @@ in `np.random.seed` will lead to different patterns of randomness, but as long a value your analysis results will be the same. In the remainder of the textbook, we will set the seed once at the beginning of each chapter. +```{index} RandomState +``` + +```{index} see: RandomState; seed +``` + ````{note} When you use `np.random.seed`, you are really setting the seed for the `numpy` package's *default random number generator*. Using the global default random @@ -516,7 +531,7 @@ glue("cancer_train_nrow", "{:d}".format(len(cancer_train))) glue("cancer_test_nrow", "{:d}".format(len(cancer_test))) ``` -```{index} info +```{index} DataFrame; info ``` We can see from the `info` method above that the training set contains {glue:text}`cancer_train_nrow` observations, @@ -525,7 +540,7 @@ a train / test split of 75% / 25%, as desired. Recall from {numref}`Chapter %s < that we use the `info` method to preview the number of rows, the variable names, their data types, and missing entries of a data frame. -```{index} groupby, count +```{index} Series; value_counts ``` We can use the `value_counts` method with the `normalize` argument set to `True` @@ -557,7 +572,7 @@ training and test data sets. +++ -```{index} pipeline, pipeline; make_column_transformer, pipeline; StandardScaler +```{index} scikit-learn; Pipeline, scikit-learn; make_column_transformer, scikit-learn; StandardScaler ``` Fortunately, `scikit-learn` helps us handle this properly as long as we wrap our @@ -603,7 +618,7 @@ knn_pipeline ### Predict the labels in the test set -```{index} pandas.concat +```{index} scikit-learn; predict ``` Now that we have a K-nearest neighbors classifier object, we can use it to @@ -622,7 +637,7 @@ cancer_test[["ID", "Class", "predicted"]] (eval-performance-clasfcn2)= ### Evaluate performance -```{index} scikit-learn; score +```{index} scikit-learn; score, scikit-learn; precision_score, scikit-learn; recall_score ``` Finally, we can assess our classifier's performance. First, we will examine accuracy. @@ -695,6 +710,9 @@ arguments: the actual labels first, then the predicted labels second. Note that `crosstab` orders its columns alphabetically, but the positive label is still `Malignant`, even if it is not in the top left corner as in the example confusion matrix earlier in this chapter. +```{index} crosstab +``` + ```{code-cell} ipython3 pd.crosstab( cancer_test["Class"], @@ -774,7 +792,7 @@ a recall of {glue:text}`cancer_rec_1`%. That sounds pretty good! Wait, *is* it good? Or do we need something higher? -```{index} accuracy; assessment +```{index} accuracy;assessment, precision;assessment, recall;assessment ``` In general, a *good* value for accuracy (as well as precision and recall, if applicable) @@ -1026,6 +1044,12 @@ cv_5_df = pd.DataFrame( cv_5_df ``` +```{index} see: sem;standard error +``` + +```{index} standard error, DataFrame;agg +``` + The validation scores we are interested in are contained in the `test_score` column. We can then aggregate the *mean* and *standard error* of the classifier's validation accuracy across the folds. @@ -1098,6 +1122,9 @@ cv_10_metrics["test_score"]["sem"] = cv_5_metrics["test_score"]["sem"] / np.sqrt cv_10_metrics ``` +```{index} cross-validation; folds +``` + In this case, using 10-fold instead of 5-fold cross validation did reduce the standard error very slightly. In fact, due to the randomness in how the data are split, sometimes you might even end up with a *higher* standard error when increasing the number of folds! @@ -1153,6 +1180,11 @@ functionality, named `GridSearchCV`, to automatically handle the details for us. Before we use `GridSearchCV`, we need to create a new pipeline with a `KNeighborsClassifier` that has the number of neighbors left unspecified. +```{index} see: make_pipeline; scikit-learn +``` +```{index} scikit-learn;make_pipeline +``` + ```{code-cell} ipython3 knn = KNeighborsClassifier() cancer_tune_pipe = make_pipeline(cancer_preprocessor, knn) @@ -1534,6 +1566,9 @@ us automatically. To make predictions and assess the estimated accuracy of the b `score` and `predict` methods of the fit `GridSearchCV` object. We can then pass those predictions to the `precision`, `recall`, and `crosstab` functions to assess the estimated precision and recall, and print a confusion matrix. +```{index} scikit-learn;predict, scikit-learn;score, scikit-learn;precision_score, scikit-learn;recall_score, crosstab +``` + ```{code-cell} ipython3 cancer_test["predicted"] = cancer_tune_grid.predict( cancer_test[["Smoothness", "Concavity"]] @@ -1637,7 +1672,7 @@ Overview of K-NN classification. +++ -```{index} scikit-learn, pipeline, cross-validation, K-nearest neighbors; classification, classification +```{index} scikit-learn;Pipeline, cross-validation, K-nearest neighbors; classification, classification ``` The overall workflow for performing K-nearest neighbors classification using `scikit-learn` is as follows: @@ -1755,19 +1790,7 @@ for i in range(len(ks)): cancer_tune_pipe = make_pipeline(cancer_preprocessor, KNeighborsClassifier()) param_grid = { "kneighborsclassifier__n_neighbors": range(1, 21), - } ## double check: in R textbook, it is tune_grid(..., grid=20), so I guess it matches RandomizedSearchCV - ## instead of GridSeachCV? - # param_grid_rand = { - # "kneighborsclassifier__n_neighbors": range(1, 100), - # } - # cancer_tune_grid = RandomizedSearchCV( - # estimator=cancer_tune_pipe, - # param_distributions=param_grid_rand, - # n_iter=20, - # cv=5, - # n_jobs=-1, - # return_train_score=True, - # ) + } cancer_tune_grid = GridSearchCV( estimator=cancer_tune_pipe, param_grid=param_grid, @@ -1980,7 +2003,10 @@ where to learn more about advanced predictor selection methods. +++ -### Forward selection in `scikit-learn` +### Forward selection in Python + +```{index} variable selection; implementation +``` We now turn to implementing forward selection in Python. First we will extract a smaller set of predictors to work with in this illustrative example—`Smoothness`, diff --git a/source/clustering.md b/source/clustering.md index 7dc7815a..11ce494f 100755 --- a/source/clustering.md +++ b/source/clustering.md @@ -308,7 +308,7 @@ have. clus = penguins_clustered[penguins_clustered["cluster"] == 0][["bill_length_standardized", "flipper_length_standardized"]] ``` -```{index} see: within-cluster sum-of-squared-distances; WSSD +```{index} see: within-cluster sum of squared distances; WSSD ``` ```{index} WSSD @@ -623,7 +623,7 @@ are changing, and the algorithm terminates. ### Random restarts -```{index} K-means; init argument +```{index} K-means; restart ``` Unlike the classification and regression models we studied in previous chapters, K-means can get "stuck" in a bad solution. @@ -792,7 +792,10 @@ Total WSSD for K clusters ranging from 1 to 9. ## K-means in Python -```{index} K-means; kmeans function, scikit-learn; KMeans +```{index} K-means, scikit-learn; KMeans +``` + +```{index} see: KMeans; scikit-learn ``` We can perform K-means in Python using a workflow similar to those @@ -807,6 +810,9 @@ To address this problem, we typically standardize our data before clustering, which ensures that each variable has a mean of 0 and standard deviation of 1. The `StandardScaler` function in `scikit-learn` can be used to do this. +```{index} scikit-learn; StandardScaler, scikit-learn;KMeans, standardization;K-means, K-means;standardization +``` + ```{code-cell} ipython3 from sklearn.preprocessing import StandardScaler from sklearn.compose import make_column_transformer @@ -826,6 +832,9 @@ To indicate that we are performing K-means clustering, we will create a `KMeans` model object. It takes at least one argument: the number of clusters `n_clusters`, which we set to 3. +```{index} KMeans;n_clusters +``` + ```{code-cell} ipython3 from sklearn.cluster import KMeans @@ -833,6 +842,9 @@ kmeans = KMeans(n_clusters=3) kmeans ``` +```{index} scikit-learn;make_pipeline, scikit-learn;Pipeline, scikit-learn;fit +``` + To actually run the K-means clustering, we combine the preprocessor and model object in a `Pipeline`, and use the `fit` function. Note that the K-means algorithm uses a random initialization of assignments, but since we set @@ -846,7 +858,7 @@ penguin_clust.fit(penguins) penguin_clust ``` -```{index} K-means; inertia_, K-means; cluster_centers_, K-means; labels_, K-means; predict +```{index} KMeans; labels_, KMeans; inertia_ ``` The fit `KMeans` object—which is the second item in the @@ -874,6 +886,9 @@ adding the `:N` suffix ensures that `altair` will treat the `cluster` variable as a nominal/categorical variable, and hence use a discrete color map for the visualization. +```{index} altair; :N +``` + ```{code-cell} ipython3 cluster_plot=alt.Chart(penguins).mark_circle().encode( x=alt.X("flipper_length_mm").title("Flipper Length").scale(zero=False), @@ -895,10 +910,10 @@ glue("cluster_plot", cluster_plot, display=True) The data colored by the cluster assignments returned by K-means. ::: -```{index} WSSD; total, K-means; inertia_ +```{index} WSSD; total, KMeans; inertia_ ``` -```{index} see: WSSD; K-means inertia +```{index} see: WSSD; KMeans ``` As mentioned above, @@ -920,6 +935,9 @@ where we repeat an operation multiple times and return a list with the result. Here is an examples of a list comprehension that stores the numbers 0-2 in a list: +```{index} list comprehension +``` + ```{code-cell} ipython3 [n for n in range(3)] ``` @@ -992,9 +1010,6 @@ glue("elbow_plot", elbow_plot, display=True) A plot showing the total WSSD versus the number of clusters. ::: -```{index} K-means; init argument -``` - It looks like three clusters is the right choice for this data, since that is where the "elbow" of the line is the most distinct. In the plot, @@ -1008,6 +1023,9 @@ This is because K-means can get "stuck" in a bad solution due to an unlucky initialization of the initial center positions as we mentioned earlier in the chapter. +```{index} KMeans; n_init +``` + ```{note} It is rare that the implementation of K-means from `scikit-learn` gets stuck in a bad solution, because `scikit-learn` tries to choose diff --git a/source/inference.md b/source/inference.md index dfb36c07..44136c9c 100755 --- a/source/inference.md +++ b/source/inference.md @@ -168,7 +168,7 @@ We can find the proportion of listings for each room type by using the `value_counts` function with the `normalize` parameter as we did in previous chapters. -```{index} pandas.DataFrame; df[], count, len +```{index} DataFrame; [], DataFrame; value_counts ``` ```{code-cell} ipython3 @@ -187,13 +187,13 @@ value, {glue:text}`population_proportion`, is the population parameter. Remember parameter value is usually unknown in real data analysis problems, as it is typically not possible to make measurements for an entire population. -```{index} pandas.DataFrame; sample +```{index} DataFrame; sample, seed;numpy.random.seed ``` Instead, perhaps we can approximate it with a small subset of data! To investigate this idea, let's try randomly selecting 40 listings (*i.e.,* taking a random sample of size 40 from our population), and computing the proportion for that sample. -We will use the `sample` method of the `pandas.DataFrame` +We will use the `sample` method of the `DataFrame` object to take the sample. The argument `n` of `sample` is the size of the sample to take and since we are starting to use randomness here, we are also setting the random seed via numpy to make the results reproducible. @@ -213,6 +213,9 @@ airbnb.sample(n=40)["room_type"].value_counts(normalize=True) glue("sample_1_proportion", "{:.3f}".format(airbnb.sample(n=40, random_state=155)["room_type"].value_counts(normalize=True)["Entire home/apt"])) ``` +```{index} DataFrame; value_counts +``` + Here we see that the proportion of entire home/apartment listings in this random sample is {glue:text}`sample_1_proportion`. Wow—that's close to our true population value! But remember, we computed the proportion using a random sample of size 40. @@ -245,7 +248,7 @@ commonly refer to as $n$) from a population is called a **sampling distribution**. The sampling distribution will help us see how much we would expect our sample proportions from this population to vary for samples of size 40. -```{index} pandas.DataFrame; sample +```{index} DataFrame; sample ``` We again use the `sample` to take samples of size 40 from our @@ -281,6 +284,9 @@ to compute the number of qualified observations in each sample; finally compute Both the first and last few entries of the resulting data frame are printed below to show that we end up with 20,000 point estimates, one for each of the 20,000 samples. +```{index} DataFrame;groupby, DataFrame;reset_index +``` + ```{code-cell} ipython3 ( samples @@ -473,7 +479,7 @@ The price per night of all Airbnb rentals in Vancouver, BC is \${glue:text}`population_mean`, on average. This value is our population parameter since we are calculating it using the population data. -```{index} pandas.DataFrame; sample +```{index} DataFrame; sample ``` Now suppose we did not have access to the population data (which is usually the @@ -492,6 +498,9 @@ We can create a histogram to visualize the distribution of observations in the sample ({numref}`fig:11-example-means-sample-hist`), and calculate the mean of our sample. +```{index} altair;mark_bar +``` + ```{code-cell} ipython3 :tags: [remove-output] @@ -978,7 +987,7 @@ mean of the sample is \${glue:text}`estimate_mean`. Remember, in practice, we usually only have this one sample from the population. So this sample and estimate are the only data we can work with. -```{index} bootstrap; in Python, scikit-learn; resample (bootstrap) +```{index} bootstrap; in Python, DataFrame; sample (bootstrap) ``` We now perform steps 1–5 listed above to generate a single bootstrap @@ -1097,6 +1106,9 @@ generate a bootstrap distribution of these point estimates. The bootstrap distribution ({numref}`fig:11-bootstrapping5`) suggests how we might expect our point estimate to behave if we take multiple samples. +```{index} DataFrame;reset_index, DataFrame;rename, DataFrame;groupby, Series;mean +``` + ```{code-cell} ipython3 boot20000_means = ( boot20000 @@ -1240,7 +1252,10 @@ Quantiles are expressed in proportions rather than percentages, so the 2.5th and 97.5th percentiles would be the 0.025 and 0.975 quantiles, respectively. -```{index} numpy; percentile, pandas.DataFrame; df[] +```{index} DataFrame; [], DataFrame;quantile +``` + +```{index} percentile ``` ```{code-cell} ipython3 diff --git a/source/intro.md b/source/intro.md index 576deba0..606f3d27 100755 --- a/source/intro.md +++ b/source/intro.md @@ -264,7 +264,7 @@ Non-Official & Non-Aboriginal languages,American Sign Language,2685,3020,1145,21 Non-Official & Non-Aboriginal languages,Amharic,22465,12785,200,33670 ``` -```{index} function, argument, read function; read\_csv +```{index} function, argument, read function; read_csv ``` To load this data into Python so that we can do things with it (e.g., perform @@ -437,7 +437,13 @@ can_lang ## Creating subsets of data frames with `[]` & `loc[]` -```{index} pandas.DataFrame; [], pandas.DataFrame; loc[] +```{index} see: []; DataFrame +``` + +```{index} see: loc[]; DataFrame +``` + +```{index} DataFrame; [], DataFrame; loc[], selecting columns ``` Now that we've loaded our data into Python, we can start wrangling the data to @@ -469,7 +475,7 @@ high-level categories of languages, which include "Aboriginal languages", our question we want to filter our data set so we restrict our attention to only those languages in the "Aboriginal languages" category. -```{index} pandas.DataFrame; [], filter, logical statement, logical statement; equivalency operator, string +```{index} DataFrame; [], filtering rows, logical statement, logical operator; equivalency (==), string ``` We can use the `[]` operation to obtain the subset of rows with desired values @@ -515,7 +521,7 @@ can_lang[can_lang["category"] == "Aboriginal languages"] ### Using `[]` to select columns -```{index} pandas.DataFrame; [], select; +```{index} DataFrame; [], selecting columns ``` We can also use the `[]` operation to select columns from a data frame. @@ -545,7 +551,7 @@ can_lang[["language", "mother_tongue"]] ### Using `loc[]` to filter rows and select columns -```{index} pandas.DataFrame; loc[] +```{index} DataFrame; loc[], selecting columns ``` The `[]` operation is only used when you want to filter rows *or* select columns; @@ -606,7 +612,7 @@ So it looks like the `loc[]` operation gave us the result we wanted! ## Using `sort_values` and `head` to select rows by ordered values -```{index} pandas.DataFrame; sort_values, pandas.DataFrame; head +```{index} DataFrame; sort_values, DataFrame; head ``` We have used the `[]` and `loc[]` operations on a data frame to obtain a table @@ -652,7 +658,7 @@ ten_lang (ch1-adding-modifying)= ## Adding and modifying columns -```{index} assign +```{index} adding columns, modifying columns ``` Recall that our data analysis question referred to the *count* of Canadians @@ -700,9 +706,6 @@ as a mother tongue by between 0.008% and 0.18% of the Canadian population. ## Combining analysis steps with chaining and multiline expressions -```{index} chaining methods -``` - It took us 3 steps to find the ten Aboriginal languages most often reported in 2016 as mother tongues in Canada. Starting from the `can_lang` data frame, we: @@ -771,6 +774,9 @@ what the rest of the expression is. We could, of course, put all of the code on one line of code, but splitting it across multiple lines helps a lot with code readability. +```{index} chaining +``` + We still have to handle the issue that each line of code---i.e., each step in the analysis---introduces a new temporary object. To address this issue, we can *chain* multiple operations together without assigning intermediate objects. The key idea of chaining is that the *output* of @@ -866,7 +872,9 @@ First, we need to import the `altair` package. ```{code-cell} ipython3 import altair as alt +``` +```{index} altair; mark_bar, altair; encoding channel ``` +++ @@ -916,7 +924,7 @@ Bar plot of the ten Aboriginal languages most often reported by Canadian residen +++ -```{index} see: .; chaining methods +```{index} see: .; chaining ``` ### Formatting `altair` charts @@ -935,7 +943,7 @@ Canadian Residents)" would be much more informative. To make the code easier to read, we're spreading it out over multiple lines just as we did in the previous section with pandas. -```{index} plot; labels, plot; axis labels +```{index} plot; labels, plot; axis labels, altair; alt.X, altair; alt.Y, altair; title ``` Adding additional labels to our visualizations that we create in `altair` is diff --git a/source/jupyter.md b/source/jupyter.md index 6f14c442..85110c29 100755 --- a/source/jupyter.md +++ b/source/jupyter.md @@ -410,15 +410,18 @@ notebook. ## Exploring data files +```{index} separator +``` + It is essential to preview data files before you try to read them into Python to see -whether or not there are column names, what the delimiters are, and if there are +whether or not there are column names, what the separators are, and if there are lines you need to skip. In Jupyter, you preview data files stored as plain text files (e.g., comma- and tab-separated files) in their plain text format ({numref}`open-data-w-editor-2`) by right-clicking on the file's name in the Jupyter file explorer, selecting **Open with**, and then selecting **Editor** ({numref}`open-data-w-editor-1`). Suppose you do not specify to open the data file with an editor. In that case, Jupyter will render a nice table -for you, and you will not be able to see the column delimiters, and therefore +for you, and you will not be able to see the column separators, and therefore you will not know which function to use, nor which arguments to use and values to specify for them. diff --git a/source/preface-text.md b/source/preface-text.md index 78148f79..39c506d2 100755 --- a/source/preface-text.md +++ b/source/preface-text.md @@ -15,11 +15,9 @@ kernelspec: # Preface -```{index} data science, auditable, reproducible +```{index} data science; definition, auditable, reproducible ``` - - This textbook aims to be an approachable introduction to the world of data science. In this book, we define **data science** as the process of generating insight from data through **reproducible** and **auditable** processes. diff --git a/source/reading.md b/source/reading.md index 61c9d53c..442e5921 100755 --- a/source/reading.md +++ b/source/reading.md @@ -88,9 +88,6 @@ with respect to your *working directory* (i.e., "where you are currently") on th On the other hand, an absolute path indicates where the file is with respect to the computer's filesystem base (or *root*) folder, regardless of where you are working. -```{index} Happiness Report -``` - Suppose our computer's filesystem looks like the picture in {numref}`Filesystem`. We are working in a file titled `worksheet_02.ipynb`, and our current working directory is `worksheet_02`; @@ -126,6 +123,15 @@ happy_data = pd.read_csv("data/happiness_report.csv") Note that there is no forward slash at the beginning of a relative path; if we accidentally typed `"/data/happiness_report.csv"`, Python would look for a folder named `data` in the root folder of the computer—but that doesn't exist! +```{index} path; previous, path; current +``` + +```{index} see: ..; path +``` + +```{index} see: .; path +``` + Aside from specifying places to go in a path using folder names (like `data` and `worksheet_02`), we can also specify two additional special places: the *current directory* and the *previous directory*. We indicate the current working directory with a single dot `.`, and the previous directory with two dots `..`. So for instance, if we wanted to reach the `bike_share.csv` file from the `worksheet_02` folder, we could @@ -177,7 +183,7 @@ to where the resource is located on the remote machine. (readcsv)= ### `read_csv` to read in comma-separated values files -```{index} csv, reading; separator, read function; read\_csv +```{index} csv, reading; separator, read function; read_csv ``` Now that we have learned about *where* data could be, we will learn about *how* @@ -277,7 +283,7 @@ canlang_data = pd.read_csv("data/can_lang_meta-data.csv") ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 6 ``` -```{index} Error +```{index} ParserError ``` ```{index} read function; skiprows argument @@ -330,7 +336,7 @@ Non-Official & Non-Aboriginal languages Amharic 22465 12785 200 33670 ```{index} see: tab-separated values; tsv ``` -```{index} tsv, read function; read_tsv +```{index} tsv ``` To read in `.tsv` (**t**ab **s**eparated **v**alues) files, we can set the `sep` argument @@ -362,7 +368,7 @@ arguments depending on the file format, our resulting data frame ### Using the `header` argument to handle missing column names -```{index} read function; header, reading; separator +```{index} read function; header argument, reading; separator ``` The `can_lang_no_names.tsv` file contains a slightly different version @@ -401,7 +407,7 @@ canlang_data = pd.read_csv( canlang_data ``` -```{index} pandas.DataFrame; rename, pandas +```{index} DataFrame; rename, pandas ``` It is best to rename your columns manually in this scenario. The current column names @@ -528,6 +534,10 @@ X?a??4VT?,D?Jq ```{index} read function; read_excel ``` +```{index} Excel spreadsheet; reading +``` + + This type of file representation allows Excel files to store additional things that you cannot store in a `.csv` file, such as fonts, text formatting, graphics, multiple sheets and more. And despite looking odd in a plain text @@ -614,12 +624,15 @@ usually stored and accessed locally on one computer from a file with a `.db` extension (or sometimes a `.sqlite` extension). Similar to Excel files, these are not plain text files and cannot be read in a plain text editor. -```{index} database; connect, ibis, ibis; ibis +```{index} database; connection, ibis; connect ``` ```{index} see: ibis; database ``` +```{index} see: database; ibis +``` + The first thing you need to do to read data into Python from a database is to connect to the database. For an SQLite database, we will do that using the `connect` function from the @@ -642,7 +655,7 @@ import ibis conn = ibis.sqlite.connect("data/can_lang.db") ``` -```{index} database; tables; list_tables +```{index} database; table, ibis; list_tables, ibis; sqlite ``` Often relational databases have many tables; thus, in order to retrieve @@ -656,7 +669,7 @@ tables = conn.list_tables() tables ``` -```{index} database; table, ibis; table +```{index} table, ibis; table ``` The `list_tables` function returned only one name---`"can_lang"`---which tells us @@ -672,7 +685,10 @@ canlang_table = conn.table("can_lang") canlang_table ``` -```{index} database; count, ibis; count +```{index} ibis; count +``` + +```{index} see: count; ibis ``` Although it looks like we might have obtained the whole data frame from the database, we didn't! @@ -687,7 +703,7 @@ In `ibis`, we can do that using the `count` function from the table object. canlang_table.count() ``` -```{index} execute, ibis; execute +```{index} ibis; execute ``` Wait a second...this isn't the number of rows in the database. In fact, we haven't actually sent our @@ -708,7 +724,9 @@ the *actual* text of the SQL query that `ibis` sends to the database, you can us instead of `execute`. But note that you have to pass the result of `compile` to the `str` function to turn it into a human-readable string first. -```{index} compile, ibis; compile +```{index} see: compile;ibis +``` +```{index} ibis; compile, str ``` ```{code-cell} ipython3 @@ -725,7 +743,7 @@ The `ibis` package provides lots of `pandas`-like tools for working with databas For example, we can look at the first few rows of the table by using the `head` function, followed by `execute` to retrieve the response. -```{index} database; head, ibis; +```{index} ibis; head ``` ```{code-cell} ipython3 @@ -742,7 +760,7 @@ the `language` and `mother_tongue` columns. We can use the `[]` operation with a logical statement to obtain only certain rows. Below we filter the data to include only Aboriginal languages. -```{index} database; filter, ibis; +```{index} database; filter rows, ibis; [] ``` ```{code-cell} ipython3 @@ -755,7 +773,7 @@ We didn't call `execute` because we are not ready to bring the data into Python We can still use the database to do some work to obtain *only* the small amount of data we want to work with locally in Python. Let's add the second part of our SQL query: selecting only the `language` and `mother_tongue` columns. -```{index} database; select, ibis; +```{index} database; select columns ``` ```{code-cell} ipython3 @@ -777,7 +795,7 @@ that we need for analysis; we do eventually need to call `execute`. For example, `ibis` does not provide the `tail` function to look at the last rows in a database, even though `pandas` does. -```{index} pandas.DataFrame; tail +```{index} DataFrame; tail ``` ```{code-cell} ipython3 @@ -821,6 +839,9 @@ Note that the `host` (`fakeserver.stat.ubc.ca`), `user` (`user0001`), and `password` (`abc123`) below are *not real*; you will not actually be able to connect to a database using this information. +```{index} ibis; postgres, ibis; connect +``` + ```python conn = ibis.postgres.connect( database = "can_mov_db", @@ -836,6 +857,9 @@ that connecting to and working with a Postgres database is identical to connecting to and working with an SQLite database. For example, we can again use `list_tables` to find out what tables are in the `can_mov_db` database: +```{index} ibis; list_tables +``` + ```python conn.list_tables() ``` @@ -848,6 +872,9 @@ We see that there are 10 tables in this database. Let's first look at the `"ratings"` table to find the lowest rating that exists in the `can_mov_db` database. +```{index} ibis; table +``` + ```python ratings_table = conn.table("ratings") ratings_table @@ -860,7 +887,7 @@ AlchemyTable: ratings num_votes int64 ``` -```{index} ibis; select +```{index} ibis; [] ``` To find the lowest rating that exists in the data base, we first need to @@ -882,7 +909,7 @@ Selection[r0] average_rating: r0.average_rating ``` -```{index} database; order_by, ibis; head, ibis; ibis +```{index} database; ordering, ibis; order_by, ibis; head ``` Next we use the `order_by` function from `ibis` order the table by `average_rating`, @@ -929,7 +956,7 @@ Databases are beneficial in a large-scale setting: ## Writing data from Python to a `.csv` file -```{index} write function; to_csv, pandas.DataFrame; to_csv +```{index} write function; to_csv, DataFrame; to_csv ``` At the middle and end of a data analysis, we often want to write a data frame @@ -1309,6 +1336,9 @@ argument—the URL of the page to scrape—and will return a list of data frames corresponding to all the tables it finds at that URL. We can see below that `read_html` found 17 tables on the Wikipedia page for Canada. +```{index} read function; read_html +``` + ```python canada_wiki_tables = pd.read_html("https://en.wikipedia.org/wiki/Canada") len(canada_wiki_tables) @@ -1356,16 +1386,16 @@ hope that it gives you enough of a basic idea that you can learn how to use another API if needed. In particular, in this book we will show you the basics of how to use the `requests` package in Python to access data from the NASA "Astronomy Picture of the Day" API (a great source of desktop backgrounds, by the way—take a look at the stunning -picture of the Rho-Ophiuchi cloud complex in {numref}`fig:NASA-API-Rho-Ophiuchi` from July 13, 2023!). +picture of the Rho-Ophiuchi cloud complex {cite:p}`rhoophiuchi` in {numref}`fig:NASA-API-Rho-Ophiuchi` from July 13, 2023!). -```{index} API; requests, NASA, API; token; key +```{index} requests, NASA, API; token ``` ```{figure} img/reading/NASA-API-Rho-Ophiuchi.png :name: fig:NASA-API-Rho-Ophiuchi :width: 400px -The James Webb Space Telescope's NIRCam image of the Rho Ophiuchi molecular cloud complex {cite:p}`rhoophiuchi`. +The James Webb Space Telescope's NIRCam image of the Rho Ophiuchi molecular cloud complex. ``` +++ @@ -1411,6 +1441,9 @@ That should be more than enough for our purposes in this section. #### Accessing the NASA API +```{index} API; HTTP, API; query parameters, API; endpoint +``` + The NASA API is what is known as an *HTTP API*: this is a particularly common kind of API, where you can obtain data simply by accessing a particular URL as if it were a regular website. To make a query to the NASA @@ -1459,6 +1492,12 @@ disks.","hdurl":"https://apod.nasa.gov/apod/image/2307/STScI-01_RhoOph.png", Rho Ophiuchi","url":"https://apod.nasa.gov/apod/image/2307/STScI-01_RhoOph1024.png"} ``` +```{index} see: JavaScript Object Notation; JSON +``` + +```{index} JSON, requests; get, requests; json +``` + Neat! There is definitely some data there, but it's a bit hard to see what it all is. As it turns out, this is a common format for data called *JSON* (JavaScript Object Notation). We won't encounter this kind of data much in this book, diff --git a/source/regression1.md b/source/regression1.md index d7b23af3..028b21a4 100755 --- a/source/regression1.md +++ b/source/regression1.md @@ -81,6 +81,9 @@ numerical, and so predicting them given past data is considered a regression pro ```{index} classification; comparison to regression ``` +```{index} regression; comparison to classification +``` + Just like in the classification setting, there are many possible methods that we can use to predict numerical response variables. In this chapter we will focus on the **K-nearest neighbors** algorithm {cite:p}`knnfix,knncover`, and in the next chapter @@ -136,6 +139,9 @@ is fair, or perhaps how to set the price of a new listing. We begin the analysis by loading and examining the data, as well as setting the seed value. +```{index} seed;numpy.random.seed +``` + ```{code-cell} ipython3 import altair as alt import numpy as np @@ -214,7 +220,7 @@ predict the former. ## K-nearest neighbors regression -```{index} K-nearest neighbors; regression +```{index} K-nearest neighbors, K-nearest neighbors; regression ``` Much like in the case of classification, @@ -227,7 +233,7 @@ how well it predicts house sale price. This subsample is taken to allow us to illustrate the mechanics of K-NN regression with a few data points; later in this chapter we will use all the data. -```{index} pandas.DataFrame; sample +```{index} DataFrame; sample ``` To take a small random sample of size 30, we'll use the @@ -281,7 +287,7 @@ Scatter plot of price (USD) versus house size (square feet) with vertical line i +++ -```{index} pandas.DataFrame; assign, pandas.DataFrame; head, pandas.DataFrame; sort_values, abs +```{index} DataFrame; abs, DataFrame; nsmallest ``` We will employ the same intuition from {numref}`Chapters %s ` and {numref}`%s `, and use the @@ -291,9 +297,6 @@ For the example shown in {numref}`fig:07-small-eda-regr`, we find and label the 5 nearest neighbors to our observation of a house that is 2,000 square feet. -```{index} nsmallest -``` - ```{code-cell} ipython3 small_sacramento["dist"] = (2000 - small_sacramento["sqft"]).abs() nearest_neighbors = small_sacramento.nsmallest(5, "dist") @@ -303,7 +306,6 @@ nearest_neighbors ```{code-cell} ipython3 :tags: [remove-cell] - nn_plot = small_plot + rule # plot horizontal lines which is perpendicular to x=2000 @@ -389,7 +391,7 @@ about what the data must look like for it to work. ## Training, evaluating, and tuning the model -```{index} training data, test data +```{index} training set, test set ``` As usual, we must start by putting some test data away in a lock box @@ -538,7 +540,7 @@ training or testing data. But many people just use RMSE for both, and rely on context to denote which data the root mean squared error is being calculated on. ``` -```{index} scikit-learn, scikit-learn; pipeline, scikit-learn; make_pipeline, scikit-learn; make_column_transformer +```{index} scikit-learn, scikit-learn; Pipeline, scikit-learn; make_pipeline, scikit-learn; make_column_transformer ``` Now that we know how we can assess how well our model predicts a numerical diff --git a/source/regression2.md b/source/regression2.md index f7245fdf..0ed649e1 100755 --- a/source/regression2.md +++ b/source/regression2.md @@ -150,6 +150,9 @@ Scatter plot of sale price versus size with line of best fit for subset of the S ```{index} straight line; equation ``` +```{index} see: line; straight line +``` + The equation for the straight line is: $$\text{house sale price} = \beta_0 + \beta_1 \cdot (\text{house size}),$$ @@ -348,7 +351,7 @@ Below we illustrate how we can use the usual `scikit-learn` workflow to predict price given house size. We use a simple linear regression approach on the full Sacramento real estate data set. -```{index} scikit-learn; random_state +```{index} seed; numpy.random.seed ``` As usual, we start by loading packages, setting the seed, loading data, and @@ -731,6 +734,13 @@ mlm.fit( ``` Finally, we make predictions on the test data set to assess the quality of our model. +```{index} scikit-learn;predict, scikit-learn;mean_squared_error +``` +```{index} see: mean_squared_error;scikit-learn +``` +```{index} see: predict;scikit-learn +``` + ```{code-cell} ipython3 sacramento_test["predicted"] = mlm.predict(sacramento_test[["sqft","beds"]]) @@ -1059,10 +1069,7 @@ Scatter plot of the full data, with outlier highlighted in red. ### Multicollinearity -```{index} colinear -``` - -```{index} see: multicolinear; colinear +```{index} multicollinearity ``` The second, and much more subtle, issue can occur when performing multivariable diff --git a/source/setup.md b/source/setup.md index a540198d..e9bf4b31 100755 --- a/source/setup.md +++ b/source/setup.md @@ -66,7 +66,7 @@ exactly right! To keep things simple, we instead recommend that you install [Docker](https://docker.com). Docker lets you run your Jupyter notebooks inside a pre-built *container* that comes with precisely the right versions of all software packages needed run the worksheets that come with this book. -```{index} Docker +```{index} Docker, container ``` ```{note} @@ -85,6 +85,8 @@ installed on your computer—or even if you haven't installed Python at all! visit [the online Docker documentation](https://docs.docker.com/desktop/install/windows-install/), and download the `Docker Desktop Installer.exe` file. Double-click the file to open the installer and follow the instructions on the installation wizard, choosing **WSL-2** instead of **Hyper-V** when prompted. +```{index} Docker;installation +``` ```{note} Occasionally, when you first run Docker on Windows, you will encounter an error message. Some common errors you may see: @@ -99,6 +101,8 @@ Occasionally, when you first run Docker on Windows, you will encounter an error to help you with this, as editing the BIOS can be dangerous. Detailed instructions for doing this are beyond the scope of this book. ``` +```{index} Docker;image, Docker;tag +``` **Running JupyterLab** Run Docker Desktop. Once it is running, you need to download and run the Docker *image* that we have made available for the worksheets (an *image* is like a "snapshot" of a computer with all the right packages pre-installed). You only need to do this step one time; the image will remain diff --git a/source/version-control.md b/source/version-control.md index a553efdf..4a3b3d78 100755 --- a/source/version-control.md +++ b/source/version-control.md @@ -149,6 +149,9 @@ want to use one for your project. ```{index} repository, repository;local, repository;remote ``` +```{index} see: repository; version control +``` + Typically, when we put a data analysis project under version control, we create two copies of the repository ({numref}`vc1-no-changes`). One copy we use as our primary workspace where we create, edit, and delete files. @@ -197,8 +200,6 @@ one for each commit: `Created README.md` and `Added analysis draft`. ```{index} hash ``` - - The hash is a string of characters consisting of about 40 letters and numbers. The purpose of the hash is to serve as a unique identifier for the commit, and is used by Git to index project history. Although hashes are quite long—imagine @@ -233,11 +234,14 @@ name: vc2-changes Local repository with changes to files. ``` -```{index} git;add, staging area +```{index} git;add, staging area, git;commit +``` + +```{index} see: staging area; git ``` Once you reach a point that you want Git to keep a record -of the current version of your work, you need to commit +of the current version of your work, you need to **commit** (i.e., snapshot) your changes. A prerequisite to this is telling Git which files should be included in that snapshot. We call this step **adding** the files to the **staging area**. @@ -256,8 +260,6 @@ name: vc-ba2-add Adding modified files to the staging area in the local repository. ``` - - Once the files we wish to commit have been added to the staging area, we can then commit those files to the repository history ({numref}`vc-ba3-commit`). When we do this, we are required to include a helpful *commit message* to tell @@ -282,8 +284,6 @@ Committing the modified files in the staging area to the local repository histor ```{index} git;push ``` - - Once you have made one or more commits that you want to share with your collaborators, you need to **push** (i.e., send) those commits back to GitHub ({numref}`vc5-push`). This updates the history in the remote repository (i.e., GitHub) to match what you have in your @@ -330,15 +330,11 @@ name: vc7-pull Pulling changes from the remote GitHub repository to synchronize your local repository. ``` - - ## Working with remote repositories using GitHub ```{index} repository;remote, GitHub, git;clone ``` - - Now that you have been introduced to some of the key general concepts and workflows of Git version control, we will walk through the practical steps. There are several different ways to start using version control @@ -368,10 +364,9 @@ name: new-repository-01 New repositories on GitHub can be created by clicking on "New Repository" from the + menu. ``` -```{index} repository;public +```{index} repository;public, repository;private ``` - Repositories can be set up with a variety of configurations, including a name, optional description, and the inclusion (or not) of several template files. One of the most important configuration items to choose is the visibility to the outside world, @@ -394,8 +389,6 @@ name: new-repository-02 Repository configuration for a project that is public and initialized with a README.md template file. ``` - - A newly created public repository with a `README.md` template file should look something like what is shown in {numref}`new-repository-03`. @@ -406,8 +399,6 @@ name: new-repository-03 Respository configuration for a project that is public and initialized with a README.md template file. ``` - - +++ ### Editing files on GitHub with the pen tool @@ -437,8 +428,6 @@ The text box where edits can be made after clicking on the pen tool. ```{index} GitHub; commit ``` - - After you are done with your edits, they can be "saved" by *committing* your changes. When you *commit a file* in a repository, the version control system takes a snapshot of what the file looks like. As you continue working on the @@ -470,8 +459,6 @@ Saving changes using the pen tool requires committing those changes, and an asso ```{index} GitHub; add file ``` - - The "Add file" menu can be used to create new plain text files and upload files from your computer. To create a new plain text file, click the "Add file" drop-down menu and select the "Create new file" option @@ -487,8 +474,6 @@ New plain text files can be created directly on GitHub. ```{index} markdown ``` - - A page will open with a small text box for the file name to be entered, and a larger text box where the desired file content text can be entered. Note the two tabs, "Edit new file" and "Preview". Toggling between them lets you enter and @@ -573,8 +558,6 @@ to learn how to use Jupyter before reading this chapter. ```{index} GitHub; personal access token ``` - - To send and retrieve work between your local repository and the remote repository on GitHub, you will frequently need to authenticate with GitHub @@ -641,18 +624,11 @@ name: generate-pat-03 Display of the newly generated personal access token. ``` - - ### Cloning a repository using Jupyter - - ```{index} git;clone ``` - - *Cloning* a remote repository from GitHub to create a local repository results in a copy that knows where it was obtained from so that it knows where to send/receive @@ -758,8 +734,6 @@ Adding `eda.ipynb` makes it visible in the staging area. ```{index} git;commit ``` - - To snapshot the changes with an associated commit message, you must put a message in the text box at the bottom of the Git pane and click on the blue "Commit" button ({numref}`git-commit-01`). @@ -779,12 +753,10 @@ name: git-commit-01 A commit message must be added into the Jupyter Git extension commit text box before the blue Commit button can be used to record the commit. ``` - After "committing" the file(s), you will see there are 0 "Staged" files. You are now ready to push your changes to the remote repository on GitHub ({numref}`git-commit-03`). - ```{figure} img/version-control/git_commit_03.png --- name: git-commit-03 @@ -792,15 +764,11 @@ name: git-commit-03 After recording a commit, the staging area should be empty. ``` - - ### Pushing the commits to GitHub ```{index} git;push ``` - - To send the committed changes back to the remote repository on GitHub, you need to *push* them. To do this, click on the cloud icon with the up arrow on the Jupyter Git tab @@ -813,7 +781,6 @@ name: git-push-01 The Jupyter Git extension "push" button (circled in red). ``` - You will then be prompted to enter your GitHub username and the personal access token that you generated earlier (not your account password!). Click @@ -826,7 +793,6 @@ name: git-push-02 Enter your Git credentials to authorize the push to the remote repository. ``` - If the files were successfully pushed to the project repository on GitHub, you will be shown a success message ({numref}`git-push-03`). Click "Dismiss" to continue working in Jupyter. @@ -838,7 +804,6 @@ name: git-push-03 The prompt that the push was successful. ``` - If you visit the remote repository on GitHub, you will see that the changes now exist there too ({numref}`git-push-04`)! @@ -850,7 +815,6 @@ name: git-push-04 The GitHub web interface shows a preview of the commit message, and the time of the most recently pushed commit for each file. ``` - ## Collaboration ### Giving collaborators access to your project @@ -858,8 +822,6 @@ The GitHub web interface shows a preview of the commit message, and the time of ```{index} GitHub; collaborator access ``` - - As mentioned earlier, GitHub allows you to control who has access to your project. The default of both public and private projects are that only the person who created the GitHub repository has permissions to create, edit and @@ -988,7 +950,6 @@ name: merge-conflict-01 Error message that indicates that there are changes on the remote repository that you do not have locally. ``` - Usually, getting out of this situation is not too troublesome. First you need to pull the changes that exist on GitHub that you do not yet have in the local repository. Usually when this happens, Git can automatically merge the changes @@ -1010,15 +971,11 @@ same line of the same file and that Git will not be able to automatically merge the changes. ``` - - ### Handling merge conflicts ```{index} git;merge conflict ``` - - To fix the merge conflict, you need to open the offending file in a plain text editor and look for special marks that Git puts in the file to tell you where the merge conflict occurred ({numref}`merge-conflict-04`). diff --git a/source/viz.md b/source/viz.md index 1f56039d..35867bbe 100755 --- a/source/viz.md +++ b/source/viz.md @@ -241,6 +241,9 @@ The `ppm` column holds the value of CO$_{\text{2}}$ in parts per million that was measured on each date, and is type `float64`; this is the usual type for decimal numbers. +```{index} dates and times +``` + ```{note} `read_csv` was able to parse the `date_measured` column into the `datetime` vector type because it was entered @@ -267,7 +270,7 @@ and the CO$_{\text{2}}$ concentration as the `y` coordinate. We create a chart with the `alt.Chart()` function. There are a few basic aspects of a plot that we need to specify: -```{index} altair; graphical mark, altair; encoding channel +```{index} altair; graphical mark, altair; encoding channel, altair; mark_point ``` - The name of the **data frame** to visualize. @@ -649,9 +652,6 @@ glue("can_lang_plot", can_lang_plot, display=False) Scatter plot of number of Canadians reporting a language as their mother tongue vs the primary language at home ::: -```{index} escape character -``` - To make an initial improvement in the interpretability of {numref}`can_lang_plot`, we should replace the default axis @@ -663,6 +663,9 @@ where each string in the list will correspond to a new line of text. We can also increase the font size to further improve readability. +```{index} altair; multiline labels +``` + ```{code-cell} ipython3 can_lang_plot_labels = alt.Chart(can_lang).mark_circle().encode( x=alt.X("most_at_home").title( @@ -687,8 +690,6 @@ Scatter plot of number of Canadians reporting a language as their mother tongue ::: - - ```{code-cell} ipython3 :tags: ["remove-cell"] import numpy as np @@ -717,7 +718,7 @@ in the magnitude of these two numbers! We can confirm that the two points in the upper right-hand corner correspond to Canada's two official languages by filtering the data: -```{index} pandas.DataFrame; loc[] +```{index} DataFrame; loc[] ``` ```{code-cell} ipython3 @@ -785,6 +786,9 @@ To fix these issue, we can limit the number of ticks and gridlines to only include the seven major ones, and change the number formatting to include a suffix which makes the labels shorter. +```{index} altair; tick count, altair; tick formatting +``` + ```{code-cell} ipython3 can_lang_plot_log_revised = alt.Chart(can_lang).mark_circle().encode( x=alt.X("most_at_home") @@ -844,7 +848,7 @@ using `_` so that it is easier to read; this does not affect how Python interprets the number and is just added for readability. -```{index} pandas.DataFrame; assign, pandas.DataFrame; [[]] +```{index} DataFrame; column assignment, DataFrame; [] ``` ```{code-cell} ipython3 @@ -898,21 +902,21 @@ To fully answer the question, we need to use {numref}`can_lang_plot_percent` to assess a few key characteristics of the data: -```{index} relationship; positive negative none +```{index} relationship; positive, relationship; negative, relationship; none ``` - **Direction:** if the y variable tends to increase when the x variable increases, then y has a **positive** relationship with x. If y tends to decrease when x increases, then y has a **negative** relationship with x. If y does not meaningfully increase or decrease as x increases, then y has **little or no** relationship with x. -```{index} relationship; strong weak +```{index} relationship; strong, relationship; weak ``` - **Strength:** if the y variable *reliably* increases, decreases, or stays flat as x increases, then the relationship is **strong**. Otherwise, the relationship is **weak**. Intuitively, the relationship is strong when the scatter points are close together and look more like a "line" or "curve" than a "cloud." -```{index} relationship; linear nonlinear +```{index} relationship; linear, relationship; nonlinear ``` - **Shape:** if you can draw a straight line roughly through the data points, the relationship is **linear**. Otherwise, it is **nonlinear**. @@ -985,6 +989,9 @@ and specify that we want it on the top of the chart. This automatically changes the legend items to be laid out horizontally instead of vertically, but we could also keep the vertical layout by specifying `direction="vertical"` inside `alt.Legend`. +```{index} altair; alt.Legend +``` + ```{code-cell} ipython3 can_lang_plot_legend = alt.Chart(can_lang).mark_circle().encode( x=alt.X("most_at_home_percent") @@ -1014,6 +1021,9 @@ glue("can_lang_plot_legend", can_lang_plot_legend.properties(height=320, width=4 Scatter plot of percentage of Canadians reporting a language as their mother tongue vs the primary language at home colored by language category with the legend edited. ::: +```{index} color palette, color blindness simulator +``` + In {numref}`can_lang_plot_legend`, the points are colored with the default `altair` color scheme, which is called `"tableau10"`. This is an appropriate choice for most situations and is also easy to read for people with reduced color vision. In general, the color schemes that are used by default in Altair are adapted to the type of data that is displayed and selected to be easy to interpret both for people with good and reduced color vision. @@ -1021,9 +1031,6 @@ If you are unsure about a certain color combination, you can use this [color blindness simulator](https://www.color-blindness.com/coblis-color-blindness-simulator/) to check if your visualizations are color-blind friendly. -```{index} color palette; color blindness simulator -``` - All the available color schemes and information on how to create your own can be viewed [in the Altair documentation](https://altair-viz.github.io/user_guide/customization.html#customizing-colors). To change the color scheme of our chart, we can add the `scheme` argument in the `scale` of the `color` encoding. @@ -1048,7 +1055,7 @@ can_lang_plot_theme = alt.Chart(can_lang).mark_point(filled=True).encode( y=alt.Y("mother_tongue_percent") .scale(type="log") .axis(tickCount=7) - .title("Mother tongue (percentage of Canadian residents)"), + .title(["Mother tongue", "(percentage of Canadian residents)"]), color=alt.Color("category") .legend(orient="top") .title("") @@ -1081,6 +1088,9 @@ via the `Tooltip` encoding channel, so that text labels for each point show up once we hover over it with the mouse pointer. Here we also add the exact values of the variables on the x and y-axis to the tooltip. +```{index} altair; alt.Tooltip +``` + ```{code-cell} ipython3 can_lang_plot_tooltip = alt.Chart(can_lang).mark_point(filled=True).encode( x=alt.X("most_at_home_percent") @@ -1090,7 +1100,7 @@ can_lang_plot_tooltip = alt.Chart(can_lang).mark_point(filled=True).encode( y=alt.Y("mother_tongue_percent") .scale(type="log") .axis(tickCount=7) - .title("Mother tongue (percentage of Canadian residents)"), + .title(["Mother tongue", "(percentage of Canadian residents)"]), color=alt.Color("category") .legend(orient="top") .title("") @@ -1218,7 +1228,7 @@ as `sort_values` followed by `head`, but are slightly more efficient because the In general, it is good to use more specialized functions when they are available! ``` -```{index} pandas.DataFrame; nlargest; nsmallest +```{index} DataFrame; nlargest, DataFrame; nsmallest ``` ```{code-cell} ipython3 @@ -1338,7 +1348,10 @@ morley_df = pd.read_csv("data/morley.csv") morley_df ``` -```{index} distribution, altair; histogram +```{index} distribution, altair; histogram, altair; count +``` + +```{index} see: count; altair ``` In this experimental data, @@ -1416,7 +1429,7 @@ Histogram of Michelson's speed of light data. #### Adding layers to an `altair` chart -```{index} altair; +; mark_rule +```{index} altair; +, altair; mark_rule, altair; layers ``` {numref}`morley_hist` is a great start. @@ -1696,6 +1709,9 @@ When you create a histogram in `altair`, it tries to choose a reasonable number We can change the number of bins by using the `maxbins` parameter inside the `bin` method. +```{index} altair; maxbins +``` + ```{code-cell} ipython3 morley_hist_maxbins = alt.Chart(morley_df).mark_bar().encode( x=alt.X("RelativeError").bin(maxbins=30), @@ -1950,7 +1966,7 @@ bad, while raster images eventually start to look "pixelated." ```{index} PDF ``` -```{index} see: portable document dormat; PDF +```{index} see: portable document format; PDF ``` ```{note} diff --git a/source/wrangling.md b/source/wrangling.md index 2c400af3..4cd6d36e 100755 --- a/source/wrangling.md +++ b/source/wrangling.md @@ -72,7 +72,10 @@ This knowledge will be helpful in effectively utilizing these objects in our dat ```{index} data frame; definition ``` -```{index} pandas.DataFrame +```{index} see: data frame; DataFrame +``` + +```{index} DataFrame ``` A data frame is a table-like structure for storing data in Python. Data frames are @@ -109,7 +112,7 @@ A data frame storing data regarding the population of various regions in Canada. ### What is a series? -```{index} pandas.Series +```{index} Series ``` In Python, `pandas` **series** are are objects that can contain one or more elements (like a list). @@ -117,10 +120,8 @@ They are a single column, are ordered, can be indexed, and can contain any data The `pandas` package uses `Series` objects to represent the columns in a data frame. `Series` can contain a mix of data types, but it is good practice to only include a single type in a series because all observations of one variable should be the same type. -Python -has several different basic data types, as shown in -{numref}`tab:datatype-table`. -You can create a `pandas` series using the +Python has several different basic data types, as shown in +{numref}`tab:datatype-table`. You can create a `pandas` series using the `pd.Series()` function. For example, to create the series `region` as shown in {numref}`fig:02-series`, you can write the following. @@ -140,39 +141,29 @@ region Example of a `pandas` series whose type is string. ``` - -```{code-cell} ipython3 -:tags: [remove-cell] - -# The following table was taken from DSCI511 Lecture 1, credit to Arman Seyed-Ahmadi, MDS 2021 -``` - -```{index} data types, string, integer, floating point number, boolean, list, set, dictionary, tuple, none -``` - -```{index} see: str; string +```{index} data types; string (str), data types; integer (int), data types; floating point number (float), data types; boolean (bool), data types; NoneType (none) ``` -```{index} see: int; integer +```{index} see: str; data types ``` -```{index} see: float; floating point number +```{index} see: int; data types ``` -```{index} see: bool; boolean +```{index} see: float; data types ``` -```{index} see: NoneType; none +```{index} see: bool; data types ``` -```{index} see: dict; dictionary +```{index} see: NoneType; data types ``` ```{table} Basic data types in Python :name: tab:datatype-table | Data type | Abbreviation | Description | Example | | :-------------------- | :----------- | :-------------------------------------------- | :----------------------------------------- | -| integer | `int` | positive/negative/zero whole numbers | `42` | +| integer | `int` | positive/negative/zero whole numbers | `42` | | floating point number | `float` | real number in decimal form | `3.14159` | | boolean | `bool` | true or false | `True` | | string | `str` | text | `"Hello World"` | @@ -249,6 +240,12 @@ to both `DataFrames` and `Series` as "data frames" in the text. There are other types that represent data structures in Python. We summarize the most common ones in {numref}`tab:datastruc-table`. +```{index} data structures; list, data structures; set, data structures; dictionary (dict), data structures; tuple +``` + +```{index} see: dict; data structures +``` + ```{table} Basic data structures in Python :name: tab:datastruc-table | Data Structure | Description | @@ -378,7 +375,7 @@ represented as individual columns to make the data tidy. ### Tidying up: going from wide to long using `melt` -```{index} pandas.DataFrame; melt +```{index} DataFrame; melt ``` One task that is commonly performed to get data into a tidy format @@ -548,7 +545,7 @@ been met: (pivot-wider)= ### Tidying up: going from long to wide using `pivot` -```{index} pandas.DataFrame; pivot +```{index} DataFrame; pivot ``` Suppose we have observations spread across multiple rows rather than in a single @@ -654,6 +651,9 @@ lang_home_tidy.columns = [ lang_home_tidy ``` +```{index} DataFrame; reset_index +``` + In the first step, note that we added a call to `reset_index`. When `pivot` is called with multiple column names passed to the `index`, those entries become the "name" of each row that would be used when you filter rows with `[]` or `loc` rather than just simple numbers. This @@ -665,6 +665,9 @@ The second operation we applied is to rename the columns. When we perform the `p operation, it keeps the original column name `"count"` and adds the `"type"` as a second column name. Having two names for a column can be confusing! So we rename giving each column only one name. +```{index} DataFrame; info +``` + We can print out some useful information about our data frame using the `info` function. In the first row it tells us the `type` of `lang_home_tidy` (it is a `pandas` `DataFrame`). The second row tells us how many rows there are: 1070, and to index those rows, you can use numbers between @@ -697,16 +700,19 @@ more columns, and we would see the data set "widen." +++ (str-split)= -### Tidying up: using `str.split` to deal with multiple delimiters +### Tidying up: using `str.split` to deal with multiple separators + +```{index} Series; str.split, separator +``` -```{index} pandas.Series; str.split, delimiter +```{index} see: delimiter; separator ``` Data are also not considered tidy when multiple values are stored in the same cell. The data set we show below is even messier than the ones we dealt with above: the `Toronto`, `Montréal`, `Vancouver`, `Calgary` and `Edmonton` columns contain the number of Canadians reporting their primary language at home and -work in one column separated by the delimiter (`/`). The column names are the +work in one column separated by the separator (`/`). The column names are the values of a variable, *and* each value does not have its own cell! To turn this messy data into tidy data, we'll have to fix these issues. @@ -786,7 +792,7 @@ tidy_lang.info() Object columns in `pandas` data frames are columns of strings or columns with mixed types. In the previous example in {numref}`pivot-wider`, the `most_at_home` and `most_at_work` variables were `int64` (integer), which is a type of numeric data. -This change is due to the delimiter (`/`) when we read in this messy data set. +This change is due to the separator (`/`) when we read in this messy data set. Python read these columns in as string types, and by default, `str.split` will return columns with the `object` data type. @@ -828,6 +834,12 @@ This section will highlight more advanced usage of the `[]` function, including an in-depth treatment of the variety of logical statements one can use in the `[]` to select subsets of rows. +```{index} DataFrame; [], logical statement +``` + +```{index} see: logical statement; logical operator +``` + +++ ### Extracting columns by name @@ -867,6 +879,13 @@ tidy_lang["language"] ### Extracting rows that have a certain value with `==` + +```{index} logical operator; equivalency (==) +``` + +```{index} see: ==; logical operator +``` + Suppose we are only interested in the subset of rows in `tidy_lang` corresponding to the official languages of Canada (English and French). We can extract these rows by using the *equivalency operator* (`==`) @@ -886,6 +905,12 @@ official_langs ### Extracting rows that do not have a certain value with `!=` +```{index} logical operator; inequivalency (!=) +``` + +```{index} see: !=; logical operator +``` + What if we want all the other language categories in the data set *except* for those in the `"Official languages"` category? We can accomplish this with the `!=` operator, which means "not equal to". So if we want to find all the rows @@ -900,6 +925,12 @@ tidy_lang[tidy_lang["category"] != "Official languages"] (filter-and)= ### Extracting rows satisfying multiple conditions using `&` +```{index} logical operator; and (&) +``` + +```{index} see: &; logical operator +``` + Suppose now we want to look at only the rows for the French language in Montréal. To do this, we need to filter the data set @@ -921,6 +952,12 @@ tidy_lang[ ### Extracting rows satisfying at least one condition using `|` +```{index} logical operator; or (|) +``` + +```{index} see: |; logical operator +``` + Suppose we were interested in only those rows corresponding to cities in Alberta in the `official_langs` data set (Edmonton and Calgary). We can't use `&` as we did above because `region` @@ -940,6 +977,12 @@ official_langs[ ### Extracting rows with values in a list using `isin` +```{index} logical operator; containment (isin) +``` + +```{index} see: isin; logical operator +``` + Next, suppose we want to see the populations of our five cities. Let's read in the `region_data.csv` file that comes from the 2016 Canadian census, @@ -987,6 +1030,21 @@ pd.Series(["Vancouver", "Toronto"]).isin(pd.Series(["Toronto", "Vancouver"])) ### Extracting rows above or below a threshold using `>` and `<` +```{index} logical operator; greater than (> and >=), logical operator; less than (< and <=) +``` + +```{index} see: >; logical operator +``` + +```{index} see: >=; logical operator +``` + +```{index} see: <; logical operator +``` + +```{index} see: <=; logical operator +``` + ```{code-cell} ipython3 :tags: [remove-cell] @@ -1017,6 +1075,9 @@ than French in Montréal according to the 2016 Canadian census. ### Extracting rows using `query` +```{index} logical statement; query +``` + You can also extract rows above, below, equal or not-equal to a threshold using the `query` method. For example the following gives us the same result as when we used `official_langs[official_langs["most_at_home"] > 2669195]`. @@ -1032,7 +1093,7 @@ to make long chains of filtering operations a bit easier to read. (loc-iloc)= ## Using `loc[]` to filter rows and select columns -```{index} pandas.DataFrame; loc[] +```{index} DataFrame; loc[] ``` The `[]` operation is only used when you want to either filter rows **or** select columns; @@ -1111,7 +1172,7 @@ corresponding to the column names that start with the desired characters. tidy_lang.loc[:, tidy_lang.columns.str.startswith("most")] ``` -```{index} pandas.Series; str.contains +```{index} Series; str.contains ``` We could also have chosen the columns containing an underscore `_` by using the @@ -1123,7 +1184,7 @@ tidy_lang.loc[:, tidy_lang.columns.str.contains("_")] ``` ## Using `iloc[]` to extract rows and columns by position -```{index} pandas.DataFrame; iloc[], column range +```{index} DataFrame; iloc[], column range ``` Another approach for selecting rows and columns is to use `iloc[]`, which provides the ability to index with the position rather than the label of the columns. @@ -1158,7 +1219,7 @@ accidentally put in the wrong integer index! If you did not correctly remember that the `language` column was index `1`, and used `2` instead, your code might end up having a bug that is quite hard to track down. -```{index} pandas.Series; str.startswith +```{index} Series; str.startswith ``` +++ {"tags": []} @@ -1203,6 +1264,9 @@ region_lang = pd.read_csv("data/region_lang.csv") region_lang ``` +```{index} Series; min, Series; max +``` + We use `.min` to calculate the minimum and `.max` to calculate maximum number of Canadians reporting a particular language as their primary language at home, @@ -1230,6 +1294,9 @@ total number of people in the survey, we could use the `sum` summary statistic m region_lang["most_at_home"].sum() ``` +```{index} Series; sum, Series; mean, Series; median, Series; std, summary statistic +``` + Other handy summary statistics include the `mean`, `median` and `std` for computing the mean, median, and standard deviation of observations, respectively. We can also compute multiple statistics at once using `agg` to "aggregate" results. @@ -1273,6 +1340,12 @@ summary statistics that you can compute with `pandas`. +++ +++ +```{index} see: NaN; missing data +``` + +```{index} missing data +``` + ```{note} In `pandas`, the value `NaN` is often used to denote missing data. @@ -1329,7 +1402,7 @@ region_lang.loc[:, "mother_tongue":"lang_known"].agg(["mean", "std"]) +++ -```{index} pandas.DataFrame; groupby +```{index} DataFrame; groupby ``` What happens if we want to know how languages vary by region? In this case, we need a new tool that lets us group rows by region. This can be achieved @@ -1434,6 +1507,9 @@ region_lang.groupby("region")[["most_at_home", "most_at_work", "lang_known"]].ma To see how many observations there are in each group, we can use `value_counts`. +```{index} DataFrame; value_counts +``` + ```{code-cell} ipython3 :tags: ["output_scroll"] region_lang.value_counts("region") @@ -1476,11 +1552,14 @@ we can see that this would be the columns from `mother_tongue` to `lang_known`. region_lang ``` -```{index} pandas.DataFrame; apply, pandas.DataFrame; loc[] +```{index} DataFrame; apply, DataFrame; loc[] ``` We can simply call the `.astype` function to apply it across the desired range of columns. +```{index} DataFrame; astype, Series; astype +``` + ```{code-cell} ipython3 region_lang_nums = region_lang.loc[:, "mother_tongue":"lang_known"].astype("int32") region_lang_nums.info() @@ -1530,7 +1609,7 @@ you can use the more general [`apply`](https://pandas.pydata.org/docs/reference/ ## Modifying and adding columns -```{index} pandas.DataFrame; [] +```{index} DataFrame; [], column assignment, assign ``` When we compute summary statistics or apply functions, @@ -1666,6 +1745,10 @@ See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stab :tags: [remove-input] english_lang ``` + +```{index} SettingWithCopyWarning +``` + Wait a moment...what is that warning message? It seems to suggest that something went wrong, but if we inspect the `english_lang` data frame above, it looks like the city populations were added just fine! As it turns out, this is caused by the earlier filtering we did from `region_lang` to @@ -1680,6 +1763,9 @@ For the rest of the book, we will silence that warning to help with readability. pd.options.mode.chained_assignment = None ``` +```{index} DataFrame; merge +``` + ```{note} Inserting the data column `[4098927, 5928040, ...]` manually as we did above is generally very error-prone and is not recommended. We do it here to demonstrate another usage of `assign` and regular column assignment. @@ -1714,6 +1800,9 @@ english_lang ## Using `merge` to combine data frames +```{index} DataFrame; merge +``` + Let's return to the situation right before we added the city populations of Toronto, Montréal, Vancouver, Calgary, and Edmonton to the `english_lang` data frame. Before adding the new column, we had filtered `region_lang` to create the `english_lang` data frame containing only English speakers in the five cities