diff --git a/source/classification1.md b/source/classification1.md index 71035fa8..ec1eaf27 100755 --- a/source/classification1.md +++ b/source/classification1.md @@ -955,7 +955,7 @@ In order to fit the model on the breast cancer data, we need to call `fit` on the model object. The `X` argument is used to specify the data for the predictor variables, while the `y` argument is used to specify the data for the response variable. So below, we set `X=cancer_train[["Perimeter", "Concavity"]]` and -`y=cancer_train['Class']` to specify that `Class` is the target +`y=cancer_train['Class']` to specify that `Class` is the response variable (the one we want to predict), and both `Perimeter` and `Concavity` are to be used as the predictors. Note that the `fit` function might look like it does not do much from the outside, but it is actually doing all the heavy lifting to train diff --git a/source/classification2.md b/source/classification2.md index b2a0dda6..79a893e6 100755 --- a/source/classification2.md +++ b/source/classification2.md @@ -373,7 +373,7 @@ that the accuracy estimates from the test data are reasonable. First, setting `shuffle=True` (which is the default) means the data will be shuffled before splitting, which ensures that any ordering present in the data does not influence the data that ends up in the training and testing sets. -Second, by specifying the `stratify` parameter to be the target column of the training set, +Second, by specifying the `stratify` parameter to be the response variable in the training set, it **stratifies** the data by the class label, to ensure that roughly the same proportion of each class ends up in both the training and testing sets. For example, in our data set, roughly 63% of the diff --git a/source/regression1.md b/source/regression1.md index 004f8324..ecd884b1 100755 --- a/source/regression1.md +++ b/source/regression1.md @@ -170,7 +170,7 @@ The scientific question guides our initial exploration: the columns in the data that we are interested in are `sqft` (house size, in livable square feet) and `price` (house sale price, in US dollars (USD)). The first step is to visualize the data as a scatter plot where we place the predictor variable -(house size) on the x-axis, and we place the target/response variable that we +(house size) on the x-axis, and we place the response variable that we want to predict (sale price) on the y-axis. > **Note:** Given that the y-axis unit is dollars in {numref}`fig:07-edaRegr`, @@ -922,7 +922,7 @@ As the algorithm is the same, we will not cover it again in this chapter. We will now demonstrate a multivariable KNN regression analysis of the Sacramento real estate data using `scikit-learn`. This time we will use house size (measured in square feet) as well as number of bedrooms as our -predictors, and continue to use house sale price as our outcome/target variable +predictors, and continue to use house sale price as our response variable that we are trying to predict. It is always a good practice to do exploratory data analysis, such as visualizing the data, before we start modeling the data. {numref}`fig:07-bedscatter` diff --git a/source/regression2.md b/source/regression2.md index 73a7296d..ccd1557a 100755 --- a/source/regression2.md +++ b/source/regression2.md @@ -464,7 +464,7 @@ glue("sacr_RMSPE", "{0:,.0f}".format(RMSPE)) Our final model's test error as assessed by RMSPE is {glue:text}`sacr_RMSPE`. -Remember that this is in units of the target/response variable, and here that +Remember that this is in units of the response variable, and here that is US Dollars (USD). Does this mean our model is "good" at predicting house sale price based off of the predictor of home size? Again, answering this is tricky and requires knowledge of how you intend to use the prediction. @@ -645,7 +645,7 @@ flexible and can be quite wiggly. But there is a major interpretability advantag model to a straight line. A straight line can be defined by two numbers, the vertical intercept and the slope. The intercept tells us what the prediction is when -all of the predictors are equal to 0; and the slope tells us what unit increase in the target/response +all of the predictors are equal to 0; and the slope tells us what unit increase in the response variable we predict given a unit increase in the predictor variable. KNN regression, as simple as it is to implement and understand, has no such interpretability from its wiggly line. @@ -654,7 +654,7 @@ interpretability from its wiggly line. ``` There can, however, also be a disadvantage to using a simple linear regression -model in some cases, particularly when the relationship between the target and +model in some cases, particularly when the relationship between the response variable and the predictor is not linear, but instead some other shape (e.g., curved or oscillating). In these cases the prediction model from a simple linear regression will underfit (have high bias), meaning that model/predicted values do not @@ -1324,7 +1324,7 @@ predictive performance. So far in this textbook we have used regression only in the context of prediction. However, regression can also be seen as a method to understand and -quantify the effects of individual variables on a response / outcome of interest. +quantify the effects of individual variables on a response variable of interest. In the housing example from this chapter, beyond just using past data to predict future sale prices, we might also be interested in describing the