@@ -971,9 +971,14 @@ and calculate the means for
971
971
each of those replicates. Recall that this assumes that ` one_sample ` * looks like*
972
972
our original population; but since we do not have access to the population itself,
973
973
this is often the best we can do.
974
+ Note that here we break the list comprehension over multiple lines
975
+ so that it is easier to read.
974
976
975
977
``` {code-cell} ipython3
976
- boot20000 = pd.concat([one_sample.sample(frac=1, replace=True).assign(replicate=n) for n in range(20_000)])
978
+ boot20000 = pd.concat([
979
+ one_sample.sample(frac=1, replace=True).assign(replicate=n)
980
+ for n in range(20_000)
981
+ ])
977
982
boot20000
978
983
```
979
984
@@ -1005,26 +1010,44 @@ Histograms of first six replicates of bootstrap samples.
1005
1010
1006
1011
+++
1007
1012
1008
- We see in {numref}` fig:11-bootstrapping-six-bootstrap-samples ` how the
1009
- bootstrap samples differ. We can also calculate the sample mean for each of
1010
- these six replicates.
1013
+ We see in {numref}` fig:11-bootstrapping-six-bootstrap-samples ` how the distributions of the
1014
+ bootstrap samples differ. If we calculate the sample mean for each of
1015
+ these six samples, we can see that these are also different between samples.
1016
+ To compute the mean for each sample,
1017
+ we first group by the "replicate" which is the column containing the sample/replicate number.
1018
+ Then we compute the mean of the ` price ` column and rename it to ` sample_mean `
1019
+ for it to be more descriptive.
1020
+ Finally we use ` reset_index ` to get the ` replicate ` values back as a column in the dataframe.
1011
1021
1012
1022
``` {code-cell} ipython3
1013
- six_bootstrap_samples.groupby("replicate")["price"].mean().reset_index().rename(
1014
- columns={"price": "mean"}
1023
+ (
1024
+ six_bootstrap_samples
1025
+ .groupby("replicate")
1026
+ ["price"]
1027
+ .mean()
1028
+ .rename(columns={"price": "sample_mean"})
1029
+ .reset_index()
1015
1030
)
1016
1031
```
1017
1032
1018
- We can see that the bootstrap sample distributions and the sample means are
1019
- different. They are different because we are sampling * with replacement* . We
1020
- will now calculate point estimates for our 20,000 bootstrap samples and
1021
- generate a bootstrap distribution of our point estimates. The bootstrap
1033
+ The distributions and the means differ between the bootstrapped samples
1034
+ because we are sampling * with replacement* .
1035
+ If we instead would have sampled * without replacement* ,
1036
+ we would end up with the exact same values in the sample each time.
1037
+
1038
+ We will now calculate point estimates of the mean for our 20,000 bootstrap samples and
1039
+ generate a bootstrap distribution of these point estimates. The bootstrap
1022
1040
distribution ({numref}` fig:11-bootstrapping5 ` ) suggests how we might expect
1023
- our point estimate to behave if we took another sample .
1041
+ our point estimate to behave if we take multiple samples .
1024
1042
1025
1043
``` {code-cell} ipython3
1026
- boot20000_means = boot20000.groupby("replicate")["price"].mean().reset_index().rename(
1027
- columns={"price": "sample_mean"}
1044
+ boot20000_means = (
1045
+ boot20000
1046
+ .groupby("replicate")
1047
+ ["price"]
1048
+ .mean()
1049
+ .rename(columns={"price": "sample_mean"})
1050
+ .reset_index()
1028
1051
)
1029
1052
1030
1053
boot20000_means
0 commit comments