Skip to content

Commit f8de446

Browse files
committed
Reformat and improve the explanation of the bootstrapped mean calculation
1 parent bb548d1 commit f8de446

File tree

1 file changed

+36
-13
lines changed

1 file changed

+36
-13
lines changed

source/inference.md

+36-13
Original file line numberDiff line numberDiff line change
@@ -971,9 +971,14 @@ and calculate the means for
971971
each of those replicates. Recall that this assumes that `one_sample` *looks like*
972972
our original population; but since we do not have access to the population itself,
973973
this is often the best we can do.
974+
Note that here we break the list comprehension over multiple lines
975+
so that it is easier to read.
974976

975977
```{code-cell} ipython3
976-
boot20000 = pd.concat([one_sample.sample(frac=1, replace=True).assign(replicate=n) for n in range(20_000)])
978+
boot20000 = pd.concat([
979+
one_sample.sample(frac=1, replace=True).assign(replicate=n)
980+
for n in range(20_000)
981+
])
977982
boot20000
978983
```
979984

@@ -1005,26 +1010,44 @@ Histograms of first six replicates of bootstrap samples.
10051010

10061011
+++
10071012

1008-
We see in {numref}`fig:11-bootstrapping-six-bootstrap-samples` how the
1009-
bootstrap samples differ. We can also calculate the sample mean for each of
1010-
these six replicates.
1013+
We see in {numref}`fig:11-bootstrapping-six-bootstrap-samples` how the distributions of the
1014+
bootstrap samples differ. If we calculate the sample mean for each of
1015+
these six samples, we can see that these are also different between samples.
1016+
To compute the mean for each sample,
1017+
we first group by the "replicate" which is the column containing the sample/replicate number.
1018+
Then we compute the mean of the `price` column and rename it to `sample_mean`
1019+
for it to be more descriptive.
1020+
Finally we use `reset_index` to get the `replicate` values back as a column in the dataframe.
10111021

10121022
```{code-cell} ipython3
1013-
six_bootstrap_samples.groupby("replicate")["price"].mean().reset_index().rename(
1014-
columns={"price": "mean"}
1023+
(
1024+
six_bootstrap_samples
1025+
.groupby("replicate")
1026+
["price"]
1027+
.mean()
1028+
.rename(columns={"price": "sample_mean"})
1029+
.reset_index()
10151030
)
10161031
```
10171032

1018-
We can see that the bootstrap sample distributions and the sample means are
1019-
different. They are different because we are sampling *with replacement*. We
1020-
will now calculate point estimates for our 20,000 bootstrap samples and
1021-
generate a bootstrap distribution of our point estimates. The bootstrap
1033+
The distributions and the means differ between the bootstrapped samples
1034+
because we are sampling *with replacement*.
1035+
If we instead would have sampled *without replacement*,
1036+
we would end up with the exact same values in the sample each time.
1037+
1038+
We will now calculate point estimates of the mean for our 20,000 bootstrap samples and
1039+
generate a bootstrap distribution of these point estimates. The bootstrap
10221040
distribution ({numref}`fig:11-bootstrapping5`) suggests how we might expect
1023-
our point estimate to behave if we took another sample.
1041+
our point estimate to behave if we take multiple samples.
10241042

10251043
```{code-cell} ipython3
1026-
boot20000_means = boot20000.groupby("replicate")["price"].mean().reset_index().rename(
1027-
columns={"price": "sample_mean"}
1044+
boot20000_means = (
1045+
boot20000
1046+
.groupby("replicate")
1047+
["price"]
1048+
.mean()
1049+
.rename(columns={"price": "sample_mean"})
1050+
.reset_index()
10281051
)
10291052
10301053
boot20000_means

0 commit comments

Comments
 (0)