minor edits

tomvannuenen · tomvannuenen · commit e1de796d1472 · 2023-04-03T18:34:35.000-07:00
diff --git a/lessons/1_Functions_and_Conditionals.ipynb b/lessons/1_Functions_and_Conditionals.ipynb
@@ -218,7 +218,7 @@
     "*   An indented block of code denotes the start of the *body*.\n",
     "*   The final line should be a `return` statement with the value(s) to be returned from the function.\n",
     "\n",
-    "Let's take a look at a simple function:"
+    "Let's take a look at a simple function that converts feet into meters:"
    ]
   },
   {
@@ -636,15 +636,15 @@
     "\n",
     "Say that we want to create a new column in our dataset that classifies our datapoints in terms of the level of development, as measured by per capita gross national income (GNI). [This UN document](https://www.un.org/en/development/desa/policy/wesp/wesp_current/2014wesp_country_classification.pdf) outlines some rules for this.\n",
     "\n",
-    "Here's what you need to do:\n",
+    "A good way to approach these kinds of problems is to write down all the steps you need to take. Then, you write your code by following the steps. In this case, we need to do the following:\n",
     "\n",
     "1. Start a function called `assign_level` that takes in one parameter, `i`.\n",
     "2. Write an if-elif-else statement that checks `i`, based on the following rules:\n",
-    "    - If it is more than 12615, `return` the string `high-income`. \n",
-    "    - If it is more than 4086, `return` the string `upper middle income`. \n",
-    "    - If it is more than 1035, `return` the string `lower middle income`. \n",
-    "    - If it is less than or equal to 1035, `return` the string `low-income`. \n",
-    "    - Else, return `np.nan` (this is a NaN value).\n",
+    "    - `if` it is more than 12615, `return` the string `high-income`. \n",
+    "    - `elif` it is more than 4086, `return` the string `upper middle income`. \n",
+    "    - `elif` it is more than 1035, `return` the string `lower middle income`. \n",
+    "    - `elif` it is less than or equal to 1035, `return` the string `low-income`. \n",
+    "    - `else`, return `np.nan` (this is a NaN value).\n",
     "3. Use `.apply()` on the `gniPercap` column, using your new `assign_level` function as the argument. Assign the output to a new column in our DataFrame, called `income_level`."
    ]
   },
@@ -663,7 +663,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "If you've done this correctly, the following code should produce a barplot of the different income levels in our data."
+    "If you've done this correctly, the following code should produce a barplot of the different income levels in our data!"
    ]
   },
   {
diff --git a/lessons/2_Iteration_and_Visualization.ipynb b/lessons/2_Iteration_and_Visualization.ipynb
@@ -97,10 +97,9 @@
     "\n",
     "<img src=\"../images/for.png\" alt=\"For loop in Python\" width=\"700\"/>\n",
     "\n",
-    "Pay attention to the **loop variable** (`lifeExp`). It stands for each item in the list (`lifeExp_list`) we are iterating through. Loop variables: \n",
-    "  - Are created on demand.\n",
-    "  - Only exist inside the loop.\n",
-    "  - Can have any name. "
+    "Pay attention to the **loop variable** (`lifeExp`). It stands for each item in the list (`lifeExp_list`) we are iterating through. Loop variables can have any name; if we'd change it to `x`, it would still work. However, loop variables only exist inside the loop.\n",
+    "\n",
+    "🔔 **Question**: Would you prefer `lifeExp` or `x` as a name for the loop variable? Why?"
    ]
   },
   {
diff --git a/lessons/3_Project.ipynb b/lessons/3_Project.ipynb
@@ -29,9 +29,9 @@
     "\n",
     "### Sections\n",
     "1. [🚀 Project](#project)\n",
-    "2. [Import the Data](#data)\n",
-    "3. [Data Cleaning](#clean)\n",
-    "4. [Exploratory Data Analysis](#eda)\n"
+    "2. [Step 1: Import the Data](#data)\n",
+    "3. [Step 2: Data Cleaning](#clean)\n",
+    "4. [Step 3: Data Analysis](#eda)\n"
    ]
   },
   {
@@ -40,7 +40,6 @@
    "source": [
     "<a id='project'></a>\n",
     "\n",
-    "\n",
     "# 🚀 Project\n",
     "\n",
     "### Data: California Health Interview Survey\n",
@@ -55,20 +54,9 @@
     "- `household_tenure`: Self-Reported household tenure\n",
     "- `interview_language`: Language of interview\n",
     "\n",
-    "We will bring together the basic programming, loading data, and statistical analysis/visualization techniques from this workshop to analyze this data. \n",
+    "For this 🚀 Project, the goal we want to accomplish is **visualizing the relationship between poverty level and general health**. We will bring together basic programming and data science techniques you have learned to do this.\n",
     "\n",
-    "First, let's import the packages to use in this analysis:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import numpy as np\n",
-    "import pandas as pd\n",
-    "import os"
+    "🔔 **Question**: Are there other research questions you could imagine asking with this dataset?"
    ]
   },
   {
@@ -79,7 +67,7 @@
    "source": [
     "<a id='data'></a>\n",
     "\n",
-    "# Import the Data \n",
+    "# Step 1: Importing Data \n",
     "\n",
     "Before we import our data, a few words on **filepaths**. \n",
     "\n",
@@ -139,7 +127,7 @@
     "tags": []
    },
    "source": [
-    "## 🥊 Challenge 1: Find the Data\n",
+    "## Locating the Data\n",
     "\n",
     "Try to locate the files in the \"chis_data\" folder, which is in the \"data\" folder, which is in the main \"Python-Fundamentals\" folder. Using `pd.read_csv()`, read in all three data frames and assign them to the three variables defined below.\n",
     "\n",
@@ -156,6 +144,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "import pandas as pd \n",
+    "\n",
     "# YOUR CODE HERE\n",
     "df_eng = ...\n",
     "df_esp = ...\n",
@@ -168,7 +158,7 @@
     "tags": []
    },
    "source": [
-    "## 🥊 Challenge 2: Concatenate\n",
+    "## Concatenating DataFrames\n",
     "\n",
     "Look up the [documentation for Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/general_functions.html), and see if you can find a function that **concatenates** the three DataFrames we have now. Save the concatenated list in a new variable called `df`."
    ]
@@ -208,16 +198,14 @@
    "source": [
     "<a id='clean'></a>\n",
     "\n",
-    "# Data Cleaning"
+    "# Step 2: Data Cleaning"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 🥊 Challenge 3: Data Cleaning \n",
-    "\n",
-    "Often, we will want to remove some missing values in a data frame. Have a look at the `general_health` column and find the missing values using the `.isna()` method. Then, use `.sum()` to sum the amount of undefined (NaN) values."
+    "Often, we will want to remove some missing values in a DataFrame. Have a look at the `general_health` column and find the missing values using the `.isna()` method. Then, use `.sum()` to sum the amount of undefined (NaN) values."
    ]
   },
   {
@@ -253,11 +241,12 @@
    "source": [
     "<a id='eda'></a>\n",
     "\n",
-    "# Exploratory Data Analysis\n",
+    "# Step 3: Data Analysis\n",
     "\n",
+    "Now that we have preprocessed data, we want to analyze it. Recall that our goal is to visualize a relationship between poverty level and general health. Before we do this, we should get a better grasp of what is in our data.\n",
     "\n",
-    "## 🥊 Challenge 4: Count Values\n",
-    "The first thing we will want to do is count values of  some features. \n",
+    "## Counting Values\n",
+    "The first thing we will want to do is count values of poverty levels: we want to see how many levels there are, and how the data are distributed. \n",
     "1. Run `value_counts()` on the `poverty_level` column. \n",
     "2. <span style=\"color:purple\"> Look through the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) and **normalize** the output of `value_counts()`.</span>"
    ]
@@ -277,9 +266,11 @@
     "tags": []
    },
    "source": [
-    "## 🥊 Challenge 5: Create a Function\n",
+    "## Creating a Function\n",
+    "\n",
+    "It turns out that poverty is expressed \"as Times of 100% Federal Poverty Line (FPL)\". One approach to this could be to see if we can find differences in general health for people **below and above the poverty line**. \n",
     "\n",
-    "Let's see if we can find differences in general health for people **below and above the poverty line**. First, let's create a function that can check whether this is the case.\n",
+    "To do this, we can create a function that takes in values of the `poverty_level` column and outputs whether that value is above or below the poverty line.\n",
     "\n",
     "1. Create a new function called `assign_level`. It takes one parameter, which we'll call `i`.\n",
     "2. If `i` is `0-99% FPL`, return 0. In all other cases, return 1."
@@ -298,9 +289,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 🥊 Challenge 6: Apply a Function\n",
+    "## Applying a Function\n",
     "\n",
-    "Using the function we created and `apply()` method, we want to create a new column in our DataFrame. \n",
+    "Recall that we can use the `apply()` method in Pandas to apply our new function to the `poverty_level` column of our DataFrame. We also want to save the output of this `apply()` method to a new column in our DataFrame. \n",
     "\n",
     "1. Use the `apply()` method on the `poverty_level` column. Pass your `assign_level` function as the argument.\n",
     "3. Save the result of this operation in a new column in your `df`, called `above_poverty_line`."
@@ -319,9 +310,17 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 🥊 Challenge 7: Subset DataFrames\n",
+    "## Subsetting a DataFrame\n",
     "\n",
-    "We want to create two bar plots of general health – for people above and below the poverty line. \n",
+    "In order to create two bar plots of general health – for people above and below the poverty line – we can create two DataFrames for these groups. We can then plot the values in these DataFrames \"on top of\" one another in a barplot.\n",
+    "\n",
+    "Recall that we can subset DataFrames with Boolean masks. For instance, say we have a DataFrame `counts` with a column `A`. If we want to create a new DataFrame called `above_800`, which only contains the values over 800 in column `A` of `counts`, we would write:\n",
+    "\n",
+    "```\n",
+    "above_800 = counts[counts['A'] > 800]\n",
+    "```\n",
+    "\n",
+    "Let's perform the same operation on our data.\n",
     "\n",
     "1. Create a new DataFrame, `df_below`. It will be a subset of our `df`, based on the condition that the value in `above_poverty_line` is 0.\n",
     "2. Create a new DataFrame, `df_above`. It will be a subset of our `df`, based on the condition that the value in `above_poverty_line` is 1."
@@ -342,11 +341,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 🥊 Challenge 8: Bar Plot of Value Counts\n",
+    "## Creating the Visualization\n",
+    "\n",
+    "Finally, let's create our bar plots. We will create 2 plots in 1 cell, which will be plotted on top of one another. \n",
     "\n",
-    "Finally, let's create our bar plots. Fill in the blanks below, following the steps. \n",
+    "Fill in the blanks below, following the steps. \n",
     "\n",
-    "1. Run a **normalized** `value_counts()` on the `general_health` column of `df_above`.\n",
+    "1. Run a **normalized** `value_counts()` on the `general_health` column of `df_above` and `df_below`.\n",
     "2. Run `plot()` on the output of the resulting DataFrame. Enter the values for two arguments: `kind` must be set to `bar`, and `alpha` must be set to `.5`.\n"
    ]
   },
@@ -360,7 +361,7 @@
    "source": [
     "# YOUR CODE HERE\n",
     "df_above[...].value_counts(...).plot(kind=..., alpha=...);\n",
-    "df_below['general_health'].value_counts(normalize=True).plot(kind=..., alpha=...,color='maroon');"
+    "df_below[...].value_counts(...).plot(kind=..., alpha=...,color='maroon');"
    ]
   },
   {
diff --git a/solutions/3_Project.ipynb b/solutions/3_Project.ipynb
@@ -17,7 +17,7 @@
     "tags": []
    },
    "source": [
-    "## 🥊 Challenge 1: Find the Data\n",
+    "## Locating the Data\n",
     "\n",
     "Try to locate the files in the \"chis_data\" folder, which is in the \"data\" folder, which is in the main \"Python-Fundamentals\" folder. Using `pd.read_csv()`, read in all three data frames and assign them to the three variables defined below.\n",
     "\n",
@@ -69,7 +69,7 @@
     "tags": []
    },
    "source": [
-    "## 🥊 Challenge 2: Concatenate\n",
+    "## Concatenating DataFrames\n",
     "\n",
     "Look up the [documentation for Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/general_functions.html), and see if you can find a function that **concatenates** the three DataFrames we have now. Save the concatenated list in a new variable called `df`."
    ]
@@ -214,8 +214,15 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 🥊 Challenge 3: Data Cleaning \n",
+    "<a id='clean'></a>\n",
     "\n",
+    "# Step 2: Data Cleaning"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
     "Often, we will want to remove some missing values in a data frame. Have a look at the `general_health` column and find the missing values using the `.isna()` method. Then, use `.sum()` to sum the amount of undefined (NaN) values."
    ]
   },
@@ -263,10 +270,16 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 🥊 Challenge 4: Count Values\n",
-    "The first thing we will want to do is count values of  some features. \n",
+    "<a id='eda'></a>\n",
+    "\n",
+    "# Step 3: Data Analysis\n",
+    "\n",
+    "Now that we have preprocessed data, we want to analyze it. Recall that our goal is to visualize a relationship between poverty level and general health. Before we do this, we should get a better grasp of what is in our data.\n",
+    "\n",
+    "## Counting Values\n",
+    "The first thing we will want to do is count values of poverty levels: we want to see how many levels there are, and how the data are distributed. \n",
     "1. Run `value_counts()` on the `poverty_level` column. \n",
-    "2. Look through the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) and **normalize** the output of `value_counts()`."
+    "2. <span style=\"color:purple\"> Look through the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) and **normalize** the output of `value_counts()`.</span>"
    ]
   },
   {
@@ -300,9 +313,11 @@
     "tags": []
    },
    "source": [
-    "## 🥊 Challenge 5: Create a Function\n",
+    "## Creating a Function\n",
     "\n",
-    "Let's see if we can find differences in general health for people **below and above the poverty line**. First, let's create a function that can check whether this is the case.\n",
+    "It turns out that poverty is expressed \"as Times of 100% Federal Poverty Line (FPL)\". One approach to this could be to see if we can find differences in general health for people **below and above the poverty line**. \n",
+    "\n",
+    "To do this, we can create a function that takes in values of the `poverty_level` column and outputs whether that value is above or below the poverty line.\n",
     "\n",
     "1. Create a new function called `assign_level`. It takes one parameter, which we'll call `i`.\n",
     "2. If `i` is `0-99% FPL`, return 0. In all other cases, return 1."
@@ -326,9 +341,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 🥊 Challenge 6: Apply a Function\n",
+    "## Applying a Function\n",
     "\n",
-    "Using the function we created and `apply()` method, we want to create a new column in our DataFrame. \n",
+    "Recall that we can use the `apply()` method in Pandas to apply our new function to the `poverty_level` column of our DataFrame. We also want to save the output of this `apply()` method to a new column in our DataFrame. \n",
     "\n",
     "1. Use the `apply()` method on the `poverty_level` column. Pass your `assign_level` function as the argument.\n",
     "3. Save the result of this operation in a new column in your `df`, called `above_poverty_line`."
@@ -348,9 +363,17 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 🥊 Challenge 7: Subset DataFrames\n",
+    "## Subsetting a DataFrame\n",
+    "\n",
+    "In order to create two bar plots of general health – for people above and below the poverty line – we can create two DataFrames for these groups. We can then plot the values in these DataFrames in a barplot.\n",
     "\n",
-    "We want to create two bar plots of general health – for people above and below the poverty line. \n",
+    "Recall that we can subset DataFrames with Boolean masks. For instance, say we have a DataFrame `counts` with a column `A`. If we want to create a new DataFrame called `above_800`, which only contains the values over 800 in column `A` of `counts`, we would write:\n",
+    "\n",
+    "```\n",
+    "above_800 = counts[counts['A'] > 800]\n",
+    "```\n",
+    "\n",
+    "Let's perform the same operation on our data.\n",
     "\n",
     "1. Create a new DataFrame, `df_below`. It will be a subset of our `df`, based on the condition that the value in `above_poverty_line` is 0.\n",
     "2. Create a new DataFrame, `df_above`. It will be a subset of our `df`, based on the condition that the value in `above_poverty_line` is 1."
@@ -371,12 +394,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 🥊 Challenge 8: Bar Plot of Value Counts\n",
+    "## Creating the Visualization\n",
+    "\n",
+    "Finally, let's create our bar plots. We will create 2 plots in 1 cell, which will be plotted on top of one another. \n",
     "\n",
-    "Finally, let's create our bar plots. Fill in the blanks below, following the steps.\n",
+    "Fill in the blanks below, following the steps. \n",
     "\n",
-    "1. Run a **normalized** `value_counts()` on the `general_health` column of `df_above`.\n",
-    "2. Run `plot()` on the output of the resulting DataFrame. Enter the values for two arguments: `kind` must be set to `bar`, and `alpha` must be set to `.5`."
+    "1. Run a **normalized** `value_counts()` on the `general_health` column of `df_above` and `df_below`.\n",
+    "2. Run `plot()` on the output of the resulting DataFrame. Enter the values for two arguments: `kind` must be set to `bar`, and `alpha` must be set to `.5`.\n"
    ]
   },
   {

Original file line number	Diff line number	Diff line change
`@@ -97,10 +97,9 @@`
`97`	`97`	`"\n",`
`98`	`98`	`"<img src=\"../images/for.png\" alt=\"For loop in Python\" width=\"700\"/>\n",`
`99`	`99`	`"\n",`
`100`		- "Pay attention to the loop variable (`lifeExp`). It stands for each item in the list (`lifeExp_list`) we are iterating through. Loop variables: \n",
`101`		`- " - Are created on demand.\n",`
`102`		`- " - Only exist inside the loop.\n",`
`103`		`- " - Can have any name. "`
	`100`	+ "Pay attention to the loop variable (`lifeExp`). It stands for each item in the list (`lifeExp_list`) we are iterating through. Loop variables can have any name; if we'd change it to `x`, it would still work. However, loop variables only exist inside the loop.\n",
	`101`	`+ "\n",`
	`102`	+ "🔔 Question: Would you prefer `lifeExp` or `x` as a name for the loop variable? Why?"
`104`	`103`	`]`
`105`	`104`	`},`
`106`	`105`	`{`