|
29 | 29 | "\n",
|
30 | 30 | "### Sections\n",
|
31 | 31 | "1. [🚀 Project](#project)\n",
|
32 |
| - "2. [Import the Data](#data)\n", |
33 |
| - "3. [Data Cleaning](#clean)\n", |
34 |
| - "4. [Exploratory Data Analysis](#eda)\n" |
| 32 | + "2. [Step 1: Import the Data](#data)\n", |
| 33 | + "3. [Step 2: Data Cleaning](#clean)\n", |
| 34 | + "4. [Step 3: Data Analysis](#eda)\n" |
35 | 35 | ]
|
36 | 36 | },
|
37 | 37 | {
|
|
40 | 40 | "source": [
|
41 | 41 | "<a id='project'></a>\n",
|
42 | 42 | "\n",
|
43 |
| - "\n", |
44 | 43 | "# 🚀 Project\n",
|
45 | 44 | "\n",
|
46 | 45 | "### Data: California Health Interview Survey\n",
|
|
55 | 54 | "- `household_tenure`: Self-Reported household tenure\n",
|
56 | 55 | "- `interview_language`: Language of interview\n",
|
57 | 56 | "\n",
|
58 |
| - "We will bring together the basic programming, loading data, and statistical analysis/visualization techniques from this workshop to analyze this data. \n", |
| 57 | + "For this 🚀 Project, the goal we want to accomplish is **visualizing the relationship between poverty level and general health**. We will bring together basic programming and data science techniques you have learned to do this.\n", |
59 | 58 | "\n",
|
60 |
| - "First, let's import the packages to use in this analysis:" |
61 |
| - ] |
62 |
| - }, |
63 |
| - { |
64 |
| - "cell_type": "code", |
65 |
| - "execution_count": null, |
66 |
| - "metadata": {}, |
67 |
| - "outputs": [], |
68 |
| - "source": [ |
69 |
| - "import numpy as np\n", |
70 |
| - "import pandas as pd\n", |
71 |
| - "import os" |
| 59 | + "🔔 **Question**: Are there other research questions you could imagine asking with this dataset?" |
72 | 60 | ]
|
73 | 61 | },
|
74 | 62 | {
|
|
79 | 67 | "source": [
|
80 | 68 | "<a id='data'></a>\n",
|
81 | 69 | "\n",
|
82 |
| - "# Import the Data \n", |
| 70 | + "# Step 1: Importing Data \n", |
83 | 71 | "\n",
|
84 | 72 | "Before we import our data, a few words on **filepaths**. \n",
|
85 | 73 | "\n",
|
|
139 | 127 | "tags": []
|
140 | 128 | },
|
141 | 129 | "source": [
|
142 |
| - "## 🥊 Challenge 1: Find the Data\n", |
| 130 | + "## Locating the Data\n", |
143 | 131 | "\n",
|
144 | 132 | "Try to locate the files in the \"chis_data\" folder, which is in the \"data\" folder, which is in the main \"Python-Fundamentals\" folder. Using `pd.read_csv()`, read in all three data frames and assign them to the three variables defined below.\n",
|
145 | 133 | "\n",
|
|
156 | 144 | "metadata": {},
|
157 | 145 | "outputs": [],
|
158 | 146 | "source": [
|
| 147 | + "import pandas as pd \n", |
| 148 | + "\n", |
159 | 149 | "# YOUR CODE HERE\n",
|
160 | 150 | "df_eng = ...\n",
|
161 | 151 | "df_esp = ...\n",
|
|
168 | 158 | "tags": []
|
169 | 159 | },
|
170 | 160 | "source": [
|
171 |
| - "## 🥊 Challenge 2: Concatenate\n", |
| 161 | + "## Concatenating DataFrames\n", |
172 | 162 | "\n",
|
173 | 163 | "Look up the [documentation for Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/general_functions.html), and see if you can find a function that **concatenates** the three DataFrames we have now. Save the concatenated list in a new variable called `df`."
|
174 | 164 | ]
|
|
208 | 198 | "source": [
|
209 | 199 | "<a id='clean'></a>\n",
|
210 | 200 | "\n",
|
211 |
| - "# Data Cleaning" |
| 201 | + "# Step 2: Data Cleaning" |
212 | 202 | ]
|
213 | 203 | },
|
214 | 204 | {
|
215 | 205 | "cell_type": "markdown",
|
216 | 206 | "metadata": {},
|
217 | 207 | "source": [
|
218 |
| - "## 🥊 Challenge 3: Data Cleaning \n", |
219 |
| - "\n", |
220 |
| - "Often, we will want to remove some missing values in a data frame. Have a look at the `general_health` column and find the missing values using the `.isna()` method. Then, use `.sum()` to sum the amount of undefined (NaN) values." |
| 208 | + "Often, we will want to remove some missing values in a DataFrame. Have a look at the `general_health` column and find the missing values using the `.isna()` method. Then, use `.sum()` to sum the amount of undefined (NaN) values." |
221 | 209 | ]
|
222 | 210 | },
|
223 | 211 | {
|
|
253 | 241 | "source": [
|
254 | 242 | "<a id='eda'></a>\n",
|
255 | 243 | "\n",
|
256 |
| - "# Exploratory Data Analysis\n", |
| 244 | + "# Step 3: Data Analysis\n", |
257 | 245 | "\n",
|
| 246 | + "Now that we have preprocessed data, we want to analyze it. Recall that our goal is to visualize a relationship between poverty level and general health. Before we do this, we should get a better grasp of what is in our data.\n", |
258 | 247 | "\n",
|
259 |
| - "## 🥊 Challenge 4: Count Values\n", |
260 |
| - "The first thing we will want to do is count values of some features. \n", |
| 248 | + "## Counting Values\n", |
| 249 | + "The first thing we will want to do is count values of poverty levels: we want to see how many levels there are, and how the data are distributed. \n", |
261 | 250 | "1. Run `value_counts()` on the `poverty_level` column. \n",
|
262 | 251 | "2. <span style=\"color:purple\"> Look through the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) and **normalize** the output of `value_counts()`.</span>"
|
263 | 252 | ]
|
|
277 | 266 | "tags": []
|
278 | 267 | },
|
279 | 268 | "source": [
|
280 |
| - "## 🥊 Challenge 5: Create a Function\n", |
| 269 | + "## Creating a Function\n", |
| 270 | + "\n", |
| 271 | + "It turns out that poverty is expressed \"as Times of 100% Federal Poverty Line (FPL)\". One approach to this could be to see if we can find differences in general health for people **below and above the poverty line**. \n", |
281 | 272 | "\n",
|
282 |
| - "Let's see if we can find differences in general health for people **below and above the poverty line**. First, let's create a function that can check whether this is the case.\n", |
| 273 | + "To do this, we can create a function that takes in values of the `poverty_level` column and outputs whether that value is above or below the poverty line.\n", |
283 | 274 | "\n",
|
284 | 275 | "1. Create a new function called `assign_level`. It takes one parameter, which we'll call `i`.\n",
|
285 | 276 | "2. If `i` is `0-99% FPL`, return 0. In all other cases, return 1."
|
|
298 | 289 | "cell_type": "markdown",
|
299 | 290 | "metadata": {},
|
300 | 291 | "source": [
|
301 |
| - "## 🥊 Challenge 6: Apply a Function\n", |
| 292 | + "## Applying a Function\n", |
302 | 293 | "\n",
|
303 |
| - "Using the function we created and `apply()` method, we want to create a new column in our DataFrame. \n", |
| 294 | + "Recall that we can use the `apply()` method in Pandas to apply our new function to the `poverty_level` column of our DataFrame. We also want to save the output of this `apply()` method to a new column in our DataFrame. \n", |
304 | 295 | "\n",
|
305 | 296 | "1. Use the `apply()` method on the `poverty_level` column. Pass your `assign_level` function as the argument.\n",
|
306 | 297 | "3. Save the result of this operation in a new column in your `df`, called `above_poverty_line`."
|
|
319 | 310 | "cell_type": "markdown",
|
320 | 311 | "metadata": {},
|
321 | 312 | "source": [
|
322 |
| - "## 🥊 Challenge 7: Subset DataFrames\n", |
| 313 | + "## Subsetting a DataFrame\n", |
323 | 314 | "\n",
|
324 |
| - "We want to create two bar plots of general health – for people above and below the poverty line. \n", |
| 315 | + "In order to create two bar plots of general health – for people above and below the poverty line – we can create two DataFrames for these groups. We can then plot the values in these DataFrames \"on top of\" one another in a barplot.\n", |
| 316 | + "\n", |
| 317 | + "Recall that we can subset DataFrames with Boolean masks. For instance, say we have a DataFrame `counts` with a column `A`. If we want to create a new DataFrame called `above_800`, which only contains the values over 800 in column `A` of `counts`, we would write:\n", |
| 318 | + "\n", |
| 319 | + "```\n", |
| 320 | + "above_800 = counts[counts['A'] > 800]\n", |
| 321 | + "```\n", |
| 322 | + "\n", |
| 323 | + "Let's perform the same operation on our data.\n", |
325 | 324 | "\n",
|
326 | 325 | "1. Create a new DataFrame, `df_below`. It will be a subset of our `df`, based on the condition that the value in `above_poverty_line` is 0.\n",
|
327 | 326 | "2. Create a new DataFrame, `df_above`. It will be a subset of our `df`, based on the condition that the value in `above_poverty_line` is 1."
|
|
342 | 341 | "cell_type": "markdown",
|
343 | 342 | "metadata": {},
|
344 | 343 | "source": [
|
345 |
| - "## 🥊 Challenge 8: Bar Plot of Value Counts\n", |
| 344 | + "## Creating the Visualization\n", |
| 345 | + "\n", |
| 346 | + "Finally, let's create our bar plots. We will create 2 plots in 1 cell, which will be plotted on top of one another. \n", |
346 | 347 | "\n",
|
347 |
| - "Finally, let's create our bar plots. Fill in the blanks below, following the steps. \n", |
| 348 | + "Fill in the blanks below, following the steps. \n", |
348 | 349 | "\n",
|
349 |
| - "1. Run a **normalized** `value_counts()` on the `general_health` column of `df_above`.\n", |
| 350 | + "1. Run a **normalized** `value_counts()` on the `general_health` column of `df_above` and `df_below`.\n", |
350 | 351 | "2. Run `plot()` on the output of the resulting DataFrame. Enter the values for two arguments: `kind` must be set to `bar`, and `alpha` must be set to `.5`.\n"
|
351 | 352 | ]
|
352 | 353 | },
|
|
360 | 361 | "source": [
|
361 | 362 | "# YOUR CODE HERE\n",
|
362 | 363 | "df_above[...].value_counts(...).plot(kind=..., alpha=...);\n",
|
363 |
| - "df_below['general_health'].value_counts(normalize=True).plot(kind=..., alpha=...,color='maroon');" |
| 364 | + "df_below[...].value_counts(...).plot(kind=..., alpha=...,color='maroon');" |
364 | 365 | ]
|
365 | 366 | },
|
366 | 367 | {
|
|
0 commit comments