diff --git a/img/inspect.png b/img/inspect.png
new file mode 100644
index 0000000..113b173
Binary files /dev/null and b/img/inspect.png differ
diff --git a/lessons/02_web_scraping.ipynb b/lessons/02_web_scraping.ipynb
index 385806a..d5157fa 100644
--- a/lessons/02_web_scraping.ipynb
+++ b/lessons/02_web_scraping.ipynb
@@ -8,6 +8,17 @@
"\n",
"* * * \n",
"\n",
+ "
\n",
+ " \n",
+ "### Learning Objectives \n",
+ " \n",
+ "* Understand when and when not to resort to web scraping.\n",
+ "* Become confident in using BeautifulSoup as a tool for web scraping.\n",
+ "* Understand the difference between tags, attributes, and attribute values.\n",
+ "* Use BeautifulSoup on a real-world website.\n",
+ "
\n",
+ "\n",
+ "\n",
"### Icons used in this notebook\n",
"🔔 **Question**: A quick question to help you understand what's going on.
\n",
"🥊 **Challenge**: Interactive exercise. We'll work through these in the workshop!
\n",
@@ -16,14 +27,19 @@
"🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!
\n",
"\n",
"### Learning Objectives\n",
- "1. [Reflection: To Scape Or Not To Scrape](#when)\n",
- "2. [Extracting and Parsing HTML](#extract)\n",
- "3. [Scraping the Illinois General Assembly](#scrape)"
+ "1. [To Scape Or Not To Scrape](#when)\n",
+ "2. [Installation](#install)\n",
+ "3. [BeautifulSoup: A Quick Example](#ex)\n",
+ "4. [Our Data](#data)\n",
+ "5. [Extracting and Parsing HTML](#extract)\n",
+ "6. [Scraping the Illinois General Assembly](#scrape)"
]
},
{
"cell_type": "markdown",
- "metadata": {},
+ "metadata": {
+ "tags": []
+ },
"source": [
"\n",
"\n",
@@ -31,16 +47,16 @@
"\n",
"When we'd like to access data from the web, we first have to make sure if the website we are interested in offers a Web API. Platforms like Twitter, Reddit, and the New York Times offer APIs. **Check out D-Lab's [Python Web APIs](https://github.com/dlab-berkeley/Python-Web-APIs) workshop if you want to learn how to use APIs.**\n",
"\n",
- "However, there are often cases when a Web API does not exist. In these cases, we may have to resort to web scraping, where we extract the underlying HTML from a web page, and directly obtain the information we want. There are several packages in Python we can use to accomplish these tasks. We'll focus two packages: Requests and Beautiful Soup.\n",
- "\n",
- "Our case study will be scraping information on the [state senators of Illinois](http://www.ilga.gov/senate), as well as the [list of bills](http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True) each senator has sponsored. Before we get started, peruse these websites to take a look at their structure."
+ "However, there are often cases when a Web API does not exist. In these cases, we may have to resort to web scraping, where we extract the underlying HTML from a web page, and directly obtain the information we want. There are several packages in Python we can use to accomplish these tasks. We'll focus two packages: Requests and Beautiful Soup."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Installation\n",
+ "\n",
+ "\n",
+ "# Installation\n",
"\n",
"We will use two main packages: [Requests](http://docs.python-requests.org/en/latest/user/quickstart/) and [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). Go ahead and install these packages, if you haven't already:"
]
@@ -94,6 +110,166 @@
"import time"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "\n",
+ "# BeautifulSoup: A Quick Example\n",
+ "\n",
+ "Let's consider a simple HTML structure:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "html_content = \"\"\"\n",
+ " \n",
+ " Sample Page\n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ "
First paragraph.
\n",
+ "
Second paragraph.
\n",
+ "
Visit Example\n",
+ "
\n",
+ "
Nested paragraph.
\n",
+ "

\n",
+ "
\n",
+ "
\n",
+ " \n",
+ "\"\"\"\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can call `BeautifulSoup` on this `html_content`. This will return an object (called a **soup object**) which contains all of the HTML in the original document."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "soup = BeautifulSoup(html_content, 'html.parser')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# Let's have a look\n",
+ "print(soup.prettify())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "💡 **Tip:** `.prettify()` is a really useful method that retains the indentation of the original HTML. This makes it a lot more readable!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The output looks pretty similar to the original, but now it's organized in a `soup` object that allows us to more easily traverse the HTML."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## `find_all`\n",
+ "\n",
+ "Let's search through this HTML using `BeautifulSoup`. We will search for ALL `p` tags in the HTML:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "paragraphs = soup.find_all('p')\n",
+ "for para in paragraphs:\n",
+ " print(para)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "There are a lot of methods we can use to get more specific data (such as the text content itself), but this is the basic functionality of `BeautifulSoup`. Let's now look at a real-world example."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 🥊 Challenge 1: Find h1\n",
+ "\n",
+ "We can also use `find()` to find the first available tag in this HTML. Use it to find the `h1` tag in the soup!\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# YOUR CODE HERE\n",
+ "soup.find_all('h1')\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "# Our Data\n",
+ "\n",
+ "Our case study will be scraping information on the [state senators of Illinois](http://www.ilga.gov/senate).\n",
+ "\n",
+ "**Let's open this website to take a look at its structure!**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Here's what happens if you click \"Inspect\" in your browser:\n",
+ "\n",
+ "
\n",
+ "\n",
+ "On the right-hand side, you see the HTML that makes up the website. To the right of that is the CSS linked to those elements.\n",
+ "\n",
+ "Right-clicking on any part on the webpage and Inspecting it will automatically shpow you the part of the HTML that you are highlighting.\n",
+ "\n",
+ "💡 **Tip**: If you want to see the full HTML code, you can right-click on the webpage and select \"View Page Source\".\n"
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {},
@@ -104,7 +280,7 @@
"\n",
"In order to succesfully scrape and analyse HTML, we'll be going through the following 4 steps:\n",
"1. Make a GET request\n",
- "2. Parse the page with Beautiful Soup\n",
+ "2. Parse the page with `BeautifulSoup`\n",
"3. Search for HTML elements\n",
"4. Get attributes and text of these elements"
]
@@ -143,17 +319,19 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Step 2: Parse the Page with Beautiful Soup\n",
+ "## Step 2: Parse the Page with `BeautifulSoup`\n",
"\n",
- "Now, we use the `BeautifulSoup` function to parse the reponse into an HTML tree. This returns an object (called a **soup object**) which contains all of the HTML in the original document.\n",
+ "Now, we use the `BeautifulSoup` function to parse the reponse into an HTML tree. This returns a **soup object** which contains all of the HTML in the original document.\n",
"\n",
- "If you run into an error about a parser library, make sure you've installed the `lxml` package to provide Beautiful Soup with the necessary parsing tools."
+ "⚠️ **Warning**: If you run into an error about a parser library, make sure you've installed the `lxml` package to provide Beautiful Soup with the necessary parsing tools."
]
},
{
"cell_type": "code",
"execution_count": null,
- "metadata": {},
+ "metadata": {
+ "tags": []
+ },
"outputs": [],
"source": [
"# Parse the response into an HTML tree\n",
@@ -162,13 +340,6 @@
"print(soup.prettify()[:1000])"
]
},
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The output looks pretty similar to the above, but now it's organized in a `soup` object which allows us to more easily traverse the page."
- ]
- },
{
"cell_type": "markdown",
"metadata": {},
@@ -177,21 +348,23 @@
"\n",
"Beautiful Soup has a number of functions to find useful components on a page. Beautiful Soup lets you find elements by their:\n",
"\n",
- "1. HTML tags\n",
+ "1. HTML Tags\n",
"2. HTML Attributes\n",
"3. CSS Selectors\n",
"\n",
- "Let's search first for **HTML tags**. \n",
+ "Let's search first for **HTML tags**, like we did before. \n",
"\n",
"The function `find_all` searches the `soup` tree to find all the elements with an a particular HTML tag, and returns all of those elements.\n",
"\n",
- "What does the example below do?"
+ "🔔 **Question**: What does the example below do?"
]
},
{
"cell_type": "code",
"execution_count": null,
- "metadata": {},
+ "metadata": {
+ "tags": []
+ },
"outputs": [],
"source": [
"# Find all elements with a certain tag\n",
@@ -203,9 +376,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Because `find_all()` is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object as though it were a function, then it’s the same as calling `find_all()` on that object. \n",
- "\n",
- "These two lines of code are equivalent:"
+ "How many links did we obtain?"
]
},
{
@@ -215,25 +386,6 @@
"tags": []
},
"outputs": [],
- "source": [
- "a_tags = soup.find_all(\"a\")\n",
- "a_tags_alt = soup(\"a\")\n",
- "print(a_tags[0])\n",
- "print(a_tags_alt[0])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "How many links did we obtain?"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
"source": [
"print(len(a_tags))"
]
@@ -246,7 +398,7 @@
"\n",
"What if we wanted to search for HTML tags with certain attributes, such as particular CSS classes? \n",
"\n",
- "We can do this by adding an additional argument to the `find_all`. In the example below, we are finding all the `a` tags, and then filtering those with `class_=\"sidemenu\"`."
+ "We can do this by adding an additional argument to the `find_all`. In the example below, we are finding all the `a` tags, and then filtering those with `class_=\"sidemenu\"`. That means we'll only get the `a`-tags that also have a `class` attribute called `sidemenu`."
]
},
{
@@ -266,9 +418,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "A more efficient way to search for elements on a website is via a **CSS selector**. For this we have to use a different method called `select()`. Just pass a string into the `.select()` to get all elements with that string as a valid CSS selector.\n",
+ "## `find_all` and `select`\n",
+ "\n",
+ "Another way to search for elements on a website is via a **CSS selector**. This method is particularly useful when you're familiar with CSS and want to leverage that knowledge to navigate and search through the document.\n",
+ "\n",
+ "For this we can use a method called `select()`. You can pass a string into `.select()` to get all elements with that string as a valid CSS selector.\n",
"\n",
- "In the example above, we can use `\"a.sidemenu\"` as a CSS selector, which returns all `a` tags with class `sidemenu`."
+ "For instance, we can use `\"a.sidemenu\"` as a CSS selector, which returns all `a` tags with class `sidemenu`--just like we did above!"
]
},
{
@@ -288,7 +444,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "## 🥊 Challenge: Find All\n",
+ "## 🥊 Challenge 2: Find All\n",
"\n",
"Use BeautifulSoup to find all the `a` elements with class `mainmenu`."
]
@@ -306,14 +462,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Step 4: Get Attributes and Text of Elements\n",
+ "## Step 4: Get Text or Attribute Values\n",
"\n",
- "Once we identify elements, we want the access information in that element. Usually, this means two things:\n",
+ "Once we identify elements, we want the access information in that element. Usually, we will be interested in webpage text, or attribute values.\n",
"\n",
- "1. Text\n",
- "2. Attributes\n",
- "\n",
- "Getting the text inside an element is easy. All we have to do is use the `text` member of a `tag` object:"
+ "To do this, we first get a tag object. For instance, let's grab that `a` tag with the `sidemenu` attribute: "
]
},
{
@@ -329,17 +482,27 @@
"\n",
"# Examine the first link\n",
"first_link = side_menu_links[0]\n",
- "print(first_link)\n",
- "\n",
- "# What class is this variable?\n",
- "print('Class: ', type(first_link))"
+ "print(first_link)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "It's a Beautiful Soup tag! This means it has a `text` member:"
+ "What we just printed is a beautifulSoup object. It's a little piece of HTML. To recap:\n",
+ "* `` is the element or tag.\n",
+ "* `class` is an attribute.\n",
+ "* `\"sidemenu\"` is the value of the `class` attribute.\n",
+ "* `href` is another attribute.\n",
+ "* `\"/senate/default.asp\"` is the value of the href attribute.\n",
+ "* \"Members\" is the text content of the element.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To get the text of a BeautifulSoup object, we can call a Python attribute called `text`."
]
},
{
@@ -357,6 +520,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
+ "## Getting URLs\n",
+ "\n",
"Sometimes we want the value of certain attributes. This is particularly relevant for `a` tags, or links, where the `href` attribute tells us where the link goes.\n",
"\n",
"💡 **Tip**: You can access a tag’s attributes by treating the tag like a dictionary:"
@@ -377,7 +542,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "## 🥊 Challenge: Extract specific attributes\n",
+ "## 🥊 Challenge 3: Extract specific attributes\n",
"\n",
"Extract all `href` attributes for each `mainmenu` URL."
]
@@ -443,7 +608,9 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {},
+ "metadata": {
+ "tags": []
+ },
"outputs": [],
"source": [
"# Get all table row elements\n",
@@ -461,7 +628,9 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {},
+ "metadata": {
+ "tags": []
+ },
"outputs": [],
"source": [
"# Returns every ‘tr tr tr’ css selector in the page\n",
@@ -481,7 +650,9 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {},
+ "metadata": {
+ "tags": []
+ },
"outputs": [],
"source": [
"example_row = rows[2]\n",
@@ -502,7 +673,9 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {},
+ "metadata": {
+ "tags": []
+ },
"outputs": [],
"source": [
"for cell in example_row.select('td'):\n",
@@ -546,7 +719,9 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {},
+ "metadata": {
+ "tags": []
+ },
"outputs": [],
"source": [
"# Select only those 'td' tags with class 'detail' \n",
@@ -564,7 +739,9 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {},
+ "metadata": {
+ "tags": []
+ },
"outputs": [],
"source": [
"# Keep only the text in each of those cells\n",
@@ -583,7 +760,9 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {},
+ "metadata": {
+ "tags": []
+ },
"outputs": [],
"source": [
"print(row_data[0]) # Name\n",
@@ -603,7 +782,9 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {},
+ "metadata": {
+ "tags": []
+ },
"outputs": [],
"source": [
"print('Row 0:\\n', rows[0], '\\n')\n",
@@ -623,7 +804,9 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {},
+ "metadata": {
+ "tags": []
+ },
"outputs": [],
"source": [
"# Bad rows\n",
@@ -645,7 +828,9 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {},
+ "metadata": {
+ "tags": []
+ },
"outputs": [],
"source": [
"good_rows = [row for row in rows if len(row) == 5]\n",
@@ -666,7 +851,9 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {},
+ "metadata": {
+ "tags": []
+ },
"outputs": [],
"source": [
"rows[2].select('td.detail') "
@@ -675,7 +862,9 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {},
+ "metadata": {
+ "tags": []
+ },
"outputs": [],
"source": [
"# Bad row\n",
@@ -738,111 +927,6 @@
" members.append(senator)"
]
},
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Should be 61\n",
- "len(members)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's take a look at what we have in `members`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "print(members[:5])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 🥊 Challenge: Get `href` elements pointing to members' bills \n",
- "\n",
- "The code above retrieves information on: \n",
- "\n",
- "- the senator's name,\n",
- "- their district number,\n",
- "- and their party.\n",
- "\n",
- "We now want to retrieve the URL for each senator's list of bills. Each URL will follow a specific format. \n",
- "\n",
- "The format for the list of bills for a given senator is:\n",
- "\n",
- "`http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=[MEMBER_ID]&Primary=True`\n",
- "\n",
- "to get something like:\n",
- "\n",
- "`http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True`\n",
- "\n",
- "in which `MEMBER_ID=1911`. \n",
- "\n",
- "You should be able to see that, unfortunately, `MEMBER_ID` is not currently something pulled out in our scraping code.\n",
- "\n",
- "Your initial task is to modify the code above so that we also **retrieve the full URL which points to the corresponding page of primary-sponsored bills**, for each member, and return it along with their name, district, and party.\n",
- "\n",
- "Tips: \n",
- "\n",
- "* To do this, you will want to get the appropriate anchor element (``) in each legislator's row of the table. You can again use the `.select()` method on the `row` object in the loop to do this — similar to the command that finds all of the `td.detail` cells in the row. Remember that we only want the link to the legislator's bills, not the committees or the legislator's profile page.\n",
- "* The anchor elements' HTML will look like `Bills`. The string in the `href` attribute contains the **relative** link we are after. You can access an attribute of a BeatifulSoup `Tag` object the same way you access a Python dictionary: `anchor['attributeName']`. See the documentation for more details.\n",
- "* There are a _lot_ of different ways to use BeautifulSoup to get things done. whatever you need to do to pull the `href` out is fine.\n",
- "\n",
- "The code has been partially filled out for you. Fill it in where it says `#YOUR CODE HERE`. Save the path into an object called `full_path`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "# Make a GET request\n",
- "req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')\n",
- "# Read the content of the server’s response\n",
- "src = req.text\n",
- "# Soup it\n",
- "soup = BeautifulSoup(src, \"lxml\")\n",
- "# Create empty list to store our data\n",
- "members = []\n",
- "\n",
- "# Returns every ‘tr tr tr’ css selector in the page\n",
- "rows = soup.select('tr tr tr')\n",
- "# Get rid of junk rows\n",
- "rows = [row for row in rows if row.select('td.detail')]\n",
- "\n",
- "# Loop through all rows\n",
- "for row in rows:\n",
- " # Select only those 'td' tags with class 'detail'\n",
- " detail_cells = row.select('td.detail') \n",
- " # Keep only the text in each of those cells\n",
- " row_data = [cell.text for cell in detail_cells]\n",
- " # Collect information\n",
- " name = row_data[0]\n",
- " district = int(row_data[3])\n",
- " party = row_data[4]\n",
- "\n",
- " # YOUR CODE HERE\n",
- " full_path = ''\n",
- "\n",
- " # Store in a tuple\n",
- " senator = (name, district, party, full_path)\n",
- " # Append to list\n",
- " members.append(senator)"
- ]
- },
{
"cell_type": "code",
"execution_count": null,
@@ -851,30 +935,15 @@
},
"outputs": [],
"source": [
- "# Uncomment to test \n",
- "# members[:5]"
+ "# Should be 61\n",
+ "len(members)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "## 🥊 Challenge: Modularize Your Code\n",
- "\n",
- "Turn the code above into a function that accepts a URL, scrapes the URL for its senators, and returns a list of tuples containing information about each senator. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "# YOUR CODE HERE\n",
- "def get_members(url):\n",
- " return [___]\n"
+ "Let's take a look at what we have in `members`."
]
},
{
@@ -885,103 +954,25 @@
},
"outputs": [],
"source": [
- "# Test your code\n",
- "url = 'http://www.ilga.gov/senate/default.asp?GA=98'\n",
- "senate_members = get_members(url)\n",
- "len(senate_members)"
+ "print(members[:5])"
]
},
{
"cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 🥊 Take-home Challenge: Writing a Scraper Function\n",
- "\n",
- "We want to scrape the webpages corresponding to bills sponsored by each bills.\n",
- "\n",
- "Write a function called `get_bills(url)` to parse a given bills URL. This will involve:\n",
- "\n",
- " - requesting the URL using the `requests` library\n",
- " - using the features of the `BeautifulSoup` library to find all of the `` elements with the class `billlist`\n",
- " - return a _list_ of tuples, each with:\n",
- " - description (2nd column)\n",
- " - chamber (S or H) (3rd column)\n",
- " - the last action (4th column)\n",
- " - the last action date (5th column)\n",
- " \n",
- "This function has been partially completed. Fill in the rest."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "def get_bills(url):\n",
- " src = requests.get(url).text\n",
- " soup = BeautifulSoup(src)\n",
- " rows = soup.select('tr')\n",
- " bills = []\n",
- " for row in rows:\n",
- " # YOUR CODE HERE\n",
- " bill_id =\n",
- " description =\n",
- " chamber =\n",
- " last_action =\n",
- " last_action_date =\n",
- " bill = (bill_id, description, chamber, last_action, last_action_date)\n",
- " bills.append(bill)\n",
- " return bills"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
"metadata": {
+ "jp-MarkdownHeadingCollapsed": true,
"tags": []
},
- "outputs": [],
- "source": [
- "# Uncomment to test your code\n",
- "# test_url = senate_members[0][3]\n",
- "# get_bills(test_url)[0:5]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
"source": [
- "### Scrape All Bills\n",
+ " \n",
"\n",
- "Finally, create a dictionary `bills_dict` which maps a district number (the key) onto a list of bills (the value) coming from that district. You can do this by looping over all of the senate members in `members_dict` and calling `get_bills()` for each of their associated bill URLs.\n",
+ "## ❗ Key Points\n",
"\n",
- "**NOTE:** please call the function `time.sleep(1)` for each iteration of the loop, so that we don't destroy the state's web site."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "# YOUR CODE HERE\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "# Uncomment to test your code\n",
- "# bills_dict[52]"
+ "* `BeautifulSoup` creates so-called soup objects from HTML that you can search through.\n",
+ "* The `find_all()` method searches through a soup object for a specified tag and attributes, e.g. `find_all('a', class_='sidemenu')`.\n",
+ "* The `select()` method searches through a soup object using CSS selectors, e.g. `select('a.sidemenu')`.\n",
+ "* Scraping is often a matter of searching through HTML code and, step by step, getting the right subset of information.\n",
+ " "
]
}
],
@@ -1002,7 +993,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.8.13"
+ "version": "3.11.3"
},
"vscode": {
"interpreter": {
diff --git a/solutions/02_web_scraping_solutions.ipynb b/solutions/02_web_scraping_solutions.ipynb
index a21532e..53603f8 100644
--- a/solutions/02_web_scraping_solutions.ipynb
+++ b/solutions/02_web_scraping_solutions.ipynb
@@ -38,242 +38,41 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Challenge: Find All\n",
+ "## 🥊 Challenge 1: Find h1\n",
"\n",
- "Use Beautiful Soup to find all the `a` elements with class `mainmenu`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "soup.select(\"a.mainmenu\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Challenge: Extract Specific Attributes\n",
- "\n",
- "Extract all `href` attributes for each `mainmenu` URL."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "[link['href'] for link in soup.select(\"a.mainmenu\")]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Challenge: Get `href` elements pointing to members' bills\n",
- "\n",
- "The code above retrieves information on: \n",
- "\n",
- "- the senator's name,\n",
- "- their district number,\n",
- "- and their party.\n",
- "\n",
- "We now want to retrieve the URL for each senator's list of bills. Each URL will follow a specific format. \n",
- "\n",
- "The format for the list of bills for a given senator is:\n",
- "\n",
- "`http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=[MEMBER_ID]&Primary=True`\n",
- "\n",
- "to get something like:\n",
- "\n",
- "`http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True`\n",
- "\n",
- "in which `MEMBER_ID=1911`. \n",
- "\n",
- "You should be able to see that, unfortunately, `MEMBER_ID` is not currently something pulled out in our scraping code.\n",
- "\n",
- "Your initial task is to modify the code above so that we also **retrieve the full URL which points to the corresponding page of primary-sponsored bills**, for each member, and return it along with their name, district, and party.\n",
- "\n",
- "Tips: \n",
- "\n",
- "* To do this, you will want to get the appropriate anchor element (``) in each legislator's row of the table. You can again use the `.select()` method on the `row` object in the loop to do this — similar to the command that finds all of the `td.detail` cells in the row. Remember that we only want the link to the legislator's bills, not the committees or the legislator's profile page.\n",
- "* The anchor elements' HTML will look like `Bills`. The string in the `href` attribute contains the **relative** link we are after. You can access an attribute of a BeatifulSoup `Tag` object the same way you access a Python dictionary: `anchor['attributeName']`. See the documentation for more details.\n",
- "* There are a _lot_ of different ways to use BeautifulSoup to get things done. whatever you need to do to pull the `href` out is fine.\n",
- "\n",
- "The code has been partially filled out for you. Fill it in where it says `#YOUR CODE HERE`. Save the path into an object called `full_path`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Make a GET request\n",
- "req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')\n",
- "# Read the content of the server’s response\n",
- "src = req.text\n",
- "# Soup it\n",
- "soup = BeautifulSoup(src, \"lxml\")\n",
- "# Create empty list to store our data\n",
- "members = []\n",
- "\n",
- "# Returns every ‘tr tr tr’ css selector in the page\n",
- "rows = soup.select('tr tr tr')\n",
- "# Get rid of junk rows\n",
- "rows = [row for row in rows if row.select('td.detail')]\n",
- "\n",
- "# Loop through all rows\n",
- "for row in rows:\n",
- " # Select only those 'td' tags with class 'detail'\n",
- " detail_cells = row.select('td.detail') \n",
- " # Keep only the text in each of those cells\n",
- " row_data = [cell.text for cell in detail_cells]\n",
- " # Collect information\n",
- " name = row_data[0]\n",
- " district = int(row_data[3])\n",
- " party = row_data[4]\n",
- " \n",
- " # YOUR CODE HERE\n",
- " # Extract href\n",
- " href = row.select('a')[1]['href']\n",
- " # Create full path\n",
- " full_path = \"http://www.ilga.gov/senate/\" + href + \"&Primary=True\"\n",
- " \n",
- " # Store in a tuple\n",
- " senator = (name, district, party, full_path)\n",
- " # Append to list\n",
- " members.append(senator)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "members[:5]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Challenge: Modularize Your Code\n",
- "\n",
- "Turn the code above into a function that accepts a URL, scrapes the URL for its senators, and returns a list of tuples containing information about each senator. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "def get_members(url):\n",
- " # Make a GET request\n",
- " req = requests.get(url)\n",
- " # Read the content of the server’s response\n",
- " src = req.text\n",
- " # Soup it\n",
- " soup = BeautifulSoup(src, \"lxml\")\n",
- " # Create empty list to store our data\n",
- " members = []\n",
- "\n",
- " # Returns every ‘tr tr tr’ css selector in the page\n",
- " rows = soup.select('tr tr tr')\n",
- " # Get rid of junk rows\n",
- " rows = [row for row in rows if row.select('td.detail')]\n",
- "\n",
- " # Loop through all rows\n",
- " for row in rows:\n",
- " # Select only those 'td' tags with class 'detail'\n",
- " detail_cells = row.select('td.detail') \n",
- " # Keep only the text in each of those cells\n",
- " row_data = [cell.text for cell in detail_cells]\n",
- " # Collect information\n",
- " name = row_data[0]\n",
- " district = int(row_data[3])\n",
- " party = row_data[4]\n",
- "\n",
- " # YOUR CODE HERE\n",
- " # Extract href\n",
- " href = row.select('a')[1]['href']\n",
- " # Create full path\n",
- " full_path = \"http://www.ilga.gov/senate/\" + href + \"&Primary=True\"\n",
- "\n",
- " # Store in a tuple\n",
- " senator = (name, district, party, full_path)\n",
- " # Append to list\n",
- " members.append(senator)\n",
- " return(members)"
+ "We can also use `find()` to find the first available tag in this HTML. Use it to find the `h1` tag in the soup!\n"
]
},
{
"cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
+ "execution_count": 60,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 60,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
"source": [
- "# Test your code!\n",
- "url = 'http://www.ilga.gov/senate/default.asp?GA=98'\n",
- "senate_members = get_members(url)\n",
- "len(senate_members)"
+ "# YOUR CODE HERE\n",
+ "soup.find_all('h1')\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Take-home Challenge: Writing a Scraper Function\n",
- "\n",
- "We want to scrape the webpages corresponding to bills sponsored by each bills.\n",
+ "## 🥊 Challenge 2: Find All\n",
"\n",
- "Write a function called `get_bills(url)` to parse a given bills URL. This will involve:\n",
- "\n",
- " - requesting the URL using the `requests` library\n",
- " - using the features of the `BeautifulSoup` library to find all of the ` | ` elements with the class `billlist`\n",
- " - return a _list_ of tuples, each with:\n",
- " - description (2nd column)\n",
- " - chamber (S or H) (3rd column)\n",
- " - the last action (4th column)\n",
- " - the last action date (5th column)\n",
- " \n",
- "This function has been partially completed. Fill in the rest."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "def get_bills(url):\n",
- " src = requests.get(url).text\n",
- " soup = BeautifulSoup(src)\n",
- " rows = soup.select('tr tr tr')\n",
- " bills = []\n",
- " # Iterate over rows\n",
- " for row in rows:\n",
- " # Grab all bill list cells\n",
- " cells = row.select('td.billlist')\n",
- " # Keep in mind the name of the senator is not a billlist class!\n",
- " if len(cells) == 5:\n",
- " row_text = [cell.text for cell in cells]\n",
- " # Extract info from row text\n",
- " bill_id = row_text[0]\n",
- " description = row_text[1]\n",
- " chamber = row_text[2]\n",
- " last_action = row_text[3]\n",
- " last_action_date = row_text[4]\n",
- " # Consolidate bill info\n",
- " bill = (bill_id, description, chamber, last_action, last_action_date)\n",
- " bills.append(bill)\n",
- " return bills"
+ "Use Beautiful Soup to find all the `a` elements with class `mainmenu`."
]
},
{
@@ -282,31 +81,16 @@
"metadata": {},
"outputs": [],
"source": [
- "test_url = senate_members[0][3]\n",
- "get_bills(test_url)[0:5]"
+ "soup.select(\"a.mainmenu\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "### Scrape All Bills\n",
+ "## 🥊 Challenge 3: Extract Specific Attributes\n",
"\n",
- "Finally, create a dictionary `bills_dict` which maps a district number (the key) onto a list of bills (the value) coming from that district. You can do this by looping over all of the senate members in `members_dict` and calling `get_bills()` for each of their associated bill URLs.\n",
- "\n",
- "**NOTE:** please call the function `time.sleep(1)` for each iteration of the loop, so that we don't destroy the state's web site."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 134,
- "metadata": {},
- "outputs": [],
- "source": [
- "bills_dict = {}\n",
- "for member in senate_members[:5]:\n",
- " bills_dict[member[1]] = get_bills(member[3])\n",
- " time.sleep(1)"
+ "Extract all `href` attributes for each `mainmenu` URL."
]
},
{
@@ -315,7 +99,7 @@
"metadata": {},
"outputs": [],
"source": [
- "len(bills_dict[52])"
+ "[link['href'] for link in soup.select(\"a.mainmenu\")]"
]
}
],
@@ -336,7 +120,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.8.13"
+ "version": "3.11.3"
}
},
"nbformat": 4,
|