diff --git a/img/inspect.png b/img/inspect.png new file mode 100644 index 0000000..113b173 Binary files /dev/null and b/img/inspect.png differ diff --git a/lessons/02_web_scraping.ipynb b/lessons/02_web_scraping.ipynb index 385806a..d5157fa 100644 --- a/lessons/02_web_scraping.ipynb +++ b/lessons/02_web_scraping.ipynb @@ -8,6 +8,17 @@ "\n", "* * * \n", "\n", + "
\n", + " \n", + "### Learning Objectives \n", + " \n", + "* Understand when and when not to resort to web scraping.\n", + "* Become confident in using BeautifulSoup as a tool for web scraping.\n", + "* Understand the difference between tags, attributes, and attribute values.\n", + "* Use BeautifulSoup on a real-world website.\n", + "
\n", + "\n", + "\n", "### Icons used in this notebook\n", "🔔 **Question**: A quick question to help you understand what's going on.
\n", "🥊 **Challenge**: Interactive exercise. We'll work through these in the workshop!
\n", @@ -16,14 +27,19 @@ "🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!
\n", "\n", "### Learning Objectives\n", - "1. [Reflection: To Scape Or Not To Scrape](#when)\n", - "2. [Extracting and Parsing HTML](#extract)\n", - "3. [Scraping the Illinois General Assembly](#scrape)" + "1. [To Scape Or Not To Scrape](#when)\n", + "2. [Installation](#install)\n", + "3. [BeautifulSoup: A Quick Example](#ex)\n", + "4. [Our Data](#data)\n", + "5. [Extracting and Parsing HTML](#extract)\n", + "6. [Scraping the Illinois General Assembly](#scrape)" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "tags": [] + }, "source": [ "\n", "\n", @@ -31,16 +47,16 @@ "\n", "When we'd like to access data from the web, we first have to make sure if the website we are interested in offers a Web API. Platforms like Twitter, Reddit, and the New York Times offer APIs. **Check out D-Lab's [Python Web APIs](https://github.com/dlab-berkeley/Python-Web-APIs) workshop if you want to learn how to use APIs.**\n", "\n", - "However, there are often cases when a Web API does not exist. In these cases, we may have to resort to web scraping, where we extract the underlying HTML from a web page, and directly obtain the information we want. There are several packages in Python we can use to accomplish these tasks. We'll focus two packages: Requests and Beautiful Soup.\n", - "\n", - "Our case study will be scraping information on the [state senators of Illinois](http://www.ilga.gov/senate), as well as the [list of bills](http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True) each senator has sponsored. Before we get started, peruse these websites to take a look at their structure." + "However, there are often cases when a Web API does not exist. In these cases, we may have to resort to web scraping, where we extract the underlying HTML from a web page, and directly obtain the information we want. There are several packages in Python we can use to accomplish these tasks. We'll focus two packages: Requests and Beautiful Soup." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Installation\n", + "\n", + "\n", + "# Installation\n", "\n", "We will use two main packages: [Requests](http://docs.python-requests.org/en/latest/user/quickstart/) and [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). Go ahead and install these packages, if you haven't already:" ] @@ -94,6 +110,166 @@ "import time" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "# BeautifulSoup: A Quick Example\n", + "\n", + "Let's consider a simple HTML structure:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "html_content = \"\"\"\n", + " \n", + " Sample Page\n", + " \n", + " \n", + " \n", + "
\n", + "

Welcome to the Sample Page

\n", + "

First paragraph.

\n", + "

Second paragraph.

\n", + " Visit Example\n", + "
\n", + "

Nested paragraph.

\n", + " \"Sample\n", + "
\n", + "
\n", + " \n", + "\"\"\"\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can call `BeautifulSoup` on this `html_content`. This will return an object (called a **soup object**) which contains all of the HTML in the original document." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "soup = BeautifulSoup(html_content, 'html.parser')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Let's have a look\n", + "print(soup.prettify())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "💡 **Tip:** `.prettify()` is a really useful method that retains the indentation of the original HTML. This makes it a lot more readable!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The output looks pretty similar to the original, but now it's organized in a `soup` object that allows us to more easily traverse the HTML." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## `find_all`\n", + "\n", + "Let's search through this HTML using `BeautifulSoup`. We will search for ALL `p` tags in the HTML:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "paragraphs = soup.find_all('p')\n", + "for para in paragraphs:\n", + " print(para)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There are a lot of methods we can use to get more specific data (such as the text content itself), but this is the basic functionality of `BeautifulSoup`. Let's now look at a real-world example." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 🥊 Challenge 1: Find h1\n", + "\n", + "We can also use `find()` to find the first available tag in this HTML. Use it to find the `h1` tag in the soup!\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE\n", + "soup.find_all('h1')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "# Our Data\n", + "\n", + "Our case study will be scraping information on the [state senators of Illinois](http://www.ilga.gov/senate).\n", + "\n", + "**Let's open this website to take a look at its structure!**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here's what happens if you click \"Inspect\" in your browser:\n", + "\n", + "\"inspect\n", + "\n", + "On the right-hand side, you see the HTML that makes up the website. To the right of that is the CSS linked to those elements.\n", + "\n", + "Right-clicking on any part on the webpage and Inspecting it will automatically shpow you the part of the HTML that you are highlighting.\n", + "\n", + "💡 **Tip**: If you want to see the full HTML code, you can right-click on the webpage and select \"View Page Source\".\n" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -104,7 +280,7 @@ "\n", "In order to succesfully scrape and analyse HTML, we'll be going through the following 4 steps:\n", "1. Make a GET request\n", - "2. Parse the page with Beautiful Soup\n", + "2. Parse the page with `BeautifulSoup`\n", "3. Search for HTML elements\n", "4. Get attributes and text of these elements" ] @@ -143,17 +319,19 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Step 2: Parse the Page with Beautiful Soup\n", + "## Step 2: Parse the Page with `BeautifulSoup`\n", "\n", - "Now, we use the `BeautifulSoup` function to parse the reponse into an HTML tree. This returns an object (called a **soup object**) which contains all of the HTML in the original document.\n", + "Now, we use the `BeautifulSoup` function to parse the reponse into an HTML tree. This returns a **soup object** which contains all of the HTML in the original document.\n", "\n", - "If you run into an error about a parser library, make sure you've installed the `lxml` package to provide Beautiful Soup with the necessary parsing tools." + "⚠️ **Warning**: If you run into an error about a parser library, make sure you've installed the `lxml` package to provide Beautiful Soup with the necessary parsing tools." ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "# Parse the response into an HTML tree\n", @@ -162,13 +340,6 @@ "print(soup.prettify()[:1000])" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The output looks pretty similar to the above, but now it's organized in a `soup` object which allows us to more easily traverse the page." - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -177,21 +348,23 @@ "\n", "Beautiful Soup has a number of functions to find useful components on a page. Beautiful Soup lets you find elements by their:\n", "\n", - "1. HTML tags\n", + "1. HTML Tags\n", "2. HTML Attributes\n", "3. CSS Selectors\n", "\n", - "Let's search first for **HTML tags**. \n", + "Let's search first for **HTML tags**, like we did before. \n", "\n", "The function `find_all` searches the `soup` tree to find all the elements with an a particular HTML tag, and returns all of those elements.\n", "\n", - "What does the example below do?" + "🔔 **Question**: What does the example below do?" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "# Find all elements with a certain tag\n", @@ -203,9 +376,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Because `find_all()` is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object as though it were a function, then it’s the same as calling `find_all()` on that object. \n", - "\n", - "These two lines of code are equivalent:" + "How many links did we obtain?" ] }, { @@ -215,25 +386,6 @@ "tags": [] }, "outputs": [], - "source": [ - "a_tags = soup.find_all(\"a\")\n", - "a_tags_alt = soup(\"a\")\n", - "print(a_tags[0])\n", - "print(a_tags_alt[0])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "How many links did we obtain?" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], "source": [ "print(len(a_tags))" ] @@ -246,7 +398,7 @@ "\n", "What if we wanted to search for HTML tags with certain attributes, such as particular CSS classes? \n", "\n", - "We can do this by adding an additional argument to the `find_all`. In the example below, we are finding all the `a` tags, and then filtering those with `class_=\"sidemenu\"`." + "We can do this by adding an additional argument to the `find_all`. In the example below, we are finding all the `a` tags, and then filtering those with `class_=\"sidemenu\"`. That means we'll only get the `a`-tags that also have a `class` attribute called `sidemenu`." ] }, { @@ -266,9 +418,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "A more efficient way to search for elements on a website is via a **CSS selector**. For this we have to use a different method called `select()`. Just pass a string into the `.select()` to get all elements with that string as a valid CSS selector.\n", + "## `find_all` and `select`\n", + "\n", + "Another way to search for elements on a website is via a **CSS selector**. This method is particularly useful when you're familiar with CSS and want to leverage that knowledge to navigate and search through the document.\n", + "\n", + "For this we can use a method called `select()`. You can pass a string into `.select()` to get all elements with that string as a valid CSS selector.\n", "\n", - "In the example above, we can use `\"a.sidemenu\"` as a CSS selector, which returns all `a` tags with class `sidemenu`." + "For instance, we can use `\"a.sidemenu\"` as a CSS selector, which returns all `a` tags with class `sidemenu`--just like we did above!" ] }, { @@ -288,7 +444,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## 🥊 Challenge: Find All\n", + "## 🥊 Challenge 2: Find All\n", "\n", "Use BeautifulSoup to find all the `a` elements with class `mainmenu`." ] @@ -306,14 +462,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Step 4: Get Attributes and Text of Elements\n", + "## Step 4: Get Text or Attribute Values\n", "\n", - "Once we identify elements, we want the access information in that element. Usually, this means two things:\n", + "Once we identify elements, we want the access information in that element. Usually, we will be interested in webpage text, or attribute values.\n", "\n", - "1. Text\n", - "2. Attributes\n", - "\n", - "Getting the text inside an element is easy. All we have to do is use the `text` member of a `tag` object:" + "To do this, we first get a tag object. For instance, let's grab that `a` tag with the `sidemenu` attribute: " ] }, { @@ -329,17 +482,27 @@ "\n", "# Examine the first link\n", "first_link = side_menu_links[0]\n", - "print(first_link)\n", - "\n", - "# What class is this variable?\n", - "print('Class: ', type(first_link))" + "print(first_link)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "It's a Beautiful Soup tag! This means it has a `text` member:" + "What we just printed is a beautifulSoup object. It's a little piece of HTML. To recap:\n", + "* `` is the element or tag.\n", + "* `class` is an attribute.\n", + "* `\"sidemenu\"` is the value of the `class` attribute.\n", + "* `href` is another attribute.\n", + "* `\"/senate/default.asp\"` is the value of the href attribute.\n", + "* \"Members\" is the text content of the element.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To get the text of a BeautifulSoup object, we can call a Python attribute called `text`." ] }, { @@ -357,6 +520,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "## Getting URLs\n", + "\n", "Sometimes we want the value of certain attributes. This is particularly relevant for `a` tags, or links, where the `href` attribute tells us where the link goes.\n", "\n", "💡 **Tip**: You can access a tag’s attributes by treating the tag like a dictionary:" @@ -377,7 +542,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## 🥊 Challenge: Extract specific attributes\n", + "## 🥊 Challenge 3: Extract specific attributes\n", "\n", "Extract all `href` attributes for each `mainmenu` URL." ] @@ -443,7 +608,9 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "# Get all table row elements\n", @@ -461,7 +628,9 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "# Returns every ‘tr tr tr’ css selector in the page\n", @@ -481,7 +650,9 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "example_row = rows[2]\n", @@ -502,7 +673,9 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "for cell in example_row.select('td'):\n", @@ -546,7 +719,9 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "# Select only those 'td' tags with class 'detail' \n", @@ -564,7 +739,9 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "# Keep only the text in each of those cells\n", @@ -583,7 +760,9 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "print(row_data[0]) # Name\n", @@ -603,7 +782,9 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "print('Row 0:\\n', rows[0], '\\n')\n", @@ -623,7 +804,9 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "# Bad rows\n", @@ -645,7 +828,9 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "good_rows = [row for row in rows if len(row) == 5]\n", @@ -666,7 +851,9 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "rows[2].select('td.detail') " @@ -675,7 +862,9 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "# Bad row\n", @@ -738,111 +927,6 @@ " members.append(senator)" ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Should be 61\n", - "len(members)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's take a look at what we have in `members`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(members[:5])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 🥊 Challenge: Get `href` elements pointing to members' bills \n", - "\n", - "The code above retrieves information on: \n", - "\n", - "- the senator's name,\n", - "- their district number,\n", - "- and their party.\n", - "\n", - "We now want to retrieve the URL for each senator's list of bills. Each URL will follow a specific format. \n", - "\n", - "The format for the list of bills for a given senator is:\n", - "\n", - "`http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=[MEMBER_ID]&Primary=True`\n", - "\n", - "to get something like:\n", - "\n", - "`http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True`\n", - "\n", - "in which `MEMBER_ID=1911`. \n", - "\n", - "You should be able to see that, unfortunately, `MEMBER_ID` is not currently something pulled out in our scraping code.\n", - "\n", - "Your initial task is to modify the code above so that we also **retrieve the full URL which points to the corresponding page of primary-sponsored bills**, for each member, and return it along with their name, district, and party.\n", - "\n", - "Tips: \n", - "\n", - "* To do this, you will want to get the appropriate anchor element (``) in each legislator's row of the table. You can again use the `.select()` method on the `row` object in the loop to do this — similar to the command that finds all of the `td.detail` cells in the row. Remember that we only want the link to the legislator's bills, not the committees or the legislator's profile page.\n", - "* The anchor elements' HTML will look like `Bills`. The string in the `href` attribute contains the **relative** link we are after. You can access an attribute of a BeatifulSoup `Tag` object the same way you access a Python dictionary: `anchor['attributeName']`. See the documentation for more details.\n", - "* There are a _lot_ of different ways to use BeautifulSoup to get things done. whatever you need to do to pull the `href` out is fine.\n", - "\n", - "The code has been partially filled out for you. Fill it in where it says `#YOUR CODE HERE`. Save the path into an object called `full_path`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# Make a GET request\n", - "req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')\n", - "# Read the content of the server’s response\n", - "src = req.text\n", - "# Soup it\n", - "soup = BeautifulSoup(src, \"lxml\")\n", - "# Create empty list to store our data\n", - "members = []\n", - "\n", - "# Returns every ‘tr tr tr’ css selector in the page\n", - "rows = soup.select('tr tr tr')\n", - "# Get rid of junk rows\n", - "rows = [row for row in rows if row.select('td.detail')]\n", - "\n", - "# Loop through all rows\n", - "for row in rows:\n", - " # Select only those 'td' tags with class 'detail'\n", - " detail_cells = row.select('td.detail') \n", - " # Keep only the text in each of those cells\n", - " row_data = [cell.text for cell in detail_cells]\n", - " # Collect information\n", - " name = row_data[0]\n", - " district = int(row_data[3])\n", - " party = row_data[4]\n", - "\n", - " # YOUR CODE HERE\n", - " full_path = ''\n", - "\n", - " # Store in a tuple\n", - " senator = (name, district, party, full_path)\n", - " # Append to list\n", - " members.append(senator)" - ] - }, { "cell_type": "code", "execution_count": null, @@ -851,30 +935,15 @@ }, "outputs": [], "source": [ - "# Uncomment to test \n", - "# members[:5]" + "# Should be 61\n", + "len(members)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## 🥊 Challenge: Modularize Your Code\n", - "\n", - "Turn the code above into a function that accepts a URL, scrapes the URL for its senators, and returns a list of tuples containing information about each senator. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# YOUR CODE HERE\n", - "def get_members(url):\n", - " return [___]\n" + "Let's take a look at what we have in `members`." ] }, { @@ -885,103 +954,25 @@ }, "outputs": [], "source": [ - "# Test your code\n", - "url = 'http://www.ilga.gov/senate/default.asp?GA=98'\n", - "senate_members = get_members(url)\n", - "len(senate_members)" + "print(members[:5])" ] }, { "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 🥊 Take-home Challenge: Writing a Scraper Function\n", - "\n", - "We want to scrape the webpages corresponding to bills sponsored by each bills.\n", - "\n", - "Write a function called `get_bills(url)` to parse a given bills URL. This will involve:\n", - "\n", - " - requesting the URL using the `requests` library\n", - " - using the features of the `BeautifulSoup` library to find all of the `` elements with the class `billlist`\n", - " - return a _list_ of tuples, each with:\n", - " - description (2nd column)\n", - " - chamber (S or H) (3rd column)\n", - " - the last action (4th column)\n", - " - the last action date (5th column)\n", - " \n", - "This function has been partially completed. Fill in the rest." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "def get_bills(url):\n", - " src = requests.get(url).text\n", - " soup = BeautifulSoup(src)\n", - " rows = soup.select('tr')\n", - " bills = []\n", - " for row in rows:\n", - " # YOUR CODE HERE\n", - " bill_id =\n", - " description =\n", - " chamber =\n", - " last_action =\n", - " last_action_date =\n", - " bill = (bill_id, description, chamber, last_action, last_action_date)\n", - " bills.append(bill)\n", - " return bills" - ] - }, - { - "cell_type": "code", - "execution_count": null, "metadata": { + "jp-MarkdownHeadingCollapsed": true, "tags": [] }, - "outputs": [], - "source": [ - "# Uncomment to test your code\n", - "# test_url = senate_members[0][3]\n", - "# get_bills(test_url)[0:5]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, "source": [ - "### Scrape All Bills\n", + "
\n", "\n", - "Finally, create a dictionary `bills_dict` which maps a district number (the key) onto a list of bills (the value) coming from that district. You can do this by looping over all of the senate members in `members_dict` and calling `get_bills()` for each of their associated bill URLs.\n", + "## ❗ Key Points\n", "\n", - "**NOTE:** please call the function `time.sleep(1)` for each iteration of the loop, so that we don't destroy the state's web site." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# YOUR CODE HERE\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# Uncomment to test your code\n", - "# bills_dict[52]" + "* `BeautifulSoup` creates so-called soup objects from HTML that you can search through.\n", + "* The `find_all()` method searches through a soup object for a specified tag and attributes, e.g. `find_all('a', class_='sidemenu')`.\n", + "* The `select()` method searches through a soup object using CSS selectors, e.g. `select('a.sidemenu')`.\n", + "* Scraping is often a matter of searching through HTML code and, step by step, getting the right subset of information.\n", + "
" ] } ], @@ -1002,7 +993,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.13" + "version": "3.11.3" }, "vscode": { "interpreter": { diff --git a/solutions/02_web_scraping_solutions.ipynb b/solutions/02_web_scraping_solutions.ipynb index a21532e..53603f8 100644 --- a/solutions/02_web_scraping_solutions.ipynb +++ b/solutions/02_web_scraping_solutions.ipynb @@ -38,242 +38,41 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Challenge: Find All\n", + "## 🥊 Challenge 1: Find h1\n", "\n", - "Use Beautiful Soup to find all the `a` elements with class `mainmenu`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "soup.select(\"a.mainmenu\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Challenge: Extract Specific Attributes\n", - "\n", - "Extract all `href` attributes for each `mainmenu` URL." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "[link['href'] for link in soup.select(\"a.mainmenu\")]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Challenge: Get `href` elements pointing to members' bills\n", - "\n", - "The code above retrieves information on: \n", - "\n", - "- the senator's name,\n", - "- their district number,\n", - "- and their party.\n", - "\n", - "We now want to retrieve the URL for each senator's list of bills. Each URL will follow a specific format. \n", - "\n", - "The format for the list of bills for a given senator is:\n", - "\n", - "`http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=[MEMBER_ID]&Primary=True`\n", - "\n", - "to get something like:\n", - "\n", - "`http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True`\n", - "\n", - "in which `MEMBER_ID=1911`. \n", - "\n", - "You should be able to see that, unfortunately, `MEMBER_ID` is not currently something pulled out in our scraping code.\n", - "\n", - "Your initial task is to modify the code above so that we also **retrieve the full URL which points to the corresponding page of primary-sponsored bills**, for each member, and return it along with their name, district, and party.\n", - "\n", - "Tips: \n", - "\n", - "* To do this, you will want to get the appropriate anchor element (``) in each legislator's row of the table. You can again use the `.select()` method on the `row` object in the loop to do this — similar to the command that finds all of the `td.detail` cells in the row. Remember that we only want the link to the legislator's bills, not the committees or the legislator's profile page.\n", - "* The anchor elements' HTML will look like `Bills`. The string in the `href` attribute contains the **relative** link we are after. You can access an attribute of a BeatifulSoup `Tag` object the same way you access a Python dictionary: `anchor['attributeName']`. See the documentation for more details.\n", - "* There are a _lot_ of different ways to use BeautifulSoup to get things done. whatever you need to do to pull the `href` out is fine.\n", - "\n", - "The code has been partially filled out for you. Fill it in where it says `#YOUR CODE HERE`. Save the path into an object called `full_path`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Make a GET request\n", - "req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')\n", - "# Read the content of the server’s response\n", - "src = req.text\n", - "# Soup it\n", - "soup = BeautifulSoup(src, \"lxml\")\n", - "# Create empty list to store our data\n", - "members = []\n", - "\n", - "# Returns every ‘tr tr tr’ css selector in the page\n", - "rows = soup.select('tr tr tr')\n", - "# Get rid of junk rows\n", - "rows = [row for row in rows if row.select('td.detail')]\n", - "\n", - "# Loop through all rows\n", - "for row in rows:\n", - " # Select only those 'td' tags with class 'detail'\n", - " detail_cells = row.select('td.detail') \n", - " # Keep only the text in each of those cells\n", - " row_data = [cell.text for cell in detail_cells]\n", - " # Collect information\n", - " name = row_data[0]\n", - " district = int(row_data[3])\n", - " party = row_data[4]\n", - " \n", - " # YOUR CODE HERE\n", - " # Extract href\n", - " href = row.select('a')[1]['href']\n", - " # Create full path\n", - " full_path = \"http://www.ilga.gov/senate/\" + href + \"&Primary=True\"\n", - " \n", - " # Store in a tuple\n", - " senator = (name, district, party, full_path)\n", - " # Append to list\n", - " members.append(senator)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "members[:5]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Challenge: Modularize Your Code\n", - "\n", - "Turn the code above into a function that accepts a URL, scrapes the URL for its senators, and returns a list of tuples containing information about each senator. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def get_members(url):\n", - " # Make a GET request\n", - " req = requests.get(url)\n", - " # Read the content of the server’s response\n", - " src = req.text\n", - " # Soup it\n", - " soup = BeautifulSoup(src, \"lxml\")\n", - " # Create empty list to store our data\n", - " members = []\n", - "\n", - " # Returns every ‘tr tr tr’ css selector in the page\n", - " rows = soup.select('tr tr tr')\n", - " # Get rid of junk rows\n", - " rows = [row for row in rows if row.select('td.detail')]\n", - "\n", - " # Loop through all rows\n", - " for row in rows:\n", - " # Select only those 'td' tags with class 'detail'\n", - " detail_cells = row.select('td.detail') \n", - " # Keep only the text in each of those cells\n", - " row_data = [cell.text for cell in detail_cells]\n", - " # Collect information\n", - " name = row_data[0]\n", - " district = int(row_data[3])\n", - " party = row_data[4]\n", - "\n", - " # YOUR CODE HERE\n", - " # Extract href\n", - " href = row.select('a')[1]['href']\n", - " # Create full path\n", - " full_path = \"http://www.ilga.gov/senate/\" + href + \"&Primary=True\"\n", - "\n", - " # Store in a tuple\n", - " senator = (name, district, party, full_path)\n", - " # Append to list\n", - " members.append(senator)\n", - " return(members)" + "We can also use `find()` to find the first available tag in this HTML. Use it to find the `h1` tag in the soup!\n" ] }, { "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], + "execution_count": 60, + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "[

Welcome to the Sample Page

]" + ] + }, + "execution_count": 60, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "# Test your code!\n", - "url = 'http://www.ilga.gov/senate/default.asp?GA=98'\n", - "senate_members = get_members(url)\n", - "len(senate_members)" + "# YOUR CODE HERE\n", + "soup.find_all('h1')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Take-home Challenge: Writing a Scraper Function\n", - "\n", - "We want to scrape the webpages corresponding to bills sponsored by each bills.\n", + "## 🥊 Challenge 2: Find All\n", "\n", - "Write a function called `get_bills(url)` to parse a given bills URL. This will involve:\n", - "\n", - " - requesting the URL using the `requests` library\n", - " - using the features of the `BeautifulSoup` library to find all of the `` elements with the class `billlist`\n", - " - return a _list_ of tuples, each with:\n", - " - description (2nd column)\n", - " - chamber (S or H) (3rd column)\n", - " - the last action (4th column)\n", - " - the last action date (5th column)\n", - " \n", - "This function has been partially completed. Fill in the rest." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def get_bills(url):\n", - " src = requests.get(url).text\n", - " soup = BeautifulSoup(src)\n", - " rows = soup.select('tr tr tr')\n", - " bills = []\n", - " # Iterate over rows\n", - " for row in rows:\n", - " # Grab all bill list cells\n", - " cells = row.select('td.billlist')\n", - " # Keep in mind the name of the senator is not a billlist class!\n", - " if len(cells) == 5:\n", - " row_text = [cell.text for cell in cells]\n", - " # Extract info from row text\n", - " bill_id = row_text[0]\n", - " description = row_text[1]\n", - " chamber = row_text[2]\n", - " last_action = row_text[3]\n", - " last_action_date = row_text[4]\n", - " # Consolidate bill info\n", - " bill = (bill_id, description, chamber, last_action, last_action_date)\n", - " bills.append(bill)\n", - " return bills" + "Use Beautiful Soup to find all the `a` elements with class `mainmenu`." ] }, { @@ -282,31 +81,16 @@ "metadata": {}, "outputs": [], "source": [ - "test_url = senate_members[0][3]\n", - "get_bills(test_url)[0:5]" + "soup.select(\"a.mainmenu\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Scrape All Bills\n", + "## 🥊 Challenge 3: Extract Specific Attributes\n", "\n", - "Finally, create a dictionary `bills_dict` which maps a district number (the key) onto a list of bills (the value) coming from that district. You can do this by looping over all of the senate members in `members_dict` and calling `get_bills()` for each of their associated bill URLs.\n", - "\n", - "**NOTE:** please call the function `time.sleep(1)` for each iteration of the loop, so that we don't destroy the state's web site." - ] - }, - { - "cell_type": "code", - "execution_count": 134, - "metadata": {}, - "outputs": [], - "source": [ - "bills_dict = {}\n", - "for member in senate_members[:5]:\n", - " bills_dict[member[1]] = get_bills(member[3])\n", - " time.sleep(1)" + "Extract all `href` attributes for each `mainmenu` URL." ] }, { @@ -315,7 +99,7 @@ "metadata": {}, "outputs": [], "source": [ - "len(bills_dict[52])" + "[link['href'] for link in soup.select(\"a.mainmenu\")]" ] } ], @@ -336,7 +120,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.13" + "version": "3.11.3" } }, "nbformat": 4,