Skip to content

Create instagram_crawler.py #2508

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
176 changes: 176 additions & 0 deletions web_programming/instagram_crawler.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import json

headers = \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No backslashes please in Python code. See PEP8. The problem with backslashes is that whitespace to the right of the backslash breaks the script on a change that is invisible to the reader.

You can run your code through psf/black to autofix that issue as discussed in CONTRIBUTING.md

{
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}

# Usage
"""
>>> user = Instagram("github")
>>> user.is_verified()
True
>>> user.get_biography()
Built for developers.

"""


class Instagram(object):
"""
Class Instagram crawl instagram user information
"""

def __init__(self, username):
self.username = username
self.url = 'https://www.instagram.com/{}/'.format(username)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.url = 'https://www.instagram.com/{}/'.format(username)
self.url = f'https://www.instagram.com/{username}/'

As discussed in CONTRIBUTING.md, please use f-strings where they make sense.

Copy link
Member

@cclauss cclauss Sep 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to query the backend (Instagram) for every one of the fields below? Most of these values are very static and are unlikely to change minute-to-minute. Perhaps it would be better to have a self.user_data dict that contained the results of self.get_json() and the other methods could use that data.


def get_json(self):
"""
return json of user information
"""

html = requests.get(self.url, headers=headers)
soup = BeautifulSoup(html.text, 'html.parser')
try:
info = html_1(soup)
return info
except:
info = html_2(soup)
return info
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
info = html_1(soup)
return info
except:
info = html_2(soup)
return info
return html_1(soup)
except: # <-- This repo does not accept bare excepts
return html_2(soup)

Bare excepts are discussed in PEP8 and https://realpython.com/the-most-diabolical-python-antipattern/


def get_followers(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this (and similar methods below) into a @property so that we can use the syntax instagram_user.number_of_followers (without the ()).

Suggested change
def get_followers(self):
@property
def number_of_followers(self) -> int:

Also, let's streamline to a one-line implementation like:

    return self.get_json()['edge_followed_by']['count']
    # or...
    return self.data['edge_followed_by']['count']

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, do not create a variable that you get rid of on the very next line unless the variable name really helps the reader understand something nonobvious.

"""
return number of followers
"""

info = self.get_json()
followers = info['edge_followed_by']['count']
return followers

def get_followings(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_followings(self):
def get_number_of_followings(self) -> int:

"""
return number of followings
"""

info = self.get_json()
following = info['edge_follow']['count']
return following

def get_posts(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_posts(self):
def get_number_of_posts(self) -> int:

"""
return number of posts
"""

info = self.get_json()
posts = info['edge_owner_to_timeline_media']['count']
return posts

def get_biography(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_biography(self):
def get_biography(self) -> str:

"""
return biography of user
"""

info = self.get_json()
bio = info['biography']
return bio

def get_fullname(self):
"""
return fullname of the user
"""

info = self.get_json()
fullname = info['full_name']
return fullname

def get_username(self):
"""
return the username of the user
"""

info = self.get_json()
username = info['username']
return username

def get_profile_pic(self):
"""
return the link of profile picture
"""

info = self.get_json()
pic = info['profile_pic_url_hd']
return pic

def get_website(self):
"""
return the users's website link
"""

info = self.get_json()
external_url = info['external_url']
return external_url

def get_email(self):
"""
return the email id of user if
available
"""

info = self.get_json()
return info['business_email']

def is_verified(self):
"""
check the user is verified
"""

info = self.get_json()
return info['is_verified']

def is_private(self):
"""
check user is private
"""

info = self.get_json()
return info['is_private']


def html_1(soup):
"""
parse the html type-1 of instagram
page
"""

scripts = soup.find_all('script')
main_scripts = scripts[4]
data = main_scripts.contents[0]
info_object = data[data.find('{"config"'):-1]
info = json.loads(info_object)
info = info['entry_data']['ProfilePage'][0]['graphql']['user']
return info


def html_2(soup):
"""
if html_1 fails, html_2 in action
parse the html type-2 of instagram
page
"""
scripts = soup.find_all('script')
main_scripts = scripts[3]
data = main_scripts.contents[0]
info_object = data[data.find('{"config"'):-1]
info = json.loads(info_object)
info = info['entry_data']['ProfilePage'][0]['graphql']['user']
return info


user = Instagram('github')
print(user.is_verified())
print(user.get_biography())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code that is at global scope will be run by our Travis CI / pytest process as discussed in CONTRIBUTING.md.

Suggested change
user = Instagram('github')
print(user.is_verified())
print(user.get_biography())
if __name__ == '__main__':
user = Instagram('github')
print(f"{user.is_verified() = })
print(f"{user.get_biography() = })