-
-
Notifications
You must be signed in to change notification settings - Fork 46.6k
Create instagram_crawler.py #2508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It crawls the Instagram page of the user and Scarpe the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool submission!!
web_programming/instagram_crawler.py
Outdated
from bs4 import BeautifulSoup | ||
import json | ||
|
||
headers = \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No backslashes please in Python code. See PEP8. The problem with backslashes is that whitespace to the right of the backslash breaks the script on a change that is invisible to the reader.
You can run your code through psf/black to autofix that issue as discussed in CONTRIBUTING.md
web_programming/instagram_crawler.py
Outdated
|
||
def __init__(self, username): | ||
self.username = username | ||
self.url = 'https://www.instagram.com/{}/'.format(username) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.url = 'https://www.instagram.com/{}/'.format(username) | |
self.url = f'https://www.instagram.com/{username}/' |
As discussed in CONTRIBUTING.md, please use f-strings where they make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to query the backend (Instagram) for every one of the fields below? Most of these values are very static and are unlikely to change minute-to-minute. Perhaps it would be better to have a self.user_data
dict that contained the results of self.get_json() and the other methods could use that data.
web_programming/instagram_crawler.py
Outdated
info = html_1(soup) | ||
return info | ||
except: | ||
info = html_2(soup) | ||
return info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
info = html_1(soup) | |
return info | |
except: | |
info = html_2(soup) | |
return info | |
return html_1(soup) | |
except: # <-- This repo does not accept bare excepts | |
return html_2(soup) |
Bare excepts are discussed in PEP8 and https://realpython.com/the-most-diabolical-python-antipattern/
web_programming/instagram_crawler.py
Outdated
info = html_2(soup) | ||
return info | ||
|
||
def get_followers(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this (and similar methods below) into a @property so that we can use the syntax instagram_user.number_of_followers
(without the ()).
def get_followers(self): | |
@property | |
def number_of_followers(self) -> int: |
Also, let's streamline to a one-line implementation like:
return self.get_json()['edge_followed_by']['count']
# or...
return self.data['edge_followed_by']['count']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, do not create a variable that you get rid of on the very next line unless the variable name really helps the reader understand something nonobvious.
web_programming/instagram_crawler.py
Outdated
followers = info['edge_followed_by']['count'] | ||
return followers | ||
|
||
def get_followings(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_followings(self): | |
def get_number_of_followings(self) -> int: |
web_programming/instagram_crawler.py
Outdated
following = info['edge_follow']['count'] | ||
return following | ||
|
||
def get_posts(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_posts(self): | |
def get_number_of_posts(self) -> int: |
web_programming/instagram_crawler.py
Outdated
posts = info['edge_owner_to_timeline_media']['count'] | ||
return posts | ||
|
||
def get_biography(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_biography(self): | |
def get_biography(self) -> str: |
web_programming/instagram_crawler.py
Outdated
user = Instagram('github') | ||
print(user.is_verified()) | ||
print(user.get_biography()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code that is at global scope will be run by our Travis CI / pytest process as discussed in CONTRIBUTING.md.
user = Instagram('github') | |
print(user.is_verified()) | |
print(user.get_biography()) | |
if __name__ == '__main__': | |
user = Instagram('github') | |
print(f"{user.is_verified() = }) | |
print(f"{user.get_biography() = }) |
Co-authored-by: Christian Clauss <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to @cclauss some changes are done
{ | ||
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'} | ||
headers = { | ||
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36" | |
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) " | |
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36" |
Hey @yogeshwaran01, TravisCI finished with status TravisBuddy Request Identifier: 303e78f0-024d-11eb-aba2-872ffb2742c8 |
Hey @yogeshwaran01, TravisCI finished with status TravisBuddy Request Identifier: 303bb9d0-024d-11eb-aba2-872ffb2742c8 |
Hey @yogeshwaran01, TravisCI finished with status TravisBuddy Request Identifier: 303f1530-024d-11eb-aba2-872ffb2742c8 |
Hey @yogeshwaran01, TravisCI finished with status TravisBuddy Request Identifier: 30424980-024d-11eb-aba2-872ffb2742c8 |
Closing in favor of #2509 |
An algorithm crawls the Instagram page of the user and Scarpe the data.
Describe your change:
Checklist:
Fixes: #{$ISSUE_NO}
.