Skip to content

Create instagram_crawler.py #2508

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

Create instagram_crawler.py #2508

wants to merge 4 commits into from

Conversation

yogeshwaran01
Copy link
Contributor

An algorithm crawls the Instagram page of the user and Scarpe the data.

Describe your change:

  • Add an algorithm?
  • Fix a bug or typo in an existing algorithm?
  • Documentation change?

Checklist:

  • I have read CONTRIBUTING.md.
  • This pull request is all my own work -- I have not plagiarized.
  • I know that pull requests will not be merged if they fail the automated tests.
  • This PR only changes one algorithm file. To ease review, please open separate PRs for separate algorithms.
  • All new Python files are placed inside an existing directory.
  • All filenames are in all lowercase characters with no spaces or dashes.
  • All functions and variable names follow Python naming conventions.
  • All function parameters and return values are annotated with Python type hints.
  • All functions have doctests that pass the automated testing.
  • All new algorithms have a URL in its comments that points to Wikipedia or other similar explanation.
  • If this pull request resolves one or more open issues then the commit message contains Fixes: #{$ISSUE_NO}.

It crawls the Instagram page of the user and Scarpe the data.
Copy link
Member

@cclauss cclauss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool submission!!

from bs4 import BeautifulSoup
import json

headers = \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No backslashes please in Python code. See PEP8. The problem with backslashes is that whitespace to the right of the backslash breaks the script on a change that is invisible to the reader.

You can run your code through psf/black to autofix that issue as discussed in CONTRIBUTING.md


def __init__(self, username):
self.username = username
self.url = 'https://www.instagram.com/{}/'.format(username)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.url = 'https://www.instagram.com/{}/'.format(username)
self.url = f'https://www.instagram.com/{username}/'

As discussed in CONTRIBUTING.md, please use f-strings where they make sense.

Copy link
Member

@cclauss cclauss Sep 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to query the backend (Instagram) for every one of the fields below? Most of these values are very static and are unlikely to change minute-to-minute. Perhaps it would be better to have a self.user_data dict that contained the results of self.get_json() and the other methods could use that data.

Comment on lines 39 to 43
info = html_1(soup)
return info
except:
info = html_2(soup)
return info
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
info = html_1(soup)
return info
except:
info = html_2(soup)
return info
return html_1(soup)
except: # <-- This repo does not accept bare excepts
return html_2(soup)

Bare excepts are discussed in PEP8 and https://realpython.com/the-most-diabolical-python-antipattern/

info = html_2(soup)
return info

def get_followers(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this (and similar methods below) into a @property so that we can use the syntax instagram_user.number_of_followers (without the ()).

Suggested change
def get_followers(self):
@property
def number_of_followers(self) -> int:

Also, let's streamline to a one-line implementation like:

    return self.get_json()['edge_followed_by']['count']
    # or...
    return self.data['edge_followed_by']['count']

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, do not create a variable that you get rid of on the very next line unless the variable name really helps the reader understand something nonobvious.

followers = info['edge_followed_by']['count']
return followers

def get_followings(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_followings(self):
def get_number_of_followings(self) -> int:

following = info['edge_follow']['count']
return following

def get_posts(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_posts(self):
def get_number_of_posts(self) -> int:

posts = info['edge_owner_to_timeline_media']['count']
return posts

def get_biography(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_biography(self):
def get_biography(self) -> str:

Comment on lines 173 to 175
user = Instagram('github')
print(user.is_verified())
print(user.get_biography())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code that is at global scope will be run by our Travis CI / pytest process as discussed in CONTRIBUTING.md.

Suggested change
user = Instagram('github')
print(user.is_verified())
print(user.get_biography())
if __name__ == '__main__':
user = Instagram('github')
print(f"{user.is_verified() = })
print(f"{user.get_biography() = })

Copy link
Contributor Author

@yogeshwaran01 yogeshwaran01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to @cclauss some changes are done

{
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"

@TravisBuddy
Copy link

Hey @yogeshwaran01,
Something went wrong with the build.

TravisCI finished with status errored, which means the build failed because of something unrelated to the tests, such as a problem with a dependency or the build process itself.

View build log

TravisBuddy Request Identifier: 303e78f0-024d-11eb-aba2-872ffb2742c8

@TravisBuddy
Copy link

Hey @yogeshwaran01,
Something went wrong with the build.

TravisCI finished with status errored, which means the build failed because of something unrelated to the tests, such as a problem with a dependency or the build process itself.

View build log

TravisBuddy Request Identifier: 303bb9d0-024d-11eb-aba2-872ffb2742c8

@TravisBuddy
Copy link

Hey @yogeshwaran01,
Something went wrong with the build.

TravisCI finished with status errored, which means the build failed because of something unrelated to the tests, such as a problem with a dependency or the build process itself.

View build log

TravisBuddy Request Identifier: 303f1530-024d-11eb-aba2-872ffb2742c8

@TravisBuddy
Copy link

Hey @yogeshwaran01,
Something went wrong with the build.

TravisCI finished with status errored, which means the build failed because of something unrelated to the tests, such as a problem with a dependency or the build process itself.

View build log

TravisBuddy Request Identifier: 30424980-024d-11eb-aba2-872ffb2742c8

@cclauss
Copy link
Member

cclauss commented Sep 29, 2020

Closing in favor of #2509

@cclauss cclauss closed this Sep 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants