Web scraping systems, getting information from GitHub repositories using python selenium; applied approach

Andres Prieto
13 min readMay 22, 2021

--

I want to show you how to make a web scraping system; in this article you will learn how to use python selenium for getting information from GitHub using a text scraping system, so let’s check the content table

Quick overview

This is what we need to make this work, those are also de sections that we will see in this article.

Basics and prerequisites

  • What is “selenium”? Why python? What is scraping? What am I doing?
  • Basic knowledge about CSS selectors and HTML tags.

Setting out scraping system

  • Preinstallation.
  • Virtual environment.

Text scraping system

  • Starting by imports.
  • Choosing a target.
  • Getting information.
  • Starting with JS animations (Improving searching process).
  • Advance features.

Check the entire code

  • Summary.

We have a lot of information to cover, so let’s start asking a couple of things…

Basics and prerequisites

The first thing I want you to understand is, this is not a basic python tutorial, we will cover different aspects of python, libraries, models for request and web technologies. Let’s start by the question that inspired me to learn about this:

“How can I make web scraping? And What is web scraping itself?”.

What is “selenium”? Why python? What is scraping? What am I doing?

In my case, I started like this, asking myself: “What is scraping?”, so started to read about it and I discovered that web scraping is a simple implementation of this model:

Scraping system, general model.

Note: I made the images, because I don’t want a copyright claim.

Where that black box represents the code that I needed to type at my neovim, then I started to search for scraping libraries and I realized that python is use for machine learning, data science and all that cool stuff that we know exist, but we don’t know how to code it, so I found something magic call “Selenium”, which is a library that let us make some web scraping, maybe it is not the best option, but that’s what I found. Selenium describe itself as:

“Primarily it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should) also be automated as well”.

As you can read, it does not have a concrete limit and scraping is in that gray scale of “testing-hacking”; now you may be wondering: “Why are we using this?” or “What are we going to do?” both question can be answered at my GitHub profile (I’m web developer) so, CSS selectors and HTML tags are easier to use for me than other options, writing about that.

Basic knowledge about CSS selectors and HTML tags

I need you to understand this, CSS is just a set of rules you may use, like something call “Classes” or other thing call “Identifiers” both things are “CSS selectors”, in this case “Classes” are defined as dot character (.) and “Identifiers” are defined as hashtag character (#); then, HTML tags are a real lot of tags that structure the entire websites, most common tags we are going to use are: <div>, <p>, <a>, <img>; which mean “container”, “paragraph”, “anchor” and “image” respectively; also you can search for more CSS selectors heherere and check a list of HTML tags here.

You already know the basics, if you did not understand something, this is a summary. We are going to create 2 scraping system using python, we will use python instead of other options because is multipurpose language that is use for machine learning and a library call “Selenium” which is going to provide us scraping utilities, that library uses CSS selectors and HTML tags to get information from web pages; clear? Here we go.

Setting our scraping system

This is a disclaimer, now we are coding; so I will try to be clear, but this can be hard to follow.

Preinstallation

Step zero; I need to import selenium, but I want to make a python environment before in order to make my code portable and keeping selenium drivers in the same folder, take a look at this slide of selenium’s documentation:

Selenium browser drivers documentation.

It means two things; first one is that we have to select from this list our corresponding web driver, according to the web navigator that we are using, it will be saved as “geckodriver”, then we will need to place it in our $PATH system var; selenium documentation provide a guide to make this, but that way to do it less portable, so I prefer to use the “longer” way to do it, to use this option we will create a variable call “path” what have to be equal to “geckodriver” folder and file. The second one is quite difficult to understand if you are not used to this, but $PATH var is a system variable that is used to tell system where the things are.

Virtual environment

Sincerely I hate to modify system variables (And that way this article will be even longer) so, I am showing you the most portable way to make this, let’s start by installing python-virtualenv which is going to provide us a virtual environment for python.

Installation of python-virtualenv on Arch Linux.

This allows us to create a project that contains only what the project needs to work, same way this virtual environment provide a folder call “bin” there we can place our “geckodriver”, this avoids using $PATH var.

Using the “virtualenv text_scrapper” we create a new python3.9 virtual environment called “text_scrapper”, this environment is a folder, so we can access it using “cd text_scrapper”, then if we list using “ls” we should see something like this:

Virtual environment folder distribution.

Note: I am using a zsh theme call p10k so your shell terminal may look different, that does not matter just look the folders.

To start our virtual environment we have to source the “activate” file contain in ./bin folder.

Activating the virtual environment.

It is important to understand that when your virtual environment is running (Activated) all the package you install will be installed in that environment, in this case if you go to other folder and I run “pip install selenium” selenium package will be installed at “text_scrapper” environment and not at global bin as usual.

Deactivating environment.

To stop our environment you have to type “deactivate”, this command will stop the environment process, and you will be able to install python packages globally again. Don’t stop your environment until you stop developing, if you do, your project will not work.

Now that we have set our environment, and it is running we can install selenium and the corresponding driver, I will start by downloading the Firefox selenium driver, I am downloading this one because I use Firefox; remember that you have to download the web driver that correspond to your navigator, check the list.

Downloading selenium web drivers.

Now, I have my web driver at “Escritorio”, I have to move the geckodriver to “text_scrapper/bin”, I can use the “mv” commando for doing that:

Moving web driver to environment bin.

Once I did this I should see the geckodriver inside the environment bin folder just like this.

geckodriver inside ./bin folder.

It is important to get this done, we will use it in the next step.

Text scraping system

Step one; make sure you have all the above points before starting this part. Well done, start your virtual environment I moved and renamed mine to other folder with the whole project, but the steps that you have to follow are just the same, source your ./bin/activate, like this:

Sourcing ./bin/activate.

Now my environment is running, and I can install selenium using pip, but before we start with that I want you to see that my environment is “scrapper” and my project is “text_scrapping”; then install selenium using “pip install selenium”.

Installing selenium.

We have installed selenium in our environment, but we do not have any python file yet, so let’s make an “index.py”, I will use neovim as code editor you can use other if you want.

Creating the index.py file.

Starting by imports

Inside our python file is necessary to import the selenium modules, but it is unnecessary to use the whole library for web scraping, so will use just “selenium.webdriver” module, let’s take a look.

Necessary imports from selenium library.

Well currently it looks too complicated, but I am going to explain what we are importing:

  • “import Firefox”, remember that I am importing Firefox instead of other, because I use Firefox as navigator; if you use other, check the list and Import that one.
  • “.common.by” help us to use CSS selectors and HTML tags to get information from the different component that we scrap.
  • “.common.keys” this utility emulate keyboard behavior.
  • “.support.ui” It’s common in some pages to have lazy loaders, for those cases we use “WebDriverWait” this is going to tell our scraper instances to “wait” until something.
  • “.support” as we indicate to “wait" to our scraper, we need at least one condition to indicate when the scraper tan continue its job, those conditions are the “ecs".

That what we need to create our first scraper, a text scraper, so; what I want to do with this scraper is to get information from GitHub about a particular topic, for example my favorite language “JS"; the idea is to send a query, to be search by the GitHub search bar, then we will get the URL and the title of those particular repositories to be print in the terminal, something really simple.

Choosing a target

As I said before, the “target" of this scraper is going to be GitHub, if you don’t know what GitHub is, you just need to know that there is a lot of information related to software in that beautiful page, so; step three, go to your browser an search “github.com”, you should see something like this:

GitHub main page.

Now, we need to identify where the search bar is, so let’s take a look.

GitHub search bar.

Well, there it is, so know we can append our query there; but… Does it look a little bit too easy? Yes, you are right, we have to indicate the scraper all those things, also send the query, then find a way to get information and if that was not enough, we do not have a scraper yet!

Getting information

Step four, building the scraper to get information; okay we have the imports, but they have not be used still, first we will use our environment to make our life easier as I told you, it is time to create the path var, type “path = './<your_environment_name>/bin/geckodriver' “, then create an instance of Firefox navigator (In your case, It can be Chrome or something else), using “web_scrapper = Firefox()”, but we need to use pass our path var as parameter so, we use the “executable_path” parameter to make this “web_scrapper = Firefox(executable_path=path)”.

Scraper instance.

There you go, if you run it you show get something like this:

Instance of Firefox created by Selenium.

If you realized that what the scraper load is not GitHub, this line solve that.

Using scraper.get() to send scraper to GitHub.

Let’s implement the query; in order to building the query it needs to identify the search bar tag and the search bar classes or identifier, basically the scraper have to “see" it, so we indicate where the bar is using “by", also we send the query using send_keys() method, let’s take a look.

Using inspector to check the search bar.

Use ctrl + shift + i to open the inspector or go to options and open the “inspector” o “developer option”, once you made that use the inspector to click in the search bar, as you can see the HTML tag is an <input> with classes=”form-control input-sm header-search-input jump-to-field js-jump-to-field js-site-search-focus”.

HTML tag and classes owned by GitHub search bar.

Once we identify identifier, classes and tags, it is possible to build an expression to make selenium scraper identify where you have to send the query, then we can use find_element() method to find the search bar, let’s try to make it.

Identifying the search bar with the scraper.

If it works we should not see any different, so let’s make the scraper send the query, send_keys() method can emulate what we can write using the keyboard and send queries using enter key, this key can be emulate using send_keys(), let’s see the code,

Sending queries using send_keys() method.

If you append this line to your code, you will see a big different in your scraper, so; let’s try this in our small example. We will see something like this.

Sending queries to GitHub search bar using the scraper.

Once we send the query, we can have some troubles; the most important one is the Wifi connection; it may vary, it is not constant, so we can use the “ecs” (Expected conditions) it can be useful to check with the inspector what we got in the previous picture.

Analyzing the query.

This can be hard to follow, but take it easy; what I want to get is the “href” URL with its respective text, all those things are inside an <a> tag, the <a> tag is inside 2 <div> tags and that <div> tag is inside a <li> tag and that <li> is inside an <ul> tag; okay that sounds like a riddle; but let’s make it easier, drawing:

Understanding the scraper.

That looks quite simple, when the <ul> tag appers, we can be sure that the <li> tags are there, it can be a litte bit weird for you, if you do not have any background related to web development, but it is easy; if the scraper wait for <ul> tag to appers it will be able to recollect the information that I want, so; let’s implement that.

Using “ecs” to make scraper wait for <ul> tag to appear.

This time, the scraper will wait until the <ul> tag with class “repo-list” appears, check the “Analyzing the query” image to check this, that <ul> tag has that class.

Now, you understand a lot of scrapers there’s one more thing that we have to do, get the information from our query is quite simple.

Using a CSS selector to get the <a> tag.

Just, use another CSS selector to get the <a> tag we have scraped.

Showing the information.

Finally, we can show the information using a for bucle (Remember that the query that I send was “javascript”), let’s see what the scraper found:

What the scraper found.

There you go, we have created a web scraper; but there are a couple of things that we can improve, let’s check that.

Starting with JS animations (Improving searching process)

Have you ever scroll a page like google images? Sometimes when your Wi-Fi connections is not the best, some images do not load; that may happen to the scraper, it can be solved using JavaScript, selenium let its scrapers use JavaScript animation, so let’s use one small animation to take the top of the page and scroll it until the bottom.

Using JS to scroll the query.

We can also make other crazy things like switch pages in the GitHub query but that more complicated and I want to show you other thing about selenium, check the advance features.

Advance features

I want to start with good practices, generally we want to close the scraper instance, we make this using quit() method.

Quitting the scraper.

This is going to close the Firefox (Or the navigator you are using) instance.

We can use the scraper options to skip the creation of the navigator instance, if you want that behavior (Those navigator instances generally are used just during the creation of the scraper, because it consumes ram and that stuff), you have to import the options, make that.

Importing Firefox options.

Note something, if you are using other navigator you have to import that navigator options; then we can use Options() objects, then we can use headless to specify the scraper that we do not want to see the navigator instance.

Skipping the navigator instance creation

One last feature, we can make the scraper uses a proxy, this improves the search, not necessary on GitHub, in other web pages like YouTube for example can avoid countries restrictions.

Using proxy.

Check the entire code

Summary

Summary

Damn it this is too long, don’t you think so? Well, are you ready for more? Check part 2

--

--

Andres Prieto

I'm a web developer & std. of system and computational engineering, who teach himself as much as possible, I GNU/Linux since I met it and I loading what I learn