Web scraping systems, getting images from Google images using python selenium; applied approach

Andres Prieto
10 min readMay 26, 2021

--

Hey there, hope you liked part 1, this time I want to show you how to make something a little more complicated, this article is about web scraping focus on images, but I also want to show you a couple of cool things you can do to automatize searching process, cool? I have to say that I will suppose you have read my previous article (Basically for making thing shorter), that’s why I am free to mix what we already know with a couple more of things we will use; well, let’s check.

Quick overview

As we see in previous article, there are many things we need to make work a web scraper, this time we will use a couple of extras.

Quick review

  • I said “quick”.
  • Improving our basics to make this work better.

Learning the new

  • Generalizing the model.
  • Analyzing the target.
  • Getting our queries keep.

Making it be quiet

  • Making our model have “modes”.
  • Unbreakable.

Let’s introduce Unix

  • Improving.

That’s all for now (Summary)

Refenrences

I will start with a small summary, but this is going to be quick, just to remember those particular things about environments. If you do not understand something, just go to part one, surely there’s what you need.

Quick review

I said “quick”

As you may remember, we are doing this using virtual environments, so create your virtual environment, then place your “geckodriver” in ./<your_environment_name>/bin/ then, start the environment using source ./<your_environment_name>/bin/activate; after that install selenium and create your index.py, check this in this image (Remember that I use Firefox):

What I told you to do

Once you get it, import selenium into the index.py file, it’s just the same.

Improving our basics to make this work better

If you read the last part of the first article, I told you how to use a proxy to avoid restrictions and I told you how to skip the navigator instance, I could teach you, how to make all those things using an object’s orientation, but I think the topics we are going to cover are complicated enough, so let’s use function programming approach to encapsulate the computational complexity in small functions. Instead of creating an object “Scraper()” I’m using a function call “create_scrapper()”, then I can send the driver and return the selenium’s object; I can specify how to create the instance using variables, but I am not doing it for this example.

Generalizing scrapers creation

It is important to know that I am sending Firefox object because I use Firefox as navigator, then the “path” variable should be equal to the corresponding geckodriver, if you want to be able to use multiple navigators, so you can send the “path” variable using a parameter or a simple condition. This is the first improvement that I want to show you, also this is an introduction to those things that we are going to generalize.

Learning the new

Generalizing the model

To generalize this, there are a couple of things we need to know, but I need you to see this first, I’m adding some imports to make this work, so see this images first.

New imports

The “Basic” sections is making reference to those imports that do not need to be installed using pip; we will use path to make reference to objects created with BytesIO that objects will contain our images, “argv” will be use for something cool that I will explain at the end, I’m importing choice because I want to get different images each time I make a query, choice will select from a set of images just a couple to be downloaded; from “requests” library I am importing “get()” function because, this scraper will use it to download the images, using the URLs that are contained inside the Google image’s images; the last new thing there is “PIL”, that library comes with an object call “Images”, it will be used to create the images using the path references created by BytesIO.

Once you see this, don’t worry we will see the implementation with the new imports in a couple of moments; the first issue have been already solved, it was to generalize the scraper creation, do you remember the JS animation? That’s going to be the second issue, going to Google images and getting the images URLs, fetching the queries and downloading the images using their URLs are going to be the rest; let’s solve a couple of them. We can use the same animation that the previous scraped uses, this time; Google images is kind of an infinite scroll, but that’s not going to be a problem, so we can use a function (Remember that we are using a functional approach) sending the scraper instance to use the animation, let’s check.

Using JS animation in functions

Now, let’s analyze our target; as you know our target this time is “Google images”, because of that I want to show more of things we can do with selenium, let’s continue.

Analyzing the target

I need you to take in consideration that Google manipulate a lot of their URLs, so you may not understand the query I’m using; first go to Google images and check the URL.

Link shorted

This is the way that Google images displays its default URL, look too normal, right? Well, we can express this in another way, see next images.

Link extended

It may be a little difficult to see the URL, it is this one:

https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q={q}&oq={q}&gs_l=img

It is different, but you can see that it loads “Google images”, otherwise there’s a mistake; but you can use the previous way to make queries taking the Google search’s bar as we did with GitHub’s search’s bar.

Query after being fetched

When we make a query it should look like this, we will check our target before seeing the code implementation.

Extended query after being fetched

See two things, first the extended query in the URL, second; the penguins right there are going to be our target (Yes, I like penguins); as you can see, <img> tag has a CSS class called “Q4LuWd”, so we can identify the images output by the query “Antarctic penguins” with that class, but we have to take in consideration a particularity about Google images.

Analyzing the <img> tag classes

See carefully, <img> tags have class “Q4LuWd”, but that’s a class, it means that all the images right there have the same class, we can build our query like this.

Fetching queries

Let’s explain this implementation; “google_images” variable is the URL I showed you before using, another variable inside the string, “q” means query to fetch, then what we are doing here is to send the alternative URL our parameter “query” formatting the URL to add the “query” parameter to the instance of the scraper “query_scrapper”; get() method send the full URL to be search; after that, we use the scroll() function that we have seen in “generalizing the model”. Also, “res” means answer, like see in the first article using the class own by the <img> tags “res” get the images but, what “res” contains have to be analyzed to get the URLs, that’s why it is necessary to use “get_urls()” (we will see what does “get_urls()” in the next slide of code); once the URLs have by gotten, we use choice() to choose randomly a number of images, then we append one URL to images array, without repeating; finally we return images array.

Analyzing the individual <img> tag class

As I said, it is important to analyze the images to get the URLs, for that reason, “get_urls()” uses “n3VNCb” class; this class is added to the images that have been clicked, so; it is necessary to click the images first, we can make that using “click()” method over the images, that’s why the function cut out the images that have been found clicking them, then we use “find_element()” method to get the images, after we check if the images sources are secure by checking if those source strings contains HTTP protocol, then if it is true we add that source to the “images_urls” set; at the end, function returns those URLs, see the implementation below.

Getting URL images

Once we have solved those issues, there’s missing only one and that one is to keep what our scraper fetched, we already have gotten the URL, but we haven’t downloaded yet, let’s see how to make that.

Getting our queries keep

Downloading answers

Last part of basic implementation, from the “urls” array this function uses a try-except block to download each image using “get()” function, from request library, this way to download content is simple, but there are a couple of things that have to be taken in consideration, first those try-except block are there because we are manipulating information that comes from internet, because of that, it can fail, it means that our function have to be prepared to make something if that happens (In this implementation, it just continues); once you know that, “get()” function only needs the URLs to download the content, the option “.content” specify that we just need the images, that’s what the first try-except block do; then the second try-except block is there to manage our files, see; “BytesIO()” function creates a byte object (That’s what images are for python), then we use “Image.open(image_file).convert(‘RGB’)” to create a new image as a file, then we keep that in “image” variable to create a path to the file, that’s why we use “os” (As “os_path”) library, “os” library provide a simple way to use OS commands, this time our function uses “.join()” method to create the path we need to access the images, “path” parameter is used for that, mainly we use that parameter to keep our images wherever we want, then “imagen_” + str(cont) is going to create images file names using the count variable, just like “imagen_0.jpg”, “imagen_1.jpg”, etc… “.jpg” is the extension that we are using; finally, the next two lines will create and save the files.

Making it be quiet

What I want is to execute this scraper as a Unix command, that’s why I want to append “modes”, the easiest way to make that is the following.

Making our model have “modes”

I want to see what’s going on behind the scraper, when it is fetching my queries, that behavior can be added using a condition like this.

Adding a mode to see what is going on behind while fetching

Also, it might be good to see if the scraper can download all the images.

Adding a mode to see what is going on behind while downloading

Unbreakable

If I’m using Unix commands I want to have the images in a concrete directory, but that directory might not exist, so; we can use “os” library to create that specific directory, that’s simple, just check if it exists then if it doesn’t, create that path.

Creating a folder if it does not exist

That’s all we need to make this scraper work, but… Didn’t I forget something? How can I run all this? Ohhh… Yes, it is what I forget… Well, I might not show you that… (Check what this can do first, then you choose if you want to use “sys” library and “argvs” import)

Let’s introduce Unix

Improving

Using a simple command to fetch a default query
Using a simple command to fetch a concrete query

That’s all for now (Summary)

The full code

Yes… There’s the way to run this, as you saw; our default query is “Baby cats” of you do not append a query after “python index.py” that’s what the scraper will fetch, if it does, it will be your query, that behavior is because of the scraper main function implementation, let’s check the code; after “argvs” management, we send the scraper what navigator we are using, “res_urls” contains the number of images we want, finally “download_res()” takes the URLs and download all the images in a folder that is inside desktop, then we quit the scraper…

I hope you enjoyed reading, it was difficult to explain all that this scraper does (Mainly because it is a lot of information), but I hope you understood; if you didn’t check my GitHub, there’s a step by step explanation about this and the previous scraper; but it is in Spanish… Hope it does not be an impediment, otherwise use it as you want, see you next time.

Refenrences

Check my references, I read a lot to make this article and my previous one, so; I think it is a good idea to give you what helped me.

What is Web Scraping and What is it Used For? | Definition and Examples EXPLAINED ( ParseHub ) URL

The Selenium Browser Automation Project (Selenium) URL

Web Scraping with Python — A Beginner’s Guide in 2021 (Christoph Leitner) URL

Image Scraping with Python (Fabian Bosler) URL

--

--

Andres Prieto

I'm a web developer & std. of system and computational engineering, who teach himself as much as possible, I GNU/Linux since I met it and I loading what I learn