Web Scraping Using Selenium And Beautifulsoup Python



Why is Python used for web scraping? Python has become the most popular language for web scraping for a number of reasons. These include its flexibility, ease of coding, dynamic typing, large collection of libraries to manipulate data, and support for the most common scraping tools, such as Scrapy, Beautiful Soup, and Selenium. This repository explains the rationale for web scraping in python. I have implemented few basic examples using selenium, have a dekko at it! This repo covers approximately 1% of the entire python web scraping. My motive is to get you familiar with the tools that python provides if you forsee your career as a Data Engineer. To align with terms, web scraping, also known as web harvesting, or web data extraction is data scraping used for data extraction from websites. The web scraping script may access the url directly using HTTP requests or through simulating a web browser. The second approach is exactly how selenium works – it simulates a web browser.

This article was also featured at Towards Data Science here.

Motivation

Web
  • Be sure to check out Selenium’s webdriver page to understand more on how to use Selenium. Personally, I use Firefox mainly (geckodriver). Personally, I use Firefox mainly (geckodriver). Sudo pip install beautifulsoup4 Beautifulsoup is great at parsing HTML data, it’s python methods are very intuitive when navigating a HTML tree.
  • Our tools will be Python and awesome packages like requests, BeautifulSoup, and Selenium. When Should You Use Web Scraping? Web scraping is the practice of automatically fetching the content of web pages designed for interaction with human users, parsing them, and extracting some information (possibly navigating links to other pages).

I’m currently looking for a new data science role, but have found it frustrating that there are so many different websites, which list different jobs and at different times. It was becoming laborious to continually check each website to see what new roles had been posted.

Web scraping using selenium and beautiful soup python code

But then I remembered; I’m a data scientist. There must be an easier way to automate this process. So I decided to create a pipeline, which involved the following steps, and to automate part of the process using Python:

1. Extract all new job postings at a regular interval

I decided to write some Python code to web-scrape jobs from the websites I was checking the most. This is what I’ll outline in this post.

2. Check the new job postings against my skills and interests

I could have tried to automate this, but I didn’t want to risk disregarding a job that may be of interest because of the criteria that I put into the code. I decided to manually review the postings that my web-scraping returned.

3. Spend one session a week applying for the new jobs that made the cut.

I have heard about people automating this stage. However, I believe chances will be better if you take the time and effort to make tailored applications, and somewhat disagree with this approach on principle.

NOTE: all code discussed in this post is available here.I’ve also shared my favourite resources for learning everything mentioned in this post at the bottom.

I decided to use BeautifulSoup and, if needed, Selenium so my import statements were the following:

One of the main sites I was checking for data science jobs was Indeed.co.uk.

(1) Extracting the initial HTML data

I was pleased to see that they had a standardised format for URL, which would make the web scraping easier. Download connect port devices driver. The URL was ‘indeed.co.uk/jobs?’ followed by ‘q=job title’ & ‘l=location’ — as below:

This made tailoring the job title and location pretty easy. I decided to create a function that would take in job title and location as arguments, so that I anybody could tailor the search:

Using the urlencode function from urllib enabled me to slot the arguments in to create the full url. I included ‘fromage=list’ and ‘sort=date’ within the URL, so that only the most recent jobs were displayed.

And

Then, with the handy help of BeautifulSoup, I could extract the HTML and parse it appropriately.

Finally, I wanted to find the the appropriate <div> that contained all of the job listings. I found this by opening the URL (indeed.co.uk/jobs?q=data+scientist&l=london) and using the ‘Inspect’ element. Using this, I could see that <td id=“resultsCol”> contained all of the job listings, so I used soup.find(id=“resultsCol”) to select all of these jobs.

(2) Extracting job details

Now that I had the ‘soup’ of HTML containing all the job listings, the next step was to extract the information I wanted, which were:

Web scraping using python
  • The job titles
  • The companies
  • The link to the full job profile
  • The date is was listed

For each of these, I again used Inspect to identify the appropriate section, and used the .find() function to identify them, as follows:

(3) Iterating over each job listing

Using ‘Inspect’ I saw that each job card was contained within a div with the class ‘jobsearch-SerpJobCard’, so I used BeautifulSoup’s .find_all function as follows:

Then, for each card I wanted to extract the 4 pieces of key information listed above and save them in a list.

I wanted to make my function generalisable, so that people could choose which characteristics they wanted to search for (out of job titles, companies, link and date listed), so I created a list ‘desired_characs’ to specify this.

For each characteristic, I looped over and added them to a list as follows:

Finally, I brought all of these into a jobs_list, which could then be exported to the chosen format — an Excel, DataFrame or otherwise:

Using ‘cols’ enabled me to specify the titles for each key for the jobs_list dictionary, based on the characteristics that were extracted.

‘extracted_info’ was a list of lists; each list containing, for example, all the job titles or all the companies.

Using these data structures made compiling the final jobs_list dictionary much easier.

(4) Storing and saving jobs

I converted the ‘jobs_list’ dictionary into a DataFrame and then exported it to the filename and filetype that the user selects with the following function:

(5) Integrating into a single function call

Finally, I wanted users to be do all of the above with a single function call. I did so as follows:

Satisfying, this produced a final result which could be called pretty easily, as follows:

The next step was to generalise my script to also take in job listings from other websites. Another site I’ve been searching a lot was CWjobs. However, adding this proved more of a challenge.

When I inspected the URL, I noticed that there wasn’t a consistent pattern with the keyword arguments.

Therefore, I decided I would use a Selenium Webdriver to interact with the website — to enter the job title and location specified, and to retrieve the search results.

(1) Downloading and initiating the driver

I use Google Chrome, so I downloaded the appropriate web driver from here and added it to my working directory. I then created a function to initiate the driver as follows:

(if using an alternative browser, you would need to download the relevant driver, and ensure it has the name and location as specified above)

(2) Using the driver to extract the job information HTML ‘soup’

Using

(3) Extracting the job information

The next step was the same as Steps 2 and 3 from the Indeed job scraping above, only tailored to CWjobs’ DOM structure, so I won’t go over it again here.

Once I finished the code, it was pretty cool seeing my web browser being controlled, without me touching the mouse or keyboard:

I was pretty happy with my functioning code, and decided to stop there (I did actually have to apply for the jobs after all).

A quick search on GitHub revealed that people had already made similar job scrapers for LinkedIn Jobs and a few other platforms.

I uploaded the code to a GitHub repository with a READme incase anybody else wanted to scrape jobs using this code.

Then I sat back and felt pretty happy with myself. Now, I just run the scripts once a week, and then pick out the jobs that I want to apply for.

If I’m honest, the time and effort gains from using the script rather than doing things manually are fairly marginal. However, I had a lot of fun writing the code and got a bit more practice with BeautifulSoup and Selenium.

As a potential next step, I may set this up on Amazon Lambda, so that it does the new job search automatically once a week without me needing to. So there’s a possible future post coming on this if I do end up doing it.

Web Scraping Using Selenium And Beautiful Soup Python Code

I hope this was helpful/interesting!

Python 3 Web Scraping

https://realpython.com/beautiful-soup-web-scraper-python/ — a great overview / refresher on web scraping with Beautiful Soup

https://towardsdatascience.com/looking-for-a-house-build-a-web-scraper-to-help-you-5ab25badc83e — another interesting use case for a web scraper: to identify potential houses to buy

https://towardsdatascience.com/controlling-the-web-with-python-6fceb22c5f08 — a great introduction of Selenium

https://www.youtube.com/watch?v=--vqRAkcWoM — a great demo of selenium in use

https://github.com/kirkhunter/linkedin-jobs-scraper — a LinkedIn scraper, which demonstrates both selenium and beautiful soup





Comments are closed.