Web Scraping Java Selenium

Selenium Webdriver scraping, stale element. I am writing a small scraping program, it navigates to a page with a list of links, It clicks on the first link, opens a new page, gets some details, then navigates back to the page with the list of links, it then tries to find the next link, but i get: org.openqa.selenium.StaleElementReferenceException: stale element reference: element is not. And, Running Selenium test cases using the headless Firefox browser. Running Selenium test cases using a headless Edge browser. Running Selenium headless browser tests using HTMLUnitDriver. HtmlUnitDriver is an implementation of Selenium WebDriver based on HtmlUnit, which is a Java-based implementation of a web browser without a GUI.

In this post, I’m using selenium to demonstrate how to web scrape a JavaScript enabled page.

If you had some experience of using python for web scraping, you probably already heard of beautifulsoup and urllib. By using the following code, we will be able to see the HTML and then use HTML tags to extract the desired elements. However, if the web page embedded with JavaScript, you will notice that some of the HTML elements can’t be seen from beautiful soup, because they are render by the JavaScript. Instead you will only see the <script> tags, which indicate the JavaScript codes are placed.

the desired html elements are rendered from the <script>, so an alternative is need to this page.

Procedures of Web-Scraping using Selenium

1. Prerequisite

download the chrome driver from here
current stable version is 76.0.3809.126
choose your Operating System (mac/windows/linux)
extract the webdriver to CHOME_DRIVER (e.g. ./chromedriver)

2. Launch the Chrome Driver

use selenium to launch the a chrome browser, by calling webdriver.Chrome().A blank chrome window should pop up.

Now, let’s load the page we want to extract.

use driver.quit() to close the browser when you are done with testing.

3. Parse the Webpage

selenium provides multiple ways to locate the elements of the HTML. By using Chrome Developer Tools (Chrome > More tools > Developer tools), we can easily locate the HTML elements. For example, we’re going to extract the link of Details, so we point the HTML element and copy the Xpath location.

In selenium, we can call find_elements_by_xpath to extract all elements that matching the xpath pattern.

It’s worth noticing that, the xpath pattern is too specific and only returns the first link instead of all the links. Therefore we need to generalize the xpath pattern, to capture all the links.

Let’s trace back the upper levels of the xpath. Instead of using tr[1] to extract the first row, we use *[contains(@role,'row')] to capture all the rows contains the class role='row'.

Then in each element, we use td/a xpath to locate the <a> tags. Because the number of links is relative big, a tqdm progress bar is also added to show the progress of extraction.

4. save the data

Finally, we can save the links to a csv for later usage.

In this post we share with you how to perform web scraping of a JS-rendered website. The tools as seen in the header are JAVA with Selenium library driving headless Chrome instances (download driver) and JSoup as parser to fetch data of the acquired HTML.

You can view the code in GitHub

ChromeDriver initialization

I have added some arguments to chromeOptions in the code. The driver threw exceptions without them.

Getting an instance of the Document class using JSoup.

We can get the page in ChromeDriver using the following command:

Main class (ScrapeData class)

Web Scraping Java Selenium Interview

The main work is done in the ScrapeData class, which implements the Runnable interface. Basic actions in the method run:

visit category pages
get links to program pages
scrape data from program pages
save data in database

The class constructor accepts a link to a site category page and a page number to start from.

Class StudyPortalsData

StudyPortalsData class is for storing single page data.

Class ScrapeStudyPortals

Java Selenium Web Scraping

The ScrapeStudyPortals class and its main method scrapeAllDataJSoup are to retrieve data from the current page.

Class DataBase

Web Scraping Java Selenium Example

The DataBase class is to save data from an instance of StudyPortalsData to the database using the insertStudyPortalsData method. The third-party MySQL Connector / J library was used to connect to the database.

Web Scraping Tutorial

Methods of the DataBase class basically return DataBase.Status Noteplan mac. as a result.

Many of their Status instances are used in the isDataFull (StudyPortalsData studyPortalsData) method, which checks for data in all required fields.