Today I’m gonna share my own personal experience about how I jumped into the field of Scraping and also a little bit of learning programming language as well. I think it was around the month of March when my MD had a sudden meeting and was planning to retrieve an old website that had lots of data around 2000 pages. But the thing is we can’t get access to it. So he was planning to scrape our own website and he asked anyone to volunteer to get involved in this process. We had a trainee developer, but he was not having the knowledge of scrapping as well. Then I got an idea in my mind “How if I do that?” but I’m not sure that he would give me that project coz I’m not having any experience with programming and as well as I’m a graduate of Literature. But I want to give it a try at least. Hopefully, he agreed to it and gave some suggestions to use Python with BeautifulSoup. So I started the work and on the first day, it was interesting but later on felt boring. I too had a thought “How those programmers and developers are doing this all day”. I promised him that I will do it, so I gave it a try and couldn’t complete my target with Python. Later on, I started to work from home, and thankfully my sister had a visit home. She is working in an IT company. So she decided to help me with this scrapping. But this time she told that “we can try doing it with the help of Java”. Compared to Python, Java really was easy and helpful because being a graduate of Literature I myself tried doing that with the guidance of my sister. I’m sure even if you are not having any programming knowledge, this article will really be helpful to scrape data. So let’s get into the article to know more about it.
Before starting let us see what are the requirements that we need for this process.
What do you have to Learn/ Know first?
- Basic Java programming.
- A little bit about Eclipse IDE and Selenium.
What and Why do we use Selenium Webdriver?
Selenium is used to communicate with Web Browsers. Simply, it helps in doing the actions that we tell them to do. For example, If you want to Scarpe the title from a Website. It will directly go to the website and copy the title and bring it back to us in the format that we ask for. Like if you want the result to be in the excel sheet. It will gather the data and provide it in the Excel sheet once the process is completed.
What are the software and libraries that we have to install?
- Java SE
- Eclipse IDE
- Selenium Webdriver
Now we have to download, install the software and then configure it. So let’s see what to do. Firstly,
- Install Java SE, Eclipse IDE, and Selenium Web Driver.
- Configure Ecplise IDE with Selenium.
- Add Selenium Library files.
With a simple method, let me show you how to do this process.
1. JDK Installation
Things to do,
- Install JDK (Java Development Kit).
- Configure Environmental variables.
- Install JRE (Java Runtime Environment)
Installing Process
- Select the latest version of JDK and click on the download link accordingly to your system configuration.
- Then to configure Environmental variables, go to Control Panel and select Advanced System Setting. From the Advanced Screen, choose Environment Variables.
- Here you have to set a new path for Java, so tap on New and enter the Variable name and Variable Value (Variable value is the folder path C:\Program Files\Java) and click on OK.
- Once after that check whether Java is installed properly using Command Prompt. Open Command Prompt and type in the command java -version. You will get the Java version if it is installed successfully.
Note: The latest version of the Java Development Kit comes with in-build JRE (Java Runtime Environment).
2. Eclipse Installation
Things need to do,
- Download Eclipse for JEE Developers (Java Enterprise Edition).
- Extract the file and run the application as administrator.
Installing Process
- Download the Eclipse file accordingly to your system configuration.
- Locate the file and unzip it. You can create a shortcut on the desktop or you can access it from the folder as well.
- Launch the Eclipse and check whether it is working.
3. Selenium Installation
Now let us see how to install Selenium and configure it on the Eclipse.
Things to do,
- Download the Selenium server and Java Client.
- Download Google Chrome Driver.
Configuring Process
- First, download Selenium Server and move that file to the C Drive.
- On the same page, you can see Selenium Client & WebDriver Language Bindings from that download Java Client.
- Next, create a folder name Selenium (for quick reference) and move the selenium server file and java client folder to it.
- For executing the programming you have to use a driver (Google Chrome or Mozilla/ Firefox). So download the browser. (This is the same page where you downloaded Selenium files. So scroll down the page and tap on Browsers. Then choose a browser by clicking on the documentation.)
- Move the Browser drive to the Selenium folder.
4. Creating a new Project and Adding Selenium Library files
- Create a New Project.
- Add a new Package and a class.
- Add Selenium Library files
Adding Process
- Open Eclipse and tap on File -> New -> Java Project.
- Give a name for the project ( Eg: Scrapping/ Selenium Project) and tap on Next -> Finish.
- Now you have to create a new package, so right-click on the src folder from the left sidebar and choose New -> Package.
- Give the name for the package (Eg: SampleFile) and now right-click on the package and choose New -> Class, then give a name for the class.
- The system library files will be added to the project. Once after that right-click on the library file ( For eg: JRE System Library- Filename) and choose, Build Path.
- From the pop-up menu, tap on Add External JARs. Then choose the file path C:\Program Files\Selenium\selenium-java). Select the client library files and tap on Apply and Close.
Here I have shown you a simple code
package NewPackage; import org.openqa.selenium.WebDriver; public class SampleClass { public static void main(String[] args) { // declaration and instantiation of objects/variables //comment the above 2 lines and uncomment below 2 lines to use Chrome System.setProperty ("webdrive.chrome.driver","C:\\Users\\User1\\Desktop\\chrome\\chromedriver.exe"); WebDriver driver = new ChromeDriver(); String baseUrl ="http;//Wikipedia.org/test/newtours/'; String expectedTitle = "WIKIPEDIA"; String actualTitle= ""; // launch Fire fox and direct it to the Base URL driver.get(baseUrl); // get the actual value of the title actualTitle = driver.getTitle() ; /* *compare the actual title of the page with the expected one and print *the result as "passed or "Failed" */ if (actualTitle.contentEquals(expectedTitle)) { System.out.print1n("Test Passed!"); } else { System.out.print1n("Test Failed"); } //close Fire fox driver.close(); }
Explaining the code
1. Importing Packages
import org.openqa.selenium.WebDriver – This is a Webdriver class used to call the new browser ( Google Chrome or Firefox etc.,)
If you want to have more access to a site, you have to import more packages.
2. Representing Objects and variable
WebDriver driver = new ChromeDriver();
This means that the Java Program will run on the Chrome driver.
Here, the Base URL and the actual title are the Variables.
3. Initiating a Browser Session
This opens a new browser session and navigates to the actual base URL.
driver.get(baseUrl);
4. Get the actual Title
actual title = driver.getTitle() ;
This will get you the page’s main title.
5. Abort the process
driver.close();
This will be used to close the browser window after the process.
So, this is how I learned to scrap and it was really helpful and easy to collect all the data from your website. I have given a simple example of getting the title from the page. So do try this method and try adding some more functions to get more details. So, I hope this basic tutorial on Java SE will help you to learn Web Scraping. If you have any suggestions or doubts please leave that in the comment box below.