Copy and Paste a large amount of data from a website seems to be a headache and it takes time too. But just think about grasping the whole data from the website by using a simple programming language. We are having two Programming languages to make your work so simple. So let us know what are those two languages and their benefits and their usage in the upcoming paragraphs.
Python Language + Beautiful Soup
We all know Python is a simple and easy language that helps to develop Websites and Applications with GUI (Graphical User Interface). It also works on many platforms like Windows, Linux, Macintosh, etc., We know that Python combines with other programming languages but the thing is it also helps you to extract data from the website as well with the help of a library “Beautiful Soup”. Now let us see how to install this library and learn how to extract data.
Before starting with Scraping, we have to install a few libraries to extract data. Let's see what are those and how to install them for Python.
Pip (Package Manager)
pip helps to install the packages for Python. So once after installing Python you have to install pip to install all the packages for web scraping.
How to download and save pip Python?
- Open a browser on your Mac / PC.
- Then search for get-pip and open the get-pip.py page.
- Now right-click on that page and tap on save as.
- Save your file as get-pip.py and don't change the type.
Installing pip Python
After downloading pip python you have to install it on your Python to do that open Command prompt for Windows and Terminal for Mac.
- Firstly, install Python on your Mac /PC.
- Then, check the installation by entering the Python command in the Command Prompt or Terminal.
- If it shows an error then you have to install Python again.
- If it runs well that enter exit() command.
- Then enter the cd desktop command to locate the files on the desktop.
- If it's successful then enter the pip file name (get-pip.py).
- Now the pip will be installed on your PC.
Note: Make sure that the internet connection is stable.
After installing pip, you have to install the request library using python-pip.
- On your PC go to the start menu and type idle(Python).
- Then right-click on it and tap on Open file location.
- Now tap on Scripts and copy the file path.
- Next, open the Command Prompt and type cd + file path as shown below.
- If the command is right, type pip install requests in the next command.
- Then the requests library will be installed.
Now let us see the process of installing the Beautiful Soup library.
- Open Command Prompt and enter pip install beautifulSoup4
- After installation check whether the libraries are working.
- Open Command Prompt or Python.
- Then type the below command,
If you are using Command Prompt enter,
>>>Python >>>import requests >>>import bs4
If you are using Python idle enter,
>>>import requests >>>import bs4
Note: bs4 indicates BeautifulSoup4.
How to install XML or HTML ?
After installing BeautifulSoup, you can install XML or HTML. To do that open Command Prompt and enter the same command.
>>>Python >>>pip install html
Now let us see a simple coding to scarp the title of a website.
The above coding is an example of scarping the h1 title from a website. At first, we have to start with a request.
- import request (it means we are requesting sites' permission to access their website).
- import bs4 ( it means we are requesting BeautifulSoup library to scrap the website).
- res = requests.get(“https://www.wikipedia.org”) (Here we have to choose the site for scraping)
- type (res) (next we have to choose a type and I have chosen a requests type)
- If you want to scrape the whole data from the site, type res.text.
- Now let me take the h1 title from the website. to do that type h1 = Soup .select(‘title').
- If the command is right, you can get the title as shown in the image.
Beautiful Soup will help you to scrape data with ease. This article will provide you with basic information to about Beautiful Soup and how to scrape data.