TheJoin's Blog


Scraping the web to get the best flight fares | NodeJS & Puppeteer


category: python | tags: python

Scraping the web to get the best flight fares | NodeJS & Puppeteer

Puppeteer is the official tool for Chrome Headless by Google Chrome team. Since the official announcement of Chrome Headless, many of the industry standard libraries for automated testing have been discontinued by their maintainers. Including PhantomJS (sad to hear this, I know). NOTE: on March 2019 the Puppeteer’s team had released also the beta of Puppeteer-Firefox, available here.

TL;DR

In this article we will scrape Skyscanner, login to it, extract flights data with Chrome Headless, Puppeteer, Node and ExpressJS. Skyscanner have rate limiting mechanism in place to keep you under control but this post will give you some good ideas of what is “Scraping with Chrome Headless and Node”. Here is the accompanying GitHub repository.

The repo is just for personal or learning use. Skyscanner provide some useful APIs instead of scrape its site.

Feel the power

Puppeteer is a very powerful tool and easy to use, also for testing purpose. In this article we will see how to scrape a page by typing some input data and submit some forms, so we can navigate trough the website and get our results.

Our goal is to get the flights fare for a given destination and a date range.

Getting Started

Before we start you need to have Node 8.+ installed.

Clone this repository to your local environment (it can be also a Raspberry Pi), navigate into it and then run: npm install

This will install: - Puppeteer - Puppeteer Extra (useful Puppeteer’s plugin) - ExpressJS - InquirerJS

Try it

Try to run node index-inquirer.js : it will start a simple questionnaire in order to retrieve the input data like: the airport of departure and the return one and so on.. Then it will launch a browser instance (Puppeteer), type the data in the form input and print out all the data in JSON format.

You can also try to run the other index: node index.js -h and node index-concurrency.js.

The first one is CLI style implementation, so you need to pass all the parameters by arguments.

The second one is implemented by ExpressJS and it will create a GET endpoint to interface with it. This solution can manage the concurrency by opening new Puppeteer page/instance.

How it works

The following example is the simplest implementation of Puppeteer:

const puppeteer = require('puppeteer'); 
async function run() {  
     const browser = await puppeteer.launch();  
     const page = await browser.newPage();
     await page.goto('https://www.skyscanner.com');
     await page.screenshot({ path: 'screenshots/skyscanner.png' });
     browser.close();
}
run();

NOTE: the Puppeteer APIs use Promises so we need to use at least the await keyword. I suggest you to add also a try catch block, to catch the errors.

The script shown before open a Puppeteer instance, then open a new page and go to the Skyscanner homepage. After the page is loaded it take a screenshot.

The screenshot taken is only of the visible viewport. You can add the fullPage: true option to retrieve the full page (it’s more expensive in terms of time and resources). Simple, nah?

Now take a look of the following implementation to navigate trough the Skyscanner page and login on it:

The signIn function is an async function in order to use the await keyword inside of it.

The Skyscanner signIn is an AJAX request so it is not necessary to refresh our page. The script it telling: - Click on the sign in button - Focus on the email input and type in the username - Then, Focus on the password input and type in my password - After that, click on the submit button

If everything’s going well we are logged in, otherwise it will return an Error.

How to write a script

You need to analyze the target website, its resources and its requests and all the flow that you need to make in order to retrieve the results.

Puppeteer is a Headless Browser but you can add the headless: false option and it will show you all the phases of the script.

This feature is very useful in testing environment or also in the scraping mode, because you can monitoring better what’s going on.

The following script is a simple example of what I’ve implemented in the repo of this article:

I’ve mapped all the selectors to its functionality, like the datepickers selector or the adult/children counter and so on..

We are simple typing down some input data, like the origin airport and the destination airport and click on the submit button.

Note that some action are delayed, like the type functions and the popup elements (waitForSelector).

This is very useful because an element can be loaded in AJAX or can be set to display: none and Puppeteer can not manage these situation: we need to wait for the rendering of an element or waiting for an AJAX response (waitForNavigation).

Note also that we are taking a screenshot of the Skyscanner page, but not the Homepage like before. Instead we are taking a screenshot (1600x900) of the submitted search page.

On the result page I can run the function to retrieve all the flights data:

Conclusion

Puppeteer is a very powerful and useful tool to test and scrape a Website. It’s fast to learn and easy to mange.

The script showed in the repository use some functions and classes in order to have a better code quality, but it’s use the same script put on Gist.

The script are for learning use only. Always respect the website policy and don’t scrape for commercial use.