For the past couple of years, I spent a lot of time creating web scrapers and information extraction tools for clients and my own service. Recently, I had an idea how I can improve the quality of results while spending less time on data extraction rules. The result of this work is the experimental javascript project struktur.js.
- Web Scraping Angular App
- Web Scraping In Angular 9
- Web Scraping In Angular 6
- Web Scraping In Angular Interview
- Angular Web Development
Web scraping is soul crushing work. There are mainly two reasons for that:
Aprende a hacer web scraping con Nodejs, usando la biblioteca cheerio que se asemeja a jquery para escanear el contenido html de una web y poder obtener cual. Python angularjs web-scraping beautifulsoup urllib2. Improve this question. Follow edited Jan 28 '16 at 0:33. Asked Jan 28 '16 at 0:20. Stephen Lead Stephen Lead. 1,468 4 4 gold badges 18 18 silver badges 37 37 bronze badges. Aug 13, 2018 The paradigm of a data grid should be familiar to most developers. It’s a component used for displaying tabular data in a series of rows and columns. Perhaps the most common example of a data.
Reason 1: CSS Selectors and XPath queries are hard to maintain
Whenever the websites markup changes, you need to adapt your selectors. Furthermore, those selectors assume a static response from the scraped web server. If the structure changes slightly, the predefined CSS selectors fail miserably and the complete scrape job has to be restarted.
Many websites also have randomized class names and ID's in their HTML markup, such that those attributes carry no semantic information and cannot be used in CSS selectors. For example, this is html that google returns in their markup:
In our modern times, many websites are written in a Javascript frontend framework such as Angular or ReactJS. This increases the trend of class names with no semantic relation to its content.
I finally decided to think of an alternative ways of web scraping, when I had to commit the following code into se-scraper:
This code is ugly and probably breaks in a couple of weeks. Then I need at least two hours of my time to fix it. I simply cannot do this for 10 different websites where I have to maintain a different set of scraping rules via CSS selectors.
Reason 2: Websites have built in defenses against web scraping
Of course it is reasonable to protect yourself against excessive and malicious scraping attempts. Nobody wants to have theirwhole body of data scraped by third parties. The reason is obvious: The worth of many providers is constituted by the richness of information they have. For example Linkedin's sole worth depends on the information monopoly about people in the business workforce. For this reason, they employ very strict defenses against large scale scraping.
Those defenses block clients based on
- IP addresses
- User agents and other HTTP headers
- whether the client has a javascript execution engine
- number of interactions on the site, mouse movements, swiping behavior, pauses
- whether the client has a previously known profile
- captchas
It all boils down to the question whether the client is a robot or human?. There is a large industry that tries to detect bots. Well known companies in this industry are Distil and Imperva. They try to distinguish bots and humans by using machine learning models using indicators such as mouse movements, swipe behavior on mobile apps, accelerometer statistics and device fingerprints. This topic has enough depth for another major blog post and will not be researched here any further.
Scraping without a single CSS selector - Detecting Structures
We cannot make much against web scraping defenses, since they are implemented on the server side. We can only invest more in resources such as IP addresses or proxies to obtain a larger scraping infrastructure.
In this blog post, we will tackle the first issue, the problem of reliably detecting relevant structure across websites without using XPath or CSS selectors. What exactly is coherent structure?
In the remainder of this article, we define relevant structure to be a collection of similar objects which are of interest to the user. And our task is to recognize this structure on every webpage in the Internet. It is a collection, if there are at least N objects with the same
tagName
under a container node. The objects are similar, if their visible content is similar. Of course this is a circular definition, because it is still an open question how to exactly measure it.Examples of recurring structure
All those items have a common structure when interpreted visually: They have more or less the same vertical alignment, the same font size, the same html tag within the same hierarchy level.
The huge problem is that structure is created dynamically from the interplay of HTML, JavaScript and CSS. This means that the HTML structure does not necessarily need to resemble the visual output. Therefore we need to operate on a rendered web page. This implies that we need to scrape with real, headless browsers using libraries such as puppeteer.
What assumptions do we make?
The input is a website rendered by a modern browser with javascript support. We will use puppeteer to render websites and puppeteer-extra-plugin-stealth to appear like we use a real browser.
We assume that structure is what humans consider to be related structure.
- Reading goes from top-left to bottom-right
- Identical horizontal/vertical alignment among objects
- More or less same size of bounding rectangles of the object of interest
- As output we are merely interested in Links, Text and Images (which are Links). The output needs to be visible/displayed in the DOM.
- Only structure that takes a major part of the visible viewport is considered structure
- There must be at least N=6 objects within a container to be considered recurring structure
When websites protect themselves against web scraping on the HTML level, they need to present a website to the non malicious visitor at some point. And our approach is to extract this information at exactly the point after rendering.
Algorithm Description
In a first step, we find potential structure candidates:
Take a starting node as input. If no node is given, use the
body
element.See if the element contains at least N identical elements (such as div, article, li, section, ...). If yes, mark those child nodes as potential structure candidates.
Visit the next node in the tree and check again for N identical elements.
After all candidates have been found, get the bonding boxes of the candidates with
getBoundingClientRect()
If the bounding boxes vertically and horizontally align, have more or less identical dimensions, and make up a significant part of document.body.getBoundingClientRect()
, add those elements to the potential structures. There are further tests possible.In a second step, we cross correlate the contents of the structure candidates and filter out objects that are not similar enough: We compare the items within the potential structures. If they share common characteristics, we consider those elements to form a valid structure. We are only interested in
img
, a
and textNode
. Furthermore, we consider only visible nodes.Example: If we find 8 results in a google search, all those results essentially have a title (link with text), a visible link in green font (this is only green text and a child node of the title) and a snippet. The snippet typically consists of many different text nodes. Therefore one potential filter could be:
- Each object must have a link with text with font size
X
at the beginning of the object. - There must be a textNode with green font after this title.
- There must be some text after this green colored text.
Downsides
Because our algorithm is abstract, the JSON output cannot have reasonable variable names. The algorithm cannot know which text of the object is the title, which part is a price, what is a date, and so on. This means that a post data-extraction step is necessary to match variable names to the output.
Furthermore, we cannot reliably know the correct N that specifies how many recurring elements should be in a structure. It depends on the website.
Run struktur yourself
Enough talking, you can test struktur.js in the following way. First download struktur.js and put it in the same path as the following script:
Then install dependencies with
And run the above script with
node
. You will see the found structure of the google search made as JSON. You can specify any other website you want.In the last tutorial we learned how to leverage the Scrapy framework to solve common web scraping problems.Today we are going to take a look at Selenium (with Python ❤️ ) in a step-by-step tutorial.
Selenium refers to a number of different open-source projects used for browser automation. It supports bindings for all major programming languages, including our favorite language: Python.
The Selenium API uses the WebDriver protocol to control a web browser, like Chrome, Firefox or Safari. The browser can run either localy or remotely.
At the beginning of the project (almost 20 years ago!) it was mostly used for cross-browser, end-to-end testing (acceptance tests).
Now it is still used for testing, but it is also used as a general browser automation platform. And of course, it us used for web scraping!
Selenium is useful when you have to perform an action on a website such as:
- Clicking on buttons
- Filling forms
- Scrolling
- Taking a screenshot
It is also useful for executing Javascript code. Let's say that you want to scrape a Single Page Application. Plus you haven't found an easy way to directly call the underlying APIs. In this case, Selenium might be what you need.
Installation
We will use Chrome in our example, so make sure you have it installed on your local machine:
selenium
package
To install the Selenium package, as always, I recommend that you create a virtual environment (for example using virtualenv) and then:
Quickstart
Web Scraping Angular App
Once you have downloaded both Chrome and Chromedriver and installed the Selenium package, you should be ready to start the browser:
This will launch Chrome in headfull mode (like regular Chrome, which is controlled by your Python code).You should see a message stating that the browser is controlled by automated software.
To run Chrome in headless mode (without any graphical user interface), you can run it on a server. See the following example:
The
driver.page_source
will return the full page HTML code.Here are two other interesting WebDriver properties:
driver.title
gets the page's titledriver.current_url
gets the current URL (this can be useful when there are redirections on the website and you need the final URL)
Locating Elements
Locating data on a website is one of the main use cases for Selenium, either for a test suite (making sure that a specific element is present/absent on the page) or to extract data and save it for further analysis (web scraping).
There are many methods available in the Selenium API to select elements on the page. You can use:
- Tag name
- Class name
- IDs
- XPath
- CSS selectors
We recently published an article explaining XPath. Don't hesitate to take a look if you aren't familiar with XPath.
As usual, the easiest way to locate an element is to open your Chrome dev tools and inspect the element that you need.A cool shortcut for this is to highlight the element you want with your mouse and then press Ctrl + Shift + C or on macOS Cmd + Shift + C instead of having to right click + inspect each time:
find_element
There are many ways to locate an element in selenium.Let's say that we want to locate the h1 tag in this HTML:
All these methods also have
find_elements
(note the plural) to return a list of elements.For example, to get all anchors on a page, use the following:
Some elements aren't easily accessible with an ID or a simple class, and that's when you need an XPath expression. You also might have multiple elements with the same class (the ID is supposed to be unique).
XPath is my favorite way of locating elements on a web page. It's a powerful way to extract any element on a page, based on it's absolute position on the DOM, or relative to another element.
WebElement
A
WebElement
is a Selenium object representing an HTML element.There are many actions that you can perform on those HTML elements, here are the most useful:
- Accessing the text of the element with the property
element.text
- Clicking on the element with
element.click()
- Accessing an attribute with
element.get_attribute('class')
- Sending text to an input with:
element.send_keys('mypassword')
There are some other interesting methods like
is_displayed()
. This returns True if an element is visible to the user.It can be interesting to avoid honeypots (like filling hidden inputs).
Honeypots are mechanisms used by website owners to detect bots. For example, if an HTML input has the attribute
type=hidden
like this:This input value is supposed to be blank. If a bot is visiting a page and fills all of the inputs on a form with random value, it will also fill the hidden input. A legitimate user would never fill the hidden input value, because it is not rendered by the browser.
That's a classic honeypot.
Full example
Here is a full example using Selenium API methods we just covered.
We are going to log into Hacker News:
In our example, authenticating to Hacker News is not really useful on its own. However, you could imagine creating a bot to automatically post a link to your latest blog post.
In order to authenticate we need to:
- Go to the login page using
driver.get()
- Select the username input using
driver.find_element_by_*
and thenelement.send_keys()
to send text to the input - Follow the same process with the password input
- Click on the login button using
element.click()
Should be easy right? Let's see the code:
Easy, right? Now there is one important thing that is missing here. How do we know if we are logged in?
We could try a couple of things:
- Check for an error message (like “Wrong password”)
- Check for one element on the page that is only displayed once logged in.
So, we're going to check for the logout button. The logout button has the ID “logout” (easy)!
We can't just check if the element is
None
because all of the find_element_by_*
raise an exception if the element is not found in the DOM.So we have to use a try/except block and catch the NoSuchElementException
exception:Taking a screenshot
![Angular Angular](/uploads/1/1/1/8/111830457/541400298.png)
We could easily take a screenshot using:
Note that a lot of things can go wrong when you take a screenshot with Selenium. First, you have to make sure that the window size is set correctly.Then, you need to make sure that every asynchronous HTTP call made by the frontend Javascript code has finished, and that the page is fully rendered.
In our Hacker News case it's simple and we don't have to worry about these issues.
Waiting for an element to be present
Dealing with a website that uses lots of Javascript to render its content can be tricky. These days, more and more sites are using frameworks like Angular, React and Vue.js for their front-end. These front-end frameworks are complicated to deal with because they fire a lot of AJAX calls.
If we had to worry about an asynchronous HTTP call (or many) to an API, there are two ways to solve this:
- Use a
time.sleep(ARBITRARY_TIME)
before taking the screenshot. - Use a
WebDriverWait
object.
If you use a
time.sleep()
you will probably use an arbitrary value. The problem is, you're either waiting for too long or not enough.Also the website can load slowly on your local wifi internet connection, but will be 10 times faster on your cloud server.With the WebDriverWait
method you will wait the exact amount of time necessary for your element/data to be loaded.This will wait five seconds for an element located by the ID “mySuperId” to be loaded.There are many other interesting expected conditions like:
element_to_be_clickable
text_to_be_present_in_element
element_to_be_clickable
Web Scraping In Angular 9
You can find more information about this in the Selenium documentation
Executing Javascript
Sometimes, you may need to execute some Javascript on the page. For example, let's say you want to take a screenshot of some information, but you first need to scroll a bit to see it.You can easily do this with Selenium:
Conclusion
Web Scraping In Angular 6
I hope you enjoyed this blog post! You should now have a good understanding of how the Selenium API works in Python. If you want to know more about how to scrape the web with Python don't hesitate to take a look at our general Python web scraping guide.
Selenium is often necessary to extract data from websites using lots of Javascript. The problem is that running lots of Selenium/Headless Chrome instances at scale is hard. This is one of the things we solve with ScrapingBee, our web scraping API
Selenium is also an excellent tool to automate almost anything on the web.
Web Scraping In Angular Interview
If you perform repetitive tasks like filling forms or checking information behind a login form where the website doesn't have an API, it's maybe* a good idea to automate it with Selenium,just don't forget this xkcd: