SECRET OF CSS

Web Scraping Using C# and .NET. Turn web pages into structured data | by Martin Cerruti | Jul, 2022


Turn web pages into structured data

0*9jQRwSkWd3DuMZ2i
Photo by Maksym Kaharlytskyi on Unsplash

Even though C# has established itself as a reliable programming language, mostly in the realms of back-end applications, it isn’t the first language that comes to mind when you’re looking to build a web scraper.

While C#’s rigid type system may feel inflexible when dealing with seemingly arbitrary structures found in most web pages, it can actually be quite a good and high performing choice.

In this article, we’ll dive into how you can retrieve content using C# by sending an HTTP request, and how you can parse the response into a typed object.

If you’re using Visual Studio, simply create a new Console application, and name it anything you like. If you’re using the dotnet command line, create a folder for your project and run:

dotnet new console

Now you’ll have an empty Hello World application as your starting point.

The method you choose for retrieving content depends heavily on the type of page you’re looking to scrape. While you’re perfectly fine using a standard HttpClient for static pages, it won’t do you much good for SPA’s build with React, Angular and the likes.

In order to get our data from SPA’s, we’ll need to emulate a browser and have it render the JavaScript returned by the page for us. There’s slightly more to it that way, so let’s dive into a simple static page first.

First off, we’ll need an HttpClient instance. We’ll use that to make requests, and subsequently read the response:

var httpClient = new HttpClient();
var response = await httpClient.GetAsync("https://www.scrapethissite.com/pages/simple/");
var content = await response.Content.ReadAsStringAsync();

Note that the website we’re using here explicitly allows scraping. Make sure you always respect the website’s rate limits, and pages it doesn’t want scraped.

Because there isn’t much in the way of JavaScript magic going on on this page, we’re able to retrieve all of the page’s HTML in our content variable. The HttpClient method is suitable for every page where you can view the page’s HTML by doing Right Click -> View Page Source, and you can directly see the page’s content.

Using a Headless Browser

As mentioned before, if the page we’re scraping is a more dynamic application, using JavaScript to render portions of the page on the client side, using just the HttpClient won’t do the trick.

Without evaluating the JavaScript presented by the server, we’re left with a response along the lines of:

Which is rather useless. To evaluate JavaScript, we can do two things. Either we host a JavaScript runtime ourselves, and we run the JavaScript returned by the page through it, or we emulate a browser.

Both options are relatively complex. Hosting a JavaScript runtime like v8 — Google Chrome’s JavaScript runtime — isn’t particularly difficult, but we aren’t guaranteed to get the same results a browser would get. Therefore, actually running the JavaScript through a real browser is almost always preferable.

Controlling another process and emulating user input can be quite the challenge too. Fortunately for us, we don’t have to. The WebDriver protocol makes it possible to programmatically drive browser instances, which is exactly what we want to do.

This protocol has been implemented in Chrome and Firefox, and allows for so-called Headless operation of these browsers. That is, they operate without a user interface.

Two of the most popular implementations of headless browser libraries are Puppeteer, which is maintained by Google, and Playwright, which is maintained by Microsoft. They were originally intended to automate web testing, but they’ll do just fine for our purposes too.

0*U0fDTwIKtunYVYI1

Puppeteer is originally a port of the NodeJS library that controls Chromium. Fortunately, someone has ported this library to .NET. It’s called PuppeteerSharp and lets us control a Chromium browser from our C# code. Let’s install it and use it to find some content:

nuget install PuppeteerSharp

After that, let’s use Puppeteer to launch a headless Chrome instance and retrieve content from the same page we’ve used before:

This time, instead of simply sending a request to the remote host, we’ve launched an entire Chrome instance, pointed it to the address we’ve specified, and had it read the content of the page for us.

The HTML of the web page is now sitting in the content variable as a string, ready to go. The big difference is, any JavaScript has been rendered as well. This allows us to observe the content of SPA’s as well, which we wouldn’t have been able to do with the previous method.

Fun fact: if you change the Headless = true to false and run the code again, you can see the whole browser being started and navigation by yourself. Try it out!

Now, we do have some content, but it’s still fairly unstructured. We certainly wouldn’t want to fill up our database with raw HTML, so we’ll need to do some processing before we can actually use the data.

The first thing that may come to mind might be “Regular Expressions!”. And that would make sense, except HTML is very irregular, and this post does a better way of explaining why you shouldn’t use regular expressions to parse HTML than I possibly could.

What we can do is use a library AngleSharp to help us parse HTML efficiently. An alternate, and slightly more conservative pick is HtmlAgilityPack. Both work just fine, but since AngleSharp is a bit more modern, we’ll give it a spin:

dotnet nuget install AngleSharp

Now the first thing AngleSharp will need is a browsing context, and a document it can populate:

var context = BrowsingContext.New(Configuration.Default);
var document = await context.OpenAsync(req => req.Content(content));

In this context, content is the raw HTML in the string variable we captured earlier.

Finding Elements

Now we have a document containing all the raw HTML, structured by the nodes of the original page. We can search through these nodes to find the exact piece of information we need.

Let’s see if we can structure this page full of countries.

First off, we’ll need to know what we’re looking for. If we use our Chrome Dev Tools to inspect an element (CTRL / CMD + Shift + C):

We can see that the element we’re after is a div with the classes col-md-4 country — that shouldn’t be too hard to find!

Using some simple LINQ, we can obtain all the elements that match this class:

var countries = document.QuerySelectorAll("*")
.Where(e => e.LocalName == "div" && e.ClassName == "col-md-4 country")
.ToList();

I’m fully aware this can be done by using a QuerySelector as well, but we’re keeping it very simple and C#-y for now.

Running this code returns 250 elements — the exact number of countries listed on this page! We must be onto something. Now let’s get to extracting this information.

Extracting Information

Let’s get started by creating a record to hold our country information. We know the fields ahead of time, they are:

public record Country(string Name, string Capital, int Population, int Area);

Right, all good. Now comes the juicy part, and honestly, this is the part where C# does feel a little bit rough around the edges.

In order to extract the data, we have to apply some extremely rudimentary string splitting magic:

It may not be the prettiest, but it gets the job done:

1*vrRAR062k8v49qM7 P2vZw

We’ve built a very, very rudimentary scraper that uses a headless Chrome instance to visit a page, returns the raw HTML and parses it into a format we can use. Here’s the final code:

Take it, and see if you can apply it in a way that it scrapes a page you need information from.

If dealing with headless browsers, proxies, and CAPTCHAs doesn’t seem very appealing to you, there are services out there such as ScrapeShark that deal with all that stuff for you. Full disclosure: I am ScrapeShark’s creator, and it is available for free.

Building scrapers can seem daunting and time-consuming at first, but in the long run they’ll save you an infinite amount of time while you run your business!



News Credit

%d bloggers like this: