Turn web pages into structured data
Even though C# has established itself as a reliable programming language, mostly in the realms of back-end applications, it isn’t the first language that comes to mind when you’re looking to build a web scraper.
While C#’s rigid type system may feel inflexible when dealing with seemingly arbitrary structures found in most web pages, it can actually be quite a good and high performing choice.
In this article, we’ll dive into how you can retrieve content using C# by sending an HTTP request, and how you can parse the response into a typed object.
If you’re using Visual Studio, simply create a new Console application, and name it anything you like. If you’re using the
dotnet command line, create a folder for your project and run:
dotnet new console
Now you’ll have an empty Hello World application as your starting point.
The method you choose for retrieving content depends heavily on the type of page you’re looking to scrape. While you’re perfectly fine using a standard
HttpClient for static pages, it won’t do you much good for SPA’s build with React, Angular and the likes.
First off, we’ll need an
HttpClient instance. We’ll use that to make requests, and subsequently read the response:
var httpClient = new HttpClient();
var response = await httpClient.GetAsync("https://www.scrapethissite.com/pages/simple/");
var content = await response.Content.ReadAsStringAsync();
Note that the website we’re using here explicitly allows scraping. Make sure you always respect the website’s rate limits, and pages it doesn’t want scraped.
content variable. The
HttpClient method is suitable for every page where you can view the page’s HTML by doing Right Click -> View Page Source, and you can directly see the page’s content.
Using a Headless Browser
HttpClient won’t do the trick.
Controlling another process and emulating user input can be quite the challenge too. Fortunately for us, we don’t have to. The WebDriver protocol makes it possible to programmatically drive browser instances, which is exactly what we want to do.
This protocol has been implemented in Chrome and Firefox, and allows for so-called Headless operation of these browsers. That is, they operate without a user interface.
Two of the most popular implementations of headless browser libraries are Puppeteer, which is maintained by Google, and Playwright, which is maintained by Microsoft. They were originally intended to automate web testing, but they’ll do just fine for our purposes too.
Puppeteer is originally a port of the NodeJS library that controls Chromium. Fortunately, someone has ported this library to .NET. It’s called PuppeteerSharp and lets us control a Chromium browser from our C# code. Let’s install it and use it to find some content:
nuget install PuppeteerSharp
After that, let’s use Puppeteer to launch a headless Chrome instance and retrieve content from the same page we’ve used before:
This time, instead of simply sending a request to the remote host, we’ve launched an entire Chrome instance, pointed it to the address we’ve specified, and had it read the content of the page for us.
The HTML of the web page is now sitting in the
Fun fact: if you change the
Headless = true to
false and run the code again, you can see the whole browser being started and navigation by yourself. Try it out!
Now, we do have some content, but it’s still fairly unstructured. We certainly wouldn’t want to fill up our database with raw HTML, so we’ll need to do some processing before we can actually use the data.
The first thing that may come to mind might be “Regular Expressions!”. And that would make sense, except HTML is very irregular, and this post does a better way of explaining why you shouldn’t use regular expressions to parse HTML than I possibly could.
What we can do is use a library AngleSharp to help us parse HTML efficiently. An alternate, and slightly more conservative pick is HtmlAgilityPack. Both work just fine, but since AngleSharp is a bit more modern, we’ll give it a spin:
dotnet nuget install AngleSharp
Now the first thing AngleSharp will need is a browsing context, and a document it can populate:
var context = BrowsingContext.New(Configuration.Default);
var document = await context.OpenAsync(req => req.Content(content));
In this context,
content is the raw HTML in the
string variable we captured earlier.
Now we have a
document containing all the raw HTML, structured by the nodes of the original page. We can search through these nodes to find the exact piece of information we need.
Let’s see if we can structure this page full of countries.
First off, we’ll need to know what we’re looking for. If we use our Chrome Dev Tools to inspect an element (CTRL / CMD + Shift + C):
We can see that the element we’re after is a
div with the classes
col-md-4 country — that shouldn’t be too hard to find!
Using some simple LINQ, we can obtain all the elements that match this class:
var countries = document.QuerySelectorAll("*")
.Where(e => e.LocalName == "div" && e.ClassName == "col-md-4 country")
I’m fully aware this can be done by using a QuerySelector as well, but we’re keeping it very simple and C#-y for now.
Running this code returns 250 elements — the exact number of countries listed on this page! We must be onto something. Now let’s get to extracting this information.
Let’s get started by creating a
record to hold our country information. We know the fields ahead of time, they are:
public record Country(string Name, string Capital, int Population, int Area);
Right, all good. Now comes the juicy part, and honestly, this is the part where C# does feel a little bit rough around the edges.
In order to extract the data, we have to apply some extremely rudimentary string splitting magic:
It may not be the prettiest, but it gets the job done:
We’ve built a very, very rudimentary scraper that uses a headless Chrome instance to visit a page, returns the raw HTML and parses it into a format we can use. Here’s the final code:
Take it, and see if you can apply it in a way that it scrapes a page you need information from.
If dealing with headless browsers, proxies, and CAPTCHAs doesn’t seem very appealing to you, there are services out there such as ScrapeShark that deal with all that stuff for you. Full disclosure: I am ScrapeShark’s creator, and it is available for free.
Building scrapers can seem daunting and time-consuming at first, but in the long run they’ll save you an infinite amount of time while you run your business!