SECRET OF CSS

A Step-By-Step Tutorial to Web Scraping in PHP


From basic to advanced techniques through a complete example

0*P yoLqWnDBJ0MhZk
Photo by Ross Sneddon on Unsplash

Web scraping’s become increasingly popular and is now a trending topic in the IT community. As a result, several libraries help you scrape data from a website. Here, you’ll learn how to build a web scraper in PHP using one of the most popular web scraping libraries.

In this tutorial, you’ll learn PHP’s web scraping basics. And then how to get around the most popular anti-scraping systems and learn more advanced techniques and concepts, such as parallel scraping and headless browsers.

Follow this tutorial and become an expert in web scraping with PHP! Let’s not waste more time and build our first scraper in PHP.

This is the list of prerequisites you need for the simple scraper to work:

If you don’t have these installed on your systems, you can download them by following the links above.

Then, you also require the following Composer library, and you can add this to your project’s dependencies with the following command:

composer require vokku/simple_html_dom

Also, you’ll need the built-in cURL PHP library. cURL comes with the curl-ext PHP extension, which is automatically present and enabled in most PHP packages. If the PHP package you installed did not include curl-ext, you could install it as explained here.

Let’s now learn more about the dependencies mentioned here.

vokku/simple_html_dom is a fork of the Simple HTML DOM Parser project that replaces string manipulation with DOMDocument and other modern PHP classes. With nearly two million installs, vokku/simple_html_dom is a fast, reliable, and simple library for parsing HTML documents and performing web scraping in PHP.

curl-ext is a PHP extension that enables the cURL HTTP client in PHP, which allows you to perform HTTP requests in PHP.

You can find the code of the demo web scraper in this GitHub repo. Clone it and install the project’s dependencies with the following commands:

git clone https://github.com/Tonel/simple-scraper-phpcd simple-scraper-phpcomposer update

Follow this tutorial and learn how to build a web scraper app in PHP!

Here, you are going to see how to perform web scraping on https://scrapeme.live/shop/, a website designed as a scraping target.

In detail, this is what the shop looks like:

0*f6lz13dMwES7sdvK
A general view of scrapeme.live/shop

As you can see, scrapeme.live is nothing more than a simple paginated list of Pokemon-inspired products. Let’s build a simple web scraper in PHP that crawls the website and scrapes data from all these products.

First, you must download the HTML of the page you want to scrape. You can easily download an HTML document in PHP with cURL as follows:

You now have the HTML of the https://scrapeme.live/shop/ page stored in the $html variable. Load this into a HtmlDomParser instance with the str_get_html() function as below:

You can now use HtmlDomParser to browse the DOM of the HTML page and start the data extraction.

Let’s now retrieve the list of all pagination links to crawl the entire website section. Right-click the pagination number HTML element and select the “Inspect” option.

0*5XR0gD4filXLLjb8
Selecting the “Inspect” option to open the DevTools window

At this point, the browser should open a DevTools window or section with the DOM element highlighted, as below:

0*vZqV5nczBtNt2M2o
The DevTools Window after selecting a pagination number HTML element

In the WebTools window, you can see that the page-numbers CSS class identifies the pagination HTML elements. Note that a CSS class does not uniquely identify an HTML element, and many nodes could have the same class. It is precisely what happens with page-numbers in the scrapeme.live page.

Therefore, if you want to use a CSS selector to pick the elements in the DOM, you should use the CSS class along with other selectors. In particular, you can use HtmlDomParser with the .page-numbers a CSS selector to select all the pagination HTML elements on the page. Then, iterate through them to extract all the required URLs from the href attribute as follows:

Note that the function allows you to extract DOM elements based on a CSS selector. Also, considering that the pagination element is placed twice on the web page, you need to define custom logic to avoid duplicate elements in the $paginationLinks array.

If executed, this script would return:

Array (   
[0] => https://scrapeme.live/shop/page/2/
[1] => https://scrapeme.live/shop/page/3/
[2] => https://scrapeme.live/shop/page/4/
[3] => https://scrapeme.live/shop/page/46/
[4] => https://scrapeme.live/shop/page/47/
[5] => https://scrapeme.live/shop/page/48/
)

As shown, all URLs follow the same structure and are characterized by a final number that specifies the pagination number. If you want to iterate over all pages, you only need the number associated with the last page. Retrieve it as follows:

$highestPaginationNumber will contain “48”.

Now, let’s retrieve the data associated with a single product. Again, right-click on a product and open the DevTools window with the “Inspect” option. This is what you should get:

0*aCmpweMFPindqGwd
The DevTools window after selecting a product HTML element

As you can see, a product consists of a li.product HTML element containing a URL, an image, a name, and price. This product information is placed in a a, img, h2, span HTML element, respectively. You can extract this data with HtmlDomParser as below:

This logic extracts all product data on one page and saves it in the $productDataList array.

Now, you only have to iterate over each page and apply the scraping logic defined above:

Et voilà! You just learned how to build a simple web scraper in PHP!

If you want to look at the script’s entire code, you can find it here. Run it, and you’ll retrieve the following data:

Congratulations! You just extracted all the product data automatically!

The example above uses a website designed for scraping. Extracting all data was a piece of cake, but don’t be fooled by this! Scraping a website is not always that easy, and your script may be intercepted and blocked. Find out how to prevent this from happening!

There are several possible defensive mechanisms to prevent scripts from accessing a website. These techniques try to recognize requests coming from non-human or malicious users based on their behavior and block them consequently.

Bypassing all these anti-scraping systems isn’t always easy. However, you can usually avoid most of them with two simple solutions: common HTTP headers and web proxies. Let’s now take a closer look at these two approaches.

1. Using common HTTP headers to simulate a real user

Many websites block requests that don’t appear to come from real users. On the other hand, browsers set some HTTP headers. The exact headers change from vendor to vendor. So, these anti-scraping systems expect these headers to be present. Thus, you can avoid blocks by setting the appropriate HTTP headers.

Specifically, the most critical header you should always set is the User-Agent header (henceforth, UA). It is a string that identifies the application, operating system, vendor, and/or application version from which the HTTP request originates.

By default, cURL sends the curl/XX.YY.ZZ UA header, which makes the request easily identifiable as a script. You can manually set the UA header with cURL as follows:

Example:

curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36")

This line of code sets the UA currently used by the latest version of Google Chrome. It makes the cURL requests harder to recognize as coming from a script.

You can easily find a list of valid, up-to-date, and trusted UA headers online. In most cases, setting the HTTP UA header is enough to avoid being blocked. If this isn’t enough, you can send other HTTP headers with cURL as follows:

Example:

// set the Content-Language and Authorization HTTP headers 
curl_setopt($curl, CURLOPT_HTTPHEADER,
array(
"Content-Language: es",
"Authorization: 32b108le1HBuSYHMuAcCrIjW72UTO3p5X78iIzq1CLuiHKgJ8fB2VdfmcS",
)
);

2. Using web proxies to hide your IP

Anti-scraping systems tend to block users from visiting many pages in a short amount of time. The primary check looks at the IP from which the requests come. It’ll be blocked if the same IP makes many requests in a short time. In other words, to prevent blocks on an IP, you must find a way to hide it.

One of the best ways to do it is through a proxy server. A web proxy is an intermediary server between your machine and the rest of the computers on the Internet. When performing requests through a proxy, the target website will see the IP address of the proxy server instead of yours.

Several free proxies are available online, but most are short-lived, unreliable, and often unavailable. You can use them for testing. However, you shouldn’t rely on them for a production script.

On the other hand, paid proxy services are more reliable and generally comes with IP Rotation. This means that the IP exposed by the proxy server will frequently change over time or with each request. It makes it harder for each IP offered by the service to be banned, and even if that happened, you would get a new IP quickly.

You can set a web proxy with cURL as follows:

Example:

curl_setopt($curl, CURLOPT_PROXY, "102.68.128.214");  curl_setopt($curl, CURLOPT_PROXYPORT, "8080");  
curl_setopt($curl, CURLOPT_PROXY, CURLPROXY_HTTP);

With most web proxies, setting the URL of the proxy in the first line is enough. CURLOPT_PROXYTYPE can take the following values: CURLPROXY_HTTP (default), CURLPROXY_SOCKS4, CURLPROXY_SOCKS5, CURLPROXY_SOCKS4A, or CURLPROXY_SOCKS5_HOSTNAME.

You just learned how to avoid being blocked. Let’s now dig into how to make your script faster!

Dealing with multi-threading in PHP is complex. Several libraries can support you, but the simplest and most effective solution to perform parallel scraping in PHP does not require any.

This approach to parallel scraping is to make the scraping script ready to be run on multiple instances. It is possible by using HTTP GET parameters.

Consider the paging example presented earlier. Instead of having a script that iterates over all pages, you can modify the script to work on smaller chunks and then launch several instances of the script in parallel.

All you have to do is pass some parameters to the script to define the boundaries of the chunk.

You can easily accomplish this by introducing two GET parameters as below:

Now, you can launch several instances of the script by opening these links in your browser:

https://your-domain.com/scripts/scrapeme.live/scrape-products.php?from=1&to=5https://your-domain.com/scripts/scrapeme.live/scrape-products.php?from=6&to=10...https://your-domain.com/scripts/scrapeme.live/scrape-products.php?from=41&to=45https://your-domain.com/scripts/scrapeme.live/scrape-products.php?from=46&to=48

These instances will run in parallel and scrape the website simultaneously. You can find the entire code of this new version of the scraping script here.

And there you have it! You have just learned how to extract data from a website in parallel through web scraping.

Scraping a website in parallel is already a great improvement, but there are many other advanced techniques you can adopt in your PHP web scraper. Let’s find out how to take your web scraping script to the next level.

Remember that not all data of interest on a web page are directly displayed in the browser. A web page also consists of metadata and hidden elements. To access this data, right-click on an empty section of the web page and click on “View Page Source.”

0*U3nbyR1e1azM0dot
The source code of scrapeme.live/shop

Here you can see the full DOM of a web page, including hidden elements. In detail, you can find metadata about the web page in the meta HTML tags. In addition, important hidden data may be stored in <input type="hidden"/> elements.

Similarly, some data may be already present on the page via hidden HTML elements. And it’s shown by JavaScript only when a particular event occurs. Even though you cannot see the data on the page, it is still part of the DOM. Therefore, you can retrieve these hidden HTML elements with HtmlDomParser as you would with visible nodes.

Also, remember that a web page is more than its source code. Web pages can request the browser to retrieve data asynchronously via AJAX and update their DOM accordingly. These AJAX calls generally provide valuable data, and you might need to call them from your web scraping script.

To sniff these calls, you need to use the DevTools window of your browser. Right-click on a blank section of the website, select “Inspect,” and reach the “Network” tab. In the “Fetch/XHR” tab, you can see the list of AJAX calls performed by the web page, as in the example below.

0*kU2CllJwoBeQng0j
The POST AJAX call performed by a demo page

Explore all the internal tabs of the selected AJAX request to understand how to perform the AJAX call. Specifically, you can replicate this POST AJAX call cURL as below:

Congratulations! You just performed a POST call with cURL!

Sniffing and replicating AJAX calls is helpful for programmatically retrieving data from a website that is loaded due to user interaction. This data isn’t part of the source code of a web page and can’t be found in the HTML element obtained from a standard GET cURL request.

However, replicating all possible interactions, sniffing the AJAX calls, and calling them in your script is a cumbersome approach. Sometimes, you need to define a script that can interact with the page via JavaScript as a human user would. You can achieve this with a headless browser.

If you aren’t familiar with this concept, a headless browser is a web browser without a graphical user interface that offers automated control of a web page via code. The most popular libraries in PHP providing headless browser functionality are chrome-php and Selenium WebDriver.

Other useful PHP libraries you might adopt when it comes to web scraping are:

  • Guzzle: an advanced HTTP client that makes it easy to send HTTP requests and trivial to integrate with web services. You can use it as an alternative to cURL.
  • Goutte: a web scraping library that provides advanced API for crawling websites and extracting data from their HTML web pages. Since it also includes an HTTP client, you can use it as an alternative to both vokku/simple_html_dom and cURL.

Here, you’ve learned everything you should know about performing web scraping in PHP, from basic crawling to advanced techniques. As shown above, building a web scraper in PHP that can crawl a website and automatically extract data isn’t that difficult.

All you need are the correct libraries, and here we have looked at some of the most popular ones.

Also, your web scraper should be able to bypass anti-scraping systems and may have to retrieve hidden data or interact with the web page like a human user. In this tutorial, you learned how to do all this as well.



News Credit

%d bloggers like this: