Ultimate Web Scraper Toolkit Documentation

The Ultimate Web Scraper Toolkit is a powerful set of tools designed to handle all of your web scraping needs on nearly all web hosts. This toolkit easily makes RFC-compliant web requests that are indistinguishable from a real web browser, a web browser-like state engine for handling cookies and redirects, and a full cURL emulation layer for web hosts without the PHP cURL extension installed. A powerful tag filtering library (TagFilter) is included to easily extract and/or convert the desired content from each retrieved document.

This toolkit even comes with classes for creating custom web servers and WebSocket servers. That custom API you want the average person to install on their home computer or deploy to devices in the enterprise just became easier to deploy.

While this toolkit makes it really easy to scrape just about any content from the web, please don't do anything illegal. There is this little thing called copyright law that most countries have to protect various works.

Features

The following are a few features of the Ultimate Web Scraper toolkit:

And much more.

License

The Ultimate Web Scraper Toolkit is extracted from Barebones CMS and the license is also your pick of MIT or LGPL. The license and restrictions are identical to the Barebones CMS License.

If you find the Ultimate E-mail Toolkit useful, financial donations are sincerely appreciated and go towards future development efforts.

Download

Ultimate Web Scraper Toolkit 1.0RC17 is the seventeenth release candidate of the Ultimate Web Scraper Toolkit.

Download ultimate-web-scraper-1.0rc17.zip

If you find the Ultimate Web Scraper Toolkit useful, please donate toward future development efforts.

Installation

Installing the Ultimate Web Scraper Toolkit is easy. The installation procedure is as follows:

Installation is easy. Using the toolkit is a bit more difficult.

Upgrading

Like Barebones CMS, upgrading the Ultimate Web Scraper Toolkit is easy - just upload the new files to the server and overwrite existing files.

Scraping Webpages - The Easy Way

Webpages are hard to retrieve and harder to parse. And doing this consistently across a wide variety of web hosts and scenarios makes it very difficult to do this alone. The Ultimate Web Scraper Toolkit makes both retrieving and parsing webpages a whole lot easier.

Example usage:

<?php
	require_once "support/web_browser.php";
	require_once "support/tag_filter.php";

	// Retrieve the standard HTML parsing array for later use.
	$htmloptions = TagFilter::GetHTMLOptions();

	// Retrieve a URL.
	$url = "http://www.somesite.com/something/";
	$web = new WebBrowser();
	$result = $web->Process($url);

	// Check for connectivity and response errors.
	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else
	{
		$baseurl = $result["url"];

		// Use TagFilter to parse the content.
		$html = TagFilter::Explode($result["body"], $htmloptions);

		// Find all anchor tags.
		echo "All the URLs:\n";
		$result2 = $html->Find("a[href]");
		if (!$result2["success"])  echo "Error parsing/finding URLs.  " . $result2["error"] . "\n";
		else
		{
			foreach ($result2["ids"] as $id)
			{
				// Fast direct access.
				echo "\t" . $html->nodes[$id]["attrs"]["href"] . "\n";
				echo "\t" . HTTP::ConvertRelativeToAbsoluteURL($baseurl, $html->nodes[$id]["attrs"]["href"]) . "\n";
			}
		}

		// Find all table rows that have 'th' tags.
		// The 'tr' tag IDs are returned.
		$result2 = $html->Filter($hmtl->Find("tr"), "th");
		if (!$result2["success"])  echo "Error parsing/finding table rows.  " . $result2["error"] . "\n";
		else
		{
			foreach ($result2["ids"] as $id)
			{
				echo "\t" . $html->GetOuterHTML($id) . "\n\n";
			}
		}
	}
?>

Example object-oriented usage:

<?php
	require_once "support/web_browser.php";
	require_once "support/tag_filter.php";

	// Retrieve the standard HTML parsing array for later use.
	$htmloptions = TagFilter::GetHTMLOptions();

	// Retrieve a URL.
	$url = "http://www.somesite.com/something/";
	$web = new WebBrowser();
	$result = $web->Process($url);

	// Check for connectivity and response errors.
	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else
	{
		$baseurl = $result["url"];

		// Use TagFilter to parse the content.
		$html = TagFilter::Explode($result["body"], $htmloptions);

		// Retrieve a pointer object to the root node.
		$root = $html->Get();

		// Find all anchor tags.
		echo "All the URLs:\n";
		$rows = $root->Find("a[href]");
		foreach ($rows as $row)
		{
			// Somewhat slower access.
			echo "\t" . $row->href . "\n";
			echo "\t" . HTTP::ConvertRelativeToAbsoluteURL($baseurl, $row->href) . "\n";
		}

		// Find all table rows that have 'th' tags.
		$rows = $root->Find("tr")->Filter("th");
		foreach ($rows as $row)
		{
			echo "\t" . $row->GetOuterHTML() . "\n\n";
		}
	}
?>

These brief examples retrieve a URL while emulating some flavor of Firefox and display the value of the 'href' attribute of all anchor tags that have a 'href' attribute as well as finding all table rows with 'th' tags. In addition, because the WebBrowser class was used, the code will internally and automatically handle HTTP cookies and redirects.

You'll get lots of mileage out of the HTTP::ExtractURL(), HTTP::CondenseURL(), HTTP::ConvertRelativeToAbsoluteURL(), and other useful functions when extracting content from a HTML page and processing server responses.

See the following for in-depth documentation and extensive examples on performing document retrieval and extracting content with TagFilter: WebBrowser classes documentation, TagFilter classes documentation, and HTTP class documentation.

Scraping Webpages - The Hard Way

The previous example used the web browser emulation layer (WebBrowser) to retrieve the content. Sometimes getting into the nitty-gritty details of constructing a web request is the desired option (but only in extremely rare situations).

Example:

<?php
	require_once "support/http.php";
	require_once "support/tag_filter.php";

	// Retrieve the standard HTML parsing array for later use.
	$htmloptions = TagFilter::GetHTMLOptions();

	$url = "http://www.somesite.com/something/";
	$options = array(
		"headers" => array(
			"User-Agent" => HTTP::GetWebUserAgent("Firefox"),
			"Accept" => "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
			"Accept-Language" => "en-us,en;q=0.5",
			"Accept-Charset" => "ISO-8859-1,utf-8;q=0.7,*;q=0.7",
			"Cache-Control" => "max-age=0"
		)
	);
	$result = HTTP::RetrieveWebpage($url, $options);
	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else
	{
		// Use TagFilter to parse the content.
		$html = TagFilter::Explode($result["body"], $htmloptions);

		// Find all anchor tags.
		echo "All the URLs:\n";
		$result2 = $html->Find("a[href]");
		if (!$result2["success"])  echo "Error parsing/finding URLs.  " . $result2["error"] . "\n";
		else
		{
			foreach ($result2["ids"] as $id)
			{
				// Fast direct access.
				echo "\t" . $html->nodes[$id]["attrs"]["href"] . "\n";
			}
		}

		// Find all table rows that have 'th' tags.
		// The 'tr' tag IDs are returned.
		$result2 = $html->Filter($hmtl->Find("tr"), "th");
		if (!$result2["success"])  echo "Error parsing/finding table rows.  " . $result2["error"] . "\n";
		else
		{
			foreach ($result2["ids"] as $id)
			{
				echo "\t" . $html->GetOuterHTML($id) . "\n\n";
			}
		}
	}
?>

This example performs the same operation as the previous section, but doesn't get all the benefits of the web browser emulation layer such as automatically handling redirects and cookies. You should, in general, prefer using the WebBrowser class.

See the HTTP class documentation for more in-depth details and examples.

Handling HTML Forms

Traditionally, one of the hardest things to handle with web scraping is the classic HTML form. If you are like me, then you've generally just faked it and manually handled form submissions by just bypassing the form itself (i.e. manually copied variable names). The problem is that if/when the server side changes how it does things, the old form submission code will tend to break in spectacular ways. This toolkit includes several functions designed to make real form handling a walk in the park.

Example:

<?php
	require_once "support/http.php";
	require_once "support/web_browser.php";
	require_once "support/simple_html_dom.php";

	$url = "https://www.google.com/";
	$web = new WebBrowser(array("extractforms" => true));
	$result = $web->Process($url);

	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else if (count($result["forms"]) != 1)  echo "Was expecting one form.  Received:  " . count($result["forms"]) . "\n";
	else
	{
		$form = $result["forms"][0];

		$form->SetFormValue("q", "barebones cms");

		$result2 = $form->GenerateFormRequest("btnK");
		$result = $web->Process($result2["url"], "auto", $result2["options"]);

		if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
		else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
		else
		{
			// Do something with the results page here...
		}
	}
?>

This example retrieves Google's homepage, extracts the search form, modifies the search field, generates and submits the next request, and gets the response. All of that in just a few lines of code.

Note that, like the rest of the WebBrowser class, the form handler doesn't process Javascript. Very few sites actually need Javascript. For those rare, broken websites that need Javascript for the form on the page to function, I'm usually able to get away with a quick regular expression or two to pull the necessary information from the body content.

Handling Pagination

There is a common pattern in the scraping world: Pagination. This is most often seen when submitting a form and the request is passed off to a basic search engine that usually returns anywhere from 10 to 50 results.

Unfortunately, you need all 8,946 results for the database you are constructing. There are two ways to handle the scenario: Fake it or follow the links/buttons.

Let's look at "faking it" as doing so eliminates the need to handle pagination in the first place. What do I mean by this? Well, a lot of GET/POST requests in pagination scenarios pass along the "page size" to the server. Let's say you are getting back 50 results but the number '50' in a size attribute is also being sent to the server either on the first page or subsequent pages in a regular web browser. Well, what happens if you pass along the value '10000' for the page size instead? A lot, and by a lot, I mean a LOT, of server-side web facing software assumes it will only be passed the page size values provided in some client-side select box. Therefore, the server-side just casts the submitted value to an integer and passes it along to the database AND does all of its pagination calculations from that submitted value. In my experience, faking it works about 85% of the time and all of the desired server-side data can be retrieved with just one request. It's useful to note that frequently if the page size is not in the first page of search results, page 2 of those search results will generally reveal what parameter is used for page size. The ability to fake it on such a broad scale just goes to show that writing a functional search engine is a difficult task for a lot of developers.

But what if faking it doesn't work? You might encounter server-side software that can't handle processing/returning that much data and returns an error - for example, with experimenting, you discover that the server fails to return more than 3,000 rows at a time but that's still significantly more than 50 rows at a time. Or the developer wrote their code to assume that their data might get scraped and forces the upper limit on the page size anyway. Doing so just hurts them more than anything else as you'll end up using more of their system resources to retrieve the same amount of data. Regardless, if you can't get it all at once, you'll have to resort to pagination at whatever limit is imposed by the server. If the requests are just URL-based, then pagination is nothing more than a glorified URL. Personally, I break the problem down into whether I am doing a short-term or a long-term scraping task.

By short-term scraping, I refer to where the script is pulling data one time and won't be used again or will be used rarely. In such cases, building the GET request URL by hand is the simplest and quickest way to go where you enter a loop and just increment the page number in the URL and break out of the infinite loop when you stop getting data back. That is, the server stops returning rows of data.

Example:

<?php
	require_once "support/web_browser.php";
	require_once "support/tag_filter.php";

	// Retrieve the standard HTML parsing array for later use.
	$htmloptions = TagFilter::GetHTMLOptions();

	$web = new WebBrowser();

	$page = 1;
	do
	{
		// Retrieve a URL.
		$url = "http://www.somesite.com/something/?p=" . $page . "&s=50";

		$retries = 3;
		do
		{
			$result = $web->Process($url);
			$retries--;
			if (!$result["success"])  sleep(1);
		} while (!$result["success"] && $retries > 0);

		// Check for connectivity and response errors.
		if (!$result["success"])
		{
			echo "Error retrieving URL.  " . $result["error"] . "\n";
			
			exit();
		}

		if ($result["response"]["code"] != 200)
		{
			echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
			
			exit();
		}

		// Use TagFilter to parse the content.
		$html = TagFilter::Explode($result["body"], $htmloptions);

		// Retrieve a pointer object to the root node.
		$root = $html->Get();

		$found = false;
		// Attempt to extract information.
		// ... set $found to true if there is at least one row of data.

		$page++;
	} while ($found);
?>

For long-term scraping where the scraper runs on a regular schedule (e.g. cron), you will want something more robust and that means following links (or buttons) just like a regular web browser would until you get to data that has already been retrieved from a previous run of the scraper. Hopefully there is a class or ID in the HTML that you can rely on that contains the pagination links/buttons that you are interested in "clicking" on.

Example:

<?php
	require_once "support/web_browser.php";
	require_once "support/tag_filter.php";

	// Retrieve the standard HTML parsing array for later use.
	$htmloptions = TagFilter::GetHTMLOptions();

	$url = "http://www.somesite.com/something/";
	$web = new WebBrowser();

	do
	{
		// Retrieve a URL.
		$retries = 3;
		do
		{
			$result = $web->Process($url);
			$retries--;
			if (!$result["success"])  sleep(1);
		} while (!$result["success"] && $retries > 0);

		// Check for connectivity and response errors.
		if (!$result["success"])
		{
			echo "Error retrieving URL.  " . $result["error"] . "\n";
			
			exit();
		}

		if ($result["response"]["code"] != 200)
		{
			echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
			
			exit();
		}

		$baseurl = $result["url"];

		// Use TagFilter to parse the content.
		$html = TagFilter::Explode($result["body"], $htmloptions);

		// Retrieve a pointer object to the root node.
		$root = $html->Get();

		$found = false;
		// Attempt to extract information.
		// ... set $found to true if there is at least one row of data.

		if ($found)
		{
			$row = $root->Find("div.pagination a[href]")->Filter("/~contains:Next")->current();
			if ($row === false)  break;

			$url = HTTP::ConvertRelativeToAbsoluteURL($baseurl, $row->href);
		}
	} while ($found);
?>

This example looks for a div with a specific class containing links to various pages and finds the first link that contains the text "Next", converts the target URL to an absolute URL, and loops back around to process the next page.

For scraping websites that utilize POST requests with limited page sizes for pagination, it is almost guaranteed to be an ASP.NET website that you are interacting with. Or something as equally horrible. Not only is extracting content from the HTML going to be a miserable experience regardless of what tools are used, it's also likely a government website running on underpowered hardware that takes what feels like 15 seconds to respond per request. Use the WebBrowser form extraction tool to pull out the necessary information to go to the "Next" page. Other than that, I have no real recommendations other than to break down and cry while banging your head on your desk. May God have mercy on your soul.

POST Requests

Sometimes you might need to send a POST request that isn't from a form. For example, you might be writing a SDK for a RESTful API or emulating an AJAX interface. To send a POST request, simply build an options array with a "postvars" array with key-value pairs containing the information that the server requires.

Example:

<?php
	require_once "support/web_browser.php";

	// Send a POST request to a URL.
	$url = "http://api.somesite.com/profile";
	$web = new WebBrowser();
	$options = array(
		"postvars" => array(
			"id" => 12345,
			"firstname" => "John",
			"lastname" => "Smith"
		)
	);
	$result = $web->Process($url, "auto", $options);

	// Check for connectivity and response errors.
	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else
	{
		// Do something with the response.
	}
?>

All of the details of sending the correct headers and content to the server are automatically handled by the WebBrowser and HTTP classes.

Uploading Files

File uploads are handled several different ways so that very large files can be processed. The "files" option is an array of arrays that represents one or more files to upload. Note that file uploads will switch a POST request's Content-Type from "application/x-www-form-urlencoded" to "multipart/form-data".

Example:

<?php
	require_once "support/web_browser.php";

	// Retrieve a URL.
	$url = "http://api.somesite.com/photos";
	$web = new WebBrowser();
	$options = array(
		"postvars" => array(
			"uid" => 12345
		),
		"files" => array(
			array(
				"name" => "file1",
				"filename" => "mycat.jpg",
				"type" => "image/jpeg",
				"data" => file_get_contents("/path/to/mycat.jpg")
			),
			array(
				"name" => "file2",
				"filename" => "mycat-hires.jpg",
				"type" => "image/jpeg",
				"datafile" => "/path/to/mycat-hires.jpg"
			)
		)
	);
	$result = $web->Process($url, "auto", $options);

	// Check for connectivity and response errors.
	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else
	{
		// Do something with the response.
	}
?>

Each file in the "files" array must have the following options:

One of the following options must also be provided for each file:

File uploads with extracted forms are handled similarly to the above. When calling $form->SetFormValue(), pass in an array containing the file information with "filename", "type", and "data" or "datafile". The "name" key-value will automatically be filled in when calling $form->GenerateFormRequest().

Retrieving Large Files/Content

Sometimes the content to retrieve is just too large to handle completely in RAM. The Ultimate Web Scraper Toolkit sports a very impressive array of callback options allowing for retrieved information to be processed immediately instead of waiting for the request to complete. The most common use-case for using the callback options is to handle large file/content downloads. When retrieving anything over 10MB, it's a good idea to start utilizing the callback interfaces.

Example:

<?php
	require_once "support/web_browser.php";

	function DownloadFileCallback($response, $data, $opts)
	{
		if ($response["code"] == 200)
		{
			$size = ftell($opts);
			fwrite($opts, $data);

			if ($size % 1000000 > ($size + strlen($data)) % 1000000)  echo ".";
		}

		return true;
	}

	// Download a large file.
	$url = "http://downloads.somesite.com/large_file.zip";
	$fp = fopen("the_file.zip", "wb");
	$web = new WebBrowser();
	$options = array(
		"read_body_callback" => "DownloadFileCallback",
		"read_body_callback_opts" => $fp
	);
	echo "Downloading '" . $url . "'...";
	$result = $web->Process($url, "auto", $options);
	echo "\n";
	fclose($fp);

	// Check for connectivity and response errors.
	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else
	{
		// Do something with the response.
	}
?>

The example above passes a file handle through the callback options parameter. The callback is called regularly when data is received and the callback writes the retrieved data to the open file. It also determines if a 1MB boundary has been passed and, if so, it echos a dot/period out to the console.

Sending Non-Standard Requests

The vast majority of requests to servers are GET, POST application/x-www-form-urlencoded, and POST multipart/form-data. However, there may be times that other request types need to be sent to a server. For example, a lot of APIs being written these days want JSON content instead of a standard POST request to be able to handle richer incoming data.

Example:

<?php
	require_once "support/web_browser.php";

	// Retrieve a URL.
	$url = "http://api.somesite.com/profile";
	$web = new WebBrowser();
	$options = array(
		"method" => "POST",
		"headers" => array(
			"Content-Type" => "application/json"
		),
		"body" => json_encode(array(
			"id" => 12345,
			"firstname" => "John",
			"lastname" => "Smith"
		))
	);
	$result = $web->Process($url, "auto", $options);

	// Check for connectivity and response errors.
	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else
	{
		// Do something with the response.
	}
?>

Working with such APIs is best done by building a SDK. Here are several SDKs and their relevant API documentation that might be useful:

All of those SDKs utilize this toolkit.

Refined SSL Usage

By default, the Ultimate Web Scraper Toolkit verifies SSL certificate chains using the included 'support/cacert.pem' file when connecting to HTTPS URLs using the "intermediate" Mozilla cipher suite. Previously, the HTTP class did not do validation of a secure communication path due to being primarily a tool for scraping content where "working and functional" is generally more important than the security of the data being sent and received. However, due to several unfortunate changes and bugs, the policy was changed. Fortunately, you can now more reliably control the SSL options, including simpler cipher suite selection.

Where data security is of concern, keep in mind that SSL is hard to get right and best-practices change over time as evidenced by the previous paragraph.

Example:

<?php
	require_once "support/http.php";
	require_once "support/web_browser.php";

	// Generate default SSL options using the "modern" ciphers.
	$sslopts = HTTP::GetSafeSSLOpts(true, "modern");

	// See php.net for a complete list of all options.
	$sslopts["capture_peer_cert"] = true;

	// Send a POST request to a URL.
	$url = "https://api.somesite.com/profile";
	$web = new WebBrowser();
	$options = array(
		"sslopts" => $sslopts,
		"postvars" => array(
			"id" => 12345,
			"firstname" => "John",
			"lastname" => "Smith"
		)
	);
	$result = $web->Process($url, "auto", $options);

	// Check for connectivity and response errors.
	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else
	{
		// Do something with the response.
	}
?>

This example uses the Modern ciphers from Mozilla which has certain security properties and the defaults also disable SSL/TLS compression and set a number of other default options to improve the likelihood of a successful connection to a SSL/TLS enabled server. The full list of available SSL options can be found in the PHP SSL context options documentation.

Debugging

Got an API or website that's driving you crazy? A real web browser seems to work fine but your script isn't working? It might be time to dig in really deep and enable debug mode.

Example:

<?php
	require_once "support/web_browser.php";

	// Send a POST request to a URL.
	$url = "http://api.somesite.com/profile";
	$web = new WebBrowser();
	$options = array(
		"debug" => true,
		"postvars" => array(
			"id" => 12345,
			"firstname" => "John",
			"lastname" => "Smith"
		)
	);
	$result = $web->Process($url, "auto", $options);

	// Check for connectivity and response errors.
	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
//	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else
	{
echo "------- RAW SEND START -------\n";
echo $result["rawsend"];
echo "------- RAW SEND END -------\n\n";

echo "------- RAW RECEIVE START -------\n";
echo $result["rawrecv"];
echo "------- RAW RECEIVE END -------\n\n";
	}
?>

The "rawsend" and "rawrecv" show the exact byte-for-byte data sent and received, including any "chunked" information (i.e. Transfer-Encoding: chunked). Debug mode will also show data sent and received over SSL as plain-text, which makes it a better tool than Wireshark or an equivalent raw TCP dumping tool that can't see beyond a SSL handshake. Be sure to disable debug mode once the problem is resolved or it will chew up extra RAM.

Debugging SSL/TLS

The previous section covers issues that occur AFTER a connection to a server has been established. If you are connecting to a SSL/TLS enabled server, it is important to realize that those connections are much more fragile through no fault of the toolkit but rather SSL/TLS doing its thing. Here are the known reasons for a connection that fails to establish for seemingly random reasons that I've run into:

PHP does not expose much of the underlying SSL/TLS layer to applications when establishing connections, which makes it incredibly difficult to diagnose certain issues with SSL/TLS. To diagnose network related problems, use the 'openssl s_client' command line tool from the same host the problematic script is running on. Once the possibility of a network failure has been eliminated, only two common SSL/TLS certificate issues generally remain. See the section on Refined SSL Usage above for setting the "cafile", "auto_cn_match", and "auto_sni" SSL options.

If all else fails and secure, encrypted communication with the server are not required, disable the "verify_peer" and "verify_peer_name" SSL options and enable the "allow_self_signed" SSL option. Note that making these changes results in a connection that is no more secure that plaintext HTTP. Don't send passwords or other information that should be kept secure. This solution should only ever be used as a last resort. Always try to get the toolkit working with verification first.

Using the cURL Emulation Layer

The cURL emulation layer is a drop-in replacement for cURL on web hosts that don't have cURL installed. This isn't some cheesy, half-baked solution. The source code carefully follows the cURL and PHP documentation. Every define() and function is available as of PHP 5.4.0.

Example usage:

<?php
	if (!function_exists("curl_init"))
	{
		require_once "support/emulate_curl.php";
	}

	// Make cURL calls here...
?>

This example just shows how easy it is to add cURL support to any web host.

There are a few limitations though and a few differences. CURLOPT_VERBOSE is a lot more verbose. SSL/TLS support is a little flaky at times. Some things like DNS options are ignored. Only HTTP and HTTPS are supported protocols at this time. Return values from curl_getinfo() calls are close but not the same. curl_setopt() delays processing until curl_exec() is called. Multi-handle support "cheats" by performing operations in linear execution rather than parallel execution.

Using Asynchronous Sockets

Asynchronous, or non-blocking, sockets allow for a lot of powerful functionality such as scraping multiple pages and sites simultaneously from a single script. They also allow for using certain features of HTTP such as sending a second request to a server on an active connection while the previous request's response is still arriving.

See the MultiAsyncHelper class documentation for a simple example with scraping multiple URLs with WebBrowser as well as in-depth documentation on the class.

Example advanced usage:

<?php
	require_once "support/http.php";
	require_once "support/web_browser.php";
	require_once "support/simple_html_dom.php";
	require_once "support/multi_async_helper.php";

	// The URLs we want to load.
	$urls = array(
		"http://www.barebonescms.com/",
		"http://www.cubiclesoft.com/",
		"http://www.barebonescms.com/documentation/ultimate_web_scraper_toolkit/",
	);

	// Build the queue.
	$helper = new MultiAsyncHelper();
	$helper->SetConcurrencyLimit(3);

	// Mix in a regular file handle just for fun.
	$fp = fopen(__FILE__, "rb");
	stream_set_blocking($fp, 0);
	$helper->Set("__fp", $fp, "MultiAsyncHelper::ReadOnly");

	// Add the URLs to the async helper.
	$pages = array();
	foreach ($urls as $url)
	{
		$pages[$url] = new WebBrowser();
		$pages[$url]->ProcessAsync($helper, $url, NULL, $url);
	}

	// Run the main loop.
	$result = $helper->Wait();
	while ($result["success"])
	{
		// Process the file handle if it is ready for reading.
		if (isset($result["read"]["__fp"]))
		{
			$fp = $result["read"]["__fp"];
			$data = fread($fp, 500);
			if ($data === false || feof($fp))
			{
				echo "End of file reached.\n";

				$helper->Remove("__fp");
			}
		}

		// Process everything else.
		foreach ($result["removed"] as $key => $info)
		{
			if ($key === "__fp")  continue;

			if (!$info["result"]["success"])  echo "Error retrieving URL (" . $key . ").  " . $info["result"]["error"] . "\n";
			else if ($info["result"]["response"]["code"] != 200)  echo "Error retrieving URL (" . $key . ").  Server returned:  " . $info["result"]["response"]["line"] . "\n";
			else
			{
				echo "A response was returned (" . $key . ").\n";

				// Do something with the data here...
			}

			unset($pages[$key]);
		}

		// Break out of the loop when nothing is left.
		if ($result["numleft"] < 1)  break;

		$result = $helper->Wait();
	}

	// An error occurred.
	if (!$result["success"])  var_dump($result);
?>

This is a fairly complete example that retrieves three different URLs while simultaneously reading a file, processes up to three items in the queue at a time (see SetConcurrencyLimit()), and handles the various responses appropriately. Once all items have been processed, the script exits. MultiAsyncHelper is a flexible class that handles all asynchronous stream types (not just sockets).

Using the WebSocket Layer

Once upon a time, the web used to be a sane place filled with the HTTP protocol. Then the WebSocket protocol (RFC 6455) came along. WebSocket is a bi-directional, asynchronous streaming, fragmentation-capable, frame-based protocol which allows a remote server to chug all of your available bandwidth. Awesome.

The protocol itself is a little bit difficult to deal with but the handy, creatively named WebSocket class makes talking to WebSocket servers much, much easier.

Example usage:

<?php
	// Requires both the WebBrowser and HTTP classes to work.
	require_once "support/websocket.php";
	require_once "support/web_browser.php";
	require_once "support/http.php";

	$ws = new WebSocket();

	// The first parameter is the WebSocket server.
	// The second parameter is the Origin URL.
	$result = $ws->Connect("ws://ws.something.org/", "http://www.something.org");
	if (!$result["success"])
	{
		var_dump($result);
		exit();
	}

	// Send a text frame (just an example).
	$result = $ws->Write("Testtext", WebSocket::FRAMETYPE_TEXT);

	// Send a binary frame (just an example).
	$result = $ws->Write("Testbinary", WebSocket::FRAMETYPE_BINARY);

	// Main loop.
	$result = $ws->Wait();
	while ($result["success"])
	{
		do
		{
			$result = $ws->Read();
			if (!$result["success"])  break;
			if ($result["data"] !== false)
			{
				// Do something with the data.
				var_dump($result["data"]);
			}
		} while ($result["data"] !== false);

		$result = $ws->Wait();
	}

	// An error occurred.
	var_dump($result);
?>

The WebSocket class manages two queues - a read queue and a write queue - and does most of its work in the Wait() function. If you know anything about the WebSocket protocol, you know there are control frames and non-control frames. The control frames are difficult to deal with because they usually happen mid-stream but the WebSocket class automatically takes care of all of those frames for you so that you don't have to. What that means is that when you get a packet of data from the WebSocket class, the data is intended for your application.

One important thing to note about the WebSocket and WebSocketServer classes: Every major operation other than Connect() and Disconnect() is asynchronous in client mode. Connect() and Disconnect() are also asynchronous in server mode. This means that reads and writes will immediately succeed (not block), which means that a read could result in no data in the response. The data will eventually be sent/received in the Wait() function. When Wait() returns, it means that there is usually something to do but not always.

WebSocketServer implements a WebSocket server that allows a PHP WebSocket application to handle multiple clients with relative ease. WebSocketServer is an experimental product. You can try it out by running 'test_websocket_server.php' from the complete package on one command-line and 'test_websocket_client.php' from a couple more command-lines.

See the following for in-depth documentation and examples: WebSocket class documentation and WebSocketServer class documentation.

Writing a Custom Web Server

The Ultimate Web Scraper Toolkit includes a nifty class called WebServer. It allows for custom web servers to be made that do various things. One of those things might be a complex API that a user installs on their home computer. The WebServer class takes all the power and functionality of the baseline HTTP class and flips it over like a delicious pancake to serve up content.

See a complete example web server here: 'test_web_server.php'

The example starts a localhost server on port 5578 and waits for connections. It also supports upgrading connections to WebSocket. All request types - GET, POST with 'application/x-www-form-urlencoded', POST with 'form-data', and POST with JSON bodies - are supported and condense the input into a single data array before passing it off to the API assuming the user supplied an API key.

The example code forms common boilerplate logic for creating your own custom web server.

The WebServer class isn't going to win any awards for performance, beauty, or even stability against things like denial-of-service attacks. It will, however, win shiny awards for functionality and having enough features. It is written in pure PHP and has no special dependencies other than the HTTP class. Fewer external dependencies usually equates to fewer deployment problems.

See the WebServer classes documentation for in-depth details and examples.

Other Uses

The Ultimate Web Scraper Toolkit has many uses beyond pulling data down off the Internet and writing robots. For example, it can be used to scan a collection of static HTML documents on a host to find orphaned pages that are no longer being linked to:

<?php
	require_once "support/http.php";
	require_once "support/web_browser.php";
	require_once "support/simple_html_dom.php";

	// Customize options.
	$basepath = str_replace("\\", "/", dirname(__FILE__)) . "/html";
	$baseurl = "http://www.mysite.com/";
	$rootdomains = array("http://www.mysite.com/", "http://mysite.com/");
	$rootdocs = array("index.html", "index.php");
	$livescan = false;

	function LoadURLs(&$urls, $baseurl, $basepath)
	{
		if (substr($baseurl, -1) != "/")  $baseurl .= "/";

		$dir = @opendir($basepath);
		if ($dir)
		{
			while (($file = readdir($dir)) !== false)
			{
				if ($file != "." && $file != "..")
				{
					if (is_dir($basepath . "/" . $file))  LoadURLs($urls, $baseurl . $file, $basepath . "/" . $file);
					else  $urls[HTTP::ConvertRelativeToAbsoluteURL($baseurl, $file)] = $basepath . "/" . $file;
				}
			}

			closedir($dir);
		}
	}

	$html = new simple_html_dom();
	$urls = array();
	LoadURLs($urls, $baseurl, $basepath);

	// Find the root file.
	$processurls = array();
	foreach ($rootdocs as $file)
	{
		$url = HTTP::ConvertRelativeToAbsoluteURL($baseurl, $file);
		if (isset($urls[$url]))
		{
			$processurls[] = $url;

			break;
		}
	}

	// Process all URLs.
	while (count($processurls))
	{
		$url = array_shift($processurls);
		if (isset($urls[$url]))
		{
			$filename = $urls[$url];
			unset($urls[$url]);

			if (!$livescan)  $data = (string)@file_get_contents($filename);
			else
			{
				$web = new WebBrowser();
				$result = $web->Process($url);
				$data = "";

				if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
				else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
				else  $data = $result["body"];
			}

			$html->load($data);
			$rows = $html->find("a[href]");
			foreach ($rows as $row)
			{
				$url2 = (string)$row->href;
				foreach ($rootdomains as $domain)
				{
					if (strtolower(substr($url2, 0, strlen($domain))) == strtolower($domain))  $url2 = substr($url2, strlen($domain) - 1);
				}
				$url2 = HTTP::ConvertRelativeToAbsoluteURL($url, $url2);

				$processurls[] = $url2;
			}
		}
	}

	// Output files not referenced anywhere.
	echo "Orphaned files:\n\n";
	foreach ($urls as $url => $file)
	{
		echo $file . "\n";
	}
?>

If you have a specific example of a common scraper-related task that you'd like to see documented here, please drop by the forums.

Limitations

The Ultimate Web Scraper Toolkit is a serious piece of software that is approximately 99.5% effective across most websites. There are many things that can go wrong on the Internet and you are bound to encounter a lot of them if you need to scrape a lot of content. You'll encounter everything from rate limits to network timeouts to really gnarly HTML. You'll learn about responsibly retrying failed requests and quickly master the TagFilter class. However, your best bet is to try to find someone else who has already done the heavy lifting for whatever content you want to get. Barring the availability of the data you want in a nice, neat package (e.g. a nightly ZIP file), this toolkit plus your web browser's built-in tools to traverse the DOM will probably do the trick. The toolkit can scrape data out of some of the nastiest content in existence (ASP.NET and Microsoft Word HTML, I'm looking at you) and do it cleanly with less code than anything else out there.

The remaining 0.5% of websites are pure Javascript sites where the entire content is generated using Javascript. For those sites, you'll need a real web browser. Fortunately, there is PhantomJS (headless Webkit), which can be scripted (i.e. automated) to handle extremely ugly stuff such as the aforementioned Javascript-heavy sites. However, PhantomJS is rather resource intensive and slooooow. After all, PhantomJS emulates a real web browser which includes the full startup sequence and then it proceeds to download the entire page's content. That, in turn, can take hundreds of requests to complete and can easily include downloading things such as ads.

Honestly, in the last decade of extensively using this toolkit, I've only run into one website that absolutely required PhantomJS. Everything else I've built works great with the Ultimate Web Scraper Toolkit.

© CubicleSoft