Ultimate Web Scraper Toolkit Documentation

The Ultimate Web Scraper Toolkit is a powerful set of tools designed to handle all of your web scraping needs on nearly all web hosts. This toolkit easily makes RFC-compliant web requests that are indistinguishable from a real web browser, a web browser-like state engine for handling cookies and redirects, and a full cURL emulation layer for web hosts without the PHP cURL extension installed. A powerful tag filtering library (TagFilter) is included to easily extract and/or convert the desired content from each retrieved document.

This toolkit even comes with classes for creating custom web servers and WebSocket servers. That custom API you want the average person to install on their home computer or deploy to devices in the enterprise just became easier to deploy.

While this toolkit makes it really easy to scrape just about any content from the web, please don't do anything illegal. There is this little thing called copyright law that most countries have to protect various works.

Features

The following are a few features of the Ultimate Web Scraper toolkit:

And much more.

License

The Ultimate Web Scraper Toolkit is extracted from Barebones CMS and the license is also your pick of MIT or LGPL. The license and restrictions are identical to the Barebones CMS License.

If you find the Ultimate E-mail Toolkit useful, financial donations are sincerely appreciated and go towards future development efforts.

Download

Ultimate Web Scraper Toolkit 1.0RC17 is the seventeenth release candidate of the Ultimate Web Scraper Toolkit.

Download ultimate-web-scraper-1.0rc17.zip

If you find the Ultimate Web Scraper Toolkit useful, please donate toward future development efforts.

Installation

Installing the Ultimate Web Scraper Toolkit is easy. The installation procedure is as follows:

Installation is easy. Using the toolkit is a bit more difficult.

Upgrading

Like Barebones CMS, upgrading the Ultimate Web Scraper Toolkit is easy - just upload the new files to the server and overwrite existing files.

Scraping Webpages - The Easy Way

Webpages are hard to retrieve and harder to parse. And doing this consistently across a wide variety of web hosts and scenarios makes it very difficult to do this alone. The Ultimate Web Scraper Toolkit makes both retrieving and parsing webpages a whole lot easier.

Example usage:

<?php
	require_once "support/web_browser.php";
	require_once "support/tag_filter.php";

	// Retrieve the standard HTML parsing array for later use.
	$htmloptions = TagFilter::GetHTMLOptions();

	// Retrieve a URL.
	$url = "http://www.somesite.com/something/";
	$web = new WebBrowser();
	$result = $web->Process($url);

	// Check for connectivity and response errors.
	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else
	{
		$baseurl = $result["url"];

		// Use TagFilter to parse the content.
		$html = TagFilter::Explode($result["body"], $htmloptions);

		// Find all anchor tags.
		echo "All the URLs:\n";
		$result2 = $html->Find("a[href]");
		if (!$result2["success"])  echo "Error parsing/finding URLs.  " . $result2["error"] . "\n";
		else
		{
			foreach ($result2["ids"] as $id)
			{
				// Fast direct access.
				echo "\t" . $html->nodes[$id]["attrs"]["href"] . "\n";
				echo "\t" . HTTP::ConvertRelativeToAbsoluteURL($baseurl, $html->nodes[$id]["attrs"]["href"]) . "\n";
			}
		}

		// Find all table rows that have 'th' tags.
		// The 'tr' tag IDs are returned.
		$result2 = $html->Filter($hmtl->Find("tr"), "th");
		if (!$result2["success"])  echo "Error parsing/finding table rows.  " . $result2["error"] . "\n";
		else
		{
			foreach ($result2["ids"] as $id)
			{
				echo "\t" . $html->GetOuterHTML($id) . "\n\n";
			}
		}
	}
?>

Example object-oriented usage:

<?php
	require_once "support/web_browser.php";
	require_once "support/tag_filter.php";

	// Retrieve the standard HTML parsing array for later use.
	$htmloptions = TagFilter::GetHTMLOptions();

	// Retrieve a URL.
	$url = "http://www.somesite.com/something/";
	$web = new WebBrowser();
	$result = $web->Process($url);

	// Check for connectivity and response errors.
	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else
	{
		$baseurl = $result["url"];

		// Use TagFilter to parse the content.
		$html = TagFilter::Explode($result["body"], $htmloptions);

		// Retrieve a pointer object to the root node.
		$root = $html->Get();

		// Find all anchor tags.
		echo "All the URLs:\n";
		$rows = $root->Find("a[href]");
		foreach ($rows as $row)
		{
			// Somewhat slower access.
			echo "\t" . $row->href . "\n";
			echo "\t" . HTTP::ConvertRelativeToAbsoluteURL($baseurl, $row->href) . "\n";
		}

		// Find all table rows that have 'th' tags.
		$rows = $root->Find("tr")->Filter("th");
		foreach ($rows as $row)
		{
			echo "\t" . $row->GetOuterHTML() . "\n\n";
		}
	}
?>

These brief examples retrieve a URL while emulating some flavor of Firefox and display the value of the 'href' attribute of all anchor tags that have a 'href' attribute as well as finding all table rows with 'th' tags. In addition, because the WebBrowser class was used, the code will internally and automatically handle HTTP cookies and redirects.

You'll get lots of mileage out of the HTTP::ExtractURL(), HTTP::CondenseURL(), HTTP::ConvertRelativeToAbsoluteURL(), and other useful functions when extracting content from a HTML page and processing server responses.

See the following for in-depth documentation and extensive examples on performing document retrieval and extracting content with TagFilter: WebBrowser classes documentation, TagFilter classes documentation, and HTTP class documentation.

Scraping Webpages - The Hard Way

The previous example used the web browser emulation layer (WebBrowser) to retrieve the content. Sometimes getting into the nitty-gritty details of constructing a web request is the desired option (but only in extremely rare situations).

Example:

<?php
	require_once "support/http.php";
	require_once "support/tag_filter.php";

	// Retrieve the standard HTML parsing array for later use.
	$htmloptions = TagFilter::GetHTMLOptions();

	$url = "http://www.somesite.com/something/";
	$options = array(
		"headers" => array(
			"User-Agent" => HTTP::GetWebUserAgent("Firefox"),
			"Accept" => "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
			"Accept-Language" => "en-us,en;q=0.5",
			"Accept-Charset" => "ISO-8859-1,utf-8;q=0.7,*;q=0.7",
			"Cache-Control" => "max-age=0"
		)
	);
	$result = HTTP::RetrieveWebpage($url, $options);
	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else
	{
		// Use TagFilter to parse the content.
		$html = TagFilter::Explode($result["body"], $htmloptions);

		// Find all anchor tags.
		echo "All the URLs:\n";
		$result2 = $html->Find("a[href]");
		if (!$result2["success"])  echo "Error parsing/finding URLs.  " . $result2["error"] . "\n";
		else
		{
			foreach ($result2["ids"] as $id)
			{
				// Fast direct access.
				echo "\t" . $html->nodes[$id]["attrs"]["href"] . "\n";
			}
		}

		// Find all table rows that have 'th' tags.
		// The 'tr' tag IDs are returned.
		$result2 = $html->Filter($hmtl->Find("tr"), "th");
		if (!$result2["success"])  echo "Error parsing/finding table rows.  " . $result2["error"] . "\n";
		else
		{
			foreach ($result2["ids"] as $id)
			{
				echo "\t" . $html->GetOuterHTML($id) . "\n\n";
			}
		}
	}
?>

This example performs the same operation as the previous section, but doesn't get all the benefits of the web browser emulation layer such as automatically handling redirects and cookies. You should, in general, prefer using the WebBrowser class.

See the HTTP class documentation for more in-depth details and examples.

Handling HTML Forms

Traditionally, one of the hardest things to handle with web scraping is the classic HTML form. If you are like me, then you've generally just faked it and manually handled form submissions by just bypassing the form itself (i.e. manually copied variable names). The problem is that if/when the server side changes how it does things, the old form submission code will tend to break in spectacular ways. This toolkit includes several functions designed to make real form handling a walk in the park.

Example:

<?php
	require_once "support/http.php";
	require_once "support/web_browser.php";
	require_once "support/simple_html_dom.php";

	$url = "https://www.google.com/";
	$web = new WebBrowser(array("extractforms" => true));
	$result = $web->Process($url);

	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else if (count($result["forms"]) != 1)  echo "Was expecting one form.  Received:  " . count($result["forms"]) . "\n";
	else
	{
		$form = $result["forms"][0];

		$form->SetFormValue("q", "barebones cms");

		$result2 = $form->GenerateFormRequest("btnK");
		$result = $web->Process($result2["url"], "auto", $result2["options"]);

		if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
		else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
		else
		{
			// Do something with the results page here...
		}
	}
?>

This example retrieves Google's homepage, extracts the search form, modifies the search field, generates and submits the next request, and gets the response. All of that in just a few lines of code.

Note that, like the rest of the WebBrowser class, the form handler doesn't process Javascript. Very few sites actually need Javascript. For those rare, broken websites that need Javascript for the form on the page to function, I'm usually able to get away with a quick regular expression or two to pull the necessary information from the body content.

POST Requests

Sometimes you might need to send a POST request that isn't from a form. For example, you might be writing a SDK for a RESTful API or emulating an AJAX interface. To send a POST request, simply build an options array with a "postvars" array with key-value pairs containing the information that the server requires.

Example:

<?php
	require_once "support/web_browser.php";

	// Send a POST request to a URL.
	$url = "http://api.somesite.com/profile";
	$web = new WebBrowser();
	$options = array(
		"postvars" => array(
			"id" => 12345,
			"firstname" => "John",
			"lastname" => "Smith"
		)
	);
	$result = $web->Process($url, "auto", $options);

	// Check for connectivity and response errors.
	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else
	{
		// Do something with the response.
	}
?>

All of the details of sending the correct headers and content to the server are automatically handled by the WebBrowser and HTTP classes.

Uploading Files

File uploads are handled several different ways so that very large files can be processed. The "files" option is an array of arrays that represents one or more files to upload. Note that file uploads will switch a POST request's Content-Type from "application/x-www-form-urlencoded" to "multipart/form-data".

Example:

<?php
	require_once "support/web_browser.php";

	// Retrieve a URL.
	$url = "http://api.somesite.com/photos";
	$web = new WebBrowser();
	$options = array(
		"postvars" => array(
			"uid" => 12345
		),
		"files" => array(
			array(
				"name" => "file1",
				"filename" => "mycat.jpg",
				"type" => "image/jpeg",
				"data" => file_get_contents("/path/to/mycat.jpg")
			),
			array(
				"name" => "file2",
				"filename" => "mycat-hires.jpg",
				"type" => "image/jpeg",
				"datafile" => "/path/to/mycat-hires.jpg"
			)
		)
	);
	$result = $web->Process($url, "auto", $options);

	// Check for connectivity and response errors.
	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else
	{
		// Do something with the response.
	}
?>

Each file in the "files" array must have the following options:

One of the following options must also be provided for each file:

File uploads with extracted forms are handled similarly to the above. When calling $form->SetFormValue(), pass in an array containing the file information with "filename", "type", and "data" or "datafile". The "name" key-value will automatically be filled in when calling $form->GenerateFormRequest().

Retrieving Large Files/Content

Sometimes the content to retrieve is just too large to handle completely in RAM. The Ultimate Web Scraper Toolkit sports a very impressive array of callback options allowing for retrieved information to be processed immediately instead of waiting for the request to complete. The most common use-case for using the callback options is to handle large file/content downloads. When retrieving anything over 10MB, it's a good idea to start utilizing the callback interfaces.

Example:

<?php
	require_once "support/web_browser.php";

	function DownloadFileCallback($response, $data, $opts)
	{
		if ($response["code"] == 200)
		{
			$size = ftell($opts);
			fwrite($opts, $data);

			if ($size % 1000000 > ($size + strlen($data)) % 1000000)  echo ".";
		}

		return true;
	}

	// Download a large file.
	$url = "http://downloads.somesite.com/large_file.zip";
	$fp = fopen("the_file.zip", "wb");
	$web = new WebBrowser();
	$options = array(
		"read_body_callback" => "DownloadFileCallback",
		"read_body_callback_opts" => $fp
	);
	echo "Downloading '" . $url . "'...";
	$result = $web->Process($url, "auto", $options);
	echo "\n";
	fclose($fp);

	// Check for connectivity and response errors.
	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else
	{
		// Do something with the response.
	}
?>

The example above passes a file handle through the callback options parameter. The callback is called regularly when data is received and the callback writes the retrieved data to the open file. It also determines if a 1MB boundary has been passed and, if so, it echos a dot/period out to the console.

Sending Non-Standard Requests

The vast majority of requests to servers are GET, POST application/x-www-form-urlencoded, and POST multipart/form-data. However, there may be times that other request types need to be sent to a server. For example, a lot of APIs being written these days want JSON content instead of a standard POST request to be able to handle richer incoming data.

Example:

<?php
	require_once "support/web_browser.php";

	// Retrieve a URL.
	$url = "http://api.somesite.com/profile";
	$web = new WebBrowser();
	$options = array(
		"method" => "POST",
		"headers" => array(
			"Content-Type" => "application/json"
		),
		"body" => json_encode(array(
			"id" => 12345,
			"firstname" => "John",
			"lastname" => "Smith"
		))
	);
	$result = $web->Process($url, "auto", $options);

	// Check for connectivity and response errors.
	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else
	{
		// Do something with the response.
	}
?>

Working with such APIs is best done by building a SDK. Here are several SDKs and their relevant API documentation that might be useful:

All of those SDKs utilize this toolkit.

Refined SSL Usage

By default, the Ultimate Web Scraper Toolkit does not verify SSL certificate chains using the included 'support/cacert.pem' file and, in general, performs very little validation of a secure communication path. The reason for this behavior is that the toolkit is primarily for scraping content where "working and functional" is generally more important than the security of the data being sent and received. Where security is of concern, keep in mind that SSL is hard to get right and best-practices change over time.

Example:

<?php
	require_once "support/web_browser.php";

	// See php.net for a complete list of available options.
	$sslopts = array(
		"ciphers" => "ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-RSA-AES256-SHA256:DHE-RSA-AES256-SHA:ECDHE-ECDSA-DES-CBC3-SHA:ECDHE-RSA-DES-CBC3-SHA:EDH-RSA-DES-CBC3-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:DES-CBC3-SHA:!DSS",
		"disable_compression" => true,
		"allow_self_signed" => false,
		"verify_peer" => true,
		"verify_depth" => 3,
		"capture_peer_cert" => true,
		"cafile" => str_replace("\\", "/", dirname(__FILE__)) . "/support/cacert.pem",
		"auto_cn_match" => true,
		"auto_sni" => true
	);

	// Send a POST request to a URL.
	$url = "https://api.somesite.com/profile";
	$web = new WebBrowser();
	$options = array(
		"sslopts" => $sslopts,
		"postvars" => array(
			"id" => 12345,
			"firstname" => "John",
			"lastname" => "Smith"
		)
	);
	$result = $web->Process($url, "auto", $options);

	// Check for connectivity and response errors.
	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else
	{
		// Do something with the response.
	}
?>

This example uses the Intermediate ciphers from Mozilla which has certain security properties, disables SSL/TLS compression, and sets a number of other options. The full list of available options can be found in the PHP SSL context options documentation.

Debugging

Got an API or website that's driving you crazy? A real web browser seems to work fine but your script isn't working? It might be time to dig in really deep and enable debug mode.

Example:

<?php
	require_once "support/web_browser.php";

	// Send a POST request to a URL.
	$url = "http://api.somesite.com/profile";
	$web = new WebBrowser();
	$options = array(
		"debug" => true,
		"postvars" => array(
			"id" => 12345,
			"firstname" => "John",
			"lastname" => "Smith"
		)
	);
	$result = $web->Process($url, "auto", $options);

	// Check for connectivity and response errors.
	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
//	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else
	{
echo "------- RAW SEND START -------\n";
echo $result["rawsend"];
echo "------- RAW SEND END -------\n\n";

echo "------- RAW RECEIVE START -------\n";
echo $result["rawrecv"];
echo "------- RAW RECEIVE END -------\n\n";
	}
?>

The "rawsend" and "rawrecv" show the exact byte-for-byte data sent and received, including any "chunked" information (i.e. Transfer-Encoding: chunked). Debug mode will also show data sent and received over SSL as plain-text, which makes it a better tool than Wireshark or an equivalent raw TCP dumping tool that can't see beyond a SSL handshake. Be sure to disable debug mode once the problem is resolved or it will chew up extra RAM.

Debugging SSL/TLS

The previous section covers issues that occur AFTER a connection to a server has been established. If you are connecting to a SSL/TLS enabled server, it is important to realize that those connections are much more fragile through no fault of the toolkit but rather SSL/TLS doing its thing. Here are the known reasons for a connection that fails to establish for seemingly random reasons that I've run into:

PHP does not expose much of the underlying SSL/TLS layer to applications when establishing connections, which makes it incredibly difficult to diagnose certain issues with SSL/TLS. To diagnose network related problems, use the 'openssl s_client' command line tool from the same host the problematic script is running on. Once the possibility of a network failure has been eliminated, only two common SSL/TLS certificate issues generally remain. See the section on Refined SSL Usage above for setting the "cafile", "auto_cn_match", and "auto_sni" SSL options.

If all else fails and secure, encrypted communication with the server are not required, disable the "verify_peer" and "verify_peer_name" SSL options and enable the "allow_self_signed" SSL option. Note that making these changes results in a connection that is no more secure that plaintext HTTP. Don't send passwords or other information that should be kept secure. This solution should only ever be used as a last resort. Always try to get the toolkit working with verification first.

Using the cURL Emulation Layer

The cURL emulation layer is a drop-in replacement for cURL on web hosts that don't have cURL installed. This isn't some cheesy, half-baked solution. The source code carefully follows the cURL and PHP documentation. Every define() and function is available as of PHP 5.4.0.

Example usage:

<?php
	if (!function_exists("curl_init"))
	{
		require_once "support/emulate_curl.php";
	}

	// Make cURL calls here...
?>

This example just shows how easy it is to add cURL support to any web host.

There are a few limitations though and a few differences. CURLOPT_VERBOSE is a lot more verbose. SSL/TLS support is a little flaky at times. Some things like DNS options are ignored. Only HTTP and HTTPS are supported protocols at this time. Return values from curl_getinfo() calls are close but not the same. curl_setopt() delays processing until curl_exec() is called. Multi-handle support "cheats" by performing operations in linear execution rather than parallel execution.

Using Asynchronous Sockets

Asynchronous, or non-blocking, sockets allow for a lot of powerful functionality such as scraping multiple pages and sites simultaneously from a single script. They also allow for using certain features of HTTP such as sending a second request to a server on an active connection while the previous request's response is still arriving.

See the MultiAsyncHelper class documentation for a simple example with scraping multiple URLs with WebBrowser as well as in-depth documentation on the class.

Example advanced usage:

<?php
	require_once "support/http.php";
	require_once "support/web_browser.php";
	require_once "support/simple_html_dom.php";
	require_once "support/multi_async_helper.php";

	// The URLs we want to load.
	$urls = array(
		"http://www.barebonescms.com/",
		"http://www.cubiclesoft.com/",
		"http://www.barebonescms.com/documentation/ultimate_web_scraper_toolkit/",
	);

	// Build the queue.
	$helper = new MultiAsyncHelper();
	$helper->SetConcurrencyLimit(3);

	// Mix in a regular file handle just for fun.
	$fp = fopen(__FILE__, "rb");
	stream_set_blocking($fp, 0);
	$helper->Set("__fp", $fp, "MultiAsyncHelper::ReadOnly");

	// Add the URLs to the async helper.
	$pages = array();
	foreach ($urls as $url)
	{
		$pages[$url] = new WebBrowser();
		$pages[$url]->ProcessAsync($helper, $url, NULL, $url);
	}

	// Run the main loop.
	$result = $helper->Wait();
	while ($result["success"])
	{
		// Process the file handle if it is ready for reading.
		if (isset($result["read"]["__fp"]))
		{
			$fp = $result["read"]["__fp"];
			$data = fread($fp, 500);
			if ($data === false || feof($fp))
			{
				echo "End of file reached.\n";

				$helper->Remove("__fp");
			}
		}

		// Process everything else.
		foreach ($result["removed"] as $key => $info)
		{
			if ($key === "__fp")  continue;

			if (!$info["result"]["success"])  echo "Error retrieving URL (" . $key . ").  " . $info["result"]["error"] . "\n";
			else if ($info["result"]["response"]["code"] != 200)  echo "Error retrieving URL (" . $key . ").  Server returned:  " . $info["result"]["response"]["line"] . "\n";
			else
			{
				echo "A response was returned (" . $key . ").\n";

				// Do something with the data here...
			}

			unset($pages[$key]);
		}

		// Break out of the loop when nothing is left.
		if ($result["numleft"] < 1)  break;

		$result = $helper->Wait();
	}

	// An error occurred.
	if (!$result["success"])  var_dump($result);
?>

This is a fairly complete example that retrieves three different URLs while simultaneously reading a file, processes up to three items in the queue at a time (see SetConcurrencyLimit()), and handles the various responses appropriately. Once all items have been processed, the script exits. MultiAsyncHelper is a flexible class that handles all asynchronous stream types (not just sockets).

Using the WebSocket Layer

Once upon a time, the web used to be a sane place filled with the HTTP protocol. Then the WebSocket protocol (RFC 6455) came along. WebSocket is a bi-directional, asynchronous streaming, fragmentation-capable, frame-based protocol which allows a remote server to chug all of your available bandwidth. Awesome.

The protocol itself is a little bit difficult to deal with but the handy, creatively named WebSocket class makes talking to WebSocket servers much, much easier.

Example usage:

<?php
	// Requires both the WebBrowser and HTTP classes to work.
	require_once "support/websocket.php";
	require_once "support/web_browser.php";
	require_once "support/http.php";

	$ws = new WebSocket();

	// The first parameter is the WebSocket server.
	// The second parameter is the Origin URL.
	$result = $ws->Connect("ws://ws.something.org/", "http://www.something.org");
	if (!$result["success"])
	{
		var_dump($result);
		exit();
	}

	// Send a text frame (just an example).
	$result = $ws->Write("Testtext", WebSocket::FRAMETYPE_TEXT);

	// Send a binary frame (just an example).
	$result = $ws->Write("Testbinary", WebSocket::FRAMETYPE_BINARY);

	// Main loop.
	$result = $ws->Wait();
	while ($result["success"])
	{
		do
		{
			$result = $ws->Read();
			if (!$result["success"])  break;
			if ($result["data"] !== false)
			{
				// Do something with the data.
				var_dump($result["data"]);
			}
		} while ($result["data"] !== false);

		$result = $ws->Wait();
	}

	// An error occurred.
	var_dump($result);
?>

The WebSocket class manages two queues - a read queue and a write queue - and does most of its work in the Wait() function. If you know anything about the WebSocket protocol, you know there are control frames and non-control frames. The control frames are difficult to deal with because they usually happen mid-stream but the WebSocket class automatically takes care of all of those frames for you so that you don't have to. What that means is that when you get a packet of data from the WebSocket class, the data is intended for your application.

One important thing to note about the WebSocket and WebSocketServer classes: Every major operation other than Connect() and Disconnect() is asynchronous in client mode. Connect() and Disconnect() are also asynchronous in server mode. This means that reads and writes will immediately succeed (not block), which means that a read could result in no data in the response. The data will eventually be sent/received in the Wait() function. When Wait() returns, it means that there is usually something to do but not always.

WebSocketServer implements a WebSocket server that allows a PHP WebSocket application to handle multiple clients with relative ease. WebSocketServer is an experimental product. You can try it out by running 'test_websocket_server.php' from the complete package on one command-line and 'test_websocket_client.php' from a couple more command-lines.

See the following for in-depth documentation and examples: WebSocket class documentation and WebSocketServer class documentation.

Writing a Custom Web Server

The Ultimate Web Scraper Toolkit includes a nifty class called WebServer. It allows for custom web servers to be made that do various things. One of those things might be a complex API that a user installs on their home computer. The WebServer class takes all the power and functionality of the baseline HTTP class and flips it over like a delicious pancake to serve up content.

See a complete example web server here: 'test_web_server.php'

The example starts a localhost server on port 5578 and waits for connections. It also supports upgrading connections to WebSocket. All request types - GET, POST with 'application/x-www-form-urlencoded', POST with 'form-data', and POST with JSON bodies - are supported and condense the input into a single data array before passing it off to the API assuming the user supplied an API key.

The example code forms common boilerplate logic for creating your own custom web server.

The WebServer class isn't going to win any awards for performance, beauty, or even stability against things like denial-of-service attacks. It will, however, win shiny awards for functionality and having enough features. It is written in pure PHP and has no special dependencies other than the HTTP class. Fewer external dependencies usually equates to fewer deployment problems.

See the WebServer classes documentation for in-depth details and examples.

Other Uses

The Ultimate Web Scraper Toolkit has many uses beyond pulling data down off the Internet and writing robots. For example, it can be used to scan a collection of static HTML documents on a host to find orphaned pages that are no longer being linked to:

<?php
	require_once "support/http.php";
	require_once "support/web_browser.php";
	require_once "support/simple_html_dom.php";

	// Customize options.
	$basepath = str_replace("\\", "/", dirname(__FILE__)) . "/html";
	$baseurl = "http://www.mysite.com/";
	$rootdomains = array("http://www.mysite.com/", "http://mysite.com/");
	$rootdocs = array("index.html", "index.php");
	$livescan = false;

	function LoadURLs(&$urls, $baseurl, $basepath)
	{
		if (substr($baseurl, -1) != "/")  $baseurl .= "/";

		$dir = @opendir($basepath);
		if ($dir)
		{
			while (($file = readdir($dir)) !== false)
			{
				if ($file != "." && $file != "..")
				{
					if (is_dir($basepath . "/" . $file))  LoadURLs($urls, $baseurl . $file, $basepath . "/" . $file);
					else  $urls[HTTP::ConvertRelativeToAbsoluteURL($baseurl, $file)] = $basepath . "/" . $file;
				}
			}

			closedir($dir);
		}
	}

	$html = new simple_html_dom();
	$urls = array();
	LoadURLs($urls, $baseurl, $basepath);

	// Find the root file.
	$processurls = array();
	foreach ($rootdocs as $file)
	{
		$url = HTTP::ConvertRelativeToAbsoluteURL($baseurl, $file);
		if (isset($urls[$url]))
		{
			$processurls[] = $url;

			break;
		}
	}

	// Process all URLs.
	while (count($processurls))
	{
		$url = array_shift($processurls);
		if (isset($urls[$url]))
		{
			$filename = $urls[$url];
			unset($urls[$url]);

			if (!$livescan)  $data = (string)@file_get_contents($filename);
			else
			{
				$web = new WebBrowser();
				$result = $web->Process($url);
				$data = "";

				if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
				else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
				else  $data = $result["body"];
			}

			$html->load($data);
			$rows = $html->find("a[href]");
			foreach ($rows as $row)
			{
				$url2 = (string)$row->href;
				foreach ($rootdomains as $domain)
				{
					if (strtolower(substr($url2, 0, strlen($domain))) == strtolower($domain))  $url2 = substr($url2, strlen($domain) - 1);
				}
				$url2 = HTTP::ConvertRelativeToAbsoluteURL($url, $url2);

				$processurls[] = $url2;
			}
		}
	}

	// Output files not referenced anywhere.
	echo "Orphaned files:\n\n";
	foreach ($urls as $url => $file)
	{
		echo $file . "\n";
	}
?>

If you have a specific example of a common scraper-related task that you'd like to see documented here, please drop by the forums.

Limitations

The Ultimate Web Scraper Toolkit is a serious piece of software that is approximately 99.5% effective across most websites. There are many things that can go wrong on the Internet and you are bound to encounter a lot of them if you need to scrape a lot of content. You'll encounter everything from rate limits to network timeouts to really gnarly HTML. You'll learn about responsibly retrying failed requests and quickly master the TagFilter class. However, your best bet is to try to find someone else who has already done the heavy lifting for whatever content you want to get. Barring the availability of the data you want in a nice, neat package, this toolkit plus your web browser's built-in tools to traverse the DOM will probably do the trick. The toolkit can scrape data out of some of the nastiest content in existence (ASP.NET and Microsoft Word HTML, I'm looking at you) and do it cleanly with less code than anything else out there.

The remaining 0.5% of websites are pure Javascript sites where the entire content is generated using Javascript. For those sites, you'll need a real web browser. Fortunately, there is PhantomJS (headless Webkit), which can be scripted (i.e. automated) to handle extremely ugly stuff such as the aforementioned Javascript-heavy sites. However, PhantomJS is rather resource intensive and slooooow. After all, PhantomJS emulates a real web browser which includes the full startup sequence and then it proceeds to download the entire page's content. That, in turn, can take hundreds of requests to complete and can easily include downloading things such as ads.

Honestly, in the last decade of extensively using this toolkit, I've only run into one website that absolutely required PhantomJS. Everything else I've built works great with the Ultimate Web Scraper Toolkit.

© CubicleSoft