Ultimate Web Scraper Toolkit Documentation

The Ultimate Web Scraper Toolkit is a powerful set of tools designed to handle all of your web scraping needs on nearly all web hosts. This toolkit easily makes RFC-compliant web requests that are indistinguishable from a real web browser, a web browser-like state engine for handling cookies and redirects, and a full cURL emulation layer for web hosts without the PHP cURL extension installed. Simple HTML DOM is included to easily extract the desired content from each retrieved document.

While this toolkit makes it really easy to scrape just about any content from the web, please don't do anything illegal. There is this little thing called copyright law that most countries have to protect various works.

Features

The following are a few features of the Ultimate Web Scraper toolkit:

And much more.

License

The Ultimate Web Scraper Toolkit is extracted from Barebones CMS and the license is also your pick of MIT or LGPL. The license and restrictions are identical to the Barebones CMS License.

If you find the Ultimate E-mail Toolkit useful, financial donations are sincerely appreciated and go towards future development efforts.

Download

Ultimate Web Scraper Toolkit 1.0RC14 is the fourteenth release candidate of the Ultimate Web Scraper Toolkit.

Download ultimate-web-scraper-1.0rc14.zip

If you find the Ultimate Web Scraper Toolkit useful, please donate toward future development efforts.

Installation

Installing the Ultimate Web Scraper Toolkit is easy. The installation procedure is as follows:

Installation is easy. Using the toolkit is a bit more difficult.

Upgrading

Like Barebones CMS, upgrading the Ultimate Web Scraper Toolkit is easy - just upload the new files to the server and overwrite existing files.

Scraping Webpages - The Easy Way

Webpages are hard to retrieve and harder to parse. And doing this consistently across a wide variety of web hosts and scenarios makes it very difficult to do this alone. The Ultimate Web Scraper Toolkit makes both retrieving and parsing webpages a whole lot easier.

Example:

<?php
	require_once "support/http.php";
	require_once "support/web_browser.php";
	require_once "support/simple_html_dom.php";

	// Simple HTML DOM tends to leak RAM like
	// a sieve.  Declare what you will need here.
	// Objects are reusable.
	$html = new simple_html_dom();

	$url = "http://www.somesite.com/something/";
	$web = new WebBrowser();
	$result = $web->Process($url);

	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else
	{
		echo "All the URLs:\n";
		$html->load($result["body"]);
		$rows = $html->find("a[href]");
		foreach ($rows as $row)
		{
			echo "\t" . $row->href . "\n";
		}
	}
?>

This brief example retrieves a URL while emulating some flavor of Firefox and displays the value of the 'href' attribute of all anchor tags that have a 'href' attribute. See the Simple HTML DOM documentation for details on that library. See the HTTP class for in-depth documentation on the document retrieval routines.

Scraping Webpages - The Hard Way

The previous example used the web browser emulation layer to retrieve the content. Sometimes getting into the nitty-gritty details of constructing a web request is the desired option (but not usually).

Example:

<?php
	require_once "support/http.php";
	require_once "support/simple_html_dom.php";

	// Simple HTML DOM tends to leak RAM like
	// a sieve.  Declare what you will need here.
	// Objects are reusable.
	$html = new simple_html_dom();

	$url = "http://www.somesite.com/something/";
	$options = array(
		"headers" => array(
			"User-Agent" => HTTP::GetWebUserAgent("Firefox"),
			"Accept" => "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
			"Accept-Language" => "en-us,en;q=0.5",
			"Accept-Charset" => "ISO-8859-1,utf-8;q=0.7,*;q=0.7",
			"Cache-Control" => "max-age=0"
		)
	);
	$result = HTTP::RetrieveWebpage($url, $options);
	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else
	{
		echo "All the URLs:\n";
		$html->load($result["body"]);
		$rows = $html->find("a[href]");
		foreach ($rows as $row)
		{
			echo "\t" . $row->href . "\n";
		}
	}
?>

This example performs the same operation as the previous section, but doesn't get all the benefits of the web browser emulation layer such as automatically handling redirects and cookies.

Handling HTML Forms

Traditionally, one of the hardest things to handle with web scraping is the classic HTML form. If you are like me, then you've generally just faked it and manually handled form submissions by just bypassing the form itself (i.e. manually copied variable names). The problem is that if/when the server side changes how it does things, the old form submission code will tend to break in spectacular ways. This toolkit includes several functions designed to make real form handling a walk in the park.

Example:

<?php
	require_once "support/http.php";
	require_once "support/web_browser.php";
	require_once "support/simple_html_dom.php";

	$url = "https://www.google.com/";
	$web = new WebBrowser(array("extractforms" => true));
	$result = $web->Process($url);

	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else if (count($result["forms"]) != 1)  echo "Was expecting one form.  Received:  " . count($result["forms"]) . "\n";
	else
	{
		$form = $result["forms"][0];

		$form->SetFormValue("q", "barebones cms");

		$result2 = $form->GenerateFormRequest("btnK");
		$result = $web->Process($result2["url"], "auto", $result2["options"]);

		if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
		else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
		else
		{
			// Do something with the results page here...
		}
	}
?>

This example retrieves Google's homepage, extracts the search form, modifies the search field, generates and submits the request, and gets the response. All of that in just a few lines of code.

Note that, like the rest of the WebBrowser class, the form handler doesn't process Javascript.

Using the cURL Emulation Layer

The cURL emulation layer is a drop-in replacement for cURL on web hosts that don't have cURL installed. This isn't some cheesy, half-baked solution. The source code carefully follows the cURL and PHP documentation. Every define() and function is available as of PHP 5.4.0.

Example usage:

<?php
	if (!function_exists("curl_init"))
	{
		require_once "support/http.php";
		require_once "support/web_browser.php";
		require_once "support/emulate_curl.php";
	}

	// Make cURL calls here...
?>

This example just shows how easy it is to add cURL support to any web host.

There are a few limitations though and a few differences. CURLOPT_VERBOSE is a lot more verbose. SSL/TLS support is a little flaky at times. Some things like DNS options are ignored. Only HTTP and HTTPS are supported protocols at this time. Return values from curl_getinfo() calls are close but not the same. curl_setopt() delays processing until curl_exec() is called. Multi-handle support "cheats" by performing operations in linear execution rather than parallel execution.

Using Asynchronous Sockets

Asynchronous, or non-blocking, sockets allow for a lot of powerful functionality such as scraping multiple pages and sites simultaneously from a single script. They also allow for using weird features of HTTP such as sending a second request to a server on an active connection while the previous request's response is still arriving.

Example usage:

<?php
	require_once "support/http.php";
	require_once "support/web_browser.php";
	require_once "support/simple_html_dom.php";
	require_once "support/multi_async_helper.php";

	// The URLs we want to load.
	$urls = array(
		"http://www.barebonescms.com/",
		"http://www.cubiclesoft.com/",
		"http://www.barebonescms.com/documentation/ultimate_web_scraper_toolkit/",
	);

	// Build the queue.
	$helper = new MultiAsyncHelper();
	$helper->SetConcurrencyLimit(3);

	// Mix in a regular file handle just for fun.
	$fp = fopen(__FILE__, "rb");
	stream_set_blocking($fp, 0);
	$helper->Set("__fp", $fp, "MultiAsyncHelper::ReadOnly");

	// Add the URLs to the async helper.
	$pages = array();
	foreach ($urls as $url)
	{
		$pages[$url] = new WebBrowser();
		$pages[$url]->ProcessAsync($helper, $url, NULL, $url);
	}

	// Run the main loop.
	$result = $helper->Wait();
	while ($result["success"])
	{
		// Process the file handle if it is ready for reading.
		if (isset($result["read"]["__fp"]))
		{
			$fp = $result["read"]["__fp"];
			$data = fread($fp, 500);
			if ($data === false || feof($fp))
			{
				echo "End of file reached.\n";

				$helper->Remove("__fp");
			}
		}

		// Process everything else.
		foreach ($result["removed"] as $key => $info)
		{
			if ($key === "__fp")  continue;

			if (!$info["result"]["success"])  echo "Error retrieving URL (" . $key . ").  " . $info["result"]["error"] . "\n";
			else if ($info["result"]["response"]["code"] != 200)  echo "Error retrieving URL (" . $key . ").  Server returned:  " . $info["result"]["response"]["line"] . "\n";
			else
			{
				echo "A response was returned (" . $key . ").\n";

				// Do something with the data here...
			}

			unset($pages[$key]);
		}

		// Break out of the loop when nothing is left.
		if ($result["numleft"] < 1)  break;

		$result = $helper->Wait();
	}

	// An error occurred.
	if (!$result["success"])  var_dump($result);
?>

This is a fairly complete example that retrieves three different URLs while simultaneously reading a file, processes up to three items in the queue at a time (see SetConcurrencyLimit()), and handles the various responses appropriately. Once all items have been processed, the script exits. MultiAsyncHelper is a flexible class that handles all asynchronous stream types (not just sockets).

Using the WebSocket Layer

Once upon a time, the web used to be a sane place filled with the HTTP protocol. Then the WebSocket protocol (RFC 6455) came along. WebSocket is a bi-directional, asynchronous streaming, fragmentation-capable, frame-based protocol which allows a remote server to chug all of your available bandwidth. Awesome.

The protocol is a little bit difficult to deal with but the handy, creatively named WebSocket class makes talking to WebSocket servers much, much easier.

Example usage:

<?php
	// Requires both the WebBrowser and HTTP classes to work.
	require_once "support/websocket.php";
	require_once "support/web_browser.php";
	require_once "support/http.php";

	$ws = new WebSocket();

	// The first parameter is the WebSocket server.
	// The second parameter is the Origin URL.
	$result = $ws->Connect("ws://ws.something.org/", "http://www.something.org");
	if (!$result["success"])
	{
		var_dump($result);
		exit();
	}

	// Send a text frame (just an example).
	$result = $ws->Write("Testtext", WebSocket::FRAMETYPE_TEXT);

	// Send a binary frame (just an example).
	$result = $ws->Write("Testbinary", WebSocket::FRAMETYPE_BINARY);

	// Main loop.
	$result = $ws->Wait();
	while ($result["success"])
	{
		do
		{
			$result = $ws->Read();
			if (!$result["success"])  break;
			if ($result["data"] !== false)
			{
				// Do something with the data.
				var_dump($result["data"]);
			}
		} while ($result["data"] !== false);

		$result = $ws->Wait();
	}

	// An error occurred.
	var_dump($result);
?>

The WebSocket class manages two queues - a read queue and a write queue - and does most of its work in the Wait() function. If you know anything about the WebSocket protocol, you know there are control frames and non-control frames. The control frames are difficult to deal with because they usually happen mid-stream but the WebSocket class automatically takes care of all of those frames for you so that you don't have to. What that means is that when you get a packet of data from the WebSocket class, the data is intended for your application.

One important thing to note about the WebSocket and WebSocketServer classes: Every major operation other than Connect() and Disconnect() is asynchronous in client mode. Connect() and Disconnect() are also asynchronous in server mode. This means that reads and writes will immediately succeed (not block), which means that a read could result in no data in the response. The data will eventually be sent/received in the Wait() function. When Wait() returns, it means that there is usually something to do but not always.

WebSocketServer implements a WebSocket server that allows a PHP WebSocket application to handle multiple clients with relative ease. WebSocketServer is an experimental product. You can try it out by running 'test_websocket_server.php' from the complete package on one command-line and 'test_websocket_client.php' from a couple more command-lines.

See the WebSocket and WebSocketServer classes for in-depth documentation.

Other Uses

The Ultimate Web Scraper Toolkit has many uses beyond pulling data down off the Internet and writing robots. For example, it can be used to scan a collection of static HTML documents on a host to find orphaned pages that are no longer being linked to:

<?php
	require_once "support/http.php";
	require_once "support/web_browser.php";
	require_once "support/simple_html_dom.php";

	// Customize options.
	$basepath = str_replace("\\", "/", dirname(__FILE__)) . "/html";
	$baseurl = "http://www.mysite.com/";
	$rootdomains = array("http://www.mysite.com/", "http://mysite.com/");
	$rootdocs = array("index.html", "index.php");
	$livescan = false;

	function LoadURLs(&$urls, $baseurl, $basepath)
	{
		if (substr($baseurl, -1) != "/")  $baseurl .= "/";

		$dir = @opendir($basepath);
		if ($dir)
		{
			while (($file = readdir($dir)) !== false)
			{
				if ($file != "." && $file != "..")
				{
					if (is_dir($basepath . "/" . $file))  LoadURLs($urls, $baseurl . $file, $basepath . "/" . $file);
					else  $urls[HTTP::ConvertRelativeToAbsoluteURL($baseurl, $file)] = $basepath . "/" . $file;
				}
			}

			closedir($dir);
		}
	}

	$html = new simple_html_dom();
	$urls = array();
	LoadURLs($urls, $baseurl, $basepath);

	// Find the root file.
	$processurls = array();
	foreach ($rootdocs as $file)
	{
		$url = HTTP::ConvertRelativeToAbsoluteURL($baseurl, $file);
		if (isset($urls[$url]))
		{
			$processurls[] = $url;

			break;
		}
	}

	// Process all URLs.
	while (count($processurls))
	{
		$url = array_shift($processurls);
		if (isset($urls[$url]))
		{
			$filename = $urls[$url];
			unset($urls[$url]);

			if (!$livescan)  $data = (string)@file_get_contents($filename);
			else
			{
				$web = new WebBrowser();
				$result = $web->Process($url);
				$data = "";

				if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
				else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
				else  $data = $result["body"];
			}

			$html->load($data);
			$rows = $html->find("a[href]");
			foreach ($rows as $row)
			{
				$url2 = (string)$row->href;
				foreach ($rootdomains as $domain)
				{
					if (strtolower(substr($url2, 0, strlen($domain))) == strtolower($domain))  $url2 = substr($url2, strlen($domain) - 1);
				}
				$url2 = HTTP::ConvertRelativeToAbsoluteURL($url, $url2);

				$processurls[] = $url2;
			}
		}
	}

	// Output files not referenced anywhere.
	echo "Orphaned files:\n\n";
	foreach ($urls as $url => $file)
	{
		echo $file . "\n";
	}
?>

If you have a specific example of a common scraper-related task that you'd like to see documented here, please drop by the forums.

© CubicleSoft