Ultimate Web Scraper Toolkit Documentation

The Ultimate Web Scraper Toolkit is a powerful set of tools designed to handle all of your web scraping needs on nearly all web hosts. This toolkit easily makes RFC-compliant web requests that are indistinguishable from a real web browser, a web browser-like state engine for handling cookies and redirects, and a full cURL emulation layer for web hosts without the PHP cURL extension installed. Simple HTML DOM is included to easily extract the desired content from each retrieved document.

While this toolkit makes it really easy to scrape just about any content from the web, please don't do anything illegal. There is this little thing called copyright law that most countries have to protect various works.

Features

The following are a few features of the Ultimate Web Scraper toolkit:

And much more.

License

The Ultimate Web Scraper Toolkit is extracted from Barebones CMS and the license is also your pick of MIT or LGPL. The license and restrictions are identical to the Barebones CMS License.

If you find the Ultimate E-mail Toolkit useful, financial donations are sincerely appreciated and go towards future development efforts.

Download

Ultimate Web Scraper Toolkit 1.0RC11 is the eleventh release candidate of the Ultimate Web Scraper Toolkit.

Download ultimate-web-scraper-1.0rc11.zip

If you find the Ultimate Web Scraper Toolkit useful, please donate toward future development efforts.

Installation

Installing the Ultimate Web Scraper Toolkit is easy. The installation procedure is as follows:

Installation is easy. Using the toolkit is a bit more difficult.

Upgrading

Like Barebones CMS, upgrading the Ultimate Web Scraper Toolkit is easy - just upload the new files to the server and overwrite existing files.

Scraping Webpages - The Easy Way

Webpages are hard to retrieve and harder to parse. And doing this consistently across a wide variety of web hosts and scenarios makes it very difficult to do this alone. The Ultimate Web Scraper Toolkit makes both retrieving and parsing webpages a whole lot easier.

Example:

<?php
	require_once "support/http.php";
	require_once "support/web_browser.php";
	require_once "support/simple_html_dom.php";

	// Simple HTML DOM tends to leak RAM like
	// a sieve.  Declare what you will need here.
	// Objects are reusable.
	$html = new simple_html_dom();

	$url = "http://www.somesite.com/something/";
	$web = new WebBrowser();
	$result = $web->Process($url);

	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else
	{
		echo "All the URLs:\n";
		$html->load($result["body"]);
		$rows = $html->find("a[href]");
		foreach ($rows as $row)
		{
			echo "\t" . $row->href . "\n";
		}
	}
?>

This brief example retrieves a URL while emulating some flavor of Firefox and displays the value of the 'href' attribute of all anchor tags that have a 'href' attribute. See the Simple HTML DOM documentation for details on that library. See the HTTP class for in-depth documentation on the document retrieval routines.

Scraping Webpages - The Hard Way

The previous example used the web browser emulation layer to retrieve the content. Sometimes getting into the nitty-gritty details of constructing a web request is the desired option.

Example:

<?php
	require_once "support/http.php";
	require_once "support/simple_html_dom.php";

	// Simple HTML DOM tends to leak RAM like
	// a sieve.  Declare what you will need here.
	// Objects are reusable.
	$html = new simple_html_dom();

	$url = "http://www.somesite.com/something/";
	$options = array(
		"headers" => array(
			"User-Agent" => HTTP::GetWebUserAgent("Firefox"),
			"Accept" => "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
			"Accept-Language" => "en-us,en;q=0.5",
			"Accept-Charset" => "ISO-8859-1,utf-8;q=0.7,*;q=0.7",
			"Cache-Control" => "max-age=0"
		)
	);
	$result = HTTP::RetrieveWebpage($url, $options);
	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else
	{
		echo "All the URLs:\n";
		$html->load($result["body"]);
		$rows = $html->find("a[href]");
		foreach ($rows as $row)
		{
			echo "\t" . $row->href . "\n";
		}
	}
?>

This example performs the same operation as the previous section, but doesn't get all the benefits of the web browser emulation layer such as automatically handling redirects and cookies.

Handling HTML Forms

Traditionally, one of the hardest things to handle with web scraping is the classic HTML form. If you are like me, then you've generally just faked it and manually handled form submissions by just bypassing the form itself (i.e. manually copied variable names). The problem is that if/when the server side changes how it does things, the old form submission code will tend to break in spectacular ways. This toolkit includes several functions designed to make real form handling a walk in the park.

Example:

<?php
	require_once "support/http.php";
	require_once "support/web_browser.php";
	require_once "support/simple_html_dom.php";

	$url = "https://www.google.com/";
	$web = new WebBrowser(array("extractforms" => true));
	$result = $web->Process($url);

	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else if (count($result["forms"]) != 1)  echo "Was expecting one form.  Received:  " . count($result["forms"]) . "\n";
	else
	{
		$form = $result["forms"][0];

		$form->SetFormValue("q", "barebones cms");

		$result2 = $form->GenerateFormRequest("btnK");
		$result = $web->Process($result2["url"], "auto", $result2["options"]);

		if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
		else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
		else
		{
			// Do something with the results page here...
		}
	}
?>

This example retrieves Google's homepage, extracts the search form, modifies the search field, generates and submits the request, and gets the response. All of that in just a few lines of code.

Note that, like the rest of the WebBrowser class, the form handler doesn't process Javascript.

Using the cURL Emulation Layer

The cURL emulation layer is a drop-in replacement for cURL on web hosts that don't have cURL installed. This isn't some cheesy, half-baked solution. The source code carefully follows the cURL and PHP documentation. Every define() and function is available as of PHP 5.4.0.

Example usage:

<?php
	if (!function_exists("curl_init"))
	{
		require_once "support/http.php";
		require_once "support/web_browser.php";
		require_once "support/emulate_curl.php";
	}

	// Make cURL calls here...
?>

This example just shows how easy it is to add cURL support to any web host.

There are a few limitations though and a few differences. CURLOPT_VERBOSE is a lot more verbose. SSL/TLS support is a little flaky at times. Some things like DNS options are ignored. Only HTTP and HTTPS are supported protocols at this time. Return values from curl_getinfo() calls are close but not the same. curl_setopt() delays processing until curl_exec() is called. Multi-handle support "cheats" by performing operations in linear execution rather than parallel execution.

Other Uses

The Ultimate Web Scraper Toolkit has many uses beyond pulling data down off the Internet and writing robots. For example, it can be used to scan a collection of static HTML documents on a host to find orphaned pages that are no longer being linked to:

<?php
	require_once "support/http.php";
	require_once "support/web_browser.php";
	require_once "support/simple_html_dom.php";

	// Customize options.
	$basepath = str_replace("\\", "/", dirname(__FILE__)) . "/html";
	$baseurl = "http://www.mysite.com/";
	$rootdomains = array("http://www.mysite.com/", "http://mysite.com/");
	$rootdocs = array("index.html", "index.php");
	$livescan = false;

	function LoadURLs(&$urls, $baseurl, $basepath)
	{
		if (substr($baseurl, -1) != "/")  $baseurl .= "/";

		$dir = @opendir($basepath);
		if ($dir)
		{
			while (($file = readdir($dir)) !== false)
			{
				if ($file != "." && $file != "..")
				{
					if (is_dir($basepath . "/" . $file))  LoadURLs($urls, $baseurl . $file, $basepath . "/" . $file);
					else  $urls[HTTP::ConvertRelativeToAbsoluteURL($baseurl, $file)] = $basepath . "/" . $file;
				}
			}

			closedir($dir);
		}
	}

	$html = new simple_html_dom();
	$urls = array();
	LoadURLs($urls, $baseurl, $basepath);

	// Find the root file.
	$processurls = array();
	foreach ($rootdocs as $file)
	{
		$url = HTTP::ConvertRelativeToAbsoluteURL($baseurl, $file);
		if (isset($urls[$url]))
		{
			$processurls[] = $url;

			break;
		}
	}

	// Process all URLs.
	while (count($processurls))
	{
		$url = array_shift($processurls);
		if (isset($urls[$url]))
		{
			$filename = $urls[$url];
			unset($urls[$url]);

			if (!$livescan)  $data = (string)@file_get_contents($filename);
			else
			{
				$web = new WebBrowser();
				$result = $web->Process($url);
				$data = "";

				if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
				else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
				else  $data = $result["body"];
			}

			$html->load($data);
			$rows = $html->find("a[href]");
			foreach ($rows as $row)
			{
				$url2 = (string)$row->href;
				foreach ($rootdomains as $domain)
				{
					if (strtolower(substr($url2, 0, strlen($domain))) == strtolower($domain))  $url2 = substr($url2, strlen($domain) - 1);
				}
				$url2 = HTTP::ConvertRelativeToAbsoluteURL($url, $url2);

				$processurls[] = $url2;
			}
		}
	}

	// Output files not referenced anywhere.
	echo "Orphaned files:\n\n";
	foreach ($urls as $url => $file)
	{
		echo $file . "\n";
	}
?>

If you have a specific example of a common scraper-related task that you'd like to see documented here, please drop by the forums.

© CubicleSoft