Ultimate Web Scraper Toolkit Documentation

The Ultimate Web Scraper Toolkit is a powerful set of tools designed to handle all of your web scraping needs on most web hosts. This toolkit offers clean, RFC-compliant web scraping that makes it easy to create web requests that are indistinguishable from a real web browser, thus minimizing headaches with browser sniffing servers. Once a document is retrieved, the included copy of the Simple HTML DOM library makes it easy to extract the desired content from the document via the jQuery-like selector syntax.

While this toolkit makes it really easy to scrape just about any content from the web, please don't do anything illegal. There is this little thing called copyright law that most countries have to protect various works.

If you are like me and familiar with the art of web scraping with PHP, you've probably run the gamut of file_get_contents(), fopen wrappers, cURL, etc. looking for something reliable. Each web host is different and there are limitations, restrictions, and annoyances to each method - in some cases, the web host blocks the functionality altogether. Also, some websites output different content based on server-side browser sniffing. The Ultimate Web Scraper Toolkit deals with all of these issues and lets you easily emulate the most popular web browsers so you don't get a headache. You're welcome.

This toolkit also comes with a full-blown web browser state engine. The web browser state engine offers a clean way to manage cookies and sessions as well as automated 301 redirection support and better web browser-like header emulation. Think of it as a GUI-less web browser that is a wrapper around the core HTTP library mentioned in the previous paragraph.

The toolkit also includes a full-blown cURL emulation layer. This is a fantastic solution for PHP libraries and classes that require cURL but the web host doesn't have cURL installed and/or enabled. Simply include this toolkit before including the library and the code will work as expected with minor exceptions (e.g. multi handle support is done with serial calls instead of parallel).

Once the document is retrieved, the next step is to extract the content the application is interested in. If you are like me and used to use preg_match() and friends in the past, you are familiar with the pain of even the most minor website redesign. Simple HTML DOM is the answer.

License

The Ultimate Web Scraper Toolkit is extracted from Barebones CMS and the license is also your pick of MIT or LGPL. The license and restrictions are identical to the Barebones CMS License.

If you find the Ultimate E-mail Toolkit useful, financial donations are sincerely appreciated and go towards future development efforts.

Installation

Installing the Ultimate Web Scraper Toolkit is easy. The installation procedure is as follows:

Installation is easy. Using the toolkit is a bit more difficult.

Upgrading

Like Barebones CMS, upgrading the Ultimate Web Scraper Toolkit is easy - just upload the new files to the server and overwrite existing files.

Scraping A Webpage - The Easy Way

Webpages are hard to retrieve and harder to parse. And doing this consistently across a wide variety of web hosts and scenarios makes it very difficult to do this alone. The Ultimate Web Scraper Toolkit makes both retrieving and parsing webpages a whole lot easier.

Example:

<?php
	require_once "support/http.php";
	require_once "support/web_browser.php";
	require_once "support/simple_html_dom.php";

	// Simple HTML DOM tends to leak RAM like
	// a sieve.  Declare what you will need here.
	// Objects are reusable.
	$html = new simple_html_dom();

	$url = "http://www.somesite.com/something/";
	$web = new WebBrowser();
	$result = $web->Process($url);

	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else
	{
		echo "All the URLs:\n";
		$html->load($result["body"]);
		$rows = $html->find("a[href]");
		foreach ($rows as $row)
		{
			echo "\t" . $row->href . "\n";
		}
	}
?>

This brief example retrieves a URL while emulating some flavor of Firefox and displays the value of the 'href' attribute of all anchor tags that have a 'href' attribute. See the Simple HTML DOM documentation for details on that library. See the HTTP functions for in-depth documentation on the document retrieval routines.

Scraping A Webpage - The Hard Way

The previous example used the web browser emulation layer to retrieve the content. Sometimes getting into the nitty-gritty details of constructing a web request is the desired option.

Example:

<?php
	require_once "support/http.php";
	require_once "support/simple_html_dom.php";

	// Simple HTML DOM tends to leak RAM like
	// a sieve.  Declare what you will need here.
	// Objects are reusable.
	$html = new simple_html_dom();

	$url = "http://www.somesite.com/something/";
	$options = array(
		"headers" => array(
			"User-Agent" => GetWebUserAgent("Firefox"),
			"Accept" => "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
			"Accept-Language" => "en-us,en;q=0.5",
			"Accept-Charset" => "ISO-8859-1,utf-8;q=0.7,*;q=0.7",
			"Cache-Control" => "max-age=0"
		)
	);
	$result = RetrieveWebpage($url, $options);
	if (!$result["success"])  echo "Error retrieving URL.  " . $result["error"] . "\n";
	else if ($result["response"]["code"] != 200)  echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
	else
	{
		echo "All the URLs:\n";
		$html->load($result["body"]);
		$rows = $html->find("a[href]");
		foreach ($rows as $row)
		{
			echo "\t" . $row->href . "\n";
		}
	}
?>

This example performs the same operation but doesn't get all the benefits of the web browser emulation layer such as automatically handling redirects and cookies.

Using the cURL Emulation Layer

The cURL emulation layer is a drop-in replacement for cURL on web hosts that don't have cURL installed. This isn't some cheesy, half-baked solution. The source code carefully follows the cURL and PHP documentation. Every define() and function is available as of PHP 5.4.0.

Example usage:

<?php
	if (!function_exists("curl_init"))
	{
		require_once "support/http.php";
		require_once "support/web_browser.php";
		require_once "support/emulate_curl.php";
	}

	// Make cURL calls here...
?>

This example just shows how easy it is to add cURL support to any web host.

There are a few limitations though and a few differences. CURLOPT_VERBOSE is a lot more verbose. SSL/TLS support is a little flaky at times. Some things like DNS options are ignored. Only HTTP and HTTPS are supported protocols at this time. Return values from curl_getinfo() calls are close but not the same. curl_setopt() delays processing until curl_exec() is called. Multi-handle support "cheats" by performing operations in linear execution rather than parallel execution.

© CubicleSoft