Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Announcement: Ultimate Web Scraper Toolkit 1.0 RC16 released
This is a minor bugfix release with a significant set of extensions to TagFilter.  TagFilter now officially replaces both Simple HTML DOM and HTML Purifier.

Over the years, I've hit Simple HTML DOM's limits and things like the severe RAM leaks really drove me nuts.  However, I didn't see any other really good options.  phpQuery was probably the next best bet.  However, it also uses XPath under the hood, which means the same memory leaks would most likely occur with object reuse and both libraries used preg_match() extensively, which is a sure sign of things done wrong where EBNF grammars are involved.  Simple HTML DOM was generally quite slow and also VERY buggy when modifying the DOM.  The version I've relied on is the last known good release from 2008. There's a newer version with more bugs and so I stuck with the 2008 version. That library can also only handle CSS2 selectors.  The only thing that's held me back from making this move is the lack of a well-written CSS3 selector tokenizer.  So I did the only thing I ever do:  Rolled my own fully-compliant CSS3 selector tokenizer that correctly uses a state engine and successfully passes the official W3C CSS3 static test suite!  I've had a good run with Simple HTML DOM, but TagFilter already had significantly superior HTML parsing capabilities and now has an extremely powerful CSS3 selector engine under the hood for finding DOM elements of interest. However, due to a lot of legacy code still relying on Simple HTML DOM, it will take time to locate it all and upgrade it to TagFilter. Therefore, Simple HTML DOM will still be included for probably a few more years.

I kicked HTML Purifier to the curb a while back in favor of TagFilter.  This release formalizes that endeavor with the new TagFilter::HTMLPurify() function.  HTML Purifier is good at purifying HTML of bad things like XSS attempts, but it is slow, I never really cared for its license, and it is extremely bloated (~730KB minified).  It's also a fairly awkward product to work with.  On the flip-side, TagFilter::HTMLPurify() clears out XSS input on par with HTML Purifier, is a fraction of the size, and is pretty straightforward to use. I recommend looking at the included test suite (tests/test_suite.php) for examples.

Finally, the TagFilterStream class is able to process an insane amount of HTML content at ~1MB/sec. In pure PHP. Before TagFilter, I was able to process content through the other two libraries, in some cases, only at 3KB/sec. TagFilterStream can be over 300 times faster than other options and can end up using very little RAM since it is a stream-capable class.

A couple of minor fixes made it into this release including better/more correct 'script' and 'style' tag parsing, fake void tags, and some callback improvements.

Learn more and download Ultimate Web Scraper Toolkit to enhance your next web scraping project:
Author of Barebones CMS

If you found my reply to be helpful, be sure to donate!
All funding goes toward future product development.

Forum Jump:

Users browsing this thread: 1 Guest(s)
© CubicleSoft