English Japanese Kawa.netxp [Perl] HTML::TagParser - Yet another HTML tag parser by pure Perl implementation

HTML::TagParser is a pure Perl implementaion for parsing HTML files.
This module provides some methods like DOM.
This module is not strict about XHTML format because many of HTML pages are not strict.
You know, many pages use <br> elemtents instead of <br/> and have <p> elements which are not closed.

Download and Install

Source archive: HTML-TagParser-0.20.tar.gz TARGZ CPAN
Repository: https://github.com/kawanet/HTML-TagParser github

make install like below or just pick up the source: HTML/TagParser.pm.

$ tar zxf HTML-TagParser-0.13.tar.gz
$ cd HTML-TagParser-0.13
$ perl Makefile.PL && make
Checking if your kit is complete...
Looks good
Writing Makefile for HTML-TagParser
cp lib/HTML/TagParser.pm blib/lib/HTML/TagParser.pm
Manifying blib/man3/HTML::TagParser.3
$ make test
PERL_DL_NONLAZY=1 /usr/bin/perl "-MExtUtils::Command::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/01_new.........ok
t/02_parse.......ok
t/03_open........ok
t/04_fetch.......ok
        4/5 skipped: URI::Fetch is not loaded.
t/05_charset.....ok
t/06_japanese....ok
t/07_getelem.....ok
t/08_nest........ok
t/09_broken......ok
t/10_escape......ok
t/20_index-j.....ok
t/21_index-e.....ok
t/22_yahoo.......ok
t/23_flickr......ok
All tests successful, 4 subtests skipped.
Files=14, Tests=121,  3 wallclock secs ( 1.12 cusr +  0.19 csys =  1.31 CPU)
$ sudo make install

URI::Fetch module is required if you wish to fetch a HTML file via HTTP.

Examples

Parse a HTML file and find its <title> element's value.

my $html = HTML::TagParser->new( "index-j.html" );
my $elem = $html->getElementsByTagName( "title" );
print "<title>", $elem->innerText(), "</title>\n" if ref $elem;

Parse a HTML source and find its first <form action=""> attribute's value.

my $html = HTML::TagParser->new( '<html><form action="hoge.cgi"></form></html>' );
my $elem = $html->getElementsByTagName( "form" );
print "<form action=\"", $elem->getAttribute("action"), "\">\n" if ref $elem;

Fetch a HTML file via HTTP, and display its all <a> elements and attributes.

my $html = HTML::TagParser->new( "http://www.kawa.net/xp/index-e.html" );
my @list = $html->getElementsByTagName( "a" );
foreach my $elem ( @list ) {
    my $tagname = $elem->tagName;
    my $attr = $elem->attributes;
    my $text = $elem->innerText;
    print "<$tagname";
    foreach my $key ( sort keys %$attr ) {
        print " $key=\"$attr->{$key}\"";
    }
    if ( $text eq "" ) {
        print " />\n";
    } else {
        print ">$text</$tagname>\n";
    }
}

Comments by AjaxCom

Links

Kawa.netxp © Copyright 2006-2012 Yusuke Kawasaki