HTML::TagParser is a pure Perl implementaion for parsing HTML files.
This module provides some methods like DOM.
This module is not strict about XHTML format
because many of HTML pages are not strict.
You know, many pages use <br> elemtents instead of <br/>
and have <p> elements which are not closed.
Source archive:
HTML-TagParser-0.20.tar.gz
TARGZ
CPAN
Repository:
https://github.com/kawanet/HTML-TagParser
github
make install like below or just pick up the source: HTML/TagParser.pm.
$ tar zxf HTML-TagParser-0.13.tar.gz $ cd HTML-TagParser-0.13 $ perl Makefile.PL && make Checking if your kit is complete... Looks good Writing Makefile for HTML-TagParser cp lib/HTML/TagParser.pm blib/lib/HTML/TagParser.pm Manifying blib/man3/HTML::TagParser.3 $ make test PERL_DL_NONLAZY=1 /usr/bin/perl "-MExtUtils::Command::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t t/01_new.........ok t/02_parse.......ok t/03_open........ok t/04_fetch.......ok 4/5 skipped: URI::Fetch is not loaded. t/05_charset.....ok t/06_japanese....ok t/07_getelem.....ok t/08_nest........ok t/09_broken......ok t/10_escape......ok t/20_index-j.....ok t/21_index-e.....ok t/22_yahoo.......ok t/23_flickr......ok All tests successful, 4 subtests skipped. Files=14, Tests=121, 3 wallclock secs ( 1.12 cusr + 0.19 csys = 1.31 CPU) $ sudo make install
URI::Fetch module is required if you wish to fetch a HTML file via HTTP.
Parse a HTML file and find its <title> element's value.
my $html = HTML::TagParser->new( "index-j.html" ); my $elem = $html->getElementsByTagName( "title" ); print "<title>", $elem->innerText(), "</title>\n" if ref $elem;
Parse a HTML source and find its first <form action=""> attribute's value.
my $html = HTML::TagParser->new( '<html><form action="hoge.cgi"></form></html>' ); my $elem = $html->getElementsByTagName( "form" ); print "<form action=\"", $elem->getAttribute("action"), "\">\n" if ref $elem;
Fetch a HTML file via HTTP, and display its all <a> elements and attributes.
my $html = HTML::TagParser->new( "http://www.kawa.net/xp/index-e.html" ); my @list = $html->getElementsByTagName( "a" ); foreach my $elem ( @list ) { my $tagname = $elem->tagName; my $attr = $elem->attributes; my $text = $elem->innerText; print "<$tagname"; foreach my $key ( sort keys %$attr ) { print " $key=\"$attr->{$key}\""; } if ( $text eq "" ) { print " />\n"; } else { print ">$text</$tagname>\n"; } }