Microformats and a Web page tagger

While preparing an experiment for my master’s degree, I needed to manually tag terms from Web pages and store these term/tag pairs for later processing. This set of term/tag pairs was the basis for performance evaluation of two extraction methods I worked on. The experiment can be summarize as: a page is transformed into a set of terms, which are manually tagged and then processed to extract semi-structured data in the form of postal addresses. Performance is evaluated by comparing the manually tagged set with the resulting extraction.

The whole “problem” is with the manual tagging step and how it handles term/tag pairs. These pairs are based on term position relative to all the terms in the page, starting with position 0. This is very dependent on the parser used to transform the Web page into plain text, making it difficult for others to use my test set. My current parser is implemented with BeautifulSoup and some Python code. Also, tagging a set of terms without the actual look of a Web page isn’t something that I like. I prefer to have a system where terms are tagged directly from a rendered page. See the screenshot below for an idea on how the current system looks. Each term can be tagged with one of eight different tags, each associated to a different color.

Since most of the work required for my master’s is done, besides the dissertation defense and a paper, I’ve begun to investigate a proper solution for the problem of manually tagging and storing tag information from a rendered Web page.

So far, I’ve come across microformats. A microformat is “a way to make your Web pages readable by more than just people. The idea is that you put special forms of HTML in your page, around the stuff you already have in your page. This special code lets other computers that happen to be looking at your page make some form of sense out of it.” The microformat information is expected to be encoded by the Web page author, but I don’t see why this could not be also used by my tagging system.

More specifically, for the case of postal addresses, is the adr microformat. I must say that I didn’t like it too much. The elements of a postal address are not represented in much detail, such as a separate property for street name and another for building number. I currently require this level of detail. I might end up devising a microformat of my own.

My current idea is to develop a Firefox plug-in or some JavaScript to be automatically embedded into pages. The final target to aid the task of manually tagging terms from a rendered Web page by storing tags using a microformat in the page itself.


0 Responses to “Microformats and a Web page tagger”

  1. Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

Twitter Updates

Error: Twitter did not respond. Please wait a few minutes and refresh this page.

April 2009
« Mar   May »

%d bloggers like this: