Archive for April, 2009

Starting posts with “hi”

Recently, I’ve seen a large number of blog posts starting with “hi.” The authors are obviously greeting their readers before getting into business. This has always intrigued me. Actually I find it quite weird. I can understand a greeting in the first post of a new blog, but why do people greet their readers in every single post? Maybe they are trying to be courteous and/or more appealing to the masses? My best guess for this behavior is that this is a reminiscent from sending emails. I also think that it is not a good practice.


Mestrado defendido. Na verdade foi na quarta-feira, dia 22. Falei 1 hora, mas deveria ter falado aproximadamente 45 minutos. Para mim pareceram só 15 minutos. Até demorei em certas partes. Achava que ia acabar muito rápido. Tenho 15 dias para fazer as correções na dissertação.

Obviamente o mestrado demorou mais do que o esperado. Sei que poderia e deveria ter acabado bem antes ou ter feito mais. Trabalhar e fazer mestrado ao mesmo tempo não é fácil. Ainda mais quando deve-se cumprir uma jornada de trabalho com 44 horas por semana.

Não estou inspirado para escrever muito mais do que isso. Só quero agradecer ao professores João e Altigran pela oportunidade, orientação e paciência.

10 dias de férias

Finalmente algumas fotos dos 10 dias de férias:

P1010827 P1010859
P1010896 P1010958

Microformats and a Web page tagger

While preparing an experiment for my master’s degree, I needed to manually tag terms from Web pages and store these term/tag pairs for later processing. This set of term/tag pairs was the basis for performance evaluation of two extraction methods I worked on. The experiment can be summarize as: a page is transformed into a set of terms, which are manually tagged and then processed to extract semi-structured data in the form of postal addresses. Performance is evaluated by comparing the manually tagged set with the resulting extraction.

The whole “problem” is with the manual tagging step and how it handles term/tag pairs. These pairs are based on term position relative to all the terms in the page, starting with position 0. This is very dependent on the parser used to transform the Web page into plain text, making it difficult for others to use my test set. My current parser is implemented with BeautifulSoup and some Python code. Also, tagging a set of terms without the actual look of a Web page isn’t something that I like. I prefer to have a system where terms are tagged directly from a rendered page. See the screenshot below for an idea on how the current system looks. Each term can be tagged with one of eight different tags, each associated to a different color.

Since most of the work required for my master’s is done, besides the dissertation defense and a paper, I’ve begun to investigate a proper solution for the problem of manually tagging and storing tag information from a rendered Web page.

So far, I’ve come across microformats. A microformat is “a way to make your Web pages readable by more than just people. The idea is that you put special forms of HTML in your page, around the stuff you already have in your page. This special code lets other computers that happen to be looking at your page make some form of sense out of it.” The microformat information is expected to be encoded by the Web page author, but I don’t see why this could not be also used by my tagging system.

More specifically, for the case of postal addresses, is the adr microformat. I must say that I didn’t like it too much. The elements of a postal address are not represented in much detail, such as a separate property for street name and another for building number. I currently require this level of detail. I might end up devising a microformat of my own.

My current idea is to develop a Firefox plug-in or some JavaScript to be automatically embedded into pages. The final target to aid the task of manually tagging terms from a rendered Web page by storing tags using a microformat in the page itself.

A defesa foi adiada

Bem, não tenho muito o que escrever. A defesa do meu mestrado foi adiada. Sem data definida.

Twitter Updates

April 2009