Layout Group Extraction from Web Content for Effective Adaptation

These days, people access the Web by using various devices and
methods, such as PDAs, cellular phones, and voice-based browsers.
However, most Web content is designed for desktop computers.
Therefore, already-existing Web content should be transcoded to be
suitable for each access device and method. For this purpose, some
annotation-based transcoding systems have been developed. An
annotation is additional information of Web content, and effective
adaptation can be achieved by using it. One of the most difficult
problems of annotation is the cost of annotating Web content. Many
popular sites, such as news sites, have a large number of Web pages
and add new content continually. Hence, it is almost impossible to
annotate all of the content in these sites. To solve this problem, we
introduce a method to extract common layouts from Web pages. We focus
on the structure and characteristics of particular HTML tags that
affect the layout of Web pages. Our method calculates the distance
between Web pages using this method. When the distance is below the
threshold, these pages can be considered as the same layout pages. By
using this method, a certain annotation can be applied to any Web
pages that have the same layout. Therefore, the cost of adaptation
will be reduced.

By: Kentarou Fukuda, Hironobu Takagi, Junji Maeda and Chieko Asakawa

Published in: RT0493 in 2007

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

RT0493.pdf

Questions about this service can be mailed to reports@us.ibm.com .