A Wrapper for Complex HTML Tables

Team 

Günther Sommer, Vienna University of Technology
Katharina Kaiser, Vienna University of Technology, Institute of Software Technology and Interactive Systems

Contact Person  Katharina Kaiser
 
Project 

The main purpose of this student project is a wrapper that transforms complex HTML tables into an XML format. The complexity of the HTML tables is defined by the occurrence of spanned cells.

Nowadays, many wrappers exist applying Information Extraction (IE) methods on semi-structured data, like HTML files. One drawback of many of these wrappers is the inability of handling complex tables. By means of complex tables information is structured and the representation of redundant information is omitted, i.e., the table is displayed in a normalized format. The advantage is that the layout is more concise, as HTML is mainly designed for layout presented to human users. But to support the computer-based processing of the information of these complex tables, it has to be de-normalized to allow a faster access to each record. Additionally, to enable a more efficient processing, the de-normalized table is not only represented in the HTML format, but also in an XML format.

The procedure can be described as following: The information in spanned cells is disassembled and the information is stored in the disassembled cells. Therefore, it is necessary to read the HTML file and parse the full HTML source code including text and attributes. The focus is on the special attributes COLSPAN and ROWSPAN of the table tags (<TD> and <TH>), because they indicate spanned cells. The application breaks up the spanned cell into an amount of cells defined by COLSPAN and ROWSPAN and clones the information into this broken up cells. In dependency of the header cells (configured via command line arguments or detected by the <TH> tag) the cloning can be suppressed for certain cells.

The output of the program can be stored both in an HTML and an XML file. The structure of the latter is defined by a DTD, which is included in the package.

The main challenges in this project are defined as following:

  • Developing a parser for HTML tables to detect structure information
  • Detection of header cells
  • Detection of vertical and/or horizontal spanned cells
  • Disassemble cells and clone the contained information, if appropriate
  • Applying a single-pass method: therefore, a dynamic table structure is used
  • Accepting and handling missing tags

Table 1: Normalized complex table. Input for DeTable.
 
causedrug of choicedosage
adults Gonococcus Ceftriaxone 1g IM, single dose
lavage infected eye
Chlamydia Azithromycin 1g orally single dose
or
Doxycycline 100 mg orally twice a day for 7 days
children Gonococcus Children who weigh < 45 kg Ceftriaxone 125 mg IM, single dose
Children who weigh > 45 kg same treatment as adults
Chlamydia Children who weigh < 45 kg Erythromycin base 50 mg/kg/day orally in 4 divided doses for 10-14 days
Children under 8 years old who weigh > 45 kg Azithromycin 1 gm orally, single dose
Children 8 years old or older Azithromycin 1 gm orally, single dose
or
Doxycycline 100 mg orally, twice a day for 7 days
Neonates Ophthalmia neonatorum (Caused by N. gonorrhoeae) Ceftriaxone 25-50 mg/kg IV or IM, single dose, not to exceed 125 mg
Chlamydia Erythromycin 50 mg/kg/day orally in 4 divided doss for 10-14 days


Table 2: De-normalized table (in HTML format) including redundant information.
Output of DeTable in HTML format.
 
causecausecausedrug of choicedosage
adultsGonococcus Ceftriaxone1g IM, single dose
adultsGonococcuslavage infected eye
adultsChlamydia Azithromycin1g orally single dose
adultsChlamydiaoror
adultsChlamydia Doxycycline100 mg orally twice a day for 7 days
childrenGonococcusChildren who weigh < 45 kg Ceftriaxone125 mg IM, single dose
childrenGonococcusChildren who weigh > 45 kg same treatment as adults
childrenChlamydiaChildren who weigh < 45 kg Erythromycin base50 mg/kg/day orally in 4 divided doses for 10-14 days
childrenChlamydia Children under 8 years old who weigh > 45 kg Azithromycin1 gm orally, single dose
childrenChlamydia Children 8 years old or olderAzithromycin 1 gm orally, single dose
childrenChlamydiaChildren 8 years old or older oror
childrenChlamydiaChildren 8 years old or older Doxycycline100 mg orally, twice a day for 7 days
NeonatesOphthalmia neonatorum (Caused by N. gonorrhoeae) Ceftriaxone25-50 mg/kg IV or IM, single dose, not to exceed 125 mg
NeonatesChlamydiaErythromycin 50 mg/kg/day orally in 4 divided doss for 10-14 days

Downloads 
download
>> Download DeTable
Filename: detable.zip         Size: 664 KB
Needed application: Java JRE 1.3 or 1.4.x

Related Work