Information Extraction - Utilizing Table Patterns

Team 

Burcu Yildiz, Vienna University of Technology, Institute of Software Technology and Interactive Systems, Vienna, Austria
Kathi Kaiser, Vienna University of Technology, Institute of Software Technology and Interactive Systems, Vienna, Austria
Silvia Miksch, Vienna University of Technology, Institute of Software Technology and Interactive Systems, Vienna, Austria

Contact Person 

Burcu Yildiz

Project 

Motivation/Research Interest

If we surf the web we can find PDF files in heaps. Once technical details of an amazing five mega pixel digital camera, once a statistic about the last two years incomes of an enterprise, and once a brilliant crime novel of Sir Arthur Conan Doyle is saved in a PDF file. The widespread use of this file format takes the focus on the question of how to reuse the data in such a file. Many things are already done in this area. For example, there are several tools that convert PDF-files to other formats.

My work focuses only on the extraction of table information from PDF-files. I searched for tools that extract basic information from PDF-files. I found a tool named pdf2html which also returns data in XML format. To access this XML output I used the JDOM archive.

I developed several heuristics for table detection and decomposition. These heuristics work pretty good on lucid tables (without spanning columns or rows) and fairly good on complex tables (with spanning rows or columns).

Images 

The following table is cutted out of a PDF-file.

My tool produces an XML-file with datarow information of the tables in a PDF-file. The extracted information looks, with the corresponding style-sheet, as follows:

Publications 

Burcu Yildiz. Information Extraction - Utilizing Table Patterns, Masters thesis, Vienna University of Technology, 2004.

Burcu Yildiz, Katharina Kaiser, Silvia Miksch. pdf2table: A Method to Extract Table Information from PDF Files , In: Proceedings of the 2nd Indian International Conference on Artificial Intelligence (IICAI05), Pune, India, 2005.
Downloads  For Windows: pdf2table_win.jar v2.0, Source code under the GNU License: GNU_pdf2tableWIN.zip
For MacOS and Linux: pdf2table.jar v2.0, Source code under the GNU License: GNU_pdf2tableMAC.zip

Installation ReadMe: readme.txt
Related Work 
  • Wang Y. Document Analysis: Table Structure Understanding Zone Content Classification. Doctoral dissertation, Washington University, 2002.
  • Tupaj S, Shi Z, Dr.Chang C.H, Alan H. Tufts. Extracting Tabular Information From Text Files. 1996.
  • Ramel J.-Y, Crucianu M, Vincent N, Faure C. Detection, Extraction and Representation of Tables. (ICDAR’03).
  • Pinto D, McCallum A, Wei X, Croft B.W. Table Extraction Using Conditional Random Fields. SIGIR’03.