Vienna University of Technology, Institute of
Software Technology and Interactive Systems, Vienna, Austria
If we surf the web we can find PDF files in heaps. Once technical details of an amazing five mega pixel digital camera, once a statistic about the last two years incomes of an enterprise, and once a brilliant crime novel of Sir Arthur Conan Doyle is saved in a PDF file. The widespread use of this file format takes the focus on the question of how to reuse the data in such a file. Many things are already done in this area. For example, there are several tools that convert PDF-files to other formats.
My work focuses only on the extraction of table information from PDF-files. I searched for tools that extract basic information from PDF-files. I found a tool named pdf2html which also returns data in XML format. To access this XML output I used the JDOM archive.
I developed several heuristics for table detection and decomposition. These heuristics work pretty good on lucid tables (without spanning columns or rows) and fairly good on complex tables (with spanning rows or columns).
The following table is cutted out of a PDF-file.
My tool produces an XML-file with datarow information of the tables
in a PDF-file. The extracted information looks, with the corresponding
style-sheet, as follows:
Burcu Yildiz. Information Extraction - Utilizing Table Patterns, Masters thesis, Vienna University of Technology, 2004.Burcu Yildiz, Katharina Kaiser, Silvia Miksch. pdf2table: A Method to Extract Table Information from PDF Files , In: Proceedings of the 2nd Indian International Conference on Artificial Intelligence (IICAI05), Pune, India, 2005.
For Windows: pdf2table_win.jar v2.0, Source code under the GNU License: GNU_pdf2tableWIN.zip
For MacOS and Linux: pdf2table.jar v2.0, Source code under the GNU License: GNU_pdf2tableMAC.zip
Installation ReadMe: readme.txt