Recently I have been working on a lot of data migration to XML - of all kinds - PDF documents to XML, Word document to XML, SGML to XML. My love for the angle brackets is obvious. For the kind of data that I migrate, there aren't any suitable off-the-shelf tools available. The approach is to come up with a customized migration engine to perform the incremental migration on the input.
PDF to XML -
the rules- PDF doesn't contain any structure information. MS Word and other Word Processors, all hold information in objects - paras, pictures, tables, lists. This information is usually accessible using an object model that the application makes available through an API. PDF on the other hand is like a canvas - with text and images painted on a flat white page.
- There are tools that help you create a PDF file from an object-based structure consisting of para, images etc, but none of them are able to parse a PDF back into an object structure well enough (some of them like PDFBox can extract text, but still no structure). If you extract information from a PDF file, what you get is a dump of all the text with positional (X and Y co-ordinates on the page) and font information. I have used PDF2HTML for this before - it works well with single column PDF documents.
- PDF does have some information in an object model – the bookmarks, TOC etc, that can be extracted using some of the available libraries, but that information is rarely of any use.
PDF to XML –
what do you want to do with it?- The objective is to produce an XML file that is usually hierarchical (sublists within lists, images/para/tables inside the sublist items) – a-la-MSWord Outline Numbering document
PDF to XML –
approach- The migration engine we developed at Imfinity was a template based application, reverse-MSWord. When you are writing a Word document, you write text and apply formatting to it by selecting a style.
- The same was done from on the PDF document, but in the reverse direction. Using a divide and conquer approach, the document is first divided into the highest level templates (sections, or probably separate out the lists). This can be done by marking a range of page numbers as one section, or as a top level item in the target DTD. Top level lists can be isolated using indentation or heading formatting.
- Going further down in the hierarchy, within each high level element, we mark out a certain font, indentation, spacing – as a template. As soon as one is marked, all text items with the same formatting (based on indentation and font) can be separated and tagged as a particular simpler XML element.
- The above approach incrementally produces a simple structured XML document. This XML document doesn't need to be compliant with any particular DTD, but it should capture all the fields, text that needs to be mapped into the target DTD. A simple structure like the image below might be sufficient.
- Once we have this basic XML document, we write advanced XSLT scripts to transform the data into the target DTD.