Introduction to Docvert
The Drupal Docvert module is a plugin to assist with the conversion from propriatory Microsoft-only Word documents into workable HTML, managed by Drupal as nodes, pages and books.
Important: This Drupal module alone does not include the entire required process, it needs to integrate with a Docvert Service which needs to be hosted on a nearby machine.
Setting up that stand-alone service in the first place can be quite a job - you need full admin control of it, and if you can't supply or find such a machine, the Drupal module alone will not be much help to you.
Instructions for setting it up are however available here, and from the Docvert home site.
The Process
- An editor creates content within Drupal, and attaches a MSWord document as a filefield attachment.
- On demand, that attached file is POSTed to the Docvert web service.
- On the service end:
- An instance of LibreOffice (nee OpenOffice) is launched to handle the file.
- API methods within LibreOffice are called to parse the Word Doc, and export it as structured HTML, including exports of the embedded images and a subset of the formatting.
- Additional optional 'pipeline' methods are called using XSLT to process the HTML into tidied, templated results.
- The entire result is packaged into a ZIP file and returned to the caller.
See the Docvert FAQ for more about how and why this works.
- The Drupal docvert module unpacks those results and inserts the text into the Drupal node, or set of book pages.
- Returned embedded images are saved locally, and re-linked into the resulting pages.