Setup a Docvert Service as an appliance on AWS EC2

The Docvert service can run well as a dedicated web appliance , on a machine that is not used for anything else. Also, it is performance-intensive, but generally used only rarely. As an experiment, I tried creating an Amazon Web Service "Elastic Cloud" machine instance to use as a web appliance. The following instructions got me from start to finish, with a working result. YMMV.

Get started with Amazon Web Services

I won't go into detail here, instead check out a few useful tutorials. Their control panel has got heaps better in recent years. It does require a bit of reading to get used to the terminology, but it's cheap to test (around 50c a day) You can do all the virtual server provisioning from their Web UI, and it's a good idea to start there.

The instructions below first summarize how you could do it through the UI, and are followed by a keylog that allows you do do it from a script.

The main step by step guide to AWS/EC2 has screenshots and things.

For the Ubuntu and commandline-minded, The Ubuntu docs EC2StartersGuide is alittle old, but still valid. Hidden away a bit, here is the reference for all the EC2 commandline tools. The inline help for those commands is a bit too brief.

Overview

  1. Get an AWS account (see online guides).
  2. Choose a small Debian server instance to enable. Today I'm using AMI: ebs/ubuntu-images/ubuntu-natty-11.04-amd64-server-20110426 (ami-60582132)
  3. Enable that instance. It's easiest through the GUI. Refer to the AWS/EC2 docs
  4. TAKE CARE to ensure that the'security group' you choose or make has SSH and HTTP ports open, my 'default' one didn't and that lost me some time. If you want to use the Python version on port 8080, you should add that to your security group also.
  5. You should find yourself with a named *.pem file which is your 'keypair'. Keep that safe, we'll need it later. A good place to store it is inside ~/.ec2/
  6. Connect to your new machine via ssh. If it's running, you can select the instance, and click [Instance Actions][Connect] for the exact ssh connection string. This includes its new web address, which you will need to know!

You are now logged in. Skip down to the steps for installing Docvert.

Setting up an AWS EC2 instance programatically

Install the AWS developer commandline tools as needed.

https://help.ubuntu.com/community/EC2StartersGuide http://docs.amazonwebservices.com/AWSEC2/latest/CommandLineReference/index.html?ApiReference-cmd-DescribeInstances.html

Launch an EC2 Instance from the comfort of your own commandline

The keypair ???.pem file you were given has a name as well as a filename. The commandline tool need the raw name, It's probably best if you name the file the same as the given name. My example keypair is named dmanAWS and is stored in ~/.ec2/dmanAWS.pem

# Breaking the settings out so we can see them ...
export EC2_KEYPAIR=dmanAWS    # THIS will be different for you
export EC2_KEYPAIR_FILE=~/.ec2/${EC2_KEYPAIR}.pem
export EC2_AMI=ami-60582132   # This is an ubuntu natty server base
export EC2_TYPE=t1.micro      # I found this out when choosing the base ami
export EC2_REMOTE_USER=ubuntu # This is specific to the instances, sometimes it's 'root'
export EC2_GROUP=sg-72277a20  # This is your own security group ID. Set it up in the UI first

ec2-run-instances $EC2_AMI --key $EC2_KEYPAIR --instance-type $EC2_TYPE --group $EC2_GROUP > ~/ec2-instance_info.txt

#... And you get a running machine, though we probably should also set a name?

# We have recorded the output of that command into ~/ec2-instance_info.txt
# so that we can work on the info it returned - the new ID we have been assigned.
# Get some info from it so we know where we are now talking about
export EC2_INSTANCE=`awk '$1 == "INSTANCE" { print $2 }' ~/ec2-instance_info.txt`

Now we have a running machine (you can probably see it if you refresh the AWS GUI) but we only know its instance ID, not the public web address. It will be something horrible like http://ec2-122-248-213-221.ap-southeast-1.compute.amazonaws.com/ for now. We need to retrieve that. Though you can get it from the GUI too, here is a commandline way.

# At the time the request was made, we had not actually been assigned an address.
# Wait a bit, and then ask for it. Need to parse the text response for the bit we need.
sleep 10 # force a wait, it's pretty quick
ec2-describe-instances --filter "instance-id=$EC2_INSTANCE" > ~/ec2-instance_info2.txt 
export EC2_ADDRESS=`awk '$1 == "INSTANCE" { print $4 }' ~/ec2-instance_info2.txt`
export EC2_IP=`awk '$1 == "INSTANCE" { print $14 }' ~/ec2-instance_info2.txt`
# With this info, we can now connect to it like this :
ssh -i $EC2_KEYPAIR_FILE $EC2_REMOTE_USER@$EC2_ADDRESS

Turn it off when you are done

ec2-terminate-instances $EC2_INSTANCE

Installing Docvert to run as an appliance

This is mostly the same as the main manual instructions, just run together a bit.

The above code was run from your terminal. The below code should be run on the target server, after ssh-ing into it.

sudo apt-get update
sudo apt-get dist-upgrade

# Note that in recent versions, you need to list libreoffice explicitly, first
# the dpkg of docvert doesn't include everything it really needs.
sudo apt-get install -y  libreoffice libreoffice-java-common docvert 

# This is an extra manual step required to enable the actual web service
sudo sed -i s/#Alias/Alias/  /etc/apache2/conf.d/docvert 

# Also some settings so the web service can use temp files
sudo apache2ctl stop
sudo usermod --home /tmp www-data
sudo apache2ctl start

# The above instructions WORK. 2011-07
# You should be able to access http://${DOCVERT_SERVER}/docvert and see stuff
# but an alternative option is also recommended by Matthew. Unfortunately, it is unfinished.

# Additionally, try the python version also.
# this should co-exist with version 4, so both methods are available

# Now we install version 5 of Docvert 

# The docvert 5 version does not check dependencies! I've found we need (at least) the following
sudo apt-get install -y libreoffice python-lxml  python-imaging 
# maybe these too, if things don't work first time?
# sudo apt-get install -y python-bottle pdf2svg python-rsvg

# Get the package
wget http://holloway.co.nz/docvert/docvert-5.tar.gz
tar -xzf docvert-5.tar.gz 
# Put it nearby, but not in sthe same place as the other, they are incompatible
sudo cp -r holloway-docvert-*/ /usr/share/docvert-5/

# Optional - take over port 80
#sudo sed -i s/port=8080/port=80/ /usr/share/docvert-5/docvert-web.py

# The 'listen' only listens on known hostnames? Change that to wildcard
sudo sed -i s/host=\'localhost\'/host=\'0.0.0.0\'/ /usr/share/docvert-5/docvert-web.py

# Docvert expects you to start a daemon to manage office application calls
# Starting this manually is optional if we are building a hybrid appliance, it will already have the correct startup script installed.
# sudo /usr/bin/soffice -headless -norestore -nologo -norestore -nofirststartwizard -accept="socket,port=2002;urp;" &
 
# Start it
cd /usr/share/docvert-5/
sudo python docvert-web.py &

About now you should be able to visit http://{DOCVERT_SERVER}/docvert and see some action. And if you are really lucky, also http://{DOCVERT_SERVER}:8080 for the new alternative.