In this post I talk about how you can use Django, Dramatiq and h2p to create a simple HTTP API that can turn any URL into a PDF.
What is Dramatiq?
Dramatiq is a distributed task processing library for Python 3 that I've been working on as an alternative to Celery. Using Dramatiq, you can transparently run functions in the background across a large number of machines. In this post I'm going to use it to offload the work of generating PDFs from the web server onto a background processing server.
Why use a task queue?
Long-running or computationally-intensive tasks in the middle of the request-response cycle of a web server can severily impact the latency and throughput of that server. A common pattern to work around this issue is to use a task queue to offload the parts of the request that can be done later and in the background off to a different fleet of servers known as workers. This has other advantages, too: tasks may easily be retried later in case there's an error and you can run tasks completely outside of the request-response cycle (eg. using a cron job).
Generating PDFs from web pages is a slow process so we want to take that out of the request and give the requester a way to poll for the result of the operation.
Setup
First things first, we're going to need a message broker. Dramatiq currently works with Redis and RabbitMQ, but for this post I'm going to use RabbitMQ. To install it on macOS, you can run:
$ brew install rabbitmq
Run it with rabbitmq-server
.
Next, we're going to create a new virtual environment and, inside of that environment, use pipenv to install all the prerequisite libraries:
$ pipenv install django djangorestframework django_dramatiq "dramatiq[rabbitmq, watch]" h2p
django_dramatiq
is a small Django app that makes integrating Dramatiq and Django easy.
After that's done, we're going to create a Django project called
pdfapi
:
$ django-admin.py startproject pdfapi .
Finally, we need to configure django_dramatiq
to use RabbitMQ. In
pdfapi/settings.py
, add django_dramatiq
and rest_framework
to your
INSTALLED_APPS
:
INSTALLED_APPS = [
'django.contrib.admin',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.messages',
'django.contrib.staticfiles',
'django_dramatiq',
'rest_framework',
]
And configure the broker in the same file:
DRAMATIQ_BROKER = {
"BROKER": "dramatiq.brokers.rabbitmq.RabbitmqBroker",
"OPTIONS": {
"url": "amqp://localhost:5672",
},
"MIDDLEWARE": [
"dramatiq.middleware.Prometheus",
"dramatiq.middleware.AgeLimit",
"dramatiq.middleware.TimeLimit",
"dramatiq.middleware.Retries",
"django_dramatiq.middleware.AdminMiddleware",
"django_dramatiq.middleware.DbConnectionsMiddleware",
]
}
Let's run the migrations and then the server to make sure everything's working so far:
$ python manage.py migrate
$ python manage.py runserver
If you visit http://127.0.0.1:8000, you should now see the familiar
"Congratulations on your first Django-powered page" view. Kill the
server and create a new app called pdfs
:
$ python manage.py startapp pdfs
The API
The API we're going to define is going to be very simple. It will
accept POST
requests to /v1/pdfs
containing the url
we're
expected to convert into a PDF, these requests will submit a task to
generate the PDF and immediately return a JSON object with an id
and
a status
field that the caller can then use to keep track of the
job.
Using the id
field from the response, the caller will be able to
poll /v1/pdfs/{id}
to find out what the status of the task is.
The PDF model
In pdfs/models.py
declare the following model:
class PDF(models.Model):
STATUS_PENDING = "pending"
STATUS_FAILED = "failed"
STATUS_DONE = "done"
STATUSES = [
(STATUS_PENDING, "Pending"),
(STATUS_FAILED, "Failed"),
(STATUS_DONE, "Done"),
]
source_url = models.CharField(max_length=512)
status = models.CharField(
max_length=10,
default=STATUS_PENDING,
choices=STATUSES,
)
@property
def filename(self):
raise NotImplementedError
@property
def pdf_url(self):
raise NotImplementedError
We're going to skip the implementations of the filename
and
pdf_url
properties for now.
Build and run the migrations:
$ python manage.py makemigrations
$ python manage.py migrate
Then add a serializer for that model in pdfs/serializers.py
:
from rest_framework import serializers
from .models import PDF
class PDFSerializer(serializers.ModelSerializer):
source_url = serializers.URLField(max_length=512)
pdf_url = serializers.URLField(read_only=True)
class Meta:
model = PDF
fields = ("id", "source_url", "pdf_url", "status")
read_only_fields = ("status",)
We're going to use this serializer to render PDF models as JSON and to validate incoming requests.
The Views
In pdfs/views.py
add the following views:
from django.views.decorators.csrf import csrf_exempt
from rest_framework import status
from rest_framework.decorators import api_view
from rest_framework.response import Response
from .models import PDF
from .serializers import PDFSerializer
@csrf_exempt
@api_view(["POST"])
def create_pdf(request):
serializer = PDFSerializer(data=request.data)
if serializer.is_valid():
serializer.save()
return Response(serializer.data)
return Response(serializer.errors, status=status.HTTP_400_BAD_REQUEST)
@api_view(["GET"])
def view_pdf(request, pk):
try:
pdf = PDF.objects.get(pk=pk)
serializer = PDFSerializer(pdf)
return Response(serializer.data)
except PDF.DoesNotExist:
return Response(status=status.HTTP_404_NOT_FOUND)
And then hook them up in pdfs/urls.py
:
from django.conf.urls import url
from . import views
app_name = "pdfs"
urlpatterns = [
url(r"^$", views.create_pdf, name="create_pdf"),
url(r"^(?P<pk>\d+)$", views.view_pdf, name="view_pdf"),
]
Add the pdfs
app to pdfapi/settings.py
:
INSTALLED_APPS = [
'django.contrib.admin',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.messages',
'django.contrib.staticfiles',
'django_dramatiq',
'rest_framework',
'pdfs',
]
Finally, include the pdfs
urls in pdfapi/urls.py
:
from django.conf.urls import url, include
from django.contrib import admin
urlpatterns = [
url(r'^admin/', admin.site.urls),
url(r'^v1/pdfs/', include("pdfs.urls")),
]
At this point if you run the development server and visit /v1/pdfs
you should be able to interact with the API.
The Task
So far we've declared the model and an API that lets us interact with
it, but we haven't done anything to actually generate PDFs so every
PDF we create using the API is going to be in a perpetual pending
state. Let's fix that.
In pdfs/tasks.py
add the following task:
import dramatiq
import h2p
from .models import PDF
@dramatiq.actor
def generate_pdf(pk):
pdf = PDF.objects.get(pk=pk)
try:
h2p.generate_pdf(
pdf.filename,
source_uri=pdf.source_url,
).result()
pdf.status = PDF.STATUS_DONE
except h2p.ConversionError:
pdf.status = PDF.STATUS_FAILED
pdf.save()
Let's break this down a little bit. generate_pdf
is just a normal
Python function that we've decorated with @dramatiq.actor
. This
makes it possible to run the function asynchronously.
generate_pdf
takes a pk
parameter representing the id of a PDF
,
this is important because tasks are distributed and we wouldn't want
to send entire PDF
objects over the network. It delegates the work
of actually creating the PDF to h2p
and updates the PDF
object's
status based on the result of that operation.
We're passing the filename
property of PDF
to h2p.generate_pdf
but we haven't implemented it yet so let's fill it and pdf_url
in on
the PDF
model in pdfs/models.py
:
@property
def filename(self):
return f"{settings.MEDIA_ROOT}{self.pk}.pdf"
@property
def pdf_url(self):
return f"{settings.MEDIA_URL}{self.pk}.pdf"
Don't forget to add MEDIA_ROOT
and MEDIA_URL
to
pdfapi/settings.py
:
MEDIA_ROOT = os.path.join(BASE_DIR, "files/")
MEDIA_URL = "/media"
Create the files
folder and then add a static handler to
pdfapi/urls.py
:
from django.conf import settings
from django.conf.urls import url, include
from django.conf.urls.static import static
from django.contrib import admin
urlpatterns = [
url(r'^admin/', admin.site.urls),
url(r'^v1/pdfs/', include("pdfs.urls")),
] + static(settings.MEDIA_URL, document_root=settings.MEDIA_ROOT)
Hooking 'em up
At this point we've created a task that can generate PDF files and an API that can submit and keep track of that work. Let's hook them up!
In the create_pdf
view from pdfs/views.py
change the
serializer.save()
call to:
if serializer.is_valid():
pdf = serializer.save()
generate_pdf.send(pdf.pk)
return Response(serializer.data)
Now every time someone creates a PDF
object using the API, we'll
enqueue a generate_pdf
task. Spin up the API server and some
Dramatiq workers and test it out.
$ python manage.py runserver
$ python manage.py rundramatiq # in a separate terminal window
To test it out, send a create request using curl:
$ curl -d"source_url=http://example.com" http://127.0.0.1:8000/v1/pdfs/
{"id":1,"source_url":"http://example.com","pdf_url":"/media/1.pdf","status":"pending"}
Then poll using GET requests until it's ready:
$ curl http://127.0.0.1:8000/v1/pdfs/1
{"id":1,"source_url":"http://example.com","pdf_url":"/media/1.pdf","status":"done"}
Finally, visit http://127.0.0.1:8000/media/1.pdf to view the generated PDF.
Next Steps
You can find the full code on GitHub. If you want to learn more about Dramatiq (and hopefully you do!) head on to the docs. I've put a lot of work into making them as accessible as possible.
Happy coding!