MongoDB Aggregation Pipeline and Pagination

One of the most common problems in web development is paginating a set of data.
It’s common that the set of data we need to show our users is pretty big and we might want to show only a part of it retrieving the next slices only when requested.

Skip/Limit Pagination

The most common way to achieve this is usually to count the number of items and split them in pages, each of a fixed set of items. This is usually achieved through the limit, skip and count operators.

For the purpose of this post I’ll be relying on a collection containing tickets for a project management tool, each document looks like:

{u'_id': ObjectId('4eca327260fc00346500000f'),
 u'_sprint': ObjectId('4eca318460fc00346500000b'),
 u'complexity': u'days',
 u'description': u'This is the ticket description',
 u'priority': u'low',
 u'status': u'new',
 u'title': u'This is the ticket main title'}

stored in the projects.ticket collection:

import pymongo
con = pymongo.MongoClient()
db = con.projects
tickets = db.ticket

then we can know the total amount of pages by counting the entries in the collection and dividing them for the number of entries we want to show in each page. In this case we are going to paginate over the list of tickets in status done (that have been completed by developer):

import math
pagecount = math.ceil(float(tickets.find({'status': 'done'}).count()) / ITEMS_PER_PAGE)
>>> pagecount

Now we know that to show all the items in our set we need 471 pages.
Then we just need to actually get the items in each page through limit and skip

page1 = list(tickets.find({'status': 'done'}).skip(0).limit(ITEMS_PER_PAGE))
page2 = list(tickets.find({'status': 'done'}).skip(1 * ITEMS_PER_PAGE).limit(ITEMS_PER_PAGE))
page3 = list(tickets.find({'status': 'done'}).skip(2 * ITEMS_PER_PAGE).limit(ITEMS_PER_PAGE))

and so on…

While this is not the most efficient paradigm (as skip is actually a pretty slow function), it’s one of the most common solutions to the pagination problem. It’s so common that it is usually the one you find in pagination support provided by libraries or frameworks.

For the pure purpose of comparison with other techniques, I’m going to time how long it takes to get number of pages and retrieve a page:

>>> def get_page():
...   pagecount = math.ceil(tickets.find({'status': 'done'}).count() / ITEMS_PER_PAGE)
...   page = list(tickets.find({'status': 'done'}).skip(1 * ITEMS_PER_PAGE).limit(ITEMS_PER_PAGE))
...   return pagecount, page

>>> import timeit
>>> timeit.timeit(get_page, number=1000)

So retrieving 1000 times a page for a set of 4703 items (totally there are 6313 items in the collection) with this approach required 2.3 seconds.

Aggregation based Pagination before 3.2

Trying to improve over this solution one might notice that to render each page we are required to perform two queries: one to get the total amount of items and one to retrieve the page items themselves. So we might try to achieve the same result using a single query to the database.

If there is a tool that allows us to perform multiple operations in a single command is the MongoDB aggregation pipeline, so might might try to see if there is a way to retrieve the items count and a page with a single aggregation pipeline.

First of all we know that we are only looking for the tickets in status done, so the first step of our pipeline will be a $match stage for those tickets:

pipeline = [
   {'$match': {'status': 'done'}}

Then we want to fetch those while actually counting them, which we can achieve through the $group stage which will put all the items in an array of which we can then ask the size through a $project stage:

pipeline = [
    {'$match': {'status': 'done'}},
    {'$group': {'_id': 'results', 'result': {'$push': '$$CURRENT'}}},
    {'$project': {'_id': 0, 'result': 1, 'pages': {'$divide': [{'$size': '$result'}, ITEMS_PER_PAGE]}}},

This is already enough the give us all the entries with their total, but we want to avoid having to send them all from the database to the client, so we can already slice them for the page we are looking for through the $limit and $skip stages. The only side-effect is that before being able to apply the $limit stage we must $unwind our array to get back a list of documents we can then limit:

pipeline = [
    {'$match': {'status': 'done'}},
    {'$group': {'_id': 'results', 'result': {'$push': '$$CURRENT'}}},
    {'$project': {'_id': 0, 'result': 1, 'pages': {'$divide': [{'$size': '$result'}, ITEMS_PER_PAGE]}}},
    {'$unwind': '$result'},
    {'$skip': 1 * ITEMS_PER_PAGE},
    {'$limit': ITEMS_PER_PAGE}

Now if we run our pipeline we will actually get 10 results (ITEMS_PER_PAGE is 10) with the total number of pages:

>>> r = list(tickets.aggregate(pipeline))
>>> len(r)
>>> r[0]
{u'pages': 470.3, u'result': {u'status': u'done', u'description': u"TICKET_DESCRIPTION", u'title': u'TICKET_TITLE', u'priority': u'HIGH', u'complexity': u'hour', u'_sprint': ObjectId('4eca331460fc00358d000005'), u'_id': ObjectId('4ecce02760fc0009fe00000d')}}

We will have to apply math.ceil to pages but most of the work is already done by mongodb, so we actually achieved our target.

Let’s see if this approach is actually faster for our date than the previous one:

>>> def get_page():
...   r = list(tickets.aggregate(pipeline))
...   return math.ceil(r[0]['pages']), r

>>> import timeit
>>> timeit.timeit(get_page, number=1000)

Sadly this approach is actually slower.

I wanted to show it empirically, but it was pretty clear that is would have been slower, because we are actually retrieving the whole set of data to store it inside an array we then have to unwind. First of all we retrieve far more data than previously, then we even push it inside an array and MongoDB arrays are not actually really performant for big amounts of data. As far as I remember they are implemented on memory arrays to support random indexing, so they have to be reallocated when growing, with the consequent cost of copying all the data each time.

Aggregation based Pagination on 3.2

One of the reason why the aggregation based approach is pretty slow is that it has to pass through the whole array of data twice, the second time just to unwind it. To our help in mongodb 3.2 a new array operator has been introduced, the $slice operator, which would remove the need to use $unwind, $skip and $limit to retrieve our page.

Let’s build a new pipeline based on the new operator and see if it can help us:

pipeline = [
    {'$match': {'status': 'done'}},
    {'$group': {'_id': 'results', 'result': {'$push': '$$CURRENT'}}},
    {'$project': {'_id': 0, 
                  'result': {'$slice': ['$result', 1 * ITEMS_PER_PAGE, ITEMS_PER_PAGE]}, 
                  'pages': {'$divide': [{'$size': '$result'}, ITEMS_PER_PAGE]}}},

Now the result will be a single document with pages and result values:

>>> r = next(tickets.aggregate(pipeline), None)
>>> r['pages']
>>> len(r['result'])
>>> r['result'][0]
{u'status': u'done', u'description': u'TICKET_DESCRIPTION', u'title': TICKET_TITLE', u'priority': u'HIGH', u'complexity': u'days', u'_sprint': ObjectId('4ece227d60fc003675000009'), u'_id': ObjectId('4ecbc52060fc0074bb00000d')}

so we are actually getting the same data as before with far fewer operations…
Let’s check if this approach is faster than the previous one:

>>> def get_page():
...   r = next(tickets.aggregate(pipeline), None)
...   return math.ceil(r['pages']), r['result']
>>> import timeit
>>> timeit.timeit(get_page, number=1000)

Well, we actually gained a 25% speed up related to the previous try, but still retrieving all the data and pushing it inside an array costs to much to follow this road. So far if we want to display the total amount of pages in pagination it seems that doing two separate queries is still the fastest solution for MongoDB.

There is actually another technique for pagination, which is called Range Based Pagination, it’s usually the way to achieve best performances when performing pagination and I plan to write another blog-post to show how to do it with MongoDB.

Leave a Reply

Your email address will not be published. Required fields are marked *