MongoDB Aggregation Pipeline and Pagination

One of the most common problems in web development is paginating a set of data.
It’s common that the set of data we need to show our users is pretty big and we might want to show only a part of it retrieving the next slices only when requested.

Skip/Limit Pagination

The most common way to achieve this is usually to count the number of items and split them in pages, each of a fixed set of items. This is usually achieved through the limit, skip and count operators.

For the purpose of this post I’ll be relying on a collection containing tickets for a project management tool, each document looks like:

{u'_id': ObjectId('4eca327260fc00346500000f'),
 u'_sprint': ObjectId('4eca318460fc00346500000b'),
 u'complexity': u'days',
 u'description': u'This is the ticket description',
 u'priority': u'low',
 u'status': u'new',
 u'title': u'This is the ticket main title'}

stored in the projects.ticket collection:

import pymongo
con = pymongo.MongoClient()
db = con.projects
tickets = db.ticket

then we can know the total amount of pages by counting the entries in the collection and dividing them for the number of entries we want to show in each page. In this case we are going to paginate over the list of tickets in status done (that have been completed by developer):

import math
 
ITEMS_PER_PAGE = 10
 
pagecount = math.ceil(float(tickets.find({'status': 'done'}).count()) / ITEMS_PER_PAGE)
>>> pagecount
471.0

Now we know that to show all the items in our set we need 471 pages.
Then we just need to actually get the items in each page through limit and skip

page1 = list(tickets.find({'status': 'done'}).skip(0).limit(ITEMS_PER_PAGE))
page2 = list(tickets.find({'status': 'done'}).skip(1 * ITEMS_PER_PAGE).limit(ITEMS_PER_PAGE))
page3 = list(tickets.find({'status': 'done'}).skip(2 * ITEMS_PER_PAGE).limit(ITEMS_PER_PAGE))

and so on…

While this is not the most efficient paradigm (as skip is actually a pretty slow function), it’s one of the most common solutions to the pagination problem. It’s so common that it is usually the one you find in pagination support provided by libraries or frameworks.

For the pure purpose of comparison with other techniques, I’m going to time how long it takes to get number of pages and retrieve a page:

>>> def get_page():
...   pagecount = math.ceil(tickets.find({'status': 'done'}).count() / ITEMS_PER_PAGE)
...   page = list(tickets.find({'status': 'done'}).skip(1 * ITEMS_PER_PAGE).limit(ITEMS_PER_PAGE))
...   return pagecount, page

>>> import timeit
>>> timeit.timeit(get_page, number=1000)
2.3415567874908447

So retrieving 1000 times a page for a set of 4703 items (totally there are 6313 items in the collection) with this approach required 2.3 seconds.

Aggregation based Pagination before 3.2

Trying to improve over this solution one might notice that to render each page we are required to perform two queries: one to get the total amount of items and one to retrieve the page items themselves. So we might try to achieve the same result using a single query to the database.

If there is a tool that allows us to perform multiple operations in a single command is the MongoDB aggregation pipeline, so might might try to see if there is a way to retrieve the items count and a page with a single aggregation pipeline.

First of all we know that we are only looking for the tickets in status done, so the first step of our pipeline will be a $match stage for those tickets:

pipeline = [
   {'$match': {'status': 'done'}}
]

Then we want to fetch those while actually counting them, which we can achieve through the $group stage which will put all the items in an array of which we can then ask the size through a $project stage:

pipeline = [
    {'$match': {'status': 'done'}},
    {'$group': {'_id': 'results', 'result': {'$push': '$$CURRENT'}}},
    {'$project': {'_id': 0, 'result': 1, 'pages': {'$divide': [{'$size': '$result'}, ITEMS_PER_PAGE]}}},
]

This is already enough the give us all the entries with their total, but we want to avoid having to send them all from the database to the client, so we can already slice them for the page we are looking for through the $limit and $skip stages. The only side-effect is that before being able to apply the $limit stage we must $unwind our array to get back a list of documents we can then limit:

pipeline = [
    {'$match': {'status': 'done'}},
    {'$group': {'_id': 'results', 'result': {'$push': '$$CURRENT'}}},
    {'$project': {'_id': 0, 'result': 1, 'pages': {'$divide': [{'$size': '$result'}, ITEMS_PER_PAGE]}}},
    {'$unwind': '$result'},
    {'$skip': 1 * ITEMS_PER_PAGE},
    {'$limit': ITEMS_PER_PAGE}
]

Now if we run our pipeline we will actually get 10 results (ITEMS_PER_PAGE is 10) with the total number of pages:

>>> r = list(tickets.aggregate(pipeline))
>>> len(r)
10
>>> r[0]
{u'pages': 470.3, u'result': {u'status': u'done', u'description': u"TICKET_DESCRIPTION", u'title': u'TICKET_TITLE', u'priority': u'HIGH', u'complexity': u'hour', u'_sprint': ObjectId('4eca331460fc00358d000005'), u'_id': ObjectId('4ecce02760fc0009fe00000d')}}

We will have to apply math.ceil to pages but most of the work is already done by mongodb, so we actually achieved our target.

Let’s see if this approach is actually faster for our date than the previous one:

>>> def get_page():
...   r = list(tickets.aggregate(pipeline))
...   return math.ceil(r[0]['pages']), r

>>> import timeit
>>> timeit.timeit(get_page, number=1000)
33.202540159225464

Sadly this approach is actually slower.

I wanted to show it empirically, but it was pretty clear that is would have been slower, because we are actually retrieving the whole set of data to store it inside an array we then have to unwind. First of all we retrieve far more data than previously, then we even push it inside an array and MongoDB arrays are not actually really performant for big amounts of data. As far as I remember they are implemented on memory arrays to support random indexing, so they have to be reallocated when growing, with the consequent cost of copying all the data each time.

Aggregation based Pagination on 3.2

One of the reason why the aggregation based approach is pretty slow is that it has to pass through the whole array of data twice, the second time just to unwind it. To our help in mongodb 3.2 a new array operator has been introduced, the $slice operator, which would remove the need to use $unwind, $skip and $limit to retrieve our page.

Let’s build a new pipeline based on the new operator and see if it can help us:

pipeline = [
    {'$match': {'status': 'done'}},
    {'$group': {'_id': 'results', 'result': {'$push': '$$CURRENT'}}},
    {'$project': {'_id': 0, 
                  'result': {'$slice': ['$result', 1 * ITEMS_PER_PAGE, ITEMS_PER_PAGE]}, 
                  'pages': {'$divide': [{'$size': '$result'}, ITEMS_PER_PAGE]}}},
]

Now the result will be a single document with pages and result values:

>>> r = next(tickets.aggregate(pipeline), None)
>>> r['pages']
470.3
>>> len(r['result'])
10
>>> r['result'][0]
{u'status': u'done', u'description': u'TICKET_DESCRIPTION', u'title': TICKET_TITLE', u'priority': u'HIGH', u'complexity': u'days', u'_sprint': ObjectId('4ece227d60fc003675000009'), u'_id': ObjectId('4ecbc52060fc0074bb00000d')}

so we are actually getting the same data as before with far fewer operations…
Let’s check if this approach is faster than the previous one:

>>> def get_page():
...   r = next(tickets.aggregate(pipeline), None)
...   return math.ceil(r['pages']), r['result']
... 
>>> import timeit
>>> timeit.timeit(get_page, number=1000)
26.79308009147644

Well, we actually gained a 25% speed up related to the previous try, but still retrieving all the data and pushing it inside an array costs to much to follow this road. So far if we want to display the total amount of pages in pagination it seems that doing two separate queries is still the fastest solution for MongoDB.

There is actually another technique for pagination, which is called Range Based Pagination, it’s usually the way to achieve best performances when performing pagination and I plan to write another blog-post to show how to do it with MongoDB.

MongoDB and UnitOfWork love or hate?

One of the most common source of issues for MongoDB newcomers is the lack of transactions, people have been used to work with transactions for the past 10 years and probably their web frameworks automatically starts, commits and rolls back transactions for them whenever something wrong happens. So we are pretty used to web development environments where the problem of writing only a part of our changes is usually solved out of the box.

When people first approach MongoDB I noticed that this behaviour is often took for granted and messed up data might arise from code that crashes while creating or updating entities on the database.

No transaction, no party?

To showcase the issue, I’ll try to came up with an example. Suppose you must create users and put them in Group3, Group5 or Group10 groups randomly. For the sake of the example we came up with the idea of dividing 10 by a random number up to 3 which actually leads to 3, 5 and 10. Code that randomly fails whenever randint returns 0:

import pymongo, random
c = pymongo.MongoClient()
user_id = c.test.users.insert({'user_name': 'User1'})
group_id = c.test.groups.insert({'user_id': user_id, 'group_name': "Group {}".format(10 / random.randint(0, 3))})
  

I know that this is both a terrible schema design and using random.choice((3, 5, 10)) would prevent the crash, but it perfectly showcases the issue as it randomly crashes from time to time:

>>> group_id = c.test.groups.insert({'user_id': user_id, 'group_name': "Group {}".format(10 / random.randint(0, 3))})
>>> group_id = c.test.groups.insert({'user_id': user_id, 'group_name': "Group {}".format(10 / random.randint(0, 3))})
>>> group_id = c.test.groups.insert({'user_id': user_id, 'group_name': "Group {}".format(10 / random.randint(0, 3))})
Traceback (most recent call last):
  File "", line 1, in 
ZeroDivisionError: integer division or modulo by zero
  

Now what happens is that whenever our group creation fails we end up with an user on the database which is unrelated to any group, which might actually even cause crashes in other parts of our code that might take for granted that each user has a group.

That can be verified as we should have the same amount of users and groups if everything works correctly, while it is not the case when our code breaks:

>>> c.test.users.count()
3
>>> c.test.groups.count()
2

As we create the groups after the users, it’s easy to see that whenever the code fails we end up with more users than groups.

This is usually an issue that it’s rare to face when working with classic database engines as you would usually run in a transaction that gets rolled back for you whenever the code crashes (at least this is how it works on TurboGears when the transaction manager is enabled) and so both the user and the group would never have existed.

Working with a Unit Of Work

The good news is that a similar behaviour can actually be achieved through the UnitOfWork design pattern, which the Ming library for MongoDB provides on Python. When working with a Unit of Work all the changes to the database happen together when we flush the unit of work, when something fails we just clear the unit of work and nothing happened.

To start working with the UnitOfWork we need to declare an unit of work aware database session

from ming import create_datastore
from ming.odm import ThreadLocalODMSession

session = ThreadLocalODMSession(bind=create_datastore('test'))

Then we can create the models for our data which are used to represent the User and the Group

from ming import schema
from ming.odm import FieldProperty
from ming.odm.declarative import MappedClass

class User(MappedClass):
    class __mongometa__:
        session = session
        name = 'users'

    _id = FieldProperty(schema.ObjectId)
    user_name = FieldProperty(schema.String(required=True))


class Group(MappedClass):
    class __mongometa__:
        session = session
        name = 'groups'

    _id = FieldProperty(schema.ObjectId)
    user_id = FieldProperty(schema.ObjectId)
    group_name = FieldProperty(schema.String(required=True))

Now we can finally create our users and groups like we did before, the only major change is that we flush the session at the end and clear it in case of crashes:

import random
try:
    u = User(user_name='User1')
    g = Group(user_id=u._id, group_name="Group {}".format(10 / random.randint(0, 3)))
    session.flush()
finally:
    session.clear()

Now running the same code three times leads to a crash like before:

Traceback (most recent call last):
  File "test.py", line 33, in 
    g = Group(user_id=u._id, group_name="Group {}".format(10 / random.randint(0, 3)))
ZeroDivisionError: integer division or modulo by zero

The major difference is that now, if we look at the count of groups and users they always coincide:

>>> c.test.users.count()
2
>>> c.test.groups.count()
2

because in case of a failure we clear the unit of work and so we never created the user.

Uh?! But the relation?!

A really interesting thing to note is that we actually created a relation between the User and the Group before any of them even existed. That is explicitly visible on Group(user_id=u._id) we are storing the id of an user that actually doesn’t even exist yet.

How is that even possible?

The answer is actually in the ObjectId generation algorithm, on MongoDB you would expect object ids to be generated by the database like in any other database management system, but due to the distributed nature of MongoDB that is actually not required at all. The way the object id is generated ensures that it never collides also when it is generated on different machines by different processes at different times. That is because the ObjectId itself contains the machine, process and time that generated it.

This allows for mongodb clients to actually generate the object ids themselves, so when we created the User it actaully already had an ObjectId, one provided by Ming itself for use even though the object didn’t yet exist on the database.

This actually makes possible to fully leverage the power of the unit of work as otherwise it would be really hard (if not impossible) to handle relations between different objects inside the same unit of work.

UnitOfWorks looks great, but it might play bad tricks

Thanks to the convenience of working with such a pattern it’s easy to see everything as black or white, we get used to think that as we didn’t flush the unit of work yet nothing happened on the database and so we are safe.

While this is usually true, there are cases where it’s not, and they are actually pretty common cases when leveraging the full power of MongoDB.

MongoDB provides a really powerful feature which are the Update Operators, whenever using them through update or findAndModify we can change the object atomically and in relation to its current state. They are the way to go to avoid race conditions and implement some common patterns in mongodb, but they actually do not cope well with a UnitOfWork.

Whenever we issue an update operator we must instantly contact the database and perform the operation, as the result of the operation might change depending on the time it’s performed. So we cannot queue an update operator in the unit of work and then flush it.
The general rule is to flush the unit of work before performing and update operator or to perform the update operators before starting the work inside the unit of work, but to never mix the two, otherwise unexpected behaviours is what you are looking for.

What happens if we create an user with count=1 in the unit of work, then we perform an $inc on count and then we flush the unit of work? You would expect the user to have count=2, but what happens is that you actually end up with an user with count=1. Why?

If we think for a moment about it, it’s easy to see why.

When we perform the $inc the unit of work has not been flushed yet, so our user doesn’t yet exist nor have 1 as its count.
Then when we flush the unit of work we performed the $inc operation outside of the unit of work, so the unit of work knows nothing about it and actually creates an user with count=1.

UnitOfWork or Not?

At the end, UnitOfWork is a very powerful and convenient pattern when working with MongoDB.

In 99% of the cases it will solve data consistency problems and make our life easier, but pay attention to that 1% which is involved whenever you accidentally start performing operations outside the unit of work mixing them with operations in the unit of work. That will be origin of unexpected behaviours and will be pretty hard to track to the real cause.

As soon as you realise this and start paying attention to what’s going to happen inside your unit of work the few times you need to perform operations out of it I’m sure you will live an happy life and enjoy the convenience provided by frameworks like Ming and the UnitOfWork pattern.

TurboGears 2.3 Hidden Gems #3 – Minimal Mode

Recently it has been pointed out to me that the Wikipedia TurboGears page was pretty outdate, so I ended up reading and updating a few sections. As that page reported a “2.x Components” section where Pylons was listed, I wasn’t sure what to do as it is not a component of TurboGears anymore since 2.3. Should I create a new “2.3+ Compoenents” section? Should I explain that it was used before? Then I realized that many people probably don’t even know that it got replaced and that gave me the idea of writing a blog post about the TurboGears2 Minimal Mode which is what replaced most of the Pylons functionalities in 2.3.

For people that didn’t follow TurboGears recent evolutions, the Minimal Mode is actually a micro-framework like configuration of TurboGears2 which has been available since version 2.3.

Minimal Mode was actually created as a side-effect of TurboGears becoming independent from Pylons while the team was also working on a major framework speed-up. Apart from the speed gain and opening the way to Python3, through Pylons removal, Minimal Mode proved itself to be quite convenient for rapidly prototyping Apps, HTTP APIs or showcase small examples. This is particularly visibile through the various Snippets that got created in Runnable.io TurboGears category.

The impact of the refactoring that lead to minimal mode is actually clear when comparing the TurboGears2 dependencies on a recent release

prova

to the more than forty dependencies of one of the earliest 2.x releases

tg2_dot

The new internal core made possible to greatly reduce dependencies to those that were really needed. Removing a few and moving some, that were only needed when special features were enabled, to the application itself.

By default TurboGears2 starts in full-stack mode, expecting the application to reside in a python package and enabling all the features that are commonly available to TurboGears2 applications. For backward compatibility reasons, minimal mode must be enabled explicitly through the minimal=True option. This ensures that all the apps created before 2.3 continue to work while the framework can be used as a micro-framework with minimum effort.

Another requirement is that the root controller must be explicitly passed, as TurboGears looks for it in the application package, and as we have no package configured it won’t be able to find any.

config = AppConfig(minimal=True, root_controller=RootController())

While tg.devtools and the gearbox quickstart command can continue to be used to create full stack applications, creating a minimal mode app is as easy as installing the TurboGears2 package itself and creating an application with any root controller (see TurboGears Documentation for a complete tutorial)

from tg import expose, TGController, AppConfig

class RootController(TGController):
    @expose()
    def index(self):
        return 'Hello World'

config = AppConfig(minimal=True, root_controller=RootController())

application = config.make_wsgi_app()

then the application can be served with a WSGI compatible server

from wsgiref.simple_server import make_server

print "Serving on port 8080..."
httpd = make_server('', 8080, application)
httpd.serve_forever()

If people need additional features that TurboGears usually provide like sessions, caching, authentication, database, helpers and so on… they can enable them with a bunch of options available through the AppConfig object and only in that case additional dependencies like Beaker or SQLAlchemy are required.

The recently released 2.3.7 explicitly worked on making minimal mode easier to use, especially when database is involved. Before 2.3.7 you were required to have a package for your application from where the model could be imported, but now the model can actually be any object that exposes an init_model function. The TurboGears tutorial actually uses a Bunch object which is in fact a dictionary.

After playing around with the minimal mode for a bunch of small projects I realized that while many projects start small they quickly tend to became big, and being able to switch from a micro-framework to a full stack framework when needed, without actually changing framework at all, has proved a valuable feature. One that totally adheres the TurboGears mission of being “the web-framework that scales with you” 🙂

As not many people, even in TurboGears community, were comfortable with minimal mode I hope this post clarified a bit what it is minimal mode and how you can benefit from it.

DEPOT 0.1.0 new Aliases Feature for Easy Storage switch

DEPOT is a file storage and retrieval framework we created to solve the need of switching different storage systems when deploying in different environments. We wanted a unique and cohesive API that made possible to keep storing files the same way independently from where they were actually stored.

As systems evolve and change during time we also wanted the ability to switch those storages whenever required without breaking past files or changing any code. That lead to various features of DEPOT that specifically pointed this problem, the last of this list is the new Storage Aliases support.

If you used a single storage, myfiles, registered as the default one, when you wanted to switch from Local Storage to S3 that was easy, you could register a new mys3files storage on S3 and switch the default one to be that one. Your old files would continue to be served from the old storage and your new files would be uploaded and served from the new storage, because the system knew that some files were saved in myfiles and some were on mys3files.

Now, suppose you want to keep your user avatars on a separated storage from their uploaded content. You could already declare an avatars storage and just force upload of all your files on that specific storage

from depot.fields.sqlalchemy import UploadedFileField

class User(Base):
    __tablename__ = 'users'

    uid = Column(Integer, autoincrement=True, primary_key=True)
    name = Column(Unicode(16), unique=True)

    avatar = Column(UploadedFileField(upload_storage='avatars'))

This would correctly store all your user avatars on the avatars storage. But what happened when you wanted to switch your avatars storage from saving files on disk to saving them on S3?

Because in this case DEPOT knew that all your avatars were on the avatars storage, you had to put your system on maintenance, manually move all the files on S3, switch the avatars storage configuration and then restart your application. This lead to downtime and wasn’t very convenient.

Introducing the new Aliases feature you can now declare two different storages

DepotManager.configure('local_avatars', {'depot.storage_path': '/var/www/lfs'})
DepotManager.configure('s3_avatars', {'depot.backend': 'depot.io.awss3.S3Storage', 'depot.access_key_id': ...})

tell DEPOT that avatars is just an alias for local_avatars

DepotManager.alias('avatars', 'local_avatars')

and whenever you stored a file on avatars it would actually be stored on local_avatars.

Want now to switch storing your files on S3? Just switch the alias configuration

DepotManager.alias('avatars', 's3_avatars')

And all your avatars will now be stored on S3 while the old one continue to be served from the disk as DEPOT knows they are actually on local_avatars.

This is a pretty simple and convenient solution that perfectly solved our need introducing the ability to evolve your storages forever as far as you only stored files on aliases and never directly on the storages themselves.

TurboGears 2.3 Hidden Gems #2 – Application Wrappers

One of the less known features introduced in TurboGears 2.3 are application wrappers.
Application wrappers are much like controller wrappers (available since 2.2), but instead of wrapping controllers they actually wrap the whole application providing an easier way to implement what in plain WSGI is done through middlewares.

The advantage of application wrappers over middlewares is that they have full access to TurboGears stack, they can access the current request, the database, session and caches as the application itself would do.

The great part is that, as they run between TGApp and TGController, they can also replace the TurboGears Context and the TurboGears Response providing a great way to hijack requests, responses or even replace entire features of the framework like the cache layer. A very similar concept is available in other frameworks like Pyramid Tweens.

A very simple application wrapper that intercepts exceptions and logs them without messing with the standard TurboGears error handling might look like:

class ErrorLoggingWrapper(object):
    def __init__(self, handler, config):
        self.handler = handler
        self.config = config

    def __call__(self, controller, environ, context):
        path = context.request.path
        try:
            return self.handler(controller, environ, context)
        except:
            log.exception('Error while handling %s', path)
            raise

The wrapper can then be enabled calling

base_config.register_wrapper(ErrorLoggingWrapper)

inside config/app_cfg.py

Now that we have an application wrapper able to log exceptions we can decide for example to add another one that suppresses exceptions and prints “Something went wrong!”, as it is possible to specify the order of execution for application wrappers we can register a SuppressErrorsWrapper that should execute after the ErrorLoggingWrapper:

from webob import Response

class SuppressErrorsWrapper(object):
    def __init__(self, handler, config):
        self.handler = handler
        self.config = config

    def __call__(self, controller, environ, context):
        try:
            return self.handler(controller, environ, context)
        except:
            return Response('Oh! Oh! Something went wrong!', status=500, content_type='text/plain')

Then it can be registered after the ErrorLoggingWrapper using:

base_config.register_wrapper(SuppressErrorsWrapper, after=ErrorLoggingWrapper)

While applications wrappers are a powerful feature, most of their power comes from the new response management refactoring that makes possible to access the current context and replace the outgoing response while working with high level objects instead of having to manually cope with WSGI.

TurboGears 2.3 Hidden Gems #1 – New Response Management

TurboGears2.3 has been a major improvement for the framework, most of its code got rewritten to achieve less dependencies, cleaner codebase a cleaner API and a faster framework. This resulted in reduction to only 7 dependencies in minimal mode and a 3x faster codebase.

While those are the core changes for the release, there are a lot of side effects that users can exploit at their benefit. This is the reason why I decided to start this set of posts to describe some of those hidden gems and explain users how to achieve the best from the new release.

The first change I’m going to talk about is how the response management got refactored and simplified. While this has some direct benefits it also provided some interesting side effects it makes sense to explore.

How TurboGears on Pylons did it

TurboGears tried to abstract a lot of response complexity through tg.response object and as there were not many reasons to override TGController.__call__ it was common that the response object body was always set by TurboGears itself.

Due to the fact that Pylons controllers were somehow compliant to WSGI itself the TGController was then in charge of calling the start_response function by actually providing all the headers user set into tg.response

response = self._dispatch_call()
 
# Here the response body got set, removed for brevity

if hasattr(response, 'wsgi_response'):
  # Copy the response object into the testing vars if we're testing
  if 'paste.testing_variables' in environ:
      environ['paste.testing_variables']['response'] = response
  if log_debug:
      log.debug("Calling Response object to return WSGI data")

  return response(environ, self.start_response)

While this made sense for Pylons, where you are expected to subclass the controller to perform advanced customizations, it was actually something unexposed to TurboGears users.

TurboGears made possible to change application behaviour using hooks and controller_wrappers. So the use for subclassing the TGController was actually strictly related to custom dispatching methods, which was usually better solved by specializing the TGController._dispatch method (tgext.routes is a simple enough example of this).

Cleaning Up Things

This lead to a curious situation where the TGController needed to speak with TGApp through WSGI to make Pylons happy, so it needed to call start_response and return the response iterator itself. TGApp was supposed to be the WSGI application, but in fact most of the real work was happening into TGController, in the end we had two WSGI applications: both TGController and TGApp were callable that spoke WSGI.

The 2.3 rewrite has been a great occasion to solve this ambiguity by providing a clear communication channel between TGController and TGApp by assigning each one a specific responsibility.

Communication Channel

In TG2.3 only the TGApp is now in charge of exposing the WSGI application interface. The TGController is expected to get a TurboGears Request Context object and provide back a TurboGears Response object. The TGApp will then use the provided response object to submit headers and response body.

The TGController code got much more straightforward and the whole testing and call response part was moved to the TGApp itself:

try:
    response = self._perform_call(context)
except HTTPException as httpe:
    response = httpe

# Here the response body got set, removed for brevity

return response

This has been possible without breaking backward compatibility thanks to the fact that the only subclassing of TGController common in TurboGears world was the BaseController class implemented by most applications.

The BaseController usually acts just as a pass-through between TGApp and TGController to setup some shortcuts to authentication data and other helpers for each request. So the fact that the parameters received by BaseController.__call__ changed didn’t cause an huge issue as they were just forwarded to TGController.__call__

A little side effect

One of the interesting effects of this change is that your controllers are now enabled to return any instance of webob.Response.

In previous versions it was possible to return practically only webob. WSGIHTTPException subclasses (as they exposed a wsgi_response property which was consumed by Pylons), so it was possible to return an HTTPFound instance to force a redirect, but it was not possible to return a plain response.

A consequence of the new change is enabling your controller to call third party WSGI applications by using tg.request.get_reponse with a given application. The returned response can be directly provided as the return value of your controller.

This behaviour also makes easier to write reusable components that don’t need to rely on tg.response and change it. Your application can forward the request to them and proxy back the response they return.

Part #2 will cover Application Wrappers, which greatly benefit from the new response management.

It’s a Pluggable World

One of the new additions in TG2.1.4 has been the support for the so called pluggable applications, this is a really powerful and convenient feature that probably not enough TurboGears users started embracing.

For people that never used them, pluggable applications provide a python package that can be installed and “plugged” inside any existing TurboGears application to add new features. Django has been probably the first framework to bring this feature to Python world and TurboGears implementation tries to be as convenient by making pluggable applications identical to plain TurboGears applications and providing a “quickstart-pluggable” command that creates the application skeleton for you. Pluggable applications can be installed using easy_install or pip and they can off course depend on any other pluggable application they need.

This year, at EuroPython 2012, I have been pleased to present a talk about using TurboGears for rapid prototyping (both in Italian and English, you should be able to find the videos on EuroPython youtube channel), so I decided to dedicate a part of it to pluggable applications as they are actually the fastest way to rapidly prototype a project. With my surprise most the questions I received were about the EasyCrudRestController and not about pluggable applications.

While the EasyCrudRestController is definitively a powerful tool, it’s far from being the answer to all the web developments needs. In most of the applications you are going to develop, users will probably prefer consulting content from something more engaging than an administration table of database entries.

This month, to create a set of utilities that can help people with their everyday needs, I decided to ask guys that work with me to make every part of the web sites that they were writing as pluggable applications. The result of this experiment has been that most of the pluggable apps that I did in my spare time (tgapp-smallpress, tgapp-photos, tgapp-tgcomments, tgext.tagging and so on) ended being used in real world projects and started to improve exposing hooks and ways to customize their behavior for the project they were going to be used.

After a few weeks, new pluggables like tgapp-fbauth, tgapp-userprofile, tgapp-calendarevents, tgapp-fbcontest, tgapp-youtubevideo has seen light and developing the target application started becoming blazing fast: Just plug what you need and customize it.

Embracing this philosophy the last project I’m working on has an app_cfg.py file that looks like:

plug(base_config, 'tgext.debugbar', inventing=True)
plug(base_config, 'tgext.scss')
plug(base_config, 'tgext.browserlimit')
plug(base_config, 'registration')
plug(base_config, 'photos')
plug(base_config, 'smallpress', 'press', form='XXX.lib.forms.ArticleForm')
plug(base_config, 'tgcomments', allow_anonymous=False)
from XXX.lib.matches import MyKindOfEvent
plug(base_config, 'calendarevents', 'eventi', event_types=[MyKindOfEvent()])
replace_template(base_config, 'smallpress.templates.article', 
                              'XXX.templates.press.article')
replace_template(base_config, 'smallpress.templates.excerpt', 
                              'XXX.templates.press.excerpt')

Thanks to this our development process has really improved: whenever a developer finds a bug he just has to propose a patch for the target pluggable, whenever someone notices a missing index on a query he has just to add it to the given pluggable. All the websites under development improved like people were working on the same project.

While existing pluggables might be limited, buggy or slow I’m getting confident that they will continue to improve, and some day they will surpass whatever custom implementation I can think of. I think I’m going to heavily rely on pluggable applications for any future project sticking to only one rule: “make it opensource”. This way, apart from probably helping other people, I’m also improving my own projects through other people feedbacks, bug reports and patches to the pluggables I used.

So, next time you have to start a new project give a look at the TurboGears CogBin and check if there is a pluggable application that looks like what you need. If you find any issue or find space for improvements just fork it and send a pull request, or send an email on the TurboGears Mailing List I’ll do my best to address any reported issue thanking you for your feedbacks as I’m aware that you are actually improving any past and future project that relies on that pluggable.

What’s new about Sprox 0.8

Today Sprox 0.8 got released, it is the first release to add ToscaWidgets2 support. Depending on which version of ToscaWidgets is available inside your environment Sprox will either use TW1 or TW2 to generate its forms.

Being mostly a TW2 oriented release it might seem that not a lot changed since the previous version, but a little gem is hidden between all the TW2 changes as Sprox now supports setting default behavior for models themselves using the __sprox__ attribute inside model declaration.

class Parent(DeclarativeBase):
    __tablename__ = 'parents'

    uid = Column(Integer, primary_key=True)
    data = Column(String(100))

class Child(DeclarativeBase):
    __tablename__ = 'children'

    class __sprox__(object):
        dropdown_field_names = {'owner': ['data']}

    uid = Column(Integer, primary_key=True)
    data = Column(String(100))

    owner_id = Column(Integer, ForeignKey(Parent.uid))
    owner = relation('Parent')

The previous code example makes Sprox use the Parent data field for selection fields when choosing the parent of Child entities.

Apart from making easier to share options between your AddRecordForm and EditableForm __sprox__ attribute opens a great way to customize the TurboGears admin.

By adding a __sprox__ attribute inside your models you will be able to change the TurboGears admin behavior without having to create a custom admin configuration. Setting __sprox__ attribute makes possible to change most sprox properties changing CrudRestController behavior, the same properties that are documented on sprox.org can be specified inside the __sprox__ attribute by simply removing the underscores.

TurboGears future performances comparison

Recently I decided to give a quick benchmark for curiosity to the going to be branches of TurboGears2.

I quickstarted a simple genshi based application (plain turbogears2 quckstart) and then I created a plain controller method without template, to avoid counting the template generation overhead.

The application has been installed in three virtual environments: one with TG2.1.4, one with the development branch which is going to be TG2.2 and one with the development branch which is going to be TG2.3

The following graph reports the resulting requests/second that my pc has been able to serve on each turbogears version.

I have to admit that I’m quite happy with the results, the grow is steady and TG2.3 seems to be three times faster than the current turbogears while still being backward compatible (The benchmark application has been quickstarted with TG2.1.4 and ran without issues on all the three environments)

Mastering the TurboGears EasyCrudRestController

One of the key features of TurboGears2 is the great CRUD extension. Mastering the CRUD extension can really make the difference between spending hours or just a few minutes on writing a web app prototype or even a full application.

The CRUD extension provides two main features, the CrudRestController which is meant to help creating totally custom CRUDs and the EasyCrudRestController which provides a quick and easy way to create CRUD interfaces.

I’ll focus on the EasyCrudRestController as it is the easiest and more productive one, moving forward to the CrudRestController is quite straightforward after you feel confident with the Easy one.

The target will be to create, in no more than 40 lines of controller code, a full featured photo gallery application with:

  • Multiple Albums
  • Uploads with Thumbnails Generation
  • Authenticated Access, only users in group “photos” will be able to manage photos
  • Contextual Management, manage photos of one album at time instead of having all photos mixed together in a generic management section

If you don’t already know how to create a new TurboGears project, start by giving a look at TurboGears Installation for The Impatient guide. Just remember to add tgext.datahelpers to dependencies inside your project setup.py before running the setup.py develop command.

I’ll start by providing a Gallery and Photo model. To store the images I’ll use tgext.datahelpers to avoid having to manage the attachments. Using datahelpers also provides the advantage of having thumbnails support for free.

from tgext.datahelpers.fields import Attachment, AttachedImage

class Gallery(DeclarativeBase):
    __tablename__ = 'galleries'

   uid = Column(Integer, autoincrement=True, primary_key=True)
   name = Column(Unicode(100), nullable=False)

class Photo(DeclarativeBase):
    __tablename__ = 'photos'

    uid = Column(Integer, autoincrement=True, primary_key=True)
    name = Column(Unicode(100), nullable=False)
    description = Column(Unicode(2048), nullable=False)
    image = Column(Attachment(AttachedImage))

    author_id = Column(Integer, ForeignKey(model.User.user_id)))
    author = relation(app_model.User, backref=backref('photos'))

    gallery_id = Column(Integer, ForeignKey(Gallery.uid))
    gallery = relation(Gallery, backref=backref('photos', cascade='all, delete-orphan'))

Now to be able to start using our galleries we will have to provide a place where to view them and a gallery management controller to create and manage them. Viewing them should be quite straightforward, I’ll just retrieve the galleries from the database inside my index method and render them. To access a single gallery I’ll rely on the datahelpers SQLAEntityConverter which will retrieve the gallery for us ensuring it exists and is valid. For the management part I’ll create an EasyCrudRestController mounted as /manage_galleries

from tgext.crud import EasyCrudRestController

class GalleriesController(EasyCrudRestController):
    allow_only = predicates.in_group('photos')
    title = "Manage Galleries"
    model = model.Gallery

    __form_options__ = {
        '__hide_fields__' : ['uid'],
        '__omit_fields__' : ['photos']
    }

class RootController(BaseController):
    manage_galleries = GalleriesController(DBSession)

    @expose('photos.templates.index')
    def index(self, *args, **kw):
        galleries = DBSession.query(Gallery).order_by(Gallery.uid.desc()).all()
        return dict(galleries=galleries)

    @expose('photos.templates.gallery')
    @validate(dict(gallery=SQLAEntityConverter(Gallery)), error_handler=index)
    def gallery(self, gallery):
        return dict(gallery=gallery)

Logging in with an user inside the photos group and accessing the /manage_galleries url we will be able to create a new gallery and manage the existing ones.

To configure how the crud controller forms should appear and behave the __form_options__ property of the EasyCrudRestController can be used. This property relies on the same options as Sprox FormBase and customizes both the Edit and Add forms.
The next part is probably to be able to upload some photos inside our newly created galleries. To perform this we will create a new EasyCrudRestController for gallery photos management.

from tgext.crud import EasyCrudRestController
from tw.forms import FileField
from tw.forms.validators import FieldStorageUploadConverter
from webhelpers import html

class PhotosController(EasyCrudRestController):
    allow_only = predicates.in_group('photos')
    title = "Manage Photos"
    model = model.Photo
    keep_params = ['gallery']

    __form_options__ = {
        '__hide_fields__' : ['uid', 'author', 'gallery'],
        '__field_widget_types__' : {'image':FileField},
        '__field_validator_types__' : {'image':FieldStorageUploadConverter},
        '__field_widget_args__' : {'author':{'default':lambda:request.identity['user'].user_id}}
    }

    __table_options__ = {
        '__omit_fields__' : ['uid', 'author_id', 'gallery_id', 'gallery'],
        '__xml_fields__' : ['image'],
        'image': lambda filler,row: html.literal('‹img src="%s"/›' % row.image.thumb_url)
    }

Mounting this inside the RootController as manage_photos = PhotosController(DBSession) it will be possible to upload new photos inside any gallery. To manage the photos inside the first gallery for example we will have to access /manage_photos?gallery=1url.

Each parameter passed to the EasyCrudRestController is used to filter the entries to show inside the management table and the keep_params option provides a way to keep the filter around. This makes possible to edit the photos of only one gallery at the time instead of having all the photos mixed together. Also when a new photo is created it will be created in the current gallery.

The PhotosController got more customization than the GalleriesController, through the __field_widget_types__ and __field_validator_types__ options we force the image field to be a file field and using the __field_widget_args__ we ensure that the newly uploaded photos have the current user as the author.

__table_options__ provide a way to customize the management table. The available options are the same as the Sprox TableBase and Sprox TableFiller objects. in this case we hide the indexes of the rows on the database and the gallery itself, as we are managing the photos of a specific gallery we probably don’t need to know which galleries the photos belong to. Using the __xml_fields__ we also specify that the image field provides HTML and so doesn’t have to be escaped. The image entry forces the table to show the image thumbnail for the image column of the table instead of printing the AttachedImage.__repr__ as it would by default.

At first sight it might sound a bit complex, but once you start feeling confident, the CRUD extension makes possible to create entire applications in just a bunch of code lines. With just a few lines of code we created a photo gallery with multiple albums support and we can now focus on the index and gallery templates to make the gallery as pleasant as possible for our visitors.

The complete implementation of the photo gallery is available as a pluggable application on bitbucket, feel free to use it in your TurboGears projects.