Wednesday, August 28, 2013

Two years with: MongoDB

In this blog series I will discuss some of the languages, frameworks and components that we used building PulseOn in the past two years. 

PulseOn is an educational system that enables personalized learning. It's used by students and teachers in the classroom. PulseOn runs in the browser on a wide range of devices, and has to support large scale deployments. From a technology standpoint, PulseOn is a textbook example of a modern, cloud based, web application.

While we started building PulseOn with a combination of MySQL and a semantic triple store, we now store most of our data in MongoDB. Having many years of experience with relational databases and object relational mapping frameworks in Java this was quite a conceptual leap; but one that I would recommend everyone to make.

Why MongoDB
There are a number of reasons why we use MongoDB.
  • Document oriented storage fits a RESTful architecture well
  • Schemaless
  • Reasonably scalable
  • Ease-of-use
Storage in a RESTful architecture
The User Interface of PulseOn is built with HTML5 / JavaScript. Our backend exposes data to clients using RESTful web services. This is interesting because if you look at the data produced and accepted by these services, these are in fact documents. Take a look at the following example; a JSON structure that represents a simple curriculum with learning objectives.
  published: true,
  previewLocation: "37289365.jpg",
  definition: "Example",
  learningObjectives: [
     prerequisites: [ ],    
     autoComplete: true,
     subObjectives: [
          prerequisites: [ ],    
          autoComplete: true,
          subObjectives: [ ], 
          title: "example",
     title: "example",
     prerequisites: [ ],    
     autoComplete: true,
     subObjectives: [ ], 
     title: "example 2",

The structure is nested. The curriculum contains learning objectives, that may contain sub objectives. From a Java perspective, this is fine, and can easily be represented with classes. In a relational database however, not so much. Although we could easily map this structure using JPA, we would require multiple joins to actually retrieve the structure from the database. This is where performance of even the most trivial selects becomes tricky. And we didn't even start thinking about queries in these nested structures yet....

When using MongoDB instead, we can save this structure without any mapping. In general our Java code defines the data structure as part of a service API. In this example that would be a class Curriculum and a class LearningObjective. These API classes are used internally by our OSGi services, and stored directly (without any mapping) in Mongo. The RESTful resources, based on JAX-RS, produce and consume the exact same types (again without mapping). This makes it extremely natural to work with these "documents".

Schemaless flexibility
A characteristic of MongoDB that has both upsides and downsides is it's schemaless nature. Basically a  single collection can contain documents with totally different properties. First of all, you have to be careful with this. While MongoDB is happy to work with any document you put in a collection, your Java code probably doesn't.

This is a characteristic that you should use with great care. At the same time, for some specific scenarios it's extremely useful. For example we store profiling events for anything interesting a user is doing in the system. These events always have a set of fixed properties like the user id, a timestamp and type. To make these events really useful, most events also store some context information. For example, if a learning object is opened, we want to know which object, the objective it is related to, and the score from our content recommender. Each profiling event type has a different set of context properties. In MongoDB we can still store these different event types in a single collection. This way, we can query different types of events within a single query. Of course, our Java code has to work with untyped data in this case.

Scalability and failover
MongoDB is not the fastest, nor the most scalable datastore in the world. It is however, a very good tradeoff between functionality, speed and scalability. Scalability wise it can't compete with some of the key-value stores out there, but this is really comparing apples and oranges. The biggest difference is that MongoDB offers queries on documents, while key-value stores don't support this in most cases.

Scaling MongoDB is quite easy. You start by setting up a replica set. Even for someone new to MongoDB this should be quite straightforward. You can horizontally scale the cluster by just adding more nodes to the replica set. There is a big caveat to be aware of! By default, both read and write operations will have to go through the master node. This effectively kills horizontal scalability. You can define either at the driver level or operation level if secondaries may be used. This is something you have to think about about while developing code; is it ok for a query to read potentially stale data? When this is used correctly it is quite easy to scale. At least it's a lot easier than scaling a relational database.

Failover is achieved with replica sets as well. We have had several incidents where a server died for some reason (mostly not Mongo's fault), but the replica sets never went down.

Queries and data aggregation
Query functionality is probably the strongest point of MongoDB. Basically, you can do whatever you can do with SQL. In contrast to a relational database there is no concept of relations or joins between collections. In some cases this is a downside (more about this later), but the document oriented structure eliminates most of the reasons for joins. Not having joins makes queries a lot simpler. Queries can use nested properties as well, which is extremely useful in practice.

For data aggregation (e.g. counting and calculating values) you can either use map/reduce or the aggregation framework. The aggregation framework is very easy to use, and it makes a lot more sense than fiddling with having queries in a relational database. Map/reduce is slightly less convenient to work with, but is a powerful fallback. Every time I'm working with complex aggregation questions I'm surprised how easy it is to solve those.

The lack of joins
MongoDB doesn't support the concept of relations between documents in different collections, and with that, the concept of joins. In many cases where you would use joins in a relational database, you don't need them in MongoDB. Instead of joining data from collections together, you can often simply embed the data in another document. Documents can contain arrays of complex structures, so this can get you quite far, because many of the joins in relational databases are necessary just to work around the lack of this functionality. This is not a solution for all one-to-many or many-to-many problems however. The de-normalization of data  is easy and fast while performing queries, but you need extra code to keep different documents in sync, which can be quite difficult. In scenarios where multiple documents relate to the same data, we generally choose to link documents from different collections instead of embedding the data.

Links between documents can be created manually. This can be done by simply adding a field to a document that contains the id of another document (potentially in another collection). The linked document always has to be retrieved manually (a query can only be performed on a single collection) and you need to take care of updating documents when their related documents are deleted manually as well. The latter is not difficult, but easy to forget. We have had several bugs that were caused by this, simply because you don't run into it during development.

When retrieving related documents it is important to do this in an efficient way. Specifically avoid executing too many queries. For example do not do the following (pseudo code):

posts = db.blogposts.find()

while(post.hasNext()) {


   //use author data


This would execute a query to retrieve the author data for each blog post, which is obviously very expensive. Instead do something like the following (pseude code):

post = db.blogpost.find()
authorIds = []

while(post.hasNext()) {
   authorIds.add({_id: post.authorId})
db.authors.find({_id: {$in: authorIds}})
//Use author data

The last example uses only two queries, but you will need to "join" the blogs and authors after retrieving both of them. This sometimes requires a little bit more code, but is mostly very straightforward.

The lack of transactions
Another big difference between MongoDB and relational databases is the fact that MongoDB doesn't support transactions. This feels strange in the beginning after using transactions for many years. In practice, it's really not as big of a problem. Think about how many actions in your system really need transactions. In most cases there is really only one document to update at a time (again, the nested document structure helps us here). Of course there are scenarios where transactions really apply; for example the classical banking example where money is transferred from one account to another. In these rare cases you could consider a polyglot storage solution, where a relational database is used for storing the data where transactions apply.

Data migration and scripting
In the MongoDB shell you use JavaScript to work with the database. The great thing about this is that you can script it. For example store a query result in a variable, iterate over it, and perform some operation for each document in the result. This makes it very easy to perform some data migration or some ad-hoc data analysis. In most relational databases you can do the same using proprietary SQL extensions, but using JavaScript just makes things so much easier (if you're a programmer). 

Indexes in MongoDB are very similar to indexes in a relational database. You can create indexes on properties and sets of properties. Just like a relational database, queries are crucial for query performance. Forget to create them, and performance will be bad.

To figure out if any indexes are missing, MongoDB comes with built in profiling. When you turn this on, it collects slow queries. We often use this in combination with load tests to find any problem areas.

Using MongoDB with OSGi
Our stack is entirely based on OSGi. Using MongoDB from OSGi is trivial. The Java driver is already an OSGi bundle, so you can use it directly. To make things even better we created a MongoDB OSGi component in Amdatu. This component let's you configure a MongoDB driver using Configuration Admin and inject it as an OSGi service. For object mapping we use the MongoJack library, which uses Jackson for the actual mapping. This works well because we use Jackson for JSON mapping in JAX-RS as well. You should be able to use most other mapping libraries as well.

MongoDB is certainly not a silver bullet that solves every data related problem. Choosing a solution will always be a trade off between capabilities. MongoDB does come out very strong for general (web) use.

Will I use it again?
Definitely, it's a great fit for general use in web applications.