Java, OSGi, Amdatu and a bit of JavaScript: 2013

Saturday, November 30, 2013

Building bndtools projects with Gradle

Headless builds in bnd and bndtools

Bndtools is by far the easiest way to develop OSGi applications. The extremely fast code-save-test development cycle would be reason enough. If you're new to Bndtools, take a look at this video first to see it in action.

Bndtools always generated ANT build files for each project to support "headless" builds, for example on a build server. While ANT seems to be a long forgotten build tool for some, it actually works quite nicely because the build files are generated and in most cases you don't need to ook at them much. That is until you want to add some custom steps in your build. Think about code coverage tools, JavaScript minification etc.

Selecting a better build tool

To start looking at an alternative build tool we first need to define what we need from a build tool:

Building bnd projects without additional setup
Running bnd integration tests
Easily adding custom build steps/plugins/scripts
Integration with bnd dependency management

That last step is important. Bnd already manages our build dependencies in the .bnd files; we don't want to repeat this in a build tool specific way! This makes our selection process easier; we are only looking for a build tool, not for a dependency management tool.

The two most obvious candidates in the Java ecosystem are Maven and Gradle. Let's evaluate both.

Maven is both a build tool and a dependency management tool. Dependency Management works well (although it has many flaws) but as a build tool Maven isn't exactly perfect. Building a standard project with Maven is easy, but adding your own build steps is a quite horrible experience. Even the most trivial script would require you to write (or use) a Maven plugin, which requires a lot of boilerplate steps. This is also the reason that many (non-OSGi) projects are currently migrating from Maven to Gradle. We don't need Maven's dependency management. Bnd already does this for us in a much more natural way for OSGi, using OSGi metadata instead of POM files.

Gradle is a more generic build tool with optionally dependency management similar to Maven. Gradle build files are Groovy scripts with a powerful DSL. It's trivial to add your own build steps and hook into the default life-cycle. Groovy turns out to be an extremely powerful language to write build scripts. Specially when compared to writing ANT scripts in XML or Maven's completely broken plugin development. Another nice bonus is that we can re-use existing ANT plugins directly from Gradle.

Gradle obviously matches our requirements better, so let's take a look how we can integrate with bnd!

Integrating Gradle with bnd

Bnd is not only a tool (as most people know it), but also has a powerful Java API. Using this API we can perform builds and test runs from code. This is exactly what Bndtools (which is built on top of bnd) is doing as well. Because Gradle build scripts are Groovy code, we can use the bnd API directly from our build script, instead of launching bnd as an external process.

As an example for this post I have taken the modularity-showcase which is part of our book, and set up a Gradle build. Let's take a look at the build file and walk through it step by step.

Setting up dependencies
Because the build script is using bnd, we need the bnd jar to be on the build script classpath.

Generating settings.gradle
A multi project build in Gradle requires a settings.gradle file, which lists the projects part of of the build. Since each project in the workspace is a bnd project, we can create a task that creates the settings.gradle file for us. This is exactly what the generatesettings task is doing. If you want to exclude certain projects from the build, you could filter those right there.

Building projects
Because our build script is Groovy code, we can use the bnd API directly. Before we can build anything we have to initialise the workspace.

Next we have to build each individual project. Gradle's DSL uses the subprojects syntax to declare tasks that run on each project. For each project we enable the Java plugin; this gives us a Java compiler. Next we configure our source and target directories to match bndtools defaults.

The next step is an important one. We declare that each compileJava task (the default compile task) depends on the bndbuild task of all it's project dependencies. A project dependency is a dependency on another bundle build by another bndtools project.

Next we add junit as a default dependency to the testCompile step, in case someone added junit only as an Eclipse library instead of adding it to the build path. This is common because Eclipse does this automatically.

In the bndbuild step we finally perform the build itself. This is as easy as calling the bndProject.build() method and making sure that errors and warnings are reported correctly.

Testing projects
Bnd's excellent integration testing support can also be used headless, generating junit XML output which can be parsed by most build servers. We can recognise a test project by checking the Test-Cases header in the bnd files. For each test project we simply invoke bndProject.test() which will run the tests and generate the XML files.

There is one thing to be careful with! Do not add an OSGi shell in your integration tests. While this works within the IDE, it will terminate the tests when running in headless mode.

Packaging a release
After performing a headless build it's useful to collect generated jar files and external dependencies, so that we can install them on a server, for example by using Apache ACE. Collecting generated bundles is easy, but how do we collect external dependencies that we need for a deployment?

In bndtools we use .bndrun files to run projects. In these files we specify which bundles (both from the workspace and externally) we want to install. A while ago I created an ANT task in bnd that parses these files and copies all bundles to a directory. From that directory you can easily copy the files to whatever deployment environment you use.

This ANT task is reused in the release task. You could create multiple run configurations and export all of them.

Generating a Gradle wrapper
A convenient Gradle feature is the possibility to generate a wrapper. This will generate a gradlew script (both for Windows and MacOS/Linux systems) that download Gradle and run the build. This way you can run Gradle builds even when a build server doesn't "support" Gradle.

Running the build

Before we can run a build, we fist need to generate settings.gradle. Remember to repeat this when you add new projects to the workspace!

> gradle generatesettings

Now we can run our build. Let's run a clean build and integration test.

> gradle clean bndbuild bndtest

Finally we can collect bundles for a deployment.

> gradle release

Multi project and single project builds

With the build file that we have seen we can run both multi project and single project builds. This is a standard Gradle feature and works for our builds as well. For a single project build you simply invoke gradle bndbuild within the project's directory instead of the workspace level.

Gradle builds in the real world

Setting up a build for a small project is always easy. How does this scale to large projects? For the last few months the PulseOn project that I'm working on has been running on Gradle builds. To give you an idea of size, we generate ~300 bundles and run ~1500 integration tests each build. This runs on a Bamboo server for each feature branch in Git. Our code base includes both Java and Groovy code.

Our builds also include code coverage metrics based on Clover, JavaScript optimisation using Require.js and the Closure compiler and Sonar metrics. This shows that Gradle works perfectly together with bnd to perform large, real-life builds.

Sunday, November 3, 2013

Visualizing OSGi services with D3.js

Because I couldn't resist writing a bit of code during my vacation I started playing with D3.js, a data visualization library for JavaScript, and used it to visualize dependencies between OSGi services. If you are already familiar with designing systems based on OSGi services you might just want to take a look at the video directly. If you need a little more introduction; continue reading.

We are using OSGi services for pretty much everything; all our code lives within services. Services are the primary tool when implementing a modular architecture. By just using services, your code isn't necessarily modular yet however. A lot of thought has to go into the design of service interfaces and dependencies between services.

In a services based architecture it's obviously a good thing to re-use existing services when implementing new functionality. At the same time it's important to be careful with creating too many outgoing dependencies. If everything depends on everything else, it becomes very difficult to make any changes to the code (also known as "not modular"...). When implementing a system you will start to notice that some services are re-used by many other services. Although not an official name at all, I often call them core services; a service that plays a central role in your system. These services must be kept stable because major changes to it's interface requires a lot of work. Outgoing dependencies from these services should also be used with care. To guarantee stability of the service, it's good to keep the number of dependencies low. This prevents a ripple effect of changes when touching something. In practice, a core service should only depend on other services which are very stable. Do not depend on services that are likely to change often.

Many other services might not be used by any other services at all however, for example a scheduled job or a RESTful webservice that only serves a specific part of a user interface. These services can easily be replaced or even be discarded when no longer needed. In an agile environment this happens all the time. For these services it's not really a problem to depend on other services, specially not the core services of the system.

If your architecture is sound, you probably have a very clear idea about which services are your core services. Still, it's useful to actually visualize this to identify any services that have more incoming dependencies than you expected, or at least see which other services have a dependency on a certain service. And that's exactly what I did for this experiment.

We use Apache Felix Dependency Manager to create services and dependencies between them. Because of this I used the Apache Felix Dependency Manager API to create the list of dependencies between services. Note that this will not show services that are not created by Dependency Manager. The visual graph itself is created by D3.js based on this example.

The code is available on BitBucket: https://bitbucket.org/paul_bakker/dependency-graph/overview

Saturday, September 7, 2013

Modularity at JavaOne

JavaOne is just a few weeks away, so it's time for a preview post! The road to JavaOne started last week for me with an interview by Stephen Chin in his Night Hacking show. We have been chatting about my sessions and the book by Bert Ertman and me that will be published right before JavaOne.

This year I will have four sessions, all around the modularity theme.

Building Modular Cloud Applications in Java: Lessons Learned [CON2020]

The first talk, together with Bert Ertman, is about our experiences with modularity in the past few years with building "cloud" applications. The talk is not so much about the cloud itself, but about building modern (web) applications using a modular approach in the "cloud era". We will discuss modular architecture based on OSGi, practical examples of leveraging OSGi services and cloud deployments. If you're interested in how modularity works in practice, this talk is for you.

Tutorial: Building Modular Enterprise Applications in the Cloud Age [TUT1978]
In this two hour talk, Bert Ertman and me are not just talking about building OSGi based cloud applications, we are going to actually show you. OSGi has had a name of being complex and hard to use, but with today's tools and frameworks this is far from true. During this talk we will show how to build modular applications based on dynamic OSGi services by implementing an application from scratch during the talk. If you are relatively new to OSGi, this is the perfect introduction to developing with OSGi. You will learn about modular architecture, package imports/exports, versioning, dynamic services and we will show how to use interact with MongoDB and create RESTful web services. Although the application that we will build during the talk is obviously small, it uses the exact same architecture and patterns that we use in large scale applications.

Modularity in the Cloud: A Case Study [CON2775]
While the previous two talks are mostly about building modular applications, this talk if focussed at deploying modular applications. "Cloud" features such as auto scaling and automated failover sound great, but actually setting this up for large deployments isn't trivial. In this talk, Marcel Offermans and me will share how we have been deploying a large scale OSGi application to the cloud in the past two years. We will also cover baselining and versioning of OSGi bundles, and applying modular deployments. The central component in our approach is Apache ACE; a provisioning server that fits cloud deployments perfectly. If you are either using OSGi and wonder about ways to deploy applications, or are facing the general challenge of deploying to the cloud this talk is for you.

Modular JavaScript [CON2959]
Me giving a talk about JavaScript at JavaOne. I didn't see that one coming! A few years ago I would probably have laughed at the idea, but a lot has change since that time. JavaScript is now a major part of the development I'm involved in. With that, I also started to notice the same problems with JavaScript that we have with Java; how do you keep a large code base maintainable? In this talk, Sander Mak and me will discuss possibilities to introduce modularity in JavaScript. We will be looking at module systems available in JavaScript, and will give detailed examples of RequireJS. Next we will discuss modularity at a higher level that requires services and dependency injection. We will give some examples based on AngularJS. We will also show how all of this fits a modular Java backend, so that we achieve modularity end-to-end.

Book signing session
Bert and me will have a book signing session during the JavaOne week. Rumour goes that there will be some free book give aways... I will tweet the exact time and location when I have the details.

Wednesday, August 28, 2013

Two years with: MongoDB

In this blog series I will discuss some of the languages, frameworks and components that we used building PulseOn in the past two years.

PulseOn is an educational system that enables personalized learning. It's used by students and teachers in the classroom. PulseOn runs in the browser on a wide range of devices, and has to support large scale deployments. From a technology standpoint, PulseOn is a textbook example of a modern, cloud based, web application.

Introduction
While we started building PulseOn with a combination of MySQL and a semantic triple store, we now store most of our data in MongoDB. Having many years of experience with relational databases and object relational mapping frameworks in Java this was quite a conceptual leap; but one that I would recommend everyone to make.

Why MongoDB
There are a number of reasons why we use MongoDB.

Document oriented storage fits a RESTful architecture well
Schemaless
Reasonably scalable
Ease-of-use

Storage in a RESTful architecture

The User Interface of PulseOn is built with HTML5 / JavaScript. Our backend exposes data to clients using RESTful web services. This is interesting because if you look at the data produced and accepted by these services, these are in fact documents. Take a look at the following example; a JSON structure that represents a simple curriculum with learning objectives.

{
  published: true,
  previewLocation: "37289365.jpg",
  definition: "Example",
  learningObjectives: [
   {
     prerequisites: [ ],    
     autoComplete: true,
     subObjectives: [

          prerequisites: [ ],    
          autoComplete: true,
          subObjectives: [ ], 
          title: "example",
        },

     ], 
     title: "example",
   },
   {
     prerequisites: [ ],    
     autoComplete: true,
     subObjectives: [ ], 
     title: "example 2",
   },
  ]
}

The structure is nested. The curriculum contains learning objectives, that may contain sub objectives. From a Java perspective, this is fine, and can easily be represented with classes. In a relational database however, not so much. Although we could easily map this structure using JPA, we would require multiple joins to actually retrieve the structure from the database. This is where performance of even the most trivial selects becomes tricky. And we didn't even start thinking about queries in these nested structures yet....

When using MongoDB instead, we can save this structure without any mapping. In general our Java code defines the data structure as part of a service API. In this example that would be a class Curriculum and a class LearningObjective. These API classes are used internally by our OSGi services, and stored directly (without any mapping) in Mongo. The RESTful resources, based on JAX-RS, produce and consume the exact same types (again without mapping). This makes it extremely natural to work with these "documents".

Schemaless flexibility
A characteristic of MongoDB that has both upsides and downsides is it's schemaless nature. Basically a single collection can contain documents with totally different properties. First of all, you have to be careful with this. While MongoDB is happy to work with any document you put in a collection, your Java code probably doesn't.

This is a characteristic that you should use with great care. At the same time, for some specific scenarios it's extremely useful. For example we store profiling events for anything interesting a user is doing in the system. These events always have a set of fixed properties like the user id, a timestamp and type. To make these events really useful, most events also store some context information. For example, if a learning object is opened, we want to know which object, the objective it is related to, and the score from our content recommender. Each profiling event type has a different set of context properties. In MongoDB we can still store these different event types in a single collection. This way, we can query different types of events within a single query. Of course, our Java code has to work with untyped data in this case.

Scalability and failover
MongoDB is not the fastest, nor the most scalable datastore in the world. It is however, a very good tradeoff between functionality, speed and scalability. Scalability wise it can't compete with some of the key-value stores out there, but this is really comparing apples and oranges. The biggest difference is that MongoDB offers queries on documents, while key-value stores don't support this in most cases.

Scaling MongoDB is quite easy. You start by setting up a replica set. Even for someone new to MongoDB this should be quite straightforward. You can horizontally scale the cluster by just adding more nodes to the replica set. There is a big caveat to be aware of! By default, both read and write operations will have to go through the master node. This effectively kills horizontal scalability. You can define either at the driver level or operation level if secondaries may be used. This is something you have to think about about while developing code; is it ok for a query to read potentially stale data? When this is used correctly it is quite easy to scale. At least it's a lot easier than scaling a relational database.

Failover is achieved with replica sets as well. We have had several incidents where a server died for some reason (mostly not Mongo's fault), but the replica sets never went down.

Queries and data aggregation
Query functionality is probably the strongest point of MongoDB. Basically, you can do whatever you can do with SQL. In contrast to a relational database there is no concept of relations or joins between collections. In some cases this is a downside (more about this later), but the document oriented structure eliminates most of the reasons for joins. Not having joins makes queries a lot simpler. Queries can use nested properties as well, which is extremely useful in practice.

For data aggregation (e.g. counting and calculating values) you can either use map/reduce or the aggregation framework. The aggregation framework is very easy to use, and it makes a lot more sense than fiddling with having queries in a relational database. Map/reduce is slightly less convenient to work with, but is a powerful fallback. Every time I'm working with complex aggregation questions I'm surprised how easy it is to solve those.

The lack of joins
MongoDB doesn't support the concept of relations between documents in different collections, and with that, the concept of joins. In many cases where you would use joins in a relational database, you don't need them in MongoDB. Instead of joining data from collections together, you can often simply embed the data in another document. Documents can contain arrays of complex structures, so this can get you quite far, because many of the joins in relational databases are necessary just to work around the lack of this functionality. This is not a solution for all one-to-many or many-to-many problems however. The de-normalization of data is easy and fast while performing queries, but you need extra code to keep different documents in sync, which can be quite difficult. In scenarios where multiple documents relate to the same data, we generally choose to link documents from different collections instead of embedding the data.

Links between documents can be created manually. This can be done by simply adding a field to a document that contains the id of another document (potentially in another collection). The linked document always has to be retrieved manually (a query can only be performed on a single collection) and you need to take care of updating documents when their related documents are deleted manually as well. The latter is not difficult, but easy to forget. We have had several bugs that were caused by this, simply because you don't run into it during development.

When retrieving related documents it is important to do this in an efficient way. Specifically avoid executing too many queries. For example do not do the following (pseudo code):

posts = db.blogposts.find()

while(post.hasNext()) {

   db.authors.findOne(post.authorId)

   //use author data

}

This would execute a query to retrieve the author data for each blog post, which is obviously very expensive. Instead do something like the following (pseude code):

post = db.blogpost.find()
authorIds = []

while(post.hasNext()) {

   authorIds.add({_id: post.authorId})

}

db.authors.find({_id: {$in: authorIds}})
//Use author data

The last example uses only two queries, but you will need to "join" the blogs and authors after retrieving both of them. This sometimes requires a little bit more code, but is mostly very straightforward.

The lack of transactions
Another big difference between MongoDB and relational databases is the fact that MongoDB doesn't support transactions. This feels strange in the beginning after using transactions for many years. In practice, it's really not as big of a problem. Think about how many actions in your system really need transactions. In most cases there is really only one document to update at a time (again, the nested document structure helps us here). Of course there are scenarios where transactions really apply; for example the classical banking example where money is transferred from one account to another. In these rare cases you could consider a polyglot storage solution, where a relational database is used for storing the data where transactions apply.

Data migration and scripting
In the MongoDB shell you use JavaScript to work with the database. The great thing about this is that you can script it. For example store a query result in a variable, iterate over it, and perform some operation for each document in the result. This makes it very easy to perform some data migration or some ad-hoc data analysis. In most relational databases you can do the same using proprietary SQL extensions, but using JavaScript just makes things so much easier (if you're a programmer).

Indexes
Indexes in MongoDB are very similar to indexes in a relational database. You can create indexes on properties and sets of properties. Just like a relational database, queries are crucial for query performance. Forget to create them, and performance will be bad.

To figure out if any indexes are missing, MongoDB comes with built in profiling. When you turn this on, it collects slow queries. We often use this in combination with load tests to find any problem areas.

Using MongoDB with OSGi
Our stack is entirely based on OSGi. Using MongoDB from OSGi is trivial. The Java driver is already an OSGi bundle, so you can use it directly. To make things even better we created a MongoDB OSGi component in Amdatu. This component let's you configure a MongoDB driver using Configuration Admin and inject it as an OSGi service. For object mapping we use the MongoJack library, which uses Jackson for the actual mapping. This works well because we use Jackson for JSON mapping in JAX-RS as well. You should be able to use most other mapping libraries as well.

Conclusion
MongoDB is certainly not a silver bullet that solves every data related problem. Choosing a solution will always be a trade off between capabilities. MongoDB does come out very strong for general (web) use.

Will I use it again?
Definitely, it's a great fit for general use in web applications.

Pages