Scaling up the Continuous Integration infrastructure for Eclipse Foundation’s projects

Mikaël Barbero

2018-04-27

TL;DR
Projects hosted by the Eclipse Foundation will soon benefit from a brand new enterprise-grade continuous integration (CI) infrastructure. Expected improvements are: resiliency, scalability and nimbleness. We are doing this move with tremendous support from our friends at CloudBees and RedHat with their respective products Jenkins Enterprise and OpenShift Container Platform.

>>Servers — https://www.flickr.com/photos/efandorin/ CC BY-NC-ND 2.0

A decade ago or so, the Eclipse Foundation started the continuous integration (CI) As A Service adventure by providing a single, shared, Hudson instance to its projects. It has been an immediate success. It helped projects to get more frequent integration builds and more stable releases. Despite its success, the solution had a lot of drawbacks. First, we had to find a set of plugins that were fitting everybody and that were working well together. It ended with installing only the common denominator, which was frustrating for projects which wanted to use additional plugins. Second drawback was about finding a time window for maintenance operations: it had become harder and harder. With so many projects, the utilization was close to 24/7. Updating or installing new projects was requiring a lot of coordination overhead. Third, victim of its own success, the shared instance was sometimes unstable and any downtime was affecting all projects. Finally, while this model scaled pretty well in term of computing resources (we could add more agents easily), resources were shared and it was easy for a project to (unintentionally) starve others.

In late 2011, the Common Build Infrastructure (CBI) initiative ramped up. It had 3 goals:

Make it really easy to copy and modify the source
Make it really easy to build and test the modifications
Make it really easy to contribute a change

With these goals set up, it soon became clear that the single Hudson instance was not a good fit. To reach these goals, a project needs to build and test each and every gerrit review / pull request it gets from the community. As such, build jobs will be running more often, automatically and not necessarily under the control of the project team (build can be trigger by a new contribution from someone external to the team). Finding a time window for maintenance would become impossible. Projects also wished for deploying their build results automatically. It was not possible on the shared Hudson instance as it would mean sharing projects specific credentials with all other projects: highly undesirable. Hence, the Eclipse Foundation started to deliver one Hudson Instance Per Project (HIPP). It started slow, with a couple of projects migrated away from the crowded shared instance. But this again became a big success. By the end of 2016, about 150 instance were running. It was more stable, gave projects the freedom to install plugins they want and to run isolated (from a credential point of view) from other projects. In the same year, the Hudson development was stale and it was not wise to continue to use it. Thus, we initiated a huge effort to migrate all of our Hudson instances to Jenkins — JIPP. Kudos to Frederic Gurr who lead this effort which ended in March 2018. Despite this migration, there was still something fishy about the solution: all instances are running colocated on a dozen of beefy bare metal servers. It was not an issue at the beginning, but the more JIPP we added to the farm, the more the builds of one project were affected by others on the same machine. For instance, build could last up to 5 times longer depending on the global load. Moreover, we now have about 200 JIPP. Maintaining all of this requires a lot of time. With the creation of Jakarta EE and the move of all Java EE reference implementations, this number will skyrocket…

It’s now time to scale up this setup and make it more efficient. We need to better use our hardware and be able to add interim cloud resources when needed. We need something where each project resource consumption is isolated from each other. We need to be able to update Jenkins masters and to install/update Jenkins plugins in batch. We need to provide more flexibility to projects to let them build their code in containers so that they control the build environment. We need a solution where resilience is built-in.

We’ve studied a couple of options that would offer all of this. The first conclusion of this study was that we need to run our system on top of a Kubernetes cluster. There are a couple of cluster orchestration systems out there, some more mature than Kubernetes. But we can’t ignore the momentum Kubernetes has these days and we bet that it’s a future proof solution to build on. Kubernetes offers everything we need for scalability and resiliency. The downside is that we need to run it on-premise and it can be quite overwhelming. That’s why we decided to run this cluster with the Red Hat’s distribution of Kubernetes: Red Hat OpenShift Container Platform. Among other things, it provides a rock-solid opinionated setup of Kubernetes which is very reassuring when you just start with these technologies.

The second conclusion of our study was that we needed an orchestrator for all our Jenkins masters. We currently manage our 200 JIPP with a lot of heterogeneous script / tools. We need a more integrated solution. CloudBees Jenkins Enterprise (CJE) provides exactly that in the form of the so called CloudBees Jenkins Operation Center. After some testing and demoing from CloudBees guys, and the fact that the new version 2.0 of CJE runs on top of Kubernetes, it was a no brainer: we needed this tool.

>>Touching the sky — by Samuel Zeller on Unsplash

Good news is that setting up this whole new environment has already started. OpenShift is already running on our hardware and we plan to have CJE running by the end of May. We don’t expect much disruption, and most of projects won’t need to change anything to their build settings.

Starting in a couple of weeks, all new projects will get a CJE JIPP instead of a regular JIPP. Soon after, we will start migrating existing JIPPs by calling for volunteer guinea pig projects. Once this is done and we get confident in the process, we will gradually ramp up the migration and move all remaining projects over to CJE. There is no set timeline, but we aim to move most projects to CJE before the end of the year.

We are starting a FAQ about the migration process and it will be shared very soon on the Eclipse Foundation cross-projects mailing list as well as on the CBI mailing list. We will also announce progress and milestones on these lists. Stay tuned!

Originally published at mikael-barbero.medium.com