In this video, I'll share how my team at Philips scaled our GitHub self-hosted runners in the cloud to enable CI/CD for the enterprise (similar to how GitHub-hosted runners are available for open source projects). We chose GitHub as our standard tool for software development due to its versatility and support for innersource collaboration. However, to support different operating systems and architectures, and connect to our internal network, we needed self-hosted runners.
To address this, we built a cloud-based solution that automatically scales up and down based on workload. Compared to maintaining physical servers, this approach proved both cost-effective and sustainable. The solution is open source and easy to deploy using Terraform and GitHub Actions for automation.
While we faced some challenges during scaling, the benefits of self-hosted runners in the cloud far outweighed them. If you need control over your network, hardware, or software, self-hosted runners are a great choice. However, for simpler workflows, hosted runners might be more practical. In the end, our developers found the solution seamless, allowing them to focus on coding rather than worrying about infrastructure.
In this video Guide, you will learn:
How Philips' software development process operates and why control over systems is crucial.
How self-hosted runners, along with GitHub Actions, helped to transform and streamline our processes.
How to create scalable, cloud-based solutions that can efficiently manage resources and control costs.
Niek Palm: [00:00:07]
Hi, my name is Niek Palm. I work for Philips as principal engineer in the software Center of excellence there, improving the life of our developers by giving them better tools to work faster together.
One of the things we introduced over the last year was GitHub, with GitHub Actions. To be able to use GitHub Actions, we have the need of self-hosted runners. Today, I will explain to you how we make the self-hosted runners working for our company on demand at scale.
Philips, a company that is already older than [inaudible] years, made a lot of things over the last years. And maybe you know, the company from the lights, television, or maybe even air fryer. But the company's changed over the last year. We've become a health tech company, a company building health technology to improve the life of people. And an important part of that technology is, of course, software.
Uh, so it's, it's typically at the heart of all those kinds of technologies. And the software we built with over 8,000 software professionals across the globe, they're sitting in many different business units and coming from acquired companies. So that creates a very diverse landscape already. They built all kinds of technologies for all kinds of target platform. It could be running in the cloud or maybe it's an AI algorithm embedded. It's very diverse and that’s all in a regulated environment. Our software is medical, so we have to adhere to regulations like laws and standards and so on. And it's not always easy. And those regulations also apply to the tools that we are using. So that is typically ending up that it’s not easy to introduce new tooling. And we have proven with the introduction of GitHub, that it's certainly possible to use modern tools in a regulated environment.
So that brings me to the last point here is we built already quite a lot of code. We are not new here in this field. We do this already for 30 years. So we have whole lot of names, lines of code, they're sitting in all kind of code systems, and it makes it hard working together. And this is what we are changing. That's the reason why we choose GitHub.
So GitHub is today our standard tool for doing software development. And we are moving a lot of code out of our legacy systems to GitHub. And we do that in a model that we call InnerSource. Huh? InnerSource is something like open source, but open source is out in the public and InnerSource is safe inside the boundaries of your company, a way how you can work together.
When you do software development, of course, I think we all know that we need CI/CD and everyone’s choosing GitHub Actions to build, test, and deploy our software.
So GitHub Actions is a nice way of doing your CI/CD, and maybe important to notice here is that you can do much more with GitHub Actions. You can automate anything in the GitHub ecosystem.
But at the moment a task is running a CI/CD job, it needs to run somewhere. You need to kind of compute for that and that compute you can use GitHub-hosted runners. There's nothing that you have to do; only say that it runs over there. But there’s a catch with that. If you use the GitHub-hosted runner, you cannot hook into your company network, for example. This is important for us. We have a lot of systems running inside our own network that needs to be connected from CI/CD. You could think about code quality systems, security scans, document generation, and so on. So that's the reason why we choose self-hosted runners.
And in the model of self-hosted runners, you run the GitHub agent somewhere on your systems, or in our case, many self-hosted runners somewhere on our systems. And now of all the good reasons to choose self-hosted runners, it could be that you want to choose your own hardware, you want to define something for GPU’s, or define software, or have in general, you could say, if you have any kind of level that you need some control there, for whatever it is—calls or security, hardware, software—self-hosted runners is maybe the way to go.
So we choose self-hosted runners. And then we start thinking, okay, how can we make that easy for our developers? Can we make it as simple as they used to get the hosted runner? So forget the hosted runner. The only thing that developers have to do is define tech in the workflow, or label, as it is called.
And then the job will run somewhere on the GitHub-hosted runner and it's free for open source. And if you are, if you have private repositories or internal ones, then you pay, of course, for it. You can also use self-hosted runners and then you define and label something like self-hosted runners, but then you have to have a runner somewhere for yourself.
Should your developers spin up all those kind of notices and manage their own hardware? Or can we make it easy for them? Can we make it as easy as using hosted runners? So they come automatically.
So a way to do that is of course buy a lot of computers and install all the runners. And if you buy enough computers you can certainly reach a big scale, but it's not scalable. It's not scaling automatically up and down, and certainly not sustainable.
So most likely, all those computers are running all the time. And it's not sustainable and you waste a lot of energy. And today with energy prices rising, it's also a very expensive operation. And I think I don't have to explain that doing an operation like this is, maintenance-wise, a big hell.
But today we can use the cloud. So what we did here is we built a cloud-based solution to scale the self-hosted runners up and down. And in our case, that is on Amazon. So our cloud-based solution is on Amazon. And at the moment we get a job, we get an event for a job, we scale up and there's no workload to process. We scale down. That gives us a scalable solution. We utilize the elasticity of the cloud and it also gives a sustainable solution because we only have computers running at the moment we need it.
So with this solution that you can define yourself for Amazon Cloud, you get control over network software, hardware, and also your cost. And this solution that we've built is out there, open source, it's on GitHub.
So let's have a closer look at the solution that we have built. So it all starts with defining GitHub app and the GitHub app starts sending events to the cloud. It sends every time a workflow event is triggered, and we have a serverless control plane running in the cloud and that service control plane is catching that event, and makes the decision to scale up or not. So every time it receives an event, it checks “cannot run,” “do I have enough space,” or whatever. And if it is needed, it creates a self-hosted runner. And that serverless control plane is also keeping an eye on the fleet of runners that you have.
So at the moment there's nothing to do. It scales them down. It removes instances from the kettle. So the solution is scaling up and down, and therefore sustainable.
For the, the fertile machines where the self-hosted runners are running, you can choose Amazon on the month instances or even be cheaper with spot instances. So then you have really the lowest price that you can pay for running your CI/CD jobs.
And there's another option that you have. You can also set an ephemeral. And in that case, at the moment the job is done, the runner terminates, and you get your skill done by design. So this solution you can today also deploy yourself, and the solution that they have built with TypeScript for the serverless functions and Terraform for the deployment is quite simple to deploy. You define a simple Terraform script, you deploy it to Amazon Cloud, you define in GitHub app in your org and you connect it to Amazon Cloud, and it's actually all that you have to do.
And doing that from scratch, I would say that this is 10 minutes or so, so it's even not hard. And if the solution deployed, you can run the operation for one repository, 10 repositories, all those repositories in our case for thousands of repos and thousands of jobs. And that works nicely.
But deploying it manually is also not a good idea. So we automate the deployment as well. So the deployment is running in GitHub Actions, we have the same runners. So we have here Inception, huh? So the runners are deploying themselves when we deploy a new version or in chains. And deploying is automatically also brings repeatability and predictability to your systems. Good to know here about if you start, do the automation with GitHub Actions for deployment to Amazon Cloud. You typically or may be used to entering an access key to your CI as a secret that's also not needed anymore today.
When you use typical cloud providers with GitHub Actions, please use OpenID Connect. That saves you the hassle of defining all kinds of secrets in your workflows. The solution that we have built is open source, and we were very happy that we were able to open source it and very thankful to our community. We got a lot back from the community. It's the community that built the Windows Runners, made it possible to use the solution with GitHub Enterprise server, or added a RAM support, make security fixes, solve bugs, or wrote better documentation. That's all what we go back from the community.
And in Philips, we use it today, we actually used it already for two years, and it becomes very simple for developers. They are not aware anymore that there is something like runners that needs to run somewhere. The only thing they have to do is define the label, the label that you see in the workflow with the Philips label, and then the system starts scaling up and down during the week.
And then on the weekend it also scales, but not so hard as during the week. We do maybe other things on the weekend. So on enterprise scale, huh? We are now on GitHub with, I will say roughly 4,000 developers. They develop over 6,000 repositories and generating over 15,000 jobs a day only on our Linux runners. So there are most likely more jobs and it works most of the time perfectly, but we find also some problems. So at the moment we, when we start, we start small with a few repulsions and start scaling, and at some moment we start finding out limits, and there were limits in our cloud accounts that were set by default. You can increase them typically. Not all of them, but most of them. And you have maybe API limits. And the system may make API calls to get up also to Amazon.
If you make them too frequently too fast, you get rate limited and your system is not scaling. So running on scale is nice, but it's hard. But in the end ,it works. And our developers are very happy. But our developers are also binary. It works or is not working. The system is up or down. And it is not so simple, eh? The system that we have works most of the time good because the servers, there's nothing to keep up on running, but it's dependent on all the systems. And if one of those systems in the whole chain is down, we have a problem. Our runners are not, our jobs are not processing and developers are coming to us. So you have to have a team able to handle those kind of situations. Eh, you will find out problems when you start running this on scale.
That bring me by my final question. Should I use self-hosted runners or hosted runners? It's something that I cannot answer for you, eh? It's something that you have to answer yourself.
Think about this. You use self-hosted runners when you needa good level of control. You want to control your software, your hardware, your network, or all our kinds of things. If that is the case, you should choose self-hosted runners. But be aware, if you have only running one, maybe do it manually. But if you need multiple of them, think wisely. Choose cloud solution, make it scalable, let it scale up and down by, by nature. And there are several solutions out there. Ours is on GitHub. There's an alternative for Kubernetes. Some of them are also mentioned on the GitHub documentation pages. So go out there and have a look.
If you don't need all those kind of controls, I guess it is much simpler to stick to hosted runners. Everything is managed for you and there's nothing where you have to take care about.
Thank you for attending the session today. I hope you learned how you can use GitHub self-hosted runners on demand at scale. Everything I've shown today is open source and it is mentioned on the GitHub documentation page. Go out to our repository, submit an issue, make a PR, make the community even more happy than they are today, or reach out to me in person. Thank you. Thank you very much. Have a nice day.