My OPW project has several parts (I talked about them on my previous post here). I started my work with the first part which is called “the drain now” mechanism. This mechanism aims to drain the slaves when SIGUSR1 signal is sent to them.
‘draining’ in this context means killing an entity (process) and all that it’s running underneath that entity (all the processes that were started by that entity). So ‘draining the slave’ means killing the slave processes and all the jobs/tasks the slave is running.
For the first part of the mechanism, I implemented a signal handler which is triggered when SIGUSR1 signal is issued. The handler kills everything that runs underneath the slave and after that shuts down the slave.
For the second part of the mechanism, I adjusted the first part by sending an unregister request message from the slave to the master, before shutting down the slave. This is because normally, the master waits up to the health checking delay (~75 seconds) until it considers the tasks that were running on the slave as lost. With this change, the master will mark the tasks as lost (and do all the necessary things when a task becomes lost) as soon as it will receive the unregister request message. In addition, the master will remove the slave from its lists.
At a first glance all this seems pretty easy to do, but things tend to get a bit complicated when you’re working on a big project and lots of companies relay their infrastructure on it (including Twitter!!). Every piece of code that you add has to be perfect so that the application remains efficient and without bugs. So my patch had a couple of review iterations until it was ready to be committed. My mentor (Ben Mahler) helped me a lot with reviewing all my code and giving me tips.
So as a learned lesson : ‘keep it simple’. When you have something to do, even if it looks easy, think about it twice, maybe there’s even an easier way you could do it.
Thanks for reading and have a great day 🙂 !