A Halloween developer story

Here is a scary bedtime story for developers.

Once upon a time in the land of POCIP (Proof Of Concept In Production) a developer was woken by a scary phone call: "The data nodes are failing one by one! The Hadoop cluster is going down!"

When an ops person lifted the lid and looked inside, they found an HDFS drowning in blocks. Each data node had lots and lots of blocks. Really a lot, and in each block, a teeny tiny file.

Each time the master asked the nodes for their manifest, they would scurry around and try to prepare a list. It took so long that the master would lose patience and declare a node deceased before it could answer.

The master would then wade through the swamp of blocks to ensure that redundant copies were stored on the remaining nodes, which only made matters worse! In the end it brought down the entire cluster. The developers had been slain by the beast they created.

Hadoop Halloween

The moral of the tale?

Be careful with silver bullets

Modern NoSQL DBs and other highly specialized frameworks are very powerful tools when used correctly. They take us farther than traditional technologies, by making strong design choices and trade-offs. Disregard them when designing your system and you end up like Superman holding a stick of kryptonite. HDFS has a strong design choice: be good at managing a limited number of large files. Ask it to manage a lot of small files and Bad Things happen.

Cast a protective spell

Monitoring. It's the key to keeping your nights peaceful. In this case, simply monitoring the number of blocks would have raised an alert: the cluster already had the hiccups due to high block counts before the complete failure. One could argue that we were in the land of Proof Of Concept and we didn't know what to monitor. Or worse, that we lacked the time to build proper monitoring on this new piece of technology. But...

Keep in mind the design choices of the technology will guide you in what to monitor, so you know when you break the rules. And good monitoring is the fastest path to understanding new tech.

Clean your room

Last, but certainly not least, you should clean your system on regular basis. No matter how good a system is, it will crumble under the dust you let settle on it. In this story, if you had followed the first two pieces of advice, you would already know that the number of blocks can be an issue, so keep an eye on them. Hence keeping the number of files under control by running regular purges should be obvious.

Happy Halloween!