Hadoop: An introduction

Posted on Updated on

This past week I attended a data conference where the functional piece behind a lot of the software that was being presented was Hadoop.”Hadoop is an open-source software framework for storing data and running applications” (1)

What makes Hadoop special is that it can be run on generic hardware, and the scalability is based on the amount of hardware you have, not the speed of your processor, or disk. Now to describe this in some details in layman’s terms: Imaging you are in your office with 100 employees. You have a laptop as does everyone else. If you want to processing something, you can only process as fast as your processor will go. OK this won’t cut it. You buy a huge Dell server with 16 processors, and each processor has 16 cores, giving you 256 threads. This set you back $250,000. Now you can really process a lot of data. But imagine you took all the laptops in your office and were able to cluster or chain them together. Dell XPS 13 laptops come with 2 cores, so if you chained all of them together, you could have a virtual “system: of 200 threads. At under $1000 each thats $200,000. At this point you might be thinking, ok where is the savings? The other system has more cores for not much more money. But now that we have our thinking going, what if we instead were looking at desktops, which we can buy at $399. Suddenly we have a “200 core” system at $40,000. (100 desktops x 399). $40,000 vs $250,000. Looking a lot better. Now lets scale this this up to a data center’s worth of hardware, and suddenly we are getting supercomputer performance and datacenter prices. This is just a generic example I thought of, but it shows why in scale Hadoop seems like a good choice. Like any infomercial I feel like I need to say “But wait! There more!”

Fault tolerance is built into Hadoop. FT is having the ability to not be affected by hardware failures. If a node, name given to a server or computer point, fails it does not mean that the jobs that were on that node are lost. Data and compute is distributed so jobs are redirected and data is stored on multiple points so the data on that node is also available elsewhere. An example that might make sense would be your iPhoto pictures. If you lose your phone, via iCloud they are still available on your PC, and even in the cloud if you paid for that plan. Your photos are fault tolerant, they don’t depend on your phone alone to exist.

Those are some of the ways that Hadoop makes sense to process big data, and why it exists. As I learn more, I hope I can teach you too.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s