Top 3 Hadoop distributions, which is right for you?By Marco Shaw
One of the biggest buzzwords of the past year or so has been “big data.” Many organizations have been struggling with how to approach it, and that likely won’t change any time soon. Big data can be challenging to work with due to the processing power required to handle it, and it often requires a completely different data management solution. If you’ve tried to get acquainted with “big data” at all, you’ve probably come across something called Apache Hadoop (or more commonly referred to as simply, “Hadoop”). It’s a popular framework used for processing large amounts of data-sets, and some of its biggest strengths lie in its flexibility and cost-effectiveness. However, each distribution is different, so it’s important to compare your options before deciding which you’re going to use.
There are several distributions available, such as ones provided by EMC and Intel, as well as those provided by hardware vendors like IBM which are typically all-in-one solutions that include hardware. But the three biggest and most prevalent Hadoop distributions that exist today are Cloudera, MapR and Hortonworks. If you’re anxious to test things out, all of the vendors offer free versions. As to be expected with free versions, each will have some level of restriction, either based on functionality or the number of nodes that can be added to a cluster. And if you need to get up and running really quickly, each vendor offers VM images with Linux and Hadoop already installed.
Which is better? It depends entirely on what you’re seeking. Because Hadoop is licensed under the Apache License, which is a free software license, these vendors will automatically provide patches and updates to the core Hadoop distribution, something that everyone benefits from. So it’s best to instead turn your attention to each of the strengths and weaknesses based on the product offered and the available add-ons developed for your use.
Here are a few things that make each of the top three vendors stand out from each other:
- For tutorials, I slightly prefer Hortonworks because of how they’re presented online. Now, consider that I did try to go through the tutorials using “Hortonworks Sandbox” (based on 2.0) and had issues with running some of the examples without them failing. Hopefully this isn’t a widespread problem.
- From a training perspective, Cloudera seems to have the most complete and professional training program of the three. But with that comes a bigger price tag — Cloudera’s training program and exams are typically the costliest.
- One thing that makes Hortonworks stand out quite a bit is that it supports the Microsoft Windows operating system, whereas the other vendors support the Linux operating system. (Microsoft has also taken Hortonworks product, and packaged it into its own service called HDInsight, and it can be used for both on-premise Hadoop installations, or it can be run in Windows Azure cloud service.)
- While Cloudera and Hortonworks both go with the NameNode and DataNode architecture for splitting up where the metadata is saved and data processing is done, and both depend on HDFS, MapR has a more distributed approach for saving the metadata on the processing nodes, and it depends on a different distributed file system architecture.
- Hadoop 2 was released recently, and if immediate upgrade offerings are important to you, Hortonworks was the first to release a complete production-ready Hadoop distribution based on version two. Cloudera did have Hadoop 2 features in an earlier version, but some of the components weren’t considered production-ready.
These are three companies that have been very strong in the past year, and have received quite a bit of venture capital funding. In this regard, MapR is even more interesting because late last year news broke that they were planning on going public, which means they could raise even more money for their products and development. This is an exciting time as big data is gearing up to really take off, and if MapR’s IPO plan is any indicator, this next year is going to be very interesting to watch.
About the Author
Marco Shaw is an IT consultant working in Canada. He has been working in the IT industry for over 12 years. He was awarded the Microsoft MVP award for his contributions to the Windows PowerShell community for 5 consecutive years (2007-2011). He has co-authored a book on Windows PowerShell, contributed to Microsoft Press and Microsoft TechNet magazine, and also contributed chapters for other books such as Microsoft System Center Operations Manager and Microsoft SQL Server. He has spoken at Microsoft TechDays in Canada and at TechMentor in the United States. He currently holds the GIAC GSEC and RHCE certifications, and is actively working on others.