Hadoop Maturity & Mainstream Acceptance


The last two years has seen considerable interest and perhaps hype in all things 'Big Data'. While I do not wish to fall into the trap of arguing over definitions, I'll paraphrase Wikipedia and Gartner's definitions:

"Big Data refers to large data sets (in terms of volume of data), being produced at a high rate and variety (i.e. structured, semi-structured, unstructured) which require new cost-effective means to process and analyse such data to aid decision making."

In other words the 3 Vs of volume, velocity and variety.

Apache Hadoop, an open source project that develops software that is a framework for distributed processing of large data sets across a cluster of computers; and the ecosystem of related projects around it are seen as one way of dealing with big data.

Over the last few months I've heard a lot of debate around two points:
  • Why Hadoop? - What does the Hadoop Ecosystem bring to the table that I cannot already achieve with existing tools such as RDBMs, data warehouses, dedicated appliances etc? When would I use Hadoop.
  • Hadoop Maturity - Hadoop is too immature, it's not production ready and I need real support for such complicated technology, only a few large web companies are using like Facebook and Yahoo.
In this post I'll be discussing my views on these points.

Why Hadoop?

Perhaps the question should be "when is Hadoop an appropriate solution?" because I feel some of the debate is due to a misunderstanding of the use cases that Hadoop is aligned to.

Aside from a few over-zealous marketing departments I think most people in the Hadoop community agree that Hadoop is not the solution for all types of problems. I think Microsoft's paper 'Nobody ever got fired for using Hadoop on a cluster' mostly gets it right. If your data sets are in the multi-gigabyte to a few terabytes range and doesn't tend to grow much - existing tools and architectures will probably suffice.

Hadoop is optimised for the batch processing of (or at least a large subset of) the entire data set and large streaming reads and writes. If this matches your requirements then Hadoop may be a good fit if the 3 Vs; volume, velocity and variety precludes the use of existing technologies i.e. because:
  • You cannot cost effectively store the volume of data you have on traditional SAN storage systems the cost of SAN hardware (fabric switches, HBAs etc), management software and licensing is too high. Perhaps the database software cannot scale to those volumes and the licensing may be prohibitive especially when factoring in high-availability. The large volume of data may make processing data challenging with the need to shuttle data between the SAN and the compute resources.
  • Even if you can store it on a SAN today will it cost effectively scale with the growth rate of your data?
  • You now have to deal with a wide variety of data types and from different sources that traditional tools cannot deal with effectively
Hadoop is not processing small files or random reads that only touch a small proportion of the data. It is not designed for low latency queries over relatively small data sets. Yes offerings such as Cloudera's Impala are changing this but it really depends on what 'real time' means to you.

Hadoop Maturity & Mainstream Acceptance

This brings me on to the second point of Hadoop's maturity. The core components of Hadoop (HDFS and Map Reduce) have been around in some form since 2004-2006 as part of the Nutch web crawler and then later spun out into a separate project around 2008. However, until recently many people felt the technology was too bleeding edge, immature in some respects and not enterprise ready. I think in the last few years this has changed considerably through the addition of features which has addressed common enterprise concerns:
  • Stability: The core of Hadoop has for the most part stabilised (okay aside from the fact that we are now in a transition period from MRv1 to the new YARN based MRv2 architecture).
  • Availability & Scalability: The lack of high-availability and scalability of the NameNode and JobTracker components of Hadoop are being addressed through the NameNode HA and Federation features and through YARN architecture which removes the reliance on a single JobTracker. However, to be honest you'd need to have a pretty large cluster or need to run a large number of jobs for NameNode or JobTracker scalability to become an issue.
  • Security: Improved security through the support for Kerberos
  • Mangement: The Apache Ambari project although currently an incubator project will simplify management and deployment of Hadoop clusters without locking you into a particular vendor's management toolset.

Arguably you do require more technical expertise to run a Hadoop cluster than to run traditional data warehouse platforms. However this is where some of the commercial Hadoop vendors come in. It's also true that projects such as Pig and Hive do lower the barrier to entry into Hadoop.

For me the growing number of new commercial vendors that have materialised and existing vendors that are now buying into Hadoop demonstrates the value that organisations see in Hadoop. The table below provides an overview of a selection of the players in the market. 



Vendor Product/Offering Value-Add ontop of Hadoop Apache Hadoop (Core) Committers or PMC Uses Hadoop Internally? Notes
Cloudera Cloudera Distribution Including Hadoop (CDH) Enterprise Core RTD & RTQ. Impala for interactive querying/ analytics. Cloudera Manager for cluster management. Yes ? Paid support services, pre-tested / integrated versions of Hadoop ecosystem projects. One of first vendors to bring their own distribution to market. Relatively few proprietary components (only really  Cloudera Manager).
Hortonworks Yes - the Hortonworks Data Platform (HDP) First to include Apache Ambari project in the distribution for management. Yes Yes – since they are partly owned by Yahoo A venture spun out from Yahoo. Also offers paid support, and pre-tested / integrated versions of Hadoop ecosystem projects. No proprietary components.
MapR Multiple distributions - MapR M3,M5 and M7 editions, built on proprietary extensions/modification to Hadoop architecture Real-time analytics. HA solutions. Provisioning, Monitoring and Management. Yes ? Provides support services and pre-tested/configured Hadoop distribution. MapR distribution on paper provides some significant enhancements to Hadoop. However, this potential USP has been eroded with the MRv2/YARN, NameNode HA and Federation. Not to mention that other vendors providing their own solutions in the same areas that MapR promote as their value-add. Risk of vendor lock-in - Proprietary nature of solution would make it difficult to migrate to another distribution.
IBM Infosphere Big Insight platform Integration with Cognos (bundled with BigInsights), Netezza, R, and BigSheets (for data visualisation) Yes ? Provides the option of running CDH or IBM’s Apache Hadoop distribution. Provides support services and pre-tested/configured Hadoop distribution.
Amazon Elastic MapReduce (EMR) Provides the ability to run MR jobs on top of their EC2 compute and S3 storage resources. No ? Potentially a good way of running a PoC or getting to grips with Map Reduce framework. Shared nature of infrastructure requires careful planning for production use. Virtualisation is not recommended for worker nodes in a Hadoop cluster as resources are shared which may impact performance. Hadoop leverages data locality to improve performance in processing large data sets – this is not possible with the use of S3. Likely to be cost prohibitive for large data sets due to Amazon’s pricing model for data storage and retrieval.
Datameer Datameer Personal, Workgroup and Enterprise Analytics platform that combines open source and proprietary tech. Yes in the past. ? Provides support services and pre-tested/configured Hadoop distribution.
EMC Resell MapR’s M5 distribution as Greenplum MR and also Greenplum HD. Integration with Greenplum appliance and Isilon NAS. No ? Provides support services and pre-tested/configured Hadoop distribution.
Microsoft HDInsight is their ‘Apache compatible’ Hadoop distribution that runs on Windows Server and Azure. Integration with AD, System Center 2012 and Microsoft BI toolsets and Excel. Yes ? Some of the potential pitfalls with the running Hadoop on Amazon EMR apply to HDInsight on Azure. While Microsoft offer support services the majority of other vendors are pushing Hadoop on Linux. The fact that Hadoop will have had far more ‘burn-in’ time on Linux than Windows may be an important consideration from a support perspective.
Hadapt In their own words: Hadapt’s Adaptive Analytical Platform, brings a native implementation of SQL to the Apache Hadoop. No ? Their offering appears to be in a similar space as Cloudera’s Impala.
HP HP has partnered with MapR , Cloudera and Hortonworks to develop pre-configured platforms for Hadoop Biggest value-add is the stack (servers, storage, networking) is pre-configured, integrated and tested. No ? Choice of distributions to deploy. Paid support services.
Oracle Oracle Big Data Appliance Like HP provides a pre-integrated hardware stack. Connectors for Oracle DB and integration with Enterprise Manager for management. No ? Support services. Runs CDH.


Note that I've not considered whether some of the vendors commit to other ecosystem projects only Hadoop Core (nor how many LOC each vendor has committed[1]). This is by no means a comprehensive assessment it is intended to reinforce my point, in reality there are number of other factors you would need to consider in assessing vendors (such as vendor financial stability, reference sites, regional support capability, support models, product and patch release cycles, product roadmap)

One interesting trend is that a number of vendors have introduced products / features to allow 'real time'/interactive analytics on Hadoop. I already mentioned Cloudera's Impala which is based on Google's Dremmel, as is the Apache Drill project (currently an incubator). Hortonworks, the Yahoo! spin-off, recently announced the Stinger initiative which aims to provide similar capabilities, however not through implementing a Dremmel clone, but by enhancing Hive.

[1] There seems to be a bit of a debate between Cloudera and Hortonworks around who has contributed the most lines of code or patches etc. To be honest I think using LOC as a metric is of questionable value, as is this debate in general.

Update 14th April 2014: Thanks to Harsh J for pointing out my error in stating that Impala was proprietary; it's actually open source. In addition, it's not only Cloudera that support Impala; I believe it is now also supported on AWS EMR.

Comments

  1. This is a pretty nice post, loved reading your take on the state :-)

    I notice you have mentioned Impala as being proprietary but it has never been so, ever. Right from its first ever release, it has been purely free and open source under the APLv2: http://github.com/cloudera/impala. Would be great if you can correct that in the comparison table!

    ReplyDelete

Post a Comment