Hadoop is essentially a collection of open source software that can be used to establish a computing environment that is distributed. A relatively new concept, it is being hailed as being among the best ways of extracting better value out of Big Data. However, along with its growing popularity, there are a large number of misconceptions about its capabilities that is coloring the vision of DBA experts and IT managers. Some common myths demystified:
Hadoop Is Large and Singular
While many people are inclined to discuss Hadoop as if it is a single massive system, it is, in fact, a collection of multiple products and the brand name applies to an entire family of open source products that are incubated and overseen by Apache software. People thinking of Hadoop typically may just be thinking about the Hadoop Distributed File System, which is essentially a foundation that acts to unify the rest of the components like MapReduce.
Hadoop Is Open Source
Hadoop is available from Apache as open source software that can be downloaded and used free of charge, there are a large number of vendors like IBM, EMC Greenplum, Cloudera who make available their special distribution versions that come with a number of value-added features for administration, maintenance, and support that are not available in the Apache distribution. While there are many who may question the need to buy a version of something that is available free, it needs to be appreciated that the branded versions can offer more capabilities for businesses that have established and competent IT departments.
Hadoop Is a Single Product
In reality, Hadoop is actually an ecosystem that is mistaken by many as being a single product. Technology extensions are continually being done by vendors as well as by the open source market. These new products make Hadoop even more structured and relational. A large number of sources are also engaged in providing platforms for data integration or reporting and also providing interfaces that can make usage more intuitive and easy.
Hadoop Is a Database Management System
Contrary to what most people think, Hadoop is not a database management system but actually a file management system. Even though Hadoop can manage data collections, certain attributes of database management systems are missing in Hadoop. For example, in Hadoop, you do not have the facility of query indexes that will allow data to be randomly accessed. Typically the data types in Hadoop lack the structure that is expected.
Hive Is SQL-Compatible
While Hive is quite similar to SQL, it is definitely not the standard SQL, which most database managers know. This lack of compatibility can be quite difficult to manage in many businesses because the tools that they use to access data are typically based on SQL. Hadoop actually uses Hive QL that resembles SQL and Apache Hive. While many people are dismissive of the language problem because Hive is very easy to learn it does not solve the real issue of compatibility with tools that are SQL-based. As of now, the issue of compatibility is acting as a barrier to Hadoop becoming a mainstream system.
Hadoop and MapReduce Need To Be Used Together
While Hadoop and MapReduce are often used together, they certainly do not need to be used together all the time. Google developed the technology of MapReduce long before HDFS came into existence. Additionally, there are a number of vendors that are making available MapReduce variations that do not require HDFS to be present at all. However, industry experts consider the combination of HDFS and MapReduce to be pretty good. In fact, most of the value that is derived from HDFS actually lies in the many tools that can be layered over Hadoop.