Hadoop has been widely adopted into many companies’ IT infrastructure. Adoption is child’s play for experienced big data handlers with strong engineering teams: as simple as designing a target system, selecting a technology stack and beginning implementation. However, Hadoop beginners face a host of challenges right from the get-go. Below is a list of them of these challenges:
- Choosing vendors
There are so many distributions available: Apache, Cloudera, Hortonworks and MapR, just to mention a few. However, the most important lesson for a Hadoop beginner is that very few companies can use the original Hadoop deployment in their production environment ‘as is’ i.e. without further modification. Some vendors, like Oracle, even have hardware, and things can get more complex once you actually start speaking with the vendors. Even more experienced Hadoop users may have some problems in selecting the correct distribution for their use, given that different vendors have differentconfiguration managers and components.
- SQL on Hadoop – popular but unclear
Hadoop databases store large amounts of data. Other than using predetermined pipelines for data processing, enterprises are demanding more value from their data by enabling interactive data access to their business analysis and data science teams. This is even demanded from marketing buzzes online, clearly stating that Enterprise Data Warehouses can improve competitiveness.
However, there are too many frameworks offering interactive SQL for Hadoop, resulting in a situation similar to the above. The challenge however is not just in selecting the best one. Consider especially that none of them make worthy replacements for the conventional OLAP databases. Even though they have a host of strategic upsides, there are debatable downsides with regard to support simplicity, performance and SQL compliance. It certainly does not form a replacement to conventional approaches.
- Availability of big data engineers
Every IT enterprise must have a strong team of engineers, and nowhere is this more important than for big data. Remote DBA experts who provide outsourced data management must not rely on their engineers from yesteryears, no matter how good they are at C++, Python, Java and the like. In a few years, you may find yourself with unsupportable, unstable and over-engineered, muddled scripts and jars surrounded by a plethora of frameworks. If key developers leave your company, you may find yourselves in even more dire straits. You must have personnel with specific experience in big data technological stacks, including a team of developers who will find ways to keep the system simple and evaluable in future.
- Secure Hadoop environments
Hadoop is more and more being used for storage of sensitive data, resulting in a few technical issues regarding compliance. If only MapReduce and HDFS are used, the situation is simpler as they both include at-rest and in-motion data encryption, Kerberos for authentication and file system permissions for authorization. However, if any other frameworks are used, particularly those which use their own system user for request execution, troubled waters are nigh. To begin with, not all these frameworks include Kerberos capability. Also, not all of them have specific authorization features and finally, many lack in-motion data encryption capability, creating a loophole. This could spell trouble, especially when there are requests to be submitted outside the cluster.
These are just some challenges of Hadoop deployment, and in no way should result in a beginner being scared off from the revolutionary NoSQL technology it presents. There is a host of benefits to be reaped from proper deployment of Hadoop within an enterprise. Rather, our aim is to teach beginners to carefully oversee the selection process of both distribution and staff, who will determine what kind of story your Hadoop deployment will be.