Hadoop :
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Overview Hadoop :
Hadoop wasn't the first solution to this problem . Honestly google was the Grand father of all these .
GFS : Google file system GFS inspite Hadoop for distributed storage and Map reduces inspires Hadoops distributed processing.
What is Hadoop used for ?
As of my last update in September 2021, Hadoop is a popular open-source framework used for distributed storage and processing of large volumes of data. It is primarily designed to handle big data applications, where traditional databases and data processing tools may fall short.
Hadoop's core components include:
Hadoop Distributed File System (HDFS): A distributed file system designed to store large datasets across multiple commodity hardware nodes. It provides fault tolerance by replicating data across different nodes.
MapReduce: A programming model for processing and analyzing large datasets in parallel across a Hadoop cluster. It divides the data into smaller chunks and processes them in parallel, making it suitable for batch processing and data-intensive tasks.
Hadoop is used for various purposes, including:
Big Data Storage: Hadoop's HDFS allows organizations to store vast amounts of structured and unstructured data across a cluster of machines.
Data Processing: Hadoop's MapReduce paradigm allows for the distributed processing of data, making it easier to process and analyze large datasets efficiently.
Data Analytics: Hadoop is widely used for data analytics tasks, such as data mining, machine learning, and statistical analysis, due to its ability to handle massive datasets and parallel processing capabilities.
Log Processing: Hadoop can process and analyze logs generated by various applications and systems, providing valuable insights and detecting anomalies.
ETL (Extract, Transform, Load) Pipelines: Hadoop is used to build and execute ETL pipelines for ingesting, transforming, and loading data from various sources into a data warehouse or other data storage systems.
Recommendation Systems: Hadoop's ability to handle large datasets and perform distributed computations makes it suitable for building recommendation systems used in various domains, such as e-commerce and content platforms.
IoT (Internet of Things) Data Processing: With the rise of IoT devices generating massive amounts of data, Hadoop can be used to process and analyze this data at scale.
It's worth noting that the big data ecosystem evolves rapidly, and since my knowledge is limited to September 2021, there might be newer technologies or shifts in trends related to big data and data processing.
Overview of Hadoop EcoSystem :
Major components in Hadoop :
Core Hadoop Eco-System
Code Hadoop Eco-System :
HDFS : Hadoop Distributed File System : Remember we spoke about GFS , HDFS is the hadoop version of that . That is the system that allows us to distribute the storage of Big Data across out cluster of computers. All of our hard drives look like one Giant file system .
Not only that it also maintains redundant copies of that data
Now sitting on top of HDFS we have YARN :
YARN : Yet Another Resource Negotiator
Yarn is where the data process part of hadoop is. YARN is basically the system where manages the resources on your computing cluster it is what's gets to decide what gets to run tasks when , what nodes are available for extra work . which ones are available which ones are not available . It is like the Heart beat and keeps your cluster going .
Given that we have this resource negotiator we can build interesting application on top of that , one of them is Map Reduce which again is a piece of Hadoop proper. Map reduce at a very high level is a programming metephor . A programming model that allows you to process your data across an entire cluster
It consists of Mappers and Reducers - these are both different scripts that you might write , well may be different functions while you are writing Map Reduce program . Mappers have the ability to transform your data in parallel across your entire computing cluster in a very efficient manner . And Reduce service will aggregate that data together .
It may seem like a simple model but very versatile.
Now MapReduce and Yarn were the same thing in Hadoop . They got split out recently and that has enabled other applications to be built on top of YARN that solve the same problem as MapReduce but in a more efficient manner.
And sitting on Top of Map reduce we have technology such as Pig .
Pig :
If you do not want to write Java or Python - Map-reduce code and you are more familiar with the scripting language such as the sequel style syntax Pig is for you.
Pig is a very high level programming API that allows you to write simple scripts that look a lot like SQL in some cases that allows you to chain together queries and get complex answers but without writing Java or Python code in the process.
Pig will transform script to something that will run on MapReduce which inturn goes through YARN and HDFS to process and to get the data that it needs. You get the answer you want .
that's PIG , a high level language that sits on top of MapReduce.
Hive :
We will speak about Hive which also sits on top of MapReduce . it solves the same problem like PIG but more looks more directly looks like a sequel database . So Hive is a way of taking SQL queries and making this distributed data that is sitting on the file system somewhere that look like a SQL database . So it is just like a database . You can connect to it via a shell client or ODBC and execute SQL queries on the data that is stored on the Hadoop cluster what is not really a relation database under the hood.
So if you are familiar with SQL Hive might be very useful API interface for you to use.
Apache Ambari :
Apache Ambari is this thing that sits on top of everything that just gives you the view of your cluster and lets you visualize what's running on your cluster .What systems you are using . How much resources and also has some views that allow you to do things like execute Hive queries or import databases into Hive , execute Pig queries and things like that.
So Ambari sits on top of all of those and let you have a view into the actual state of your cluster ans the applications that are running on it . Now there are other technologies that do it for you. Ambari is what Hotandworks uses
MESOS : Mesos is not really part of Hadoop Proper but I am adding it here because it is basically an alternative to YARN . It too is a resource negotiator . They basically solve the same problems. There are pros and cons to using both of them . And there are way you can have MESOS and YARN to work together .
And we bring up MESOS because we are going to talk about SPARK because SPARK is the most exciting technology in the Hadoop ecosystem. This is sitting at the same level as MapReduce and that sits on top of YARN or MESOS , you can go either way .It actually run queries on your data .
Like MapReduce it require some programming . You need to write your Spark scripts using either Python , Java or the Scala programming language -- Scala being preferred . Spark is extremely fast and it is under active development right now. Spark is most exciting and powerful technology right now.
So if you need to reliably , efficiently process data over the Hadoop cluster . Spark is a really good choice for that . it is also very versatile, it can do things like handle SQL query . It can do machine learning across an entire cluster of information . It can handle streaming data in real time
TEZ :
Tez is a data processing framework that runs on top of Apache Hadoop. It is designed to improve the performance and efficiency of big data processing tasks by providing a more flexible and optimized execution engine. Tez replaces the traditional MapReduce engine with a more expressive and powerful computational model.
Here are some key features and concepts related to Tez:
Directed Acyclic Graph (DAG): Tez represents data processing tasks as a DAG, where vertices represent processing tasks, and edges represent data flow between tasks. This allows for complex data processing workflows with multiple stages and dependencies.
Data Locality Optimization: Tez leverages Hadoop's data locality feature to minimize data movement across the network. It aims to execute tasks on nodes where the data resides, reducing the time and network bandwidth required for data transfer.
Fine-Grained Task Execution: Unlike MapReduce, which processes data in fixed-size chunks, Tez allows for fine-grained task execution. It enables tasks to process smaller units of data, leading to better resource utilization and improved performance.
Reusability and Dynamic Task Scheduling: Tez provides a framework for reusable components called "processors." These processors can be combined to create complex data processing workflows. Additionally, Tez supports dynamic task scheduling, allowing it to adapt to changing workloads and allocate resources efficiently.
Performance Optimization: Tez optimizes the execution of DAGs by analyzing the data flow and applying various optimization techniques. It includes features like pipelining, data fusion, and dynamic partitioning, which can significantly improve processing performance.
Compatibility with Existing Hadoop Ecosystem: Tez is designed to integrate seamlessly with other Hadoop components and tools. It can work with Hive, Pig, and other data processing frameworks, enabling them to take advantage of Tez's performance optimizations.
Tez simplifies the development of complex data processing workflows and provides significant performance improvements over traditional MapReduce. It has been widely adopted in the Hadoop ecosystem and is used by various organizations for large-scale data processing and analytics tasks.
Apache HBASE:
It is a way of exposing the data on your cluster to transactional platforms





Comments
Post a Comment