To start let’s just learn in simple terms what these two big words mean and how they are related to each other. First of all Big data, consists of a mix of thousands of numbers and letters that contain a lot of information, however by themselves, it doesn’t mean anything, but with the help of some software, that information can be analyzed and taken advantage of by businesses and companies, creating content and experiences much more personalized than ever before. This Big Data comes from text, audio, video, images, from using social media or using anything online like shopping. Because this is information is so big and heavy (millions of gigabytes) having it in an only flash drive is impossible and having thousands of those is even more impossible because it comes in tons every second of the day and it would be impossible to keep up and here is where terms like Hadoop come into place, this is an open-source software whose name came from a toy elephant, that provides a software for distributed storage, it divides those files from the big data into smaller blocks and it keeps it in a big online warehouse. This software has different layers trough which divides and analyzes the information, these are HDFS (Hadoop distributed file system), where the divided information is stored, MapReduce, which is the processing engine and divides the task submitted by the user into several independent subtasks and finally YARN which provides resource management.
Another part that is important to understand is the ecosystem of the software, there are 4 crucial parts called daemons, which are all of the processes that happen in the background of Hadoop, these are: Namenode located at the master nods of HDFS, Datanode located at the slave nods of HDFS, Resource Manager runs on YARN master node for MapReduce and Node Manager who runs on YARN slave node for MapReduce.
Now when you put all of these puzzle parts together forms a huge system that can be kind of complex, but here are the 4 main steps that you need to know summarized:
- Input data is broken into blocks of size 128 Mb and then blocks are moved to different nodes.
- Once all the blocks of the data are stored on data-nodes, the user can process the data.
- Resource Manager then schedules the program that was submitted by the user on individual nodes.
- Once all the nodes process the data, the output is written back to HDFS.
Know that you’ve understood the basic concept of how big data and Hadoop relate and how these two work you’ll be more capable of understanding the importance of it in today’s world and why each day moe and more companies are turning into this kind of Softwares to make their businesses even more successful.