Modern solution for Big data :A Distributed Approach

Yashraj Oswal
5 min readSep 17, 2020
Picture reference: NL Dalmia www.nldalmia.in

Big data is nothing but large, diverse sets of information which is been generated from various sources, or we can also refer it as a modern day problem, which is been faced by almost every organization. To give it a small picture take an example of social media handles, now a days number of users on social media is been increasing in count, with increasing count the data generation is also increasing, like posting texts, photo, videos, so there is tremendous need of finding an appropriate solution to handle data smartly, also we know the phrase “Modern day problem requires modern solution”.

Big data has its characteristics, which are called as 3V’s, Volume, Variety, Velocity which in short terms explained as the large Volume of data in many environment, Larger Variety of data i.e texts, pictures, videos, also different Velocity i.e speed of generation of data.

1) Volume — The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data. Hence, Volume is one characteristic which needs to be considered while dealing with Big Data.

2) Variety — The next aspect of Big Data is its variety. Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. Data are present in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data.

3) Velocity — The term velocity refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data. Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors, mobile devices etc. The flow of data is massive and continuous.

Where there is problem, there is always a solution.

Solution for Big data is using of distributed approach, which is popularly known as Hadoop. Hadoop is developed by apache software, a community which offers open source software to user. Hadoop is been developed in java language, so at the time of installation you need to install java or we can say jdk i.e Java Development Kit, so that hadoop can be run appropriately in the system without having issue while installing.

Basically Hadoop is an open-source framework, which is use to store data and run application on clusters, also it provides huge storage of data, high processing data and handling virtually limitless concurrent task.

Architecture of Hadoop:

Architecture of Hadoop

Hadoop follows Master-Slave Architecture which comprises of one NameNode which is considered as Master node and several DataNode which is know as Slave node. The Master node is connected to DataNode through Network. Master commands the slave node to perform the specified function, and the protocol used by these nodes is HDFS i.e Hadoop Distributed file system

Role of NameNode:

  • It is the master daemon that maintains and manages the DataNodes (slave nodes)
  • It records the metadata of all the files stored in the cluster, e.g. The location of blocks stored, the size of the files, permissions, hierarchy, etc. Also it keeps Metadata.
  • It records each changes occurred in the metadata, i.e if a file is deleted from HDFS (Hadoop Distributed File System) it immediately records the changes in Editlog.

Role of DataNode:

  • These are slave daemons or process which runs on each slave machine.
  • The actual data is stored on DataNodes.
  • The DataNodes perform the low-level read and write requests from the file system’s clients.
  • They send heartbeats to the NameNode periodically to report the overall health of HDFS, by default, this frequency is set to 3 seconds.

Organization which has adopted Distributed System as their solution:

  1. Facebook:

Amazing fact about facebook, the per day data generation of facebook is around 4 petabyte which is equivalent to 4000 TB of data. And if we calculate for monthly basis it is around 120 petabyte of data which is really huge or we can say Big data. This data according to survey depicts that only 15 to 20% of data is useful rest all are unstructured data which includes photos, Videos etc. So handle such big data facebook has also adopted this modern solution called Distributed approach which can handle large data easily. Following is the approach used by facebook which is a popularly known architecture in Hadoop file system.

Mapreduce Architecture

Mapreduce is an function which is applied on big data so that unstructured data can be converted into structured data , so to explain it conceptual wise, mapreduce uses an approach of mapping the data which are of similar kinds into one single block like wise it keeps on mapping until it gets all the similar kinds of data, later on it reduces it into one single block of similar data at the end, likewise it follows the mapping and reducing approach to get final output as structured data which has similar meaning. Mapping and reducing is nothing but processing the data.

Consider an real example:

We all have used in our day-to-day life. In our mobile phones when we save a contact number into our mobile memory there we can see the following option -> Create new contact, Existing Contact number. When we create new contact and save it, it means we have mapped a number to a particular block, and when we are saving the number in the existing contact number, we are assigning the reduce approach to the prior created mapped block, so here the conclusion is we are mapping new data into a new block and when it belongs to similar data block we are reducing it by saving it into existing block.

2. Google:

Google currently processes over 20 petabytes of data per day through an average of 100,000 MapReduce jobs spread across its massive computing clusters, so they had made use of distributed approach to handle the huge data.

There are almost every organization who has adopted this approach.

Facts about Data generated globally:

Over 2.5 quintillion bytes of data are created every single day, and it’s only going to grow from there. By 2020, it’s estimated that 1.7MB of data will be created every second for every person on earth.

At the end i just wanted to say that everyday, every hour, every minute, every second data is generated through numerous sources, so to handle such data and make out as much as possible structured/meaningful data from this huge cluster of data, the modern day solution i.e Distributed approach will clearly help in maintaining the Data.

--

--