Saturday, 11 March 2017

Hadoop - Storage

Hadoop - Storage i.e. HDFS ( Hadoop Distributed File System)


This framework supports Distributed File System that means now your data can store in multiple location .


These are the few things which you need to consider while working in HDFS :


1.Block size : This is the splitting data size which you need to decide at the time of Hadoop configuration on your cluster . Default is 128 MB ( Hadoop Version 2)

If block size was set to less than 64 , there would be a huge number of blocks throughout the cluster,
This causes master machine to manage an enormous amount of metadata.


Configuration parameter  & Value :
 dfs.blocksize = 134217728
Above value is in bytes ( equivalent to 128 MB)
Configuration file :
hdfs-site.xml

<property>
  <name>dfs.blocksize</name>
  <value>134217728</value>
</property>

Example :
1. Block size : 128 MB (Which is default)
Suppose you have 1 GB of data = 1024 MB
Now if you have set block size as 128 then there will be 8 Blocks created
Your data information is stored by Master Machine like on which slave machine your data is stored.
2. Block size : 32 MB
Suppose you have 1 GB of data = 1024 MB
Now if you have set block size as 32 then there will be 32 Blocks created
As you see now Blocks increases so Master Machine has to keep information about 32 Blocks data.
Note : It is advisable to use Block size of 128 MB or more .


2. Block Replication : 


Using this we can create replications / backup for the blocks .
Configuration parameter  & Value :
dfs.replication = 3
This is the default value of dfs.replication in case not provided.
Configuration file :
hdfs-site.xml

<property>
  <name>dfs.replication</name>
  <value>3</value>
</property>


Example :
When it can be useful :
If any of the slave machine data corrupts or off (i.e. unable to provide data ) then this replication parameter plays a important role.
Lets see how process works :
Suppose we have the data of 256 MB that means if block size is 128 MB then it will create 2 blocks ( as we learnt above )
Now if we have configured replication factor as well that means now your 2 blocks replicates on 3 machines .
Now if any of the one slave machine corrupt you have the data on other 2 slave machines .

3. File System Permissions :

In hdfs you will be required to set file permission : If we need to provide all access to Owner and only read and execute access to Group and Other then we set value as 022 .

<property>
  <name>fs.permissions.umask-mode</name>
  <value>022</value>
</property>

As we can see its a 3 digit no (022) , where 0 - for Owner 2 - Group & 2 - for Other.


we have 0 to 7 number representing type of permission below :

0 : read, write and execute1 : read and write2 : read and execute3 : read only4 : write and execute5 : write only6 : execute only7 : no permissions

Suppose if we give 077 as fs.permissions.umask-mode then it means 0 - read,write and execute for Owner
7 - no permissions for Group
7 - no permissions for Other



4.


For more number of parameters please refer Apache Hadoop - hdfs-site.xml parameters






No comments:

Post a Comment