The fundamental difference while comparing Traditional Databases viz Oracle,SQL Server,DB2 with Hadoop is Schema on Write vs Schema on Read.
Schema on Write
The steps as below :
Step1 : The first step here is create schema i.e. define Table Structure. For Example:
Schema on Write
The steps as below :
Step1 : The first step here is create schema i.e. define Table Structure. For Example:
CREATE
TABLE EMP
(
Ename
STRING,
EmpID
INT,
Salary
FLOAT,...)
Step2 : Once the table exists then only we can load data to it. For Example: Bulk load data into EMP table from emp.txt file
BULK INSERT EMP
FROM 'C:\EMPDATA\emp.txt'
WHERE FILELDTERMINATOR= ","
Step3 : Once the data is loaded we can query the data using SELECT statement. For Example :
SELECT Ename,Salary,... FROM EMP;
The above three steps demonstrate the schema on write which our traditional databases possesses. It is important to note here that we can't add data to the table unless the schema has been declared.
If the data changes for a given column say data-type of that very column changes from INT to VARCHAR2 or a new column has been added to the table, then whole data need to be deleted for the column and need to be re-loaded. This holds good when we have small set of data or we do not have the foreign keys. But when we have terabytes of data and foreign key existing in the table then it will really be a challenging problem.
Hadoop or any other big data technologies generally use Schema on Read. Schema on Read follow the different sequence.
Schema on Read
Step1 : Load the data on hadoop cluster.
hdfs dfs -CopyFromLocal /tmp/EMP.txt /usr/hadoop/emp
Step2: Query the data using pyton script or hive command or by any other means. For Example:
hive> SELECT * from EMP; OR
hadoop jar Hadoop-Emp.jar -mapper emp-map.py -reducer emp-red.py -input /usr/hadoop/emp/*.txt -output /usr/hadoop/output/query1A
Here, the data structure is interpreted as it is read through python script or hive command as shown above. If the column is added to the table or datatype of a column got changed we can adjust the script to read the data. We do not need to reload the whole data.
Let us understand the above theory with the help of below example :
Consider we have a USER table where in two columns are there, namely NAME and AGE with the sample data shown.
When we try to write the sample data shown in USER table in traditional database, it will throw error because NAME is a varchar column and we are trying to insert integer data to it. Similarly, AGE is an integer column and we are trying to insert XYZ (VARCHAR) data to it. Hence, the schema is verified while writing the data.Therefore, traditional database has total control over the storage. This gives ability to database to enforce schema as data is written. This is called as Schema on Write.
When comes to Hadoop or any big data technologies, it does not have any control over storage.When we try to load the above data into a hive or HDFS table, the loading will be successful. While reading the data, HIVE will verify the schema. As 123 is integer and 'XYZ' is varchar, NULL value will be displayed for the NAME and AGE fields for the values as 123 and 'XYZ' respectively. The data will be verified while querying the data and hence Schema on Read. Therefore, when data is loaded schema is not verified in Hadoop or big data technologies while as schema check happens while reading the data.