Saturday, June 30, 2018

A coworker was telling me about Hadoop some yesterday.

You may use Hadoop in big data scenarios. Replacing an ETL or running queries where the queries are "heavy" are pretty good use cases. Queries can sure be slow when you are crawling millions of records and especially so if the query is also crawling several racks of computers in several data centers and has to look through all of the computers. One of two things that Hadoop does to make things fast is to keep metadata records of ranges (say if you were searching people by last name at data as big as Facebook's) for searches at a sister metadata server logistically "next to" the main master server that will be running Hadoop itself and doing the querying. This will tell you things like... I dunno... the people with the last names that start with Dw are on this computer in this rack in this datacenter over here. It would be impossible to try to find someone you are not connected to on Facebook if Facebook did not allow you to use this technology (this piece is called Hive) to look someone up. Below the data nodes are the individual computers and the slaves the racks in datacenters. I'll get to what the secondary name node is in a minute.

 
secondary name node
name node (master)
metadata server
 
   
   
slave
data
node
data
node
data
node
data
node
data
node
data
node
data
node
data
node
   
   
slave
data
node
data
node
data
node
data
node
data
node
data
node
data
node
data
node
   
   
slave
data
node
data
node
data
node
data
node
data
node
data
node
data
node
data
node
   
   
slave
data
node
data
node
data
node
data
node
data
node
data
node
data
node
data
node
   
   
slave
data
node
data
node
data
node
data
node
data
node
data
node
data
node
data
node
   
   
slave
data
node
data
node
data
node
data
node
data
node
data
node
data
node
data
node
   
   
slave
data
node
data
node
data
node
data
node
data
node
data
node
data
node
data
node
   
   
slave
data
node
data
node
data
node
data
node
data
node
data
node
data
node
data
node

HQL is the Hive Query Language, and a rival/alternative to Hive is Impala. Instead of having the metadata server's cache grow too fat and cause performance problems which can happen with Hive, Impala stores queries at the databases being queried. I would assume there has to be some outside indexing at the metadata server too, but maybe I'm wrong. I don't really know what I'm talking about. Ha ha. Cloudera provides Apache Hadoop and Apache Spark software. (The community just puts all of this stuff under the Apache license by convention). Hortonworks is a company like Cloudera and it has ties to IBM. Apparently IBM has a big data university where one may take online classes and this is where my colleague got his own ramp up on this sphere of things. Alright, let me circular back to a couple of things I have name dropped: Spark and the secondary name node ...The other way the Hadoop approach may make an ETL process faster is with Spark which allows of parallel processing for batch processing of records to be written back to the databases at the data nodes with machine learning to boot. Data will be split and stored on numerous computers and while we are at it and have the performance gain of the parallel processing we can afford to write duplicate records to different machines to have backups. But, what if master dies? How should THAT be backed up? The secondary name node is a periodic backup of the name node (Hadoop server) and is never used in any other capacity unless there is indeed a problem one dark day. At the metadata server we will have job trackers paired with task trackers and when a task is found there is a judgment call made about whether or not it should be approved and the job runs upon approval. The metadata server's machinery can apply to the write side with Spark and not just the read side with Hive and this arena of things (task trackers and job trackers) is called the MapReduce level. The chart above in contrast shows the structure level which is the layout of server interactions. Peak and Woozy are two other tools that come into play at the data node granularity of which my teacher for the moment restrained himself from speaking of suggesting that everything else he just mentioned was more than enough for an elementary introduction.

No comments:

Post a Comment