Course Summary
ExpoNential Inc. (Host of CloudCon Expo & Conference) is offering this 3 day extensive class on Hadoop platforms. We have a team of experienced instructors who have worked extensively in Hadoop & Cassandra platforms, and have deployed various clustering software packages internationally to fortune 500 clients.
This is a fast paced, vendor agnostic, technical overview of the Hadoop landscape. No prior knowledge of databases or programming is assumed. This survey course is targeted towards both technical and non-technical people who want to understand the emerging world of Big Data, with a specific focus on Hadoop. In each sub-topic, the instructor will provide links and resource recommendations for students who want to explore that area further (for example, YouTube videos, books, blog posts). Students will be given slide deck which can be used as reference material after the course.
Students will experience real Hadoop clusters and the latest Hadoop distributions. By default, we use Cloudera’s latest Hadoop distribution. However, based on demand, we can use also use Hortonworks, MapR, and Hadoop on Windows Azure.
Duration
March 19 - 21, 2013 (9am - 5pm)
Location
The Domain Hotel
1085 El Camino Real, Sunnyvale, CA 94087
Instructor
Avkash Chauhan, Cloud Solution Architect, Microsoft
Cost
One Day | $699 |
Two Days | $1199 |
All Three Days | $1499 |
Save 15% (Use discount code save15now)
Audience
Engineers, Programmers, Networking specialists, Managers, Executives
Softwares Covered
HDFS, MapReduce, Pig, Hive, HBase
Objectives
- Introduce students to the core concepts of Hadoop
- Deep dive into the critical architecture paths of HDFS, MapReduce and HBase
- Teach the basics of how to effectively write Pig and Hive scripts
- Explain how to choose the correct use cases for Hadoop
- Give each student access to an individual 1-node Hadoop cluster in Rackspace to run through some hands-on labs for the 5 software components: HDFS, MapReduce, Pig, Hive, HBase
- Provide links to the best books, blog posts and videos for students to learn more about Hadoop on their own
Course Outline
Day 1: Introduction to Hadoop Platform
Hadoop
- Parallel Computer vs. Distributed Computing
- Brief history of Hadoop
- Scaling with Hadoop
- Hadoop clusters at Yahoo! and Facebook
- RDBMS/SQL vs. Hadoop
- Hadoop Daemons introduction: NameNode, DataNode, JobTracker, TaskTracker
- Intro to the Hadoop ecosystem: HDFS, MapReduce, Pig, Hive, HBase, ZooKeeper
- Vendor Comparison (Cloudera vs. Hortonworks vs. Amazon EMR vs HDInsight)
- Hardware + Software recommendations for Hadoop
LAB #1: Hadoop Installation, Hadoop cluster specific operations and sample job execution
Introduction to Interactive Console for Hadoop
HDFS
- Linux File system options
- Sample HDFS commands
- HDFS sample architecture at Yahoo!
- Data Locality
- Rack Awareness
- Write Pipeline
- Read Pipeline
- NameNode architecture (EditLog, FsImage, location of replicas, safe mode)
- Secondary NameNode architecture
- DataNode architecture
- Heartbeats
- Block Scanner
- Fsck Health Check + file breakdown
- Balancer
LAB #2: Various HDFS specific operations and deep dive
MapReduce
- MapReduce Architecture
- JobTracker/TaskTracker
- Combiner
- Partitioner (shuffle)
- Thinking in the MapReduce way (examples of Mappers & Reducers)
- Counters
- Hadoop Streaming (with python)
- Hadoop Java example
- Input/output formats
- Speculative Execution
- Distributed Cache
- Job Scheduling (FIFO, Fair Scheduler, Capacity Scheduler)
LAB #3: Understanding MapReduce jobs through execution and creation your own MapReduce application using Java and JavaScript
Day 2: Data processing in Hadoop
Pig
- Pig philosophy and architecture
- Pig Latin and the Grunt shell
- Loading data
- Data types and schemas
- Pig Latin details: structure, functions, expressions, relational operators
- Intro to User Defined Functions and Scripts
LAB #4: Exploring Pig Latin commands and processing larger data analysis through Pig
Hive
- Hive philosophy and architecture
- Hive vs. RDBMS
- HiveQL and Hive Shell
- Managing tables
- Data types and schemas
- Querying data
- HiveODBC
LAB #5: Analyzing real world data using Hive and performing analysis
Sqoop
LAB #6: Using Sqoop with Excel and PowerPivot to Perform Data Analysis
Data Visualization
Lab 7: Creating attractive data visualization from Social network data
Day 3: HBase, Hadoop Streaming and Machine Learning
Real-time I/O with HBase
- HBase versions and origins
- HBase architecture
- HBase core concepts
- HBase vs. RDBMS
- HBase Master and Region Servers
- Data Modeling
- Column Families and Regions
- HBase Internals: Bloom Filters and Block Indexes
- Write Pipeline / Read Pipeline
- Compactions
LAB #8: Exploring HBase command
Hadoop Streaming
Lab # 9: Writing and executing a .net Hadoop Streaming jobs in Hadoop
Machine Learning through Mahout:
Lab # 10: Exploring Mahout through a sample application
Next-gen Hadoop
- HDFS improvements: HDFS Federation, NameNode HA, Snapshots
- MapReduce improvements: YARN, Performance
- Core concept of HDInsight
- Understanding Google BigTable
Reserve Your Space Today!