Course Features
- 4 Days Workshop
- Completion Certificate awarded by GKK
Course Schedule
Course Outline
Module 1: Big Data – History, Overview, and Characteristics
- History
- Big Data Definition
- Big Data Benefits
- Big Data Characteristics
- Volume
- Velocity
- Variety
Big Data Technologies – Overview
- Big Data Success Stories
Big Data – Privacy and Ethics
- Privacy – Compliance
- Privacy – Challenges
- Privacy – Approach
- Ethics
Big Data Projects
- Who Should Be Involved?
- What Is Involved?
Module 2: Big Data Sources
2.1 Enterprise Data Sources
- Enterprise Systems
- Oracle
- SAP
- Microsoft
- Data Warehouses
- Unstructured Data – Introduction
- Unstructured Data – Metadata
2.2 Social Media Data Source
- Introduction
- Facebook – Introduction
- Facebook – Public Feed API
- Facebook – Keyword Insights API
- Facebook – Graph API
- Twitter – Introduction
- Twitter – Streaming APIs
- Twitter – REST APIs
- Other Social Media
2.3 Public Data Sources
- Introduction
- Weather
- Economics
- Finance
- Regulatory Bodies
Module 3: Data Mining – Concepts and Tools
3.1 Data Mining – Introduction
- Introduction
- Types of Data Mining – Overview
- Types of Data Mining – Classification
- Types of Data Mining – Association
- Types of Data Mining – Clustering
3.2 Data Mining – Tools
- Introduction
- Weka
- Modules of Weka Applications
- KNIME
- KNIME – Example
- R Language
Module 4: The Hadoop Distributed File System (HDFS)
4.1 Hadoop Fundamentals
- Introduction
- Main Components of Hadoop
- Additional Components of Hadoop
4.2. The Hadoop Distributed File System (HDFS)
- Overview of HDFS
- Launching HDFS in Pseudo-Distributed Mode Core HDFS Services
- Installing and Configuring HDFS
- HDFS Commands
- HDFS Safe Mode
- Check Pointing HDFS
- Federated and High Availability HDFS
- Running a Fully-Distributed HDFS Cluster with Docker
4.3. MapReduce with Hadoop
- MapReduce from the Linux Command Line Scaling MapReduce on a Cluster Introducing Apache Hadoop Overview of YARN
- Launching YARN in Pseudo-Distributed Mode Demonstration of the Hadoop Streaming API Demonstration of MapReduce with Java
Module 5: Apache
5.1. Introduction to Apache Spark
- Why Spark?
- Spark Architecture
- Spark Drivers and Executors
- Spark on YARN
- Spark and the Hive Metastore
- Structured APIs, DataFrames, and Datasets
- The Core API and Resilient Distributed Datasets (RDDs)
- Overview of Functional Programming
- MapReduce with Python
5.2. Apache Hive
- Hive as a Data Warehouse
- Hive Architecture
- Understanding the Hive Metastore and HCatalog Interacting with Hive using the Beeline Interface Creating Hive Tables
- Loading Text Data Files into Hive
- Exploring the Hive Query Language
- Partitions and Buckets
- Built-in and Aggregation Functions Invoking MapReduce Scripts from Hive Common File Formats for Big Data Processing Creating Avro and Parquet Files with Hive Creating Hive Tables from Pig
Accessing Hive Tables with the Spark SQL Shell
5.3. Persisting Data with Apache HBase
- Features and Use Cases
- HBase Architecture
- The Data Model
- Command Line Shell
- Schema Creation
- Considerations for Row Key Design
5.4 Apache Storm
- Processing Real-Time Streaming Data
- Storm Architecture: Nimbus, Supervisors, and ZooKeeper
- Application Design: Topologies, Spouts, and Bolts
Module 6: Data Modelling with Document Databases
6.1 MongoDB Fundamentals
- Introduction
- Replication
- Sharding
- Sharding and Replication
- MongoDB Ecosystem – Languages and Drivers
- MongoDB Ecosystem – Hadoop Integration
- MongoDB Ecosystem – Tools
6.2 Install and Configure
- Download
- How to Install and Configure
6.3 Document Databases
- Introduction
- Documents
- Document Design Considerations
- Fields
6.4 Data Modelling with Document Databases
- Introduction
- Twitter Sentiment Analysis
- Twitter Sentiment Analysis – Algorithm
- Network Log Analysis
- Network Log Analysis – Algorithm
FAQ
All trainees to have the following:
i) Required knowledge for attendees
- Conversant with any imperative programming language like C
- Knowledge of SQL query
ii) Hardware Requirement
— Minimum Configuration of Laptop
- Memory/ RAM 8 GB
- Free Disk Space 30 GB
- 4 CPU cores
iii) Software Requirement:
Windows or Mac
Oracle Virtual Box (https://www.virtualbox.org/wiki/Downloads)
- Software developers
- IT managers
- Service management professionals
- Technology Managers
Payment Methods
-
Cash
-
HRDF Claimable
-
Maybank Ezpay (Up to 24 months @ 0% Interest)
-
CIMB Easy Pay (Up to 24 months @ 0% Interest)
-
Cash Installment (Case by case basis)