Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time. Big data philosophy encompasses unstructured, semi-structured and structured data, however the main focus is on unstructured data. Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many exabytes of data.

Overview of Big Data Technologies and its role in Analytics

Big Data challenges & solutions

Data Science vs Data Engineering

FOUR V's of Big Data given by Google.

Unix & Java

  • Introduction to UNIX shell.
  • Basic Commands of UNIX
  • Create
  • Copy
  • Move
  • Delete etc.
  • Basic of JAVA Programming Language
  • Architecture JVM, JRE, JIT
  • Control Structures
  • OOP's Concept in Java
  • String Classes/Array/Exception Handling
  • Collection Classes

Apache HDFS

  • Understanding the problem statement and challenges persisting to such large data to perceive the need of Distributed File System.
  • Understanding HDFS architecture to solve problems
  • Understanding configuration and creating directory structure to get a solution of the given problem statement
  • Setup appropriate permissions to secure data for appropriate users
  • Setting up Java Development with HDFS libraries to use HDFS Java APIs
  • Apache Map-Reduce

    • What is Map Reduce.
    • Input and output formats.
    • Data Types in Map Reduce.
    • Flow of Map Reduce Jobs.
    • Wordcount In Map Reduce.
    • How to use Custom Input Formats
    • Use case for Structure Data Sets.
    • Writing Custom Classes.


    • What is HIVE.
    • Architecture of HIVE.
    • Tables in Hive with Load Functions.
    • Query Optimization.
    • Partitioning and Bucketing.
    • Joins in HIVE.
    • Indexing In HIVE.
    • File Formats in HIVE.
    • How to read JSON files in HIVE.


    • What is Sqoop.
    • Relation between SQL & Hadoop.
    • Performing Sqoop Import.
    • Incrementals and Conditional Imports
    • Performing Sqoop Export.


    • What is PIG & ETL.
    • Introduction to PIG Architecture.
    • Introduction of PIG Latin.
    • How to Perform ETL on any Kind of data (PIG Eats Everything)
    • Use cases of PIG.
    • Joins in PIG.
    • Co-grouping In PIG.

    Introduction to NoSQL Database &OOZIE

    • What is HBASE.
    • Architecture of HBASE.
    • CRUD operations in HBASE
    • Retrival of HBASE Data.
    • Introduction of Apache Oozie (Scheduler tool)

    Introduction to Programming in Scala

    • Basic data types and literals used
    • List the operators and methods used in Scala
    • Classes of Scala
    • Traits of Scala.
    • Control Structures in Scala.
    • Collection of Scala.
    • Libraries of Scala.

    Introduction to Spark

    • Limitations of MapReduce in Hadoop Objectives
    • Batch vs. Real-time analytics
    • Application of stream processing
    • Spark vs. Hadoop Eco-system

    Using RDD for Creating Applications in Spark

    • Features of RDDs
    • How to create RDDs
    • RDD operations and methods
    • Explain RDD functions and describe how to write different codes in Scala

    Running SQL queries Using SparkQL

    • Explain the importance and features of SparkQL
    • Describe methods to convert RDDs to DataFrames
    • Explain concepts of SparkSQL
    • Describe the concept of hive integration

    Spark ML Programming

    • Explain the use cases and techniques of Machine Learning (ML)
    • Describe the key concepts of Spark ML
    • Explain the concept of an ML Dataset, and ML algorithm, model selection via cross validation