Apache Spark Use Cases

Published on

Apache Spark is an analytics engine which can process huge data volumes at a speed much faster than MapReduce, because the data is persisted on Spark’s own processing framework. That is why it has been catching the attention of both professionals and the press. It was first developed at AMPLab of the University of California, … Continue reading Apache Spark Use Cases

Role and Responsibilities

Published on

I have seen many hadoop over the internet where they search for bigdata architect roles and responsibilities. So here I tried to help them by putting most of the points. These are the main tasks which Architect need to do or should have these skill set to become BigData Architect. Should be able to Design, … Continue reading Role and Responsibilities

Spark RDD : groupByKey VS reduceByKey

Published on

let’s look at two different ways to compute word counts, one using reduceByKey and the other using groupByKey: val words = Array( “a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”); val pairs = sc.parallelize(words).map(line => (line,1)); val wordCountsWithGroup = pairs.groupByKey().map(t => (t._1, t._2.sum)).collect() val wordCountsWithReduce = pairs.reduceByKey(_ + _) .collect()   While both of these functions will produce the correct answer, the … Continue reading Spark RDD : groupByKey VS reduceByKey

Spark RDD : distinct(), union() & more…

Published on

In this post, will look at the following Pseudo set Transformations distinct() union() intersection() subtract() cartesian() Distinct distinct(): Returns distinct element in the RDD. Warning :Involves shuffling of data over N/W Union union() : Returns an RDD containing data from both sources Note : Unlike the Mathematical Union, duplicates are not removed. Also type should … Continue reading Spark RDD : distinct(), union() & more…

Components of Spark

Published on

Following are some important components of Spark Cluster Manager Is used to run the Spark Application in Cluster Mode Application User program built on Spark. Consists of, Driver Program The Program that has SparkContext. Acts as a coordinator for the Application Executors Runs computation & Stores Application Data Are launched at the beginning of an … Continue reading Components of Spark

Executor, Job, Stage, Task & RDD Persistence

Published on

Let us start an Application. For this demo, Scala shell acts as a Driver (Application) $ spark-shell Connect to web app(localhost:4040) and explore all the tabs. Except for Environment & Executors tab all other tabs are empty     That clearly indicates we have an Executor running in the background to support our Application.   The First Run Let us check … Continue reading Executor, Job, Stage, Task & RDD Persistence

RDD: foreach() Action

Published on

foreach() is an action. Unlike other actions, foreach do not return any value. It simply operates on all the elements in the RDD. foreach() can be used in situations, where we do not want to return any result, but want to initiate a computation. A good example is ; inserting elements in RDD into database. … Continue reading RDD: foreach() Action

RDD : Lazy Evaluation

Published on

Lazy Evaluation helps to optimize the Disk & Memory Usage in Spark. Consider this example, Based on the code above, we would infer that the file ‘words.txt’ will be read during the execution of  Transformation operation (1). But this never happens in Spark. Instead, the file will only be read during the execution of action operation … Continue reading RDD : Lazy Evaluation

RDD Caching and Persistence

Published on

Introduction Caching or persistence are optimisation techniques for (iterative and interactive) Spark computations. They help saving interim partial results so they can be reused in subsequent stages. These interim results as RDDs are thus kept in memory (default) or more solid storages like disk and/or replicated. RDDs are Recomputed RDDs by default is recomputed each … Continue reading RDD Caching and Persistence