Spark RDD : groupByKey VS reduceByKey

Published on

let’s look at two different ways to compute word counts, one using reduceByKey and the other using groupByKey: val words = Array( “a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”); val pairs = sc.parallelize(words).map(line => (line,1)); val wordCountsWithGroup = pairs.groupByKey().map(t => (t._1, t._2.sum)).collect() val wordCountsWithReduce = pairs.reduceByKey(_ + _) .collect()   While both of these functions will produce the correct answer, the … Continue reading Spark RDD : groupByKey VS reduceByKey

Best practices to avoid NullPointerException

Published on

Table of Contents1 Call equals() and equalsIgnoreCase() method on known String2 Prefer valueOf() over toString() where both return same result3 Using null safe methods and libraries4 Avoid returning null from method, instead return empty collection or empty array. Call equals() and equalsIgnoreCase() method on known String Always call equals() method on known String which is … Continue reading Best practices to avoid NullPointerException

Spark RDD : distinct(), union() & more…

Published on

In this post, will look at the following Pseudo set Transformations distinct() union() intersection() subtract() cartesian() Table of Contents1 Distinct2 Union Distinct distinct(): Returns distinct element in the RDD. Warning :Involves shuffling of data over N/W Union union() : Returns an RDD containing data from both sources Note : Unlike the Mathematical Union, duplicates are … Continue reading Spark RDD : distinct(), union() & more…

Components of Spark

Published on

Following are some important components of Spark Cluster Manager Is used to run the Spark Application in Cluster Mode Application User program built on Spark. Consists of, Driver Program The Program that has SparkContext. Acts as a coordinator for the Application Executors Runs computation & Stores Application Data Are launched at the beginning of an … Continue reading Components of Spark

Executor, Job, Stage, Task & RDD Persistence

Published on

Let us start an Application. For this demo, Scala shell acts as a Driver (Application) $ spark-shell Connect to web app(localhost:4040) and explore all the tabs. Except for Environment & Executors tab all other tabs are empty     That clearly indicates we have an Executor running in the background to support our Application.   The First Run Let us check … Continue reading Executor, Job, Stage, Task & RDD Persistence

RDD: foreach() Action

Published on

foreach() is an action. Unlike other actions, foreach do not return any value. It simply operates on all the elements in the RDD. foreach() can be used in situations, where we do not want to return any result, but want to initiate a computation. A good example is ; inserting elements in RDD into database. … Continue reading RDD: foreach() Action

RDD : Lazy Evaluation

Published on

Lazy Evaluation helps to optimize the Disk & Memory Usage in Spark. Consider this example, Based on the code above, we would infer that the file ‘words.txt’ will be read during the execution of  Transformation operation (1). But this never happens in Spark. Instead, the file will only be read during the execution of action operation … Continue reading RDD : Lazy Evaluation

RDD Caching and Persistence

Published on

Table of Contents1 Introduction2 RDDs are Recomputed3 Persisting RDDs Introduction Caching or persistence are optimisation techniques for (iterative and interactive) Spark computations. They help saving interim partial results so they can be reused in subsequent stages. These interim results as RDDs are thus kept in memory (default) or more solid storages like disk and/or replicated. … Continue reading RDD Caching and Persistence