Spark textFile() VS wholeTextFile()

Published on Author adminLeave a comment

textFile()

def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]

  • Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings
  • For example sc.textFile(“/home/hdadmin/wc-data.txt”) so it will create RDD in which each individual line an element.

wholeTextFile()

def wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)]

  • Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.
  • Rather than create basic RDD, the wholeTextFile() returns pairRDD

For example, you have few files in a directory so by using wholeTextFile() method, it creates pair RDD with filename with path as key, and value being the whole file as string

Code and dataset can be found

Join us for more real time use cases and project work.

Leave a Reply

Your email address will not be published. Required fields are marked *