Java Brew: Spark

First - RDD stands for Resilient Distributed Dataset.

Spark RDD is a distributed collection of data. This distributed collection is usually created in two ways: by external data ( a file, data from HDFS) or by distributing a collection of object ( eg: List/Set) in driver program.

Scala code to create RDD:

External data RDD: val lines = sc.textFile("input.txt")
Distribute Collection RDD: val nums = sc.parallelize(List(1, 2, 3, 4))

*sc - is SparkContext object

Now we have RDDs created in our driver program. Once RDD created, we do computation on theses.
Two ways of computation can be performed on RDD:

Transformation: Transformation results in new RDDs. Commonly used Transformations:

flatMap(): apply function to each element in RDD and returns cotent of iterator returned as new RDD.
filter(): returns an RDD that contains only elements that pass filter condition
map(): returns an RDD applying function to each element in RDD
distinct(): removes duplicate elements.
Union: produces an RDD with contianning elements from both ...

Actions: Actions are the operations that return some value or write data. Commonly used Actions:

collect(): returns all elements in RDD
count(): Number of elements in RDD
foreach(): iterate over the elements in RDD
top(num): returns top num elements from RDD ...

Java Brew

Wednesday, December 28, 2016

Spark - RDD

No comments:

Post a Comment