Using shared variables and key-value pairs
In this blog post, I’ll talk about Spark’s shared variables and the type of operations you can do on key-value pairs. Spark provides two limited types of shared variables for common usage patterns: broadcast variables and accumulators. Normally, when a function is passed from the driver to a worker, a separate copy of the variables are used for each worker. Broadcast variables allow each machine to work with a read-only variable cached on each machine. Spark attempts to distribute broadcast variables using efficient algorithms. As an example, broadcast variables can be used to give every node a copy of a large dataset efficiently.The other shared variables are accumulators. These are used for counters in sums that works well in parallel. These variables can only be added through an associated operation. Only the driver can read the accumulators value, not the tasks. The tasks can only add to it. Spark supports numeric types but programmers can add support for new types. As an example, you can use accumulator variables to implement counters or sums, as in MapReduce.
Last, but not least, key-value pairs are available in Scala, Python and Java. In Scala, you create a key-value pair RDD by typing:
scala> val kvp = ("a", "b")
To access each element, invoke the “._” notation. This is not zero-index, so the “._1” will return the value in the first index and “._2” will return the value in the second index.scala> val kvp = ("a", "b")
scala> pair._1 // will return "a"
scala> pair._2 // will return "b"
In Python, it is a zero-index notation, so the value of the first index resides in index “0” and the second index is “1”.>>> kvp = ("a", "b")
>>> pair[0] # will return "a"
>>> pair[1] # will return "b"
There are special operations available to RDDs of key-value pairs. In an application, you must remember to import the SparkContext package to use PairRDD Functions such as reduceByKey(). The most common ones are those that perform grouping or aggregating by a key. If you have custom objects as the key inside your key-value pair, remember that you will need to provide your own equals() method to do the comparison as well as a matching hashCode() method.So in the example below, you have a textFile that is just a normal RDD. Then you perform some transformations on it and it creates a PairRDD which allows it to invoke the reduceByKey() method that is part of the PairRDD Functions API.
scala> val lines = sc.textFile("/home/df/Temp/alerts_new.csv")
scala> val wordCounts = lines.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_+_)
Please attention for the reduceByKey() method. This simply means that for the values of the same key, add them up together.Another thing I want to point is that for the goal of brevity, all the functions are concatenated on one line. When you actually code it yourself, you may want split each of the functions up. For example, do the flatMap operation and return that to a RDD. Then from that RDD, do the map operation to create another RDD. Then finally, from that last RDD, invoke the reduceByKey method. That would yield multiple lines, but you would be able to test each of the transformation to see if it worked properly.
Next post, let’s understand the purpose and usage of the SparkContext.
Good luck!
If you want, you can contact me to explain some doubts, make suggestions or critical…or to know my services - please send me an e-mail, or visit my website for more details.