Week 5 – Lists, investigate a ‘fundamental’ data structure in functional programming. The recursive nature of lists in Scala makes them quite different from arrays. This tied in with the power of pattern matching and the recursive nature of function programming gives credibility to the ‘fundamental’ label.
Sorting of lists was covered in good detail and in addition to list1.head and list1.tail more functions were revealed:
Week 4 continued to delve into polymorphism with a focus on Types and Pattern Matching. Generalizing functions to accept multiple parameter types is a clear highlight of Scala.
Pattern matching was a topic of week 4 that was used in every subsequent week. Case statement with pattern matching appears to be another staple of Scala.
Number(1) match {
case Number(n) => n
case Sum(e1, e2) => eval(e1) + eval(e2)
} + eval(Number(2))
Week 4’s assignment of implementing the Huffman coding compression algorithm was a good practical example of Scala’s power. Note my source has a number of errors in it.
Week 2 delved more into tail recursion but more broadly review higher order functions. Looking at perhaps the key characteristic of functional programming languages we found that functions are treated as first class values. Meaning like any other values, functions can be passed as a parameter and returned as a results. This becomes important later in the course and in real world applications where the resolution of values can be deferred until the optimal time.
Another key point of functional programming was introduced, immutable values. How immutable values related to distributed and parallel programming was touched on briefly. Odersky elaborates on this in the following presentation:
The concept of currying was also explained. In essence the passing of functions to functions in the interest of simplifying and generalizing code. These concepts and some of the different function types are details that have not stuck with me very well since the course. I guess that happens when you don’t write a lot of code in a language then leave if for e few months.
Week 2’s assignment was relatively straight forward and followed the contents of the weeks lectures.
Started and completed this course in the second half of 2012. Thought revisiting the material and uploading the weekly assignments would be a good idea. Week 1 looked at basic functions and evaluations. The recursive nature of functional programming was alluded to, particularly in the assignment.
The basics of (x <- 0 to y) and other scala language specifics such as scoping and structure can all be reviewed in the weeks source code.
I signed up for this course after watching a presentation by Rich Hickey, the creator of Clojure (another functional language for the JVM).
Once of the most important concepts I took from week 1 was that of tail recursion:
/**
* Exercise 1
*/
def pascal(c: Int, r: Int): Int = {
//recursive function
def tFactorial(number: Int) : Int = {
//Calculate factorial with tail recursion
def tfactorialWithAccumulator(accumulator: Int, number: Int) : Int = {
if (number == 1) accumulator
else tfactorialWithAccumulator(accumulator * number, number - 1)
}
//start from the start!
tfactorialWithAccumulator(1, number)
}
// element value is calulated using r!/(c!(r-c)!)
if (c == 0 || c == r || r == 0 || r == 1) 1
else
tFactorial(r) / (tFactorial(c) * tFactorial(r-c))
}
Running some like methods with and without allowing the scala compiler to optimize demonstrated the performance improvements gained by this optimization.
//this function will be tail recusive optimized
def tFactorial(number: Long) : Long = {
def tfactorialWithAccumulator(accumulator: Long, number: Long) : Long = {
if (number == 1)
return accumulator
else
tfactorialWithAccumulator(accumulator * number, number - 1)
}
tfactorialWithAccumulator(1, number)
}
//Non-tail recursive function
def ntFactorial(number:Long) : Long = {
if (number == 1)
return 1
number * ntFactorial (number - 1)
}
For explanation of these functions and why/not they are tail recursive see the source above.
Results:
Non-Tail recursive average response time: approx 7 ms
Tail recursive average response time: approx 0.7ms
The number of test conducted was limited to about 10 each and the test environment was an Ubuntu VM. The results will not be highly accurate but with a difference this significant it is clear that tail optimized recursion is very different from normal recursion (On a JVM). Looking at the purepath for each method reveals tail-recursion optimization working as expected and saving alot of execution time:
The next example looks at what the compiler does if there is extra code within a tail optimized funtion:
// Non-Tail recursive version
def execTest3nt(loopCount: Int, cur: Int): String = {
println("Hello World " + cur + "\n")
if (cur >= loopCount)
return "Completed"
else
execTest3nt(loopCount, (cur + 1))
}
The output of the non-tail recursive function was as expected, printing all of the output expected. The scala compiler optimized over the expected (perhaps incorrectly expected) behaviour of the optimized function. The output was:
enter num of iterations:1000
Hello World 1
Hello World 1000
Completed
As the function was optimized not to run through all of the loops – the println’s simply did not occur. The purepath comparision:
After some initial JVM monitoring yesterday, I looked at response times today. First thing I had to do was modify the respective classes slightly to get a non-constructor method doing all the work. This made it much easier to gather and chart the results within dynaTrace (though I did later find that there was a check box in dynaTrace to monitor constructors).
The tests today were again on 3 implementations of the Fibonacci sequence:
I spent most of the day playing with dynaTraces monitoring and reporting tools so did not actually do that much testing. The tests I did run where quite short. The first being execution of the fib sequence returning the 10,000 value, the second stretching to 100,000. The charts below show the time results for each test, if more than one test was executed for an implementation in a 30 second cycle, the chart shows the average.
*Update 16-JUL: Retested on host machine (not running in VM) results where much more consistent despite having a number of other services operating at the same time. There are some clear results this time:
From the results it appears that the variations in response times are most likely caused by external factors (ie: OS or other services taking resouces, Garbage collections etc). I will really need to increase the sample size to get a better understanding of what is influencing response times.
Trying to make some comparisons between Scala and Java. Using dynaTrace I can get very useful and granular insight into the behaviours of each. To keep things as comparable as possible the tests will be on simple, short algorithms. Large iterations will be conducted so garbage collection and space complexity can be analysed.
With dynaTrace I can get information about the JVM behaviour but also about the behaviour of specific classes and methods. This will not only allow me to make some comparison between Java and Scala implementations but also understand more about what happens when Scala is compiled to byte code.
I am running the tests from a controller class with a main method. From there I can instantiate the classes above. Read more about dynaTrace to see how it uses sensors in the JVM to grab a great deal of useful data about the JVM and about classes and methods.
Just to get things rolling, checked out how running the Fib sequence to the 10 million number in the sequence would affect the JVM. This tests would have taken a long time to complete so after I saw stable behaviour (ie: two sequences) I kill the JVM and moved on to the next sequence.
The charts below have been modified for human readability. For example, Garbage Collection in the charts was factored up. The charts are also stacked for readability. Within the dynaTrace client this is much easier to read with mouse over pop-ups.
What can be taken from this fairly simple example is that code may work and appear to be performing well. In reality it may be wreaking havoc on the JVM. Next post will look at the response times of these implementations.
After hearing the praises of Hadoop I had a brief look it and the Map/Reduce paradigm. Most info was read from Wikipedia and references in wiki articles. In particular, this paper from Jeffrey Dean and Sanjay Ghemawat.
Hadoop is an opensource software framework that implements MapReduce and Google File System. The aim is to enable easy and robust deployment of highly parallel data set processing programs. It is important to note that the MapReduce model is applicable to embarrassingly parallel problems. Processing can occur on data that is stored in a database or filesystem.
Map/Reduce refers to the steps in the model:
Map: A master node takes an input problem and divides it into sub-problems, passing them to worker nodes. Worker nodes can further divide sub-problems. For example, in the problem of counting word occurrence in a document the Map function will output a key/value pair every time it sees a a specified work – ie: (“searchterm”, 1).
Reduce: The reduce function takes the list of word(key)/values and sums the occurrences:
(foo, (1)
(bar, 1)
(foo, 1)
(bar, 1)
(foo, 1)
The output of reduce in this case could be: (foo, 3).
The MapReduce model becomes powerful when considering giant datasets can be processed in large clusters very efficiently using this model.
Hadoop Run-time Takes care of:
Parallelalization
Partitioning of input data
Scheduling programs execution across machines
Handling machine failures
Managing inter-machine communication
Hadoop aims to enable developers with little distributed programming experience to utilize compute resources such as EC2.
With the emergence of ‘big data’ and the apparent value that can be extrapolated from massive databases/datastores, many organisations have found the limits of traditional relational databases. Hadoop has such a big buzz because it can pass the processing boundaries of relational database software and enable the extrapolation of value. The video below is a decent explanation of this point by data scientists at SalesForce.com
Came across an interesting scenario this week where I decided to try and use Python for parsing HTML. Turns out to quite straight forward and something I could imagine using more often in the future.
The link we were after was the most viewed channel at any particular time:
First attempt was to load up that page in an <iframe> allowing our own JavaScript to grab the relevant URL then forward. Twitch.tv however forces forwarding when iframes are detected, presumably for proprietary reasons. So javascript was not really an option.
After experimenting with Python’s lxml http://lxml.de/ it was really straight forward and effective. It is apparently quite efficient using C core (http://stackoverflow.com/a/6494811/692180) too. The module I used is documented quite well here: http://lxml.de/lxmlhtml.html. With a little bit of trial an error a brief python script successfully grabs the relevant information.
Then using PHP’s popen() method I can simply call the python script and use the return value as a php variable for a header redirect.
import urllib
from lxml import html
def getElement():
f = urllib.urlopen("http://www.twitch.tv/directory/StarCraft%20II:%20Wings%20of%20Liberty")
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()# Read from the object, storing the page's contents in 's'.
doc = html.document_fromstring(s)
doc = doc.get_element_by_id('directory_channels')
#rVal = trimDoc.text_content()
doc = doc.find_class('thumb')
rVal = html.tostring(doc[0]).split()[2]
return makeUrl(rVal)
def makeUrl(thumbString):
return "http://twitch.tv" + thumbString[6:-2]
if __name__ == "__main__":
print getElement()