Categories
Functional Programming - Scala

Functional Programming in Scala – Week 5

Week 5 – Lists, investigate a ‘fundamental’ data structure in functional programming. The recursive nature of lists in Scala makes them quite different from arrays. This tied in with the power of pattern matching and the recursive nature of function programming gives credibility to the ‘fundamental’ label.

Pattern matching on lists is a powerful tool
Pattern matching on lists is a powerful tool

Sorting of lists was covered in good detail and in addition to list1.head and list1.tail more functions were revealed:

scala_list_functions
Some key functions for lists

Week 5 Lectures

Week 4 and Week 6 assignments were considered higher work load than others thus there was no assignment for week 5

Categories
Functional Programming - Scala

Functional Programming in Scala – Week 4

Week 4 continued to delve into polymorphism with a focus on Types and Pattern Matching. Generalizing functions to accept multiple parameter types is a clear highlight of Scala.

scala_polymorphism
generic functions in scala

Pattern matching was a topic of week 4 that was used in every subsequent week. Case statement with pattern matching appears to be another staple of Scala.

Number(1) match {
  case Number(n) => n
  case Sum(e1, e2) => eval(e1) + eval(e2)
} + eval(Number(2))

Week 4 lectures

Week 4’s assignment of implementing the Huffman coding compression algorithm was a good practical example of Scala’s power. Note my source has a number of errors in it.

Week 4 Assignment source

Categories
Functional Programming - Scala

Functional Programming in Scala – Week 3

Week 3 looked at the object oriented aspects of Scala. The evaluation of encapsulated data was the most interesting part of week 3,

The lectures went into details about some more of the ‘syntactic sugar’ of Scala.

Abstraction of classes in Scala was explained clearly in the lectures too.

The assignment for week 3 demonstrated Scala’s ability to implement sort, filtering, and union operations in just a few lines.

Polymorphism in Scala was also described. The use of types and traits was shown to enabled different ‘flavors’ of functions.

Week 3 lectures

Week 3 assignment source

scala_class_heirarchy
Class heirarchy in Scala
Categories
Functional Programming - Scala

Functional Programming in Scala – Week 2

Week 2 delved more into tail recursion but more broadly review higher order functions. Looking at perhaps the key characteristic of functional programming languages we found that functions are treated as first class values. Meaning like any other values, functions can be passed as a parameter and returned as a results. This becomes important later in the course and in real world applications where the resolution of values can be deferred until the optimal time.

Another key point of functional programming was introduced, immutable values.  How immutable values related to distributed and parallel programming was touched on briefly. Odersky elaborates on this in the following presentation:

 

The concept of currying was also explained. In essence the passing of functions to functions in the interest of simplifying and generalizing code. These concepts and some of the different function types are details that have not stuck with me very well since the course. I guess that happens when you don’t write a lot of code in a language then leave if for e few months.

Week 2’s assignment was relatively straight forward and followed the contents of the weeks lectures.

Week 2 Assignment code

Week 2 Lectures 

 

Categories
Functional Programming - Scala

Functional Programming in Scala – Week 1

Started and completed this course in the second half of 2012. Thought revisiting the material and uploading the weekly assignments would be a good idea. Week 1 looked at basic functions and evaluations. The recursive nature of functional programming was alluded to, particularly in the assignment.

The basics of (x <- 0 to y) and other scala language specifics such as scoping and structure can all be reviewed in the weeks source code.

I signed up for this course after watching a presentation by Rich Hickey, the creator of Clojure (another functional language for the JVM).

http://www.infoq.com/presentations/Are-We-There-Yet-Rich-Hickey

weeks howework

Week 1 lecture videos: https://class.coursera.org/progfun-2012-001/lecture/8

Once of the most important concepts I took from week 1 was that of tail recursion:

  /**
   * Exercise 1
   */
  def pascal(c: Int, r: Int): Int = {
		//recursive function	
		def tFactorial(number: Int) : Int = {
		//Calculate factorial with tail recursion
  		def tfactorialWithAccumulator(accumulator: Int, number: Int) : Int = {
      	if (number == 1) accumulator
      	else tfactorialWithAccumulator(accumulator * number, number - 1)
			}
		    //start from the start!
        tfactorialWithAccumulator(1, number)
		}

	// element value is calulated using r!/(c!(r-c)!)
	if (c == 0 || c == r || r == 0 || r == 1) 1
	else		
		tFactorial(r) / (tFactorial(c) * tFactorial(r-c))
	}
Categories
ITOps

Scala tail recursion vs non-tail receursion performance testing with dynaTrace

The scala compiler has partial tail call optimization (http://blog.richdougherty.com/2009/04/tail-calls-tailrec-and-trampolines.html).

Running some like methods with and without allowing the scala compiler to optimize demonstrated the performance improvements gained by this optimization.

First up, simple factoral functions (source: http://myadventuresincoding.wordpress.com/2010/06/20/tail-recursion-in-scala-a-simple-example/):

//this function will be tail recusive optimized
def tFactorial(number: Long) : Long = {
 def tfactorialWithAccumulator(accumulator: Long, number: Long) : Long = {
 if (number == 1) 
 return accumulator
 else
 tfactorialWithAccumulator(accumulator * number, number - 1)
 }
 tfactorialWithAccumulator(1, number)
}
//Non-tail recursive function
def ntFactorial(number:Long) : Long = {
 if (number == 1)
 return 1
 number * ntFactorial (number - 1)
}

For explanation of these functions and why/not they are tail recursive see the source above.

Results:

Non-Tail recursive average response time: approx 7 ms

Tail recursive average response time: approx 0.7ms

The number of test conducted was limited to about 10 each and the test environment was an Ubuntu VM. The results will not be highly accurate but with a difference this significant it is clear that tail optimized recursion is very different from normal recursion (On a JVM). Looking at the purepath for each method reveals tail-recursion optimization working as expected and saving alot of execution time:

tail-recursion
Comparing the two purepaths, the 10x increase in response time is clearly caused by the lack of tail recursion on the left.

The next example looks at what the compiler does if there is extra code within a tail optimized funtion:

// Tail recursive version 
 def execTest3Tail(loopCount: Int): String = {
 def test2WithAccumulator(accumulator: Int, number: Int): String = {
 println("Hello World " + accumulator + "\n")
 if (accumulator >= loopCount)
 return "Completed"
 else
 test2WithAccumulator(loopCount, (accumulator + 1))
 }
 test2WithAccumulator(1, loopCount)
 }
// Non-Tail recursive version
 def execTest3nt(loopCount: Int, cur: Int): String = {
 println("Hello World " + cur + "\n") 
 if (cur >= loopCount)
 return "Completed"
 else
 execTest3nt(loopCount, (cur + 1))
 }

The output of the non-tail recursive function was as expected, printing all of the output expected. The scala compiler optimized over the expected (perhaps incorrectly expected) behaviour of the optimized function. The output was:

enter num of iterations:1000
Hello World 1
Hello World 1000
Completed

As the function was optimized not to run through all of the loops – the println’s simply did not occur. The purepath comparision:

tail-recursion2
Results are similar to the first test. The right (tail-optimized) purepath call hierarchy shows why there were only 2 screen outputs..

 

Categories
ITOps

Scala vs Java performance – dynaTrace Analysis – Test 1 Fibonacci Sequence cont’

After some initial JVM monitoring yesterday, I looked at response times today. First thing I had to do was modify the respective classes slightly to get a non-constructor method doing all the work. This made it much easier to gather and chart the results within dynaTrace (though I did later find that there was a check box in dynaTrace to monitor constructors).

The tests today were again on 3 implementations of the Fibonacci sequence:

Java – see source

Scala Simple – see source

Scala Advanced – see source

Test environment info

I spent most of the day playing with dynaTraces monitoring and reporting tools so did not actually do that much testing. The tests I did run where quite short. The first being execution of the fib sequence returning the 10,000 value, the second stretching to 100,000. The charts below show the time results for each test, if more than one test was executed for an implementation in a 30 second cycle, the chart shows the average.

*Update 16-JUL: Retested on host machine (not running in VM) results where much more consistent despite having a number of other services operating at the same time. There are some clear results this time:

Java vs Scala vs Scala Adv - Fibonacci
Running on a host machine, the Java implementation has a slight performance edge over the Scala simple. The Scala advanced implementation is slower. These tests were to the 100,000th Fibonacci value

 

Test1Loop.NoAdv_JMVmon_1.000.000_HOST
Stretching the test the 1,000,000th Fibonacci value meant that Scala Adv overran the heap but the difference between Scala simple and Java was negligible

From the results it appears that the variations in response times are most likely caused by external factors (ie: OS or other services taking resouces, Garbage collections etc). I will really need to increase the sample size to get a better understanding of what is influencing response times.

Test1Loop.All3_ResponseTime_v10000
Fibonacci sequence to 10,000th value
Test1Loop.All3_ResponseTime_v100000
Fibonacci sequence to 100,000th value
Categories
ITOps

Scala vs Java performance – dynaTrace Analysis – Test 1 Fibonacci Sequence

Trying to make some comparisons between Scala and Java. Using dynaTrace I can get very useful and granular insight into the behaviours of each.  To keep things as comparable as possible the tests will be on simple, short algorithms. Large iterations will be conducted so garbage collection and space complexity can be analysed.

Thanks to Julio Faerman for his post and the implementations I have used from that post – see: http://www.infoq.com/articles/scala-java-myths-facts

With dynaTrace I can get information about the JVM behaviour but also about the behaviour of specific classes and methods. This will not only allow me to make some comparison between Java and Scala implementations but also understand more about what happens when Scala is compiled to byte code.

To start with, nice and simple test – Fibonacci sequence

Three implementations will be tested:

Java Fibonacci  – see code

Scala Simple Fibonacci  – see code

Scala Advanced Fibonacci  – see code

I am running the tests from a controller class with a main method. From there I can instantiate the classes above. Read more about dynaTrace to see how it uses sensors in the JVM to grab a great deal of useful data about the JVM and about classes and methods.

Just to get things rolling, checked out how running the Fib sequence to the 10 million number in the sequence would affect the JVM.  This tests would have taken a long time to complete so after I saw stable behaviour (ie: two sequences) I kill the JVM and moved on to the next sequence. 

The charts below have been modified for human readability. For example, Garbage Collection in the charts was factored up. The charts are also stacked for readability. Within the dynaTrace client this is much easier to read with mouse over pop-ups.

What can be taken from this fairly simple example is that code may work and appear to be performing well. In reality it may be wreaking havoc on the JVM. Next post will look at the response times of these implementations.

Test1Loop.All3_memory_v1
JVM Heap usage and Garbage Collection utilization
Test1Loop.NoAdv_memory_v1
Removing the Scala Adv Sequence allows for better comparison of Java and Scala Simple
Test1Loop.All3_GC_v1
Garbage Collection utilization and Garbage collection caused suspensions (time)
Test1Loop.NoAdv_GC_v1
GC utilization, CPU Utilization and Suspension time cause by GC – without Scala Advanced

 

Test details:

Environment:

Virtual Machine: Ubuntu 12.04 LTS 32 bit - 3072MB, 2 Processors
Physical Host: Windows 7 64-bit - 8GB, i7-2620M @ 2.70 GHz

Versions:

java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.1) (6b24-1.11.1-4ubuntu3)
OpenJDK Server VM (build 20.0-b12, mixed mode)
Scala compiler version 2.9.2

Run script:

#!/bin/bash
scala_path=/usr/share/java/scala-library.jar
java -Xmx2048m -cp $scala_path:. -agentpath:/home/mark/Desktop/performance_testing/dynatrace/agent/lib/libdtagent.so=name=TestJVM_scala_vs_java,server=192.168.242.60:9998 PerformanceController
Categories
Random

MapReduce Programming Model and Hadoop

After hearing the praises of Hadoop I had a brief look it and the Map/Reduce paradigm. Most info was read from Wikipedia and references in wiki articles. In particular, this paper from Jeffrey Dean and Sanjay Ghemawat.

Hadoop is an opensource software framework that implements MapReduce and Google File System. The aim is to enable easy and robust deployment of highly parallel data set processing programs. It is important to note that the MapReduce model is applicable to embarrassingly parallel problems. Processing can occur on data that is stored in a database or filesystem.

Map/Reduce refers to the steps in the model:

Map: A master node takes an input problem and divides it into sub-problems, passing them to worker nodes. Worker nodes can further divide sub-problems.  For example, in the problem of counting word occurrence in a document the Map function will output a key/value pair every time it sees a a specified work – ie: (“searchterm”, 1).

Reduce: The reduce function takes the list of word(key)/values and sums the occurrences:

(foo, (1)

(bar, 1)

(foo, 1)

(bar, 1)

(foo, 1)

The output of reduce in this case could be: (foo, 3).

The MapReduce model becomes powerful when considering giant datasets can be processed in large clusters very efficiently using this model.

Hadoop Run-time Takes care of:

  • Parallelalization
  • Partitioning of input data
  • Scheduling programs execution across machines
  • Handling machine failures
  • Managing inter-machine communication

Hadoop aims to enable developers with little distributed programming experience to utilize compute resources such as EC2.

With the emergence of ‘big data’ and the apparent value that can be extrapolated from massive databases/datastores, many organisations have found the limits of traditional relational databases. Hadoop has such a big buzz because it can pass the processing boundaries of relational database software and enable the extrapolation of value. The video below is a decent explanation of this point by data scientists at SalesForce.com


Categories
Random

Parsing HTML with Python

Came across an interesting scenario this week where I decided to try and use Python for parsing HTML. Turns out to quite straight forward and something I could imagine using more often in the future.

We were trying to grab the a dynamic link from this page at twitch http://www.twitch.tv/directory/StarCraft%20II:%20Wings%20of%20Liberty.

The link we were after was the most viewed channel at any particular time:

The most viewed channel is always in this location/element

First attempt was to load up that page in an <iframe> allowing our own JavaScript to grab the relevant URL then forward. Twitch.tv however forces forwarding when iframes are detected, presumably for proprietary reasons. So javascript was not really an option.

After experimenting with Python’s lxml http://lxml.de/ it was really straight forward and effective. It is apparently quite efficient using C core (http://stackoverflow.com/a/6494811/692180) too. The module I used is documented quite well here: http://lxml.de/lxmlhtml.html. With a little bit of trial an error a brief python script successfully grabs the relevant information.

Then using PHP’s popen() method I can simply call the python script and use the return value as a php variable for a header redirect.

Links to source:

get_link.py – Python script

get_link.php – PHP calling python and using return value :

 

error_reporting(E_ALL);
$handle = popen('python -i ./get_link.py 2>&1', 'r');
$read = fread($handle, 2096);
pclose($handle);
header('Location: '.$read);

 

Python source:

import urllib
from lxml import html

def getElement():
        f = urllib.urlopen("http://www.twitch.tv/directory/StarCraft%20II:%20Wings%20of%20Liberty")
        # Read from the object, storing the page's contents in 's'.
        s = f.read()
        f.close()# Read from the object, storing the page's contents in 's'.
        doc = html.document_fromstring(s)
        doc = doc.get_element_by_id('directory_channels')
        #rVal = trimDoc.text_content()
        doc = doc.find_class('thumb')
        rVal = html.tostring(doc[0]).split()[2]
        return makeUrl(rVal)

def makeUrl(thumbString):
        return "http://twitch.tv" +  thumbString[6:-2]

if __name__ == "__main__":
        print getElement()