Categories
Random

Setting secure, httpOnly and cache control headers using ModSecurity

Many older web applications do not apply headers/tags that are now considered standard information security practices. For example:

  • Pragma: no-cache
  • Cache-Control: no-cache
  • httpOnly and secure flags

Adding these controls can be achieved using ModSecurity without any needs to modify the application code.

In the case where I needed to modify the cookie headers to include these now controls I added the following to core rule set file: modsecurity_crs_16_session_hijacking.conf.

#
# This rule will identify the outbound Set-Cookie SessionID data and capture it in a setsid
#
#addding httpOnly
Header edit Set-Cookie "(?i)^(JSESSIONID=(?:(?!httponly).)+)$" "$1; httpOnly"
Header set Cache-Control "no-cache, no-store, must-revalidate"
Header set Pragma "no-cache"
Header set Expires "0"

 

This adds the cookie controls we were after – Depending on your web application you may need to change ‘JSESSIONID’ to the name of the relevant cookie.

You can find the cookie name simply using browser tools such as Chrome’s Developer Tools (hit F12 in chrome). Load the page you want to check cookies for, click on the Resources tab:

ChromeCookies

After setting the HTTPOnly and Secure flags you can check the effectiveness using the Console table and listing the document cookies… which should now return nothing.

document.cookie

Categories
Random

Migrating to EJBCA from OpenSSL and TinyCA

Install and configure EJBCA

EJBCA 6.0.3 – http://www.ejbca.org/download.html

JBoss AS 7.1.1 Final – http://download.jboss.org/jbossas/7.1/jboss-as-7.1.1.Final/jboss-as-7.1.1.Final.zip

Prereqs:

Ref:

Detailed deployment guide: http://majic.rs/book/free-software-x509-cookbook/setting-up-ejbca-as-certification-authority

EJBCA doc: http://wiki.ejbca.org/

Architecture

Recommended architecture (source: http://ejbca.org/architecture.html)

Import existing OpenSSL CA

Step 1 – Export the OpenSSL priv key and cert to a PKCS#12 keystore:

openssl pkcs12 -export -out exitingCA1.p12 -inkey  \
        -in  -name existingCA1

Step 2 – Import the PKCS#12 keystore to EJBCA CA

/bin/ejbca.sh ca importca  existingCA1.p12

Step 3 – Verify import

/bin/ejbca.sh ra adduser

### IMPORTANT ###

Distinguished name order of openssl may be opposite of ejbca default configuration – http://www.csita.unige.it/software/free/ejbca/ … If so, this ordering must changed in ejbca configuration prior to deploying (can’t be set on a per CA basis)

Have not been able to replicate this issue in testing.

Import existing TinyCA CA

Basic Admin and User operations

Create and end entity profile for server/client entities

Step 1 – Create a Certificate Profile (http://wiki.ejbca.org/certificateprofiles)

Step 2 – Create and End Entity Profile (http://wiki.ejbca.org/endentityprofiles)

* EndEntities can be deleted using:

/bin/ejbca.sh ra delendentity 

Issuing certificates from CSRs

End entities need to be created for clients/servers that require certificates signed by our CA.

Step 1 – Create and End Entity (http://ejbca.org/userguide.html#Issue a new server certificate from a CSR)

Step 2 – Sign CSR using the End Entity which is associated with a CA

Importing existing certificates

EJBCA can create endentities and import their existing certificate one-by-one or in bulk (http://www.ejbca.org/docs/adminguide.html#Importing Certificates). Bulk inserts import all certificates under a single user which may not be desirable. Below is a script to import all certs in a directory one by one under a new endentity which will take the name of the certificate CN.

#!/bin/sh

# for each certificate in the directory
#       create and enduserentity
#       enduserentity username = certificate CN
#       enduserentity token/pwrd = certificate CN

EJBCA_HOME="/usr/share/ejbca"
IMPORT_DIR=$1
CA=$2
ENDENTITYPROFILE=$3
SSLCERTPROFILE=$4
AP="_OTE"

if [ $# -lt 4 ]; then
        echo "usage: import_existing_certs.sh    "
        exit 1
fi
for X in $IMPORT_DIR*.pem
do
        echo "######################################################"
        echo "Importing: " $X
        CN=$(openssl x509 -in $X -noout -text | grep Subject: | sed -n 's/^.*CN=\(.*\),*/\1/p')
        echo "CN: " $CN
        printf "Running import: %s ca importcert '%s' '%s' '%s' ACTIVE NULL '%s' '%s' '%s'\n" "$EJBCA_HOME/bin/ejbca.sh" "$CN" "$CN" "$CA" "$X" "$ENDENTITYPROFILE" "$SSLCERTPROFILE"
        $EJBCA_HOME/bin/ejbca.sh ca importcert "$CN$AP" "$CN$AP" "$CA" ACTIVE null $X $ENDENTITYPROFILE $SSLCERTPROFILE
        echo "######################################################"
done

Creating administrators

Create administrators that can sign CSR and revoke certificates: http://ejbca.org/userguide.html#Administrator%20roles

Revoking certificates

#Generate CRL via command line
# List CAs
/usr/share/ejbca/bin/ejbca.sh CA listcas
# Create new CRLs:
/usr/share/ejbca/bin/ejbca.sh CA createcrl "" -pem 
# Export CRL to file
/usr/share/ejbca/bin/ejbca.sh CA getcrl "" -pem .pem

Checking certificate validity/revoke status via OSCP

openssl ocsp -issuer gtld_CA_cert.pem -CAfile gtld_CA_cert.pem \
-cert gtld_registrar5.pem -req_text -url http://localhost:8080/ejbca/publicweb/status/ocsp

Monitoring expiring certs

/bin/ejbca.sh listexpired 100

 

Categories
Random

Getting started with ModSecurity

XSS, CSRF and similar types of web application attacks have overtaken SQL injections as the most commonly seen attacks on the internet (https://info.cenzic.com/2013-Application-Security-Trends-Report.html). A very large number of web application were written and deployed prior to the trend up in likelihood and awareness of XSS attacks. Thus, it is extremely important to have an effective method of testing for XSS vulnerabilities and mitigating them.

Changes to production code bases can be slow, costly and can miss unreported vulnerabilities quite easily. The use of application firewalls such as ModSecurity (https://github.com/SpiderLabs/ModSecurity/) become an increasingly attractive solution when faced with a decision on how to mitigate current and future XSS vulnerabilities.

Mod Security can be embedded with Apache, NGINX and IIS which is relativity straight forward. In cases where alternative web severs are being used ModSecurity can still be a viable option by creating a reverse proxy (using Apache of NGINX).

How can ModSecurity be used?

  • Alerting
  • Transforming
  • Blocking
  • and more

These functions can be enacted by rules.

A default action can be created for a group of rules using the configuration directive “SecDefaultAction

Using the following SecDefaultAction at the top of rule set that we want enable blocking and transforming on is a blunt method of protection. Redirection can also be used as a method of blocking.

A powerful web application firewall - free software!
A powerful web application firewall – free software!

Example of a default action to be applied by ruleset (note defaults cascade through the ruleset files):

SecDefaultAction “phase:2,log,auditlog,deny,status:403,tag:’Unspecified usage'”

Rulesets have been created by OWASP.

Using optional rulesets,  modsecurity_crs_16_session_hijacking.conf and modsecurity_crs_43_csrf_protection.conf ModSecurity can provide protection against Cross Site Request Forgeries [CSRF]. The @rsub operators can inject a token on every form (and/or other html elements). ModSecurity can store the expected token value as a variable which is compared to the value posted via forms or other html elements. ModSecurity rules can be based on request methods and URIs etc – alongside the ability to chain rules there are a huge number of options for mitigating XSS and CSRF without impacting normal applicatioin usage.

@rsub

Requirements:

  • SecRuleEngine On
  • SecRequestBodyAccess On
  • SecResponseBodyAccess On

## To enable @rsub

  • SecStreamOutBodyInspection On
  • SecStreamInBodyInspection On
  • SecContentInjection On

Injecting unique request id from mod_unique_id into forms:

SecRule STREAM_OUTPUT_BODY "@rsub s/<\/form>/<input type=\"hidden\" name=\"rv_token\" value=\"%{unique_id}\"><\/form>/" \
"phase:4,t:none,nolog,pass"

Some simple rules:

<LocationMatch ".*\/<directory requiring authentication>\/.*">
# All requests submitted using POST require a token - not the validation of the token can only be completed if that variable is stored from a previous response
SecRule REQUEST_METHOD "^(?:POST)$" "chain,phase:2,id:'1234',t:none,block,msg:'CSRF Attack Detected - Missing CSRF Token when using POST method - ',redirect:/" SecRule &ARGS:token "!@eq 1" "setvar:'tx.msg=%{rule.msg}',setvar:tx.anomaly_score=+%{tx.critical_anomaly_score},setvar:tx.%{rule.id}-WEB_ATTACK/CSRF-%{matched_var_name}=%{matched_var}"
# Check referrer is valid for an authenticated area of the application

SecRule REQUEST_HEADERS:Referer "!@contains <my website>" "block,phase:2,id:'2345',t:none,block,msg:'CSRF Attack Detected - No external referers allowed to internal portal pages',redirect:/" SecRule REQUEST_URI "@contains confirmUpdate" "chain,phase:2,id:'3456',t:none,block,msg:'CSRF Attack Detected - Missing CSRF Token. Confirmation button - ',redirect:/" SecRule &ARGS:rv_token "!@eq 1" "setvar:'tx.msg=%{rule.msg}',setvar:tx.anomaly_score=+%{tx.critical_anomaly_score},setvar:tx.%{rule.id}-WEB_ATTACK/CSRF-%{matched_var_name}=%{matched_var}" </LocationMatch>

Pros:

  • Wide capabilities for logging, alerts, blocking, redirecting, transforming
  • Parses everything coming into your web server over HTTP
  • Virtual patching – if a vulnerability is made public that affects your web application you can write and deploy a rule to mitigate the vulnerability much faster than re-release of application code patched
  • Extended uses – the capabilities of ModSecurity can be applied to applications outside the scope of application security

Cons:

  • Added complexity to your application delivery chain – another point for maintenance and failure
  • Performance costs? – Though I have not had the opportunity to test the performance costs holding session information in memory and inspecting every byte of HTTP traffic can’t be free from performance cost
  • Hardware costs – Particularly if using ModSecurity’s BodyAccess and BodyInspection features, memory usage will be significant

Improving deployments:

  • Starting off being aggressive on warnings and very light on action is a necessity to ensure no impact on normal application usage
  • From this point rules and actions need to be refined
  • Understanding how the applications works allows the use of ModSecuirtys header and body inspection in effective ways

Some other notes extracted from the ModSecurity Handbook – If you decide to use ModSecurity I strongly recommend buying the handbook. It is not expensive and saves a lot of time.

### RULE STRUCTURE ###
SecRule VARIABLES OPERATOR [TRANSFORMATION_FUNCTIONS, ACTIONS]

### VARIABLES ###
REQUEST_URI Request URI, convert to exclude hostname
REQUEST_METHOD Request method
ARGS Request parameters (read-only collection)
ARGS_NAMES Request parameters’ names (collection)
ARGS_GET Query string parameters (read-only collection)
ARGS_GET_NAMES Query string parameters’ names (read-only collection)
ARGS_POST Request body parameters (read-only collection)
ARGS_POST_NAMES Request body parameters’ names (read-only collection)
### STRING MATCHING OPERATORS ###
@beginsWith Input begins with parameter
@contains Input contains parameter
@endsWith Input ends with parameter
@rsub Manipulation of request and response bodies
@rx Regular pattern match in input
@pm Parallel pattern matching
@pmFromFile (also @pmf as of 2.6) Parallel patterns matching, with patterns read from a file
@streq Input equal to parameter
@within Parameter contains input
### NUMBER MATCHING OPERATORS ###
@eq Equal
@ge Greater or equal
@gt Greater than
@le Less or equal
@lt Less than
### ACTIONS ###
# DISRUPTIVE
allow Stop processing of one or more remaining phases
block Indicate that a rule wants to block
deny Block transaction with an error page
drop Close network connection
pass Do not block, go to the next rule
pause Pause for a period of time, then execute allow.
proxy Proxy request to a backend web server
redirect Redirect request to some other web server
# FLOW
chain Connect two or more rules into a single logical rule
skip Skip over one or more rules that follow
skipAfter Skip after the rule or marker with the provided ID
Others..
#METADATA, #VARIABLE, #LOGGING, #SPECIAL, #MISC
Categories
Random

SSL Review part 1

Most of us use and rely on SSL everyday. The mathematical workings of the RSA [Rivest, Shamir, Adleman] algorithm are not overly complex but mapping everything back to what happens in reality requires detailed understanding. Skipping over the need for SSL (for confidential and authenticated exchange of a symmetric key over and insecure medium) I will review the mathematical workings then how they are applied in real world examples.

There are also details in previous posts – RSA1, RSA2

Mathematics 

Step
Components
1. public key – e
standard practice to choose: 65537 
2. random primes p,q
Let’s use:

579810099525248565010050509754571001027,

6989752565699505597485398979958574481969

3. key modulus – n
 n = pq:

4052729130775091849638047446256554071699019514021047339267026030072286291982163

4. φ(n) = (p – 1)(q – 1)
 φ(n):

4052729130775091849638047446256554071691449951355822585104530580582573146499168

5. find e that is co-prime with φ(n)
 already using a prime,which will be co- prime.. – e = 65537 (in binary 10000000000000001)
6. (ed) mod φ(n) = 1d = e–1 mod φ(n) –  Modular multiplicative inverseMore than one answer
using Extended Euclidean algorithm:

944402082567056818708092537028397604145319798848072425038015030084640082599681,

4997131213342148668346139983284951675836769750203895010142545610667213229098849,

..+ 2φ(n), +3φ(n))

7. private key – d
 de–1 mod φ(n):

944402082567056818708092537028397604145319798848072425038015030084640082599681

With a public key (e), a key modulus (n) and a private key (d) we can apply the RSA algorithm.

Message (mess) = 911

RSA encrypt -> mess ^ e mod n  = 911 ^65537 mod 4052729130775091849638047446256554071699019514021047339267026030072286291982163

RSA encrypted message (ciph) = 3095021178047041558314072884014000324030086129008597834642883051983162360819331

RSA decrypt -> ciph ^ d mod n = 3095021178047041558314072884014000324030086129008597834642883051983162360819331 ^ 944402082567056818708092537028397604145319798848072425038015030084640082599681 mod

4052729130775091849638047446256554071699019514021047339267026030072286291982163

= 911

How is that secure?

When Alice encrypts using Bob’s public key (e) along with the key modulus (n) the output is a protected cipher.

An eavesdropper does not know the private key so decryption is very difficult:

Attacker must solve:

(unknown val, x) ^ e mod n = ciph

x ^65537 mod 4052729130775091849638047446256554071699019514021047339267026030072286291982163 = 3095021178047041558314072884014000324030086129008597834642883051983162360819331

OR, easier – try to determine the private key:

The attacker knows e and n (which = pq). When we created the private key (step 6 above) we conducted:  e–1 mod φ(n) – Modular multiplicative inverse which is relatively fast for us to calculate.

The attacked does not know  e–1 mod φ(n) though. φ(n) = (p – 1)(q – 1). The attacker knows that n is a composite prime = pq (where p and q are both primes).

So… if the attacker can solve p * q = n (where they know n) then RSA is insecure.

Thankfully the process of Integer factorization is so much harder than the process of creating p,q,nφ(n), e and d that online business and confidentiality can be maintained to acceptable levels.

Threats to RSA

It would be extremely valuable to malicious individuals/groups and  (more importantly) intelligence organizations make large integer factorization efficient enough to break RSA.

However the theoretical aspects of RSA are not generally recognized as the main source of vulnerability

Categories
Random

downloading a youtube playlist and converting to mp3 (osx/linux(ubuntu))

Downloading youtube audio and converting mp3 is very simple now thanks to these guys: https://rg3.github.io/youtube-dl/

Installing:

# on Ubuntu
sudo apt-get install youtube-dl avconv
curl https://raw.githubusercontent.com/SecurityShift/tools/master/scripts/random/GetYoutubePlaylist.sh -o GetYoutubePlaylist.sh
chmod 755 GetYoutubeAudio.sh

https://raw.githubusercontent.com/SecurityShift/tools/master/backup_scripts/google_drive_backup.py

# on OSX
brew install youtube-dl libav
curl https://raw.githubusercontent.com/SecurityShift/tools/master/scripts/random/GetYoutubePlaylist.sh -o GetYoutubePlaylist.sh
chmod 755 GetYoutubeAudio.s

Contents:


Usage: sh GetPlaylist.sh [youtube playlist URL] [output_directory]

Example:

sh ./GetYoutubePlaylist.sh http://www.youtube.com/playlist?list=PL702CAF4AD2AED35B ~/Music

To get the youtube playlist URL view the playlist by clicking on its name, no videos will be playing, then click on share and the url will be highlighted:

youtube-playlist
click on share and the URL will appear
Categories
Random

content private

content private

Categories
Functional Programming - Scala Random

Virtual Enigma Machine Implementation

After watching a video on numberphile thought it would be interesting to implement a virtual version on the Enigma cipher machine.

Step 1: Install Scala and Eclipse IDE for Scala.

http://www.scala-lang.org/downloadshttp://scala-ide.org/

Step 2: Review the numberphile video and how the Enigma machine works

Step 3: Implement each function of the machine

  1. Rotors
    • Rotor choices, rotors have different circuitry (0-24) – static value for each message
    • Initial rotor combination (0-25) – static value for each message
    • Rotor increment – Faster rotor, middle rotor, slow rotor
  2. Plug board
    • 10 wires
    • swapping 2 letters
    • static value for each message

Step 4: Write an interface between the functions and a basic interface (text?)

Step 5: Test!

Conclusion: Using Scala (trying anyway) the implementation is not very complex. What is more difficult is replicating the original machines where the output character could not equal the input character. The limitation Turing used to crack the machines. Might see if I can implement that later then test out the cracking method. Not sure how easy it would be to break the current implementation…

CODE (EnigmaMachine.scala, EnigmaMachine.class):

 

 

// Virtual reconstruction of what the enigma machines did
// The replication is of the improved version where the input could result in the same output

object EnigmaMachine {
//////////////// ENCRYPTION/DECRYPTION FUNCTIONS \\\\\\\\\\\\\\\\\\\\\
    //to replicate the simple machine circuit chars are converter to 0 - 25
    def convertMessageToSimpleInt(rawInput: String):List[Int] = {
        val userMessage = rawInput.toUpperCase
      (for (x <- userMessage) yield if (x != 32) (x.toInt - 65) else 26).toList
    }
    
    //for all integers in the list, convert
    def convertMessageToString(input: List[Int]): String = {
      (for (x <- input) yield if (x == 26) " " else (x + 65).toChar).mkString
    } 
    
    def enigmaGo(rotorChoices: List[Int], rotorPositions: List[Int], inputMessage: List[Int], encrypt: Boolean): List[Int] = {    
      // incrementing the rotors is dont based on the number of previous chars.. instead of the previous position
      def incrementRots(nth: Int): List[Int] = {
        val rot0 = (rotorPositions(0) + nth) % 26 //fast rotor
        val rot1 = (rotorPositions(1) + (nth / 26)) % 26 //medium rotor
        val rot2 = (rotorPositions(2) + ((nth / 26) /26)) % 26 //slow rotor
        List(rot0, rot1, rot2)
      }
      //applies the rotor encryption then the scramble board
      def transmuteElement(element: Int, rotPosApp: List[Int]): Int = {
        val rotorTotal = rotorList(rotorChoices(0))(rotPosApp(0)) +
              rotorList(rotorChoices(1))(rotPosApp(1)) +
              rotorList(rotorChoices(2))(rotPosApp(2)) 
        
      def scramble(inputVal:Int):Int = plugMap1.find(x => (x._1 == inputVal) || (x._2 == inputVal)) match {
            case Some(y) => if (y._1 == inputVal) y._2 else y._1 
          case None => inputVal
      }
        if (element == 26) element
        else if (encrypt) (scramble(element) + rotorTotal) % 26 else { 
            //decrypt is a bit messy atm :( 
            if (element - (rotorTotal % 26) < 0) scramble(element - (rotorTotal % 26) + 26) else scramble(element - (rotorTotal % 26))
          }     
      }
      
      //Accumulator hold the input and output lists, when input list is empty, return output.. also tail recursive
      def enigmaAccu(remInput: List[Int], rotPos: List[Int], accumOutput: List[Int]): List[Int] = {
              if (remInput.isEmpty) accumOutput
            else enigmaAccu(remInput.tail, incrementRots(accumOutput.length), accumOutput:+(transmuteElement(remInput.head, rotPos)))
      }
      
      enigmaAccu(inputMessage, rotorPositions, Nil) //start the process
    }
////////////////////   \\\\\\\\\\\\\\\\\\\\\\\\\
    
//////////////// UI STUFF \\\\\\\\\\\\\\\\\\\\\
    def useMachine(rotorChoices: List[Int], rotorStartVals: List[Int], activity: String) = {
            spamLine
            val inputMessage = convertMessageToSimpleInt(askInput("Please enter a message to be " + activity + ", use only a-z: "))
            val resultingMessage = enigmaGo(rotorChoices, rotorStartVals, inputMessage, if (activity == "encrypted") true else false)
            spamLine
            println("Your message has been " + activity + ": ")
        println(convertMessageToString(resultingMessage))
            spamLine
            exit(0)
    }

    def getRotorChoices(s1:String, p: String => Boolean):List[Int] = {
        val slot0 = getUserInput(s1 + "fast slot: ", "Entry Invalid.", p).mkString
        val slot1 = getUserInput(s1 + "medium slot: ", "Entry Invalid.", p).mkString
        val slot2 = getUserInput(s1 + "slow slot: ", "Entry Invalid.", p).mkString
        List(slot0.toInt,slot1.toInt,slot2.toInt)
    }
    
    def spamLine = println("##############################################")
    
    def askInput(statement: String) = { 
      val input = readLine(statement)
      if (input == "") "99" else input
    }
    
    def askAgain(statement: String, f: => String) = {
        println(statement)
        f
    } 
    
    def inputVals(f1: => String, f2: => String) = Stream.cons( f1, Stream.continually(f2))
    
    def getUserInput(s1: String, s2: String, p:String => Boolean) = inputVals(askInput(s1), askAgain(s2, askInput(s1))).find(p)
////////////////////   \\\\\\\\\\\\\\\\\\\\\\\\\
    
//////////////// MAIN \\\\\\\\\\\\\\\\\\\\\ 
        def main(args: Array[String]) {
            println("### WELCOME TO THE ENIGMA MACHINE! ###" )
            // GET INITIAL MACHINE SETTINGS FROM USER 
            // ROTOR PLACEMENT CONFIG
            val rotorChoices = getRotorChoices("Choose rotor (0 - 4) for ", x => (x.toInt >= 0 && x.toInt < 5)) 
            // ROTOR START VALS 
            val rotorStartVals =  getRotorChoices("Choose rotor START VAL (0 - 25) for ", x => (x.toInt >= 0 && x.toInt < 25))
            // PLUG MAP
            println("There is only 1 plug map option at present:") 
            println(plugMap1.toString.drop(4))
            //ENCRYPT OF DECRYPT - ONLY RELEVENT FOR GUI TEXT.. bleh not quite need to reverse circuit manually
            val userChoice = getUserInput("Press 1 to encrypt, 2 to decrypt: ", "Entry Invalid.", x => (x == "2" || x == "1"))
            userChoice.mkString match {
              case "1" => useMachine(rotorChoices, rotorStartVals, "encrypted") 
              case "2" => useMachine(rotorChoices, rotorStartVals, "decrypted")
              case default => exit(0)
            }
    }
////////////////////   \\\\\\\\\\\\\\\\\\\\\\\\\

//////////////// CONSTANTS \\\\\\\\\\\\\\\\\\\\\
    val rotorList = List(List(6, 17, 4, 3, 23, 22, 21, 5, 0, 19, 20, 11, 16, 9, 13, 24, 18, 1, 8, 2, 7, 10, 12, 14, 25, 15)
    ,List(2, 7, 23, 12, 13, 15, 18, 1, 8, 11, 0, 3, 10, 5, 24, 4, 17, 19, 6, 9, 20, 14, 25, 21, 16, 22)
    ,List(23, 1, 19, 18, 0, 10, 3, 8, 9, 21, 2, 12, 13, 14, 24, 6, 20, 16, 11, 17, 15, 7, 4, 25, 22, 5)
    ,List(21, 22, 16, 2, 18, 20, 15, 9, 5, 1, 17, 10, 3, 6, 0, 13, 11, 23, 4, 12, 19, 8, 7, 25, 24, 14)
    ,List(25, 3, 11, 7, 5, 15, 24, 10, 21, 6, 12, 17, 8, 23, 2, 18, 19, 14, 4, 0, 20, 9, 22, 1, 16, 13))
    
    val plugMap1 = List((3,7),(11,16),(18,24),(17,9),(21,19),(14,13),(0,12),(1,25),(22,20),(4,6))
////////////////////   \\\\\\\\\\\\\\\\\\\\\\\\\
}
Categories
Random

MapReduce Programming Model and Hadoop

After hearing the praises of Hadoop I had a brief look it and the Map/Reduce paradigm. Most info was read from Wikipedia and references in wiki articles. In particular, this paper from Jeffrey Dean and Sanjay Ghemawat.

Hadoop is an opensource software framework that implements MapReduce and Google File System. The aim is to enable easy and robust deployment of highly parallel data set processing programs. It is important to note that the MapReduce model is applicable to embarrassingly parallel problems. Processing can occur on data that is stored in a database or filesystem.

Map/Reduce refers to the steps in the model:

Map: A master node takes an input problem and divides it into sub-problems, passing them to worker nodes. Worker nodes can further divide sub-problems.  For example, in the problem of counting word occurrence in a document the Map function will output a key/value pair every time it sees a a specified work – ie: (“searchterm”, 1).

Reduce: The reduce function takes the list of word(key)/values and sums the occurrences:

(foo, (1)

(bar, 1)

(foo, 1)

(bar, 1)

(foo, 1)

The output of reduce in this case could be: (foo, 3).

The MapReduce model becomes powerful when considering giant datasets can be processed in large clusters very efficiently using this model.

Hadoop Run-time Takes care of:

  • Parallelalization
  • Partitioning of input data
  • Scheduling programs execution across machines
  • Handling machine failures
  • Managing inter-machine communication

Hadoop aims to enable developers with little distributed programming experience to utilize compute resources such as EC2.

With the emergence of ‘big data’ and the apparent value that can be extrapolated from massive databases/datastores, many organisations have found the limits of traditional relational databases. Hadoop has such a big buzz because it can pass the processing boundaries of relational database software and enable the extrapolation of value. The video below is a decent explanation of this point by data scientists at SalesForce.com


Categories
Random

Parsing HTML with Python

Came across an interesting scenario this week where I decided to try and use Python for parsing HTML. Turns out to quite straight forward and something I could imagine using more often in the future.

We were trying to grab the a dynamic link from this page at twitch http://www.twitch.tv/directory/StarCraft%20II:%20Wings%20of%20Liberty.

The link we were after was the most viewed channel at any particular time:

The most viewed channel is always in this location/element

First attempt was to load up that page in an <iframe> allowing our own JavaScript to grab the relevant URL then forward. Twitch.tv however forces forwarding when iframes are detected, presumably for proprietary reasons. So javascript was not really an option.

After experimenting with Python’s lxml http://lxml.de/ it was really straight forward and effective. It is apparently quite efficient using C core (http://stackoverflow.com/a/6494811/692180) too. The module I used is documented quite well here: http://lxml.de/lxmlhtml.html. With a little bit of trial an error a brief python script successfully grabs the relevant information.

Then using PHP’s popen() method I can simply call the python script and use the return value as a php variable for a header redirect.

Links to source:

get_link.py – Python script

get_link.php – PHP calling python and using return value :

 

error_reporting(E_ALL);
$handle = popen('python -i ./get_link.py 2>&1', 'r');
$read = fread($handle, 2096);
pclose($handle);
header('Location: '.$read);

 

Python source:

import urllib
from lxml import html

def getElement():
        f = urllib.urlopen("http://www.twitch.tv/directory/StarCraft%20II:%20Wings%20of%20Liberty")
        # Read from the object, storing the page's contents in 's'.
        s = f.read()
        f.close()# Read from the object, storing the page's contents in 's'.
        doc = html.document_fromstring(s)
        doc = doc.get_element_by_id('directory_channels')
        #rVal = trimDoc.text_content()
        doc = doc.find_class('thumb')
        rVal = html.tostring(doc[0]).split()[2]
        return makeUrl(rVal)

def makeUrl(thumbString):
        return "http://twitch.tv" +  thumbString[6:-2]

if __name__ == "__main__":
        print getElement()

 

Categories
Random

Pascals Triangle – python

Wrote a quick script that attempts to complete pascals triangle recursively, thanks Feng.
** Note, after getting some feedback made a number of changes, thanks Fry**
see source: pascals.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#########################################################
# Python Script: Pascals triangle #
# Enter How many rows you want... #
# Mark Culhane 2011, straightIT.com #
#########################################################

#Initialize
tri =[[1],[1]]
n = 0

def pascals():
if len(tri) rowNum = len(tri) - 1
colCnt = len(tri[rowNum])

if colCnt == rowNum:
tri[rowNum].extend([1])
tri.append([1])
else:
p1 = tri[len(tri)-2][len(tri[len(tri)-1])-1]
p2 = tri[len(tri)-2][len(tri[len(tri)-1])]
tri[rowNum].extend([p1+p2])
pascals()
return

def displaySol():
i = 0
for x in tri:
if i != n:
print(tri[i])
i = i + 1

if __name__ == "__main__":
print "Pascals Triangle"
n = long(raw_input("How many rows do you want?:"))
print "Result: "
pascals()
displaySol()