Categories
Data Mining Project

Data Mining Project – Week 1

Started of with a bunch of data in CSV format from the World Bank. I did some initial exploration with Weka but found that everything was to complex to just plug in data and run searches. Despite doing some uni subjects on the topic I still do not have  a strong understanding  (or have forgotten) the low level details of pre-processing  and structuring data in the best way for Weka’s algorithms.

Using Python scripts is an easy way to work programatically with CSV files, particularly with the Python CSV library

An example of a very simple script for dealing with missing value is here: http://mchost/sourcecode/python/datamining/csv_nn_missing.py
Note in that implementation the replacing value is just zero. That can be changed to nearest neighbor or other preferred approximations.

I will use the ARFF file format for my own implementations as it seams to be a good standard and will mean I won’t have to keep modifying things if I want to use Weka in place of my own code.

 

So I have started working through the text book written by Weka’s creators:

Ian H. Witten, Eibe Frank, Mark A. Hall, 2011, Data Mining Practical Machine Learning Tools and Techniques 3rd Edition

Text Cover

I am breezing through the initial chapters which introduce the basic data mining concepts. There was a particularly interesting section on how difficult it is to maintain anonymity of data.

over 85% of Americans can be identified from publicly available records using just three pieces of information: five-digit zip code, birth date, and sex. Over half of Americans can be identified from just city, birth date, and sex.

In 2006, an Internet services company released to the research community the records of 20 million user searches. The New York Times were able to identify the actual person corresponding to user number 4417749 (they sought her permission before exposing her). They did so by analyzing the search terms she used, which included queries for landscapers in her hometown and for several people with the same last name as hers, which reporters correlated with public databases.

Netflix released 100 million records of movie ratings (from 1 to 5) with their dates. To their surprise, it turned out to be quite easy to identify people in the database and thus discover all the movies they had rated. For example, if you know approximately when (give or take two weeks) a person in the database rated six movies and you know the ratings, you can identify 99% of the people in the database.

Trying to get through the first of the books three sections this week so I can start implementing the practical algorithms. Hoping to do my own implementations rather than just using Weka as it will probably end up being quicker and I will definitely learn more.

Categories
Data Mining Project

Data Mining Project – Plan

Decided to try to apply the data mining techniques learnt from Intelligent systems course on publicly available economic data. The two sources for data that I will start of with are:

I will run a mix of supervised and unsupervised techniques. When conducting supervised analysis I will look for relationships between economic indications to provide inference on discussion topics such as:

  • The value of high equality in an economy
  • Benefits of non-livestock or livestock agriculture
  • Gains through geographic decentralization of population
  • Estimates on sustainable housing price ranges
  • The value of debt
  • Productivity of information technology
  • Cost and benefits of lax/tight immigration policies
  • Cost and benefits free-market/regulated/centralized economic governance

Techniques used for quantitative analysis will be varied a dependent on subsequent success. To start with I plan on using the raw data sources in conjunction with some simplistic python and Java scripts. If that turns out to be ineffective I will work with MatLab, Weka and Netica. Google and the World Bank also have numerous interfaces for exploring data. This will be an ongoing project so I will make these posts help myself keep track of progress.

 

gdp

Categories
Advanced Network Security

FIT5037 – Advanced Network Security Week 9

‘Network security and performance’ marked the ninth week of FIT5037. This is a logical extension of the previous weeks lecture of organizational level network security. There has traditionally been a mutual exclusivity between speed and security. This is most definitely a sore spot for many organizations, particularly when finding a degradation in performance after investing money! The lecture looked at common techniques that should be used to ensure convenience is not disproportional affected by security efforts. The notes outlined four key topics for the week:

  • Load balancing and firewalls
  • VPN and network performance
  • Network address translation [NAT] and load balancing
  • Network security architecture

Key awareness issues that were recurring through the lecture:

  • Security! – Does a software/hardware/architecture solution or combination of these provide sufficient security
  • Speed and availability – Do security solutions allow for the required level of service availability for operational requirements? Is service speed affected to an unacceptable extent?
  • Robustness – If one component fails, what are the repercussion for the rest of the network in terms of previous issues?
robustnetwork
Example of adjustments to design in consideration to organisational concerns (source: notes10)

The diagram above illustrates how the adoption of load balancers and multiple parallel firewalls suffices speed and robustness requirements.

The lecture went on to introduce the topics of protocol security and certain VPN solutions.

Categories
IT Research Methods

FIT5185 – IT Research Methods Week 9

The final lecture on quantitative data analysis covered4 specific statistical test:

  • Binomial – Given a weighted coin, how many heads will probably result from 30 tosses
  • Median – Checks that the medians of two populations are not significantly different
  • Mood’s median test – Checks for significant similarity between unrelated samples (non-parametric)
  • Kilmogorov-Smirnov – Measure the cumulative difference between data, are the data sets different?
  • Friedman – Testing for significant differences across testing intervals on a sample population
The lecture slides included clear examples of these tests. The tutorial followed up with some practical examples using SPSS. After the 4 weeks of quantitative data analysis we now have a decent toolbox specifically for non-parametric data analysis. Our assignment requires application of these tools. I imagine that the assignment will give lease to some of the ambiguities that arise when reasoning from quantitative analysis.
non-parametric
An example of non-parametric data (source: http://perclass.com/doc/kb/15.html)
Categories
Reading Unit - DoS Research

FIT5108 – DoS Reading Unit Part 8

The final post on attack reviews will delve into physical denial of service attacks via network intrusion. Physical attacks can be carried out by attackers gaining access to location where physical systems are stored but this attack method extends beyond the scope of this reading unit. Physical attacks via a networks generally involve maliciously modifying vulnerable firmware in an attempt to create further vulnerabilities/render hardware temporarily unavailable/ permanently disable (brick) targeted hardware.

This type of attack can be referred to by a few different names:

  • Phlashing
  • Permanent DoS / PDoS
  • Bricking
  • Firmware attack

Rich Smith of HP labs outlined this vulnerability in his 2008 presentation of a tool called PhlashDance.

In the presentation Rich looks:

  • Achieving PDoS remotely
  • Possibility of generic attacks – Which would significantly increase the likelihood of attackers creating tools, allowing almost anyone to exploit a firmware vulnerability.
  • Mitigation

Taking an abstract look at firmware development in industry we can see that it is generally behind system software. For example it is not uncommon to patch drivers, in fact Windows does this quite regularly. Updating firmware is much less common. Thus there is a great deal more legacy code and code that was not developed with security in mind. Given these facts the chances of vulnerabilities are high. Smith goes on to highlight the lack of auditing for firmware vulnerabilities and fact that most security policies over look this as a system component. This combines with the emergence of network devices that are connected to networks automatically updating firmware.

Another great point that Smith makes is the very weak access control of many devices firmware when weighed against the power re-flash access provides. The introduction to firmware is closed with definitions of the two major firmware update mechanisms:

  • Push – Firmware is sent to the device
  • Pull – Firmware update is signaled to the device which then connect to a designated location to collect the new binary

These update mechanisms are the main target for attackers who wish to maliciously modify a devices firmware.

phlashdance
Phlashdance - automated firmware vulnerability tester (Rich Smith, 2008)

Smith look at the lack of cryptographic data verification as the primary weakness in automatic firmware update packages. He implements a fuzzer to overcome the cyclical redundancy checks implemented by most vendors.

Mitigation

The presentation recommends the following mitigation efforts by developers:

  • Remote updates off by default
  • Physical presence required to flash firmware
  • Crypto signatures required to flash
  • Validation in firmware, not client application
  • Design with attack tolerance not fault tolerance

The following is also recommended for users:

  • Patch firmware
  • Lock down devices
  • Understand the full capabilities of devices and take their security seriously

For an administrator of a large network implementing intrusion detection rules that can identify malicious firmware updates would also be an ideal solutions. Taking note of the ports that firmware updates will also allow for locking down of devices behind the firewall.

Categories
Advanced Network Security

FIT5037 – Advanced Network Security Week 8

Taking a more abstract view on computer security, week 8’s topic was computer security for large networks. This first part of the lecture discussed risk analysis. Some key steps in conducting risk analysis:

  • Value of assets being protected – if attacks break into our network what is the worst case scenario? This value is constantly rising in today’s business environment. This step will also establish a budget range for system security, there is no point spending 1 million protecting a system that contains information and assets worth one hundred thousand.
  • Threat identification – What are the known threats to our system? This could include likely attackers, the types of known exploits and an understanding of what possible unknown exploits may be capable of.
  • Identification of key system components:
networkcomponents
Some key components (source: Week 9 lecture notes)
  • Define each step in the security life cycle – Prevention -> Detection -> Response -> Recovery
  • Specifying policy areas for People, Processes and Tools
  • Begin development of security policy using a logical framework: Organizational -> Security Architecture -> Technical
  • Design, implementation and testing of chosen security tools:
defencesystems
Some security tools (source: Week 9 lecture notes)
  • Audit any security systems in place at set time periods (ie: once a year)
  • Understand that organization requirements can change quickly and that the security policy is in place to protect organizations whilst allowing them to operate as unhindered as possible, there is no point having a completely secure systems that takes employees 2 hours to gain access to.

Design of system wide security policies may come off as a more managerial, less technical operation. However, to implement a good security policy, decision makers must be aware of and have an in depth understanding of the available tools, threats from attackers and the organizational requirements. I would be very surprised if most vulnerabilities were as a direct result of technical issues rather than holes as a result of poorly designed and implemented security policies.

Categories
IT Research Methods

FIT5185 – IT Research Methods Week 8

Probability, hypothesis testing and regression analysis continued the topic of quantitative analysis in week 8.  Our discussion on the statistic techniques that we are using with the SPSS package focuses on the interpretation of outputs rather than the mathematics behind them. This seems reasonable given the limited time we have assigned to such a large area.

The first points covered were definitions of probability:

  • Marginal (simple) probability – rolling 3 six in a row with a standard dice => (1/6) x (1/6) x (1/6)
  • Joint probability P(AB) => P(A) x P(B)
  • Conditional Probability – I would stick with Bayes theorem => see below
Conditional Probability
Conditional Probability
  • Binomial Distribution – probability of a number times and event occurs given a true or false outcome and n trials. ie: how many times will head appear in 20 tosses of a coin.
  • Normal (Gaussian) distribution – Requires continuous random variables (ie age), see below
Normal distribution demands the percentages show for each standard deviation interval

Hypothesis testing and Regression analysis followed. The recurring theme is the significance value of less then 0.05 required for hypothesis support.

SPSS seems like a great tool for statistical analysis with all of the statistic methods widely used and relatively simple use.

Categories
Reading Unit - DoS Research

FIT5108 – DoS Reading Unit Part 7

This weeks DoS attack review will focus on wireless vulnerabilities, specifically as a result of replay attacks. The simple definition of which is:

A network attack whereby valid data transmission is maliciously or fraudulently repeated or delayed

Replays attacks are simple but in many case very effective (source: Fent et al. 2007)

A key article used in this post is: Feng Z., Ning, J., Broustis, I., Pelechrinis, K., Krishnamurthy, S. V., Faloutsos, M.,  2008?, Coping with Packet Replay Attacks in Wireless Networks, US Army Research Office

Replay attacks are particularly effective against wireless networks as the capture and injection of packets is much easier to accomplish as opposed to a wired network. Aireplay-ng is a linux tool that enables replay attacks to be conducted on unprotected wireless network very simply. This tool is used in conjunction with packetforge-ng which allows attackers to easily create new or forged packets for injection. Feng et al. cite network degradation via one terminal against an access point of up to 61%. That degradation is a achieved through unintelligent packet spamming. Also mentioned is the straight forward mitigation strategy of using public key encryption to digitally sign packets although this is indeed a slow process for data comms.

Using packet replay, there are a number of attacks that can be launched:

  • Simplistic packet replay to increase network congestion.
  • De-authentication – This attack sends disassociate packets to one or more clients which are currently associated with a particular access point.

Mitigation strategies:

  • One time passwords
  • Session tokens
  • Random check numbers
  • Timestamping
  • RADIUS [Remote Authentication Dial In User Service] server
  • EAP [Extensible Authentication Protocol]

As per advanced network security lectures this post will focus on analyzing how a RADIUS and EAP prevent replay attacks. The RADIUS protocol documentation lists a Digest-nonce count attribute as does the EAP protocol specification.

Through the handshake process nonce values are used by both the AP and the supplicant to protect against replay attacks:

When using EAP nonce values are used to establish sessions keys safe from replay attacks

I need to do further reading as to the process post key handshake. I would imagine that an encrypted counter could be used to prevent effectivness of replay attacks.

Categories
Advanced Network Security

FIT5037 – Advanced Network Security Week 7

Week 7 jumped away from snort and on to wireless communications. The lecture slides was particularly detailed, the key enhancements to be covered:

  • TKIP – Temporal Key Integrity Protocol
  • LEAP – Lightweight Extensible Authentication Protocol (according to most sources, becoming legacy to EAP-FAST)
  • EAP-TLS – Extensible Authentication Protocol – Transport Layer Security (A public key system for wireless lans using a RADIUS server)
  • PEAP – Protected Extensible Authentication Protocol – “PEAP is similar in design to EAP-TTLS, requiring only a server-side PKI certificate to create a secure TLS tunnel to protect user authentication
  • RADIUS – Remote Authentication Dial In User Service
  • 802.11 – (a,b,g,n) IEEE standardized wireless protocols 😀
  • 802.16 – IEEE standardize WiMAX [Worldwide Interoperability for Microwave Access] family.

So, to start with there is a bag full of acronyms which are all interlinked.

There seem to be a few fundamental problems when securing wireless networks:

  1. Devices connecting may have low computational power, ie: smart phones. (This is relative to desktops and servers so will most likely always be the case)
  2. Incoming and outgoing packets are broadcasted thus easy to intercept
  3. Users can be moving to between access points
  4. Performance requirements are high, people expect wireless connections not to be slow than wired connections

These points combined force the situation of weaker security.

The detail of the lecture was in covering the different forms of handshakes and authentication that are floating around at the moment… and all of their flaws. It will take a fair bit of time to really become familiar with these.

I get the feeling that wireless security is always going to be an issue simply because of the computing power mismatch between mobile and fixed devices in addition to the broadcast nature of the communications. The advancement over the past 5 years does however show that the band-aid approach is sufficient to facilitate most of the world adopting wireless networks.

WiMAX - The way of the future!
Categories
IT Research Methods

FIT5185 – IT Research Methods Week 7

A short week for IT research methods in terms of new material. Due to the literature review presentations we did not have a tutorial and only half a lecture. The topic of the lecture was ‘Correlation Analysis’, presented by Joze Kuzic.

Lets start with the simple definition of correlation analysis, ‘A statistical investigation of the relationship between one factor and one or more other factors’.

One point that I need reminding on was correlation vs regression (source: http://www.psych.utoronto.ca/courses/c1/chap9/chap9.html):

Correlation – both variables are random variables, and 2) the end goal is simply to find a number that expresses the relation between the variables
Regression – one of the variables is a fixed variable, and 2) the end goal is use the measure of relation to predict values of the random variable based on values of the fixed variable

The topic of causality and correlation was approached quite carefully in the lecture notes citing that correlation can be used to look for causality but does not infer causality.

Methods of correlations:

Pearson’s correlation coefficient – for parametric (randomized, normally distributed data).

Spearman rank order correlation coefficient – for non-parametric data, [-1.0 , 1.0]

Significance of correlations was the next logical point covered, not much mathematical reasoning was covered apart from p < 0.05 is good :).