Lab 5 – Helping Santa

Calculate the Haversine Distance around the earth in order to help Santa:

Python script to direct the reindeer and the sled:

Lab 4 – Amazon EMR

Amazon EMR is based on Hadoop, a Java-based programming framework that supports the processing of large data sets in a distributed computing environment. MapReduce is a software framework that allows developers to write programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers.

EC2 (Elastic Compute Cloud) and S3 (Simple Secure Storage) will also be employed in this lab.

Elastic Map Reduce:

Getting Started Tutorial:

Lab 3 – Map Reduce Streaming Data With Python

This is a tutorial to demonstrate mapper and reducer python scripts to convert data from one input format to a required output format.

Follow the steps from the tutorial below:

Create the file and the file and have all files in the same windows directory.

Then run ->

type googlebooks-eng-all-1gram-20120701-x | |

set PATH=”C:\Program Files\R\R-3.2.2\bin\x64″;%PATH%

type googlebooks-eng-all-1gram-20120701-x | | | rscript grapher.r

Create a graph based on the data via R:

Create a graph with r – piping data from standard input:

Lab 2 – Spell Checker

The python code to spell check:

Testing the Spell Checker:

Lab 1 – Dataset to Process

Based on the lessons we’ve learned in Week 1 interrogate the data set to find the number of elements.

Once the number of elements is determined determine an algorithm and the data types that might be required to interrogate the dataset.

Implement the program to read the data including any testsuites / classes / functions that are required.

The dataset:


The python interrogation: