Lab 4 – Python for Data Analytics

For any data scientist Python is a must, but Python alone will not go very far on its own. Pandas is the data analytics library that allows Python to deliver the functionality which comes out of the box in R.

Setting up Python & Pandas is now made very easy with Anaconda, and the running of Python can be made very intuitive with Jupyter Notebook.

Steps:

  • Download Python & Install. – python 2.7 was used here
  • Download Anaconda & Install
  • Open Command Prompt after installation
  • set PATH=%PATH%;c:\Python27;
  • conda –version
  • conda install pandas
  • conda install ipython
  • conda install pip
  • jupyter notebook

For more details, please see the full tutorial to install Pandas here.

A simple, intuitive, and powerful introduction to Pandas can be found here.

The graphics matplotlib library is discussed here.

Statistical analysis made easy in Python with SciPy and Pandas DataFrames.

5 Questions which can teach you Multiple Regressions (with R and Python).

Data files useful to run analysis on:

Iris Data

Parasite Data

Lab 5 – Helping Santa

Calculate the Haversine Distance around the earth in order to help Santa:

Python script to direct the reindeer and the sled:

Lab 4 – Amazon EMR

Amazon EMR is based on Hadoop, a Java-based programming framework that supports the processing of large data sets in a distributed computing environment. MapReduce is a software framework that allows developers to write programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers.

EC2 (Elastic Compute Cloud) and S3 (Simple Secure Storage) will also be employed in this lab.

Elastic Map Reduce:

https://aws.amazon.com/elasticmapreduce

Getting Started Tutorial:

http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-gs.html

Lab 3 – Map Reduce Streaming Data With Python

This is a tutorial to demonstrate mapper and reducer python scripts to convert data from one input format to a required output format.

Follow the steps from the tutorial below:

https://dbaumgartel.wordpress.com/2014/04/10/an-elastic-mapreduce-streaming-example-with-python-and-ngrams-on-aws/

Create the mapper.py file and the reducer.py file and have all files in the same windows directory.

Then run ->

type googlebooks-eng-all-1gram-20120701-x | mapper.py | reducer.py

set PATH=”C:\Program Files\R\R-3.2.2\bin\x64″;%PATH%

type googlebooks-eng-all-1gram-20120701-x | mapper.py | reducer.py | rscript grapher.r

Create a graph based on the data via R:

Create a graph with r – piping data from standard input:

Lab 2 – Spell Checker


The python code to spell check:

Testing the Spell Checker: