Thedatafugee

Saturday, April 26, 2014

Comma separated numerical values as strings.

[Rtip] In my continuous series of 'Rtips', i came across this annoying set of numerical values that are expressed as strings with commas. for example "20,000,123" that should read as 20000123 . This is common when working with monetary sums or totals. So how would you convert that to readable numbers or integers. I found out that gsub function really good at refomatting this character vector.

       

> budgets_amounts <- c("20,929,200","782,000,000","100,000,000","14,111,122") 
> as.numeric(gsub(",","", budgets_amount))
[1]  20929200 782000000   100000000 14111122

Factor to Integer/numeric [R Tip]

This always annoys me and i have to look it up all the time, when i try to convert a factor to a numeric, the values change to rank values. #damn

So to transform a factor f to it's approximately original numeric values and no "bull shit" ranks. This command below will save you.

       
as.numeric(levels(f))
Also mentioned that this is recommended and slightly more efficient than
as.numeric(as.character(f))

Now backed up, bring on any other disturbing factors.

Friday, April 25, 2014

Making maps with R.

Talk: 25.04.2014

Location: Mountbatten Offices, Kampala - Uganda.

Topic: Making simple Maps With R (PLE data example)

twitter: #oddkampala #ddj @datadotug

github: https://github.com/ngamita/spatialR/blob/master/uneb.R

Displaying spatial data is must have activity/skill that most journalists should have. There are so many tools out there that can make maps but working with robust large datasets will only be supported by a few tools like R etc One of the best ways of visualizing spatial data is through a map and one has to play around with the right colours to match the right context. Finally a map is not complete without legends, title, scale bar and north arrow. In this simple step by step developing maps tutorial in R i will walk you through these steps to achieve the merging shapefiles with Primary Leaving Exams data and find out what districts / regions of Uganda are performing much better than the others.

       

# Visualization of how students performed across
# different districts in Uganda Primary Leaving 
# Examinations 2013/2014. 
# Plotting the Primary leaving Exams data on a color-coded map,
# in less than 100 lines of R code. 

# Author: Richard Ngamita 'ngamita@gmail.com'


# Disclaimer: These methods here may not be the best solutions,
# but seemed the easiest for getting started with spatial data 
# in R. For any feedback: ngamita@gmail.com


So let’s get started with loading the libraries to read our shapefiles data. 
# Load the rgdal library.
# If you don't have it 
# run this command: install.packages(''rgdal)
library(rgdal)


# Set working directory
# Check if there, else create one.
if(!file.exists('data')){
  dir.create('data')
}

# Set wd to data
setwd('./data')

# Download Uganda district shape files into districts.zip from the web
# Simple google search of data.ug district shapefiles will pull this up. 
download.file("http://maps.data.ug/geoserver/wfs?format_options=charset%3AUTF-8&typename=geonode%3Adistricts_2013_112_web_wgs84&outputFormat=SHAPE-ZIP&version=1.0.0&service=WFS&request=GetFeature", "districts.zip")

# Unzip file the shape file. 
unzip("districts.zip")

# Read in shape files. 
# ?readOGR() to find out more. 
districts <- readOGR(".", "districts_2013_112_web_wgs84")

# Check for loaded data quality.
# Plot districts  to ch
#plot(districts)
#slotNames(districts)
#class(districts)
#head(districts@data) 
# Note the @ sign includes the "SpatialPolygonsDataFrame" slot  details.


## Pull in Primary Leaving Exams results
# CSV data file. A little of data.ug PLE 
# data files will pop up the link. 

# download the uneb PLE data from data.ug website. 
# Use the wget method of **nix or Mac machines should
# use the curl. Windows, don't think need any methods. 

# download the file to local directory
download.file('http://catalog.data.ug/dataset/a4a1ef8b-afa4-4b8f-b9f3-d4ef9b783eee/resource/1ba956e0-e8b5-42f8-9655-c464715ec065/download/ple.csv', destfile='ple.csv', method='wget')

# Read CSV uneb data, sep as csv and include the header/column names. 
# as.is to keep the data file types.
ple <- read.csv('ple.csv', sep=',', header=TRUE, as.is = TRUE)

# Check if loaded well
#head(ple)
#str(ple) # Make sure data types are right.

# Clean the last column, useless to us. 
ple <- ple[,1:4]

# Convert Division1 to numeric .
# SuppressWarrnings as NA or missings values are present. 
ple$Division1 <- suppressWarnings(as.numeric(ple$Division1))

# RUN: install.packages('plyr')
# incase you don't have it installed. 
library(plyr)

# We want, to dice and aggregate counts. 
# Get all sum or totals of division1s per district
ple_division1 <- aggregate(Division1 ~ District, data = ple, sum)

# Creates a dataframe with division1 totals per district, check if loaded well. 
# head(ple_division1)

# Use the match() function to append these two different dataframes into the one SpatialPolygonsDataFrame.
districts@data <- data.frame(districts@data, 
                                  ple_division1[match(districts@data[, "DNAME_2011"],
                             ple_division1[, "District"]), ])

# Remove the repeated columns, specifically "District". 
districts@data$District <- NULL

# Re-name the colname to make sense. 
colnames(districts@data)[1] <- 'Districts_2013'



# Now the shape ﬁle or SpatialPolygonsDataFrame contains our added ﬁeld called ‘Division1’ 
# which contains the count of the number of first grades 
#  We can use this to create a choropleth map with:

# First remove incomplete rows/NA values, disctricts without results.
districts@data <- na.omit(districts)

# Using Basic plot() function
# Load the mapping and color packages.
# RUN: install.packages('package name')
# If you dont have it installed. 

library(maptools) 
library(RColorBrewer) 
library(classInt) 

# select a colour palette and 
# the number of colours you wish to display
# Could be 4, 5 or many more.
colours <- brewer.pal(4, "Blues")

# we need to set breaks
# can use the classIntervals function 
# in the classInt package e just loaded.
brks<-classIntervals(districts$Division1, n=4, style="quantile")

# With plot function, lets plot the distribution 
# of the data and view the colours assigned respectively. 
plot(brks, pal=colours)

# extract brks values from the brks object above.
brks<- brks$brks

# Finally, i got a map to show you. 
plot(districts, col=colours[findInterval(districts$Division1, brks,
all.inside=TRUE)], axes=F)


# Go ahead and add title, legent, scale etc. 

# Save file locally. 
png(filename="your/file/location/name.png")
plot(out_put)
dev.off()


#Part 2:

library(GISTools)
library(RColorBrewer)

# Clear the missing values issue. 
districts@data <- districts@data[complete.cases(districts@data), ] # getting a bug. 

# Use choropleth function show performing districts. 
choropleth(districts, districts$Division1)

#  map looks fine, but lets make it better with a few extra commands.

# Set colour and number of classes
shades <- auto.shading(districts$Division1, n = 9, cols = brewer.pal(9, "Blues"))

# Draw the map
choropleth(districts, districts$Division1, shades)

# Add a legend
choro.legend(26.64793, 1.674763, shades, fmt = "%g", title = "Count of Division 1", cex = 1.0)

# Add a title to the map
title("Count of Division 1's in PLE, 2013")

# add Notth arrow
north.arrow(27.92452, 3.30194, 10)


# Further reading. 
# Working with GoogleMaps and OpenStreetMap.
library (ggmap) 

# Further reading: check out these solutions by Rodriguez
# https://sites.google.com/site/rodriguezsanchezf/news/usingrasagis


# Goal is to have a visual map below of Uganda districts and an overlay of data from Division1s.

Wednesday, April 23, 2014

Install rgdal issues.

install.packages("rgdal") - this command throws some annoying errors.

I have tried to install rgdal on my Ubuntu 12.04 instance, but it is not as straight as installing "sp" or any other packages. It demands the pre-installed GDAL and proj.4.

      
$ sudo apt-get install libgdal1-dev libproj-dev
$ sudo R
> install.packages("rgdal")

Tuesday, April 22, 2014

Predict traffic from count of cars.

A friend recently shared a data set that got a count of cars coming in on specific roads at specific times. I wanted to predict the count of cars on future dates basing on the small data set that was shared. Apparently using times series we can forecast this.

Feel free to follow along the steps to achieve this with basic packages xts() and forecast. This is my initial stab at this but i'm sure there are better models to use around this. I'll follow up with better models soon.

       
# Simple script to use
# Default ts() package to forecast traffic count. 
# Dataset: traffic_data.csv

# Author: 'Richard Ngamta', 'ngamita@gmail.com'

# check for directory, else create one. 
if(!file('traffic')){
  #print('hello')
  dir.create('traffic')
}

# Load the forecasting and time series packages. 
require(forecast)
require(xts)

# Set wd to 'traffic'
setwd('traffic/')

# Download the traffic_data.csv from dropbox.
fileUrl <- 'https://www.dropbox.com/s/cbufz4f0rd11tl1/traffic_data.csv'
download.file(fileUrl, destfile='traffic_data.csv', method='wget') # Use method='curl' non *nix 

# Load data into R memory/data frame. df == data frame, csv and got a header. 
traffic_df <- read.csv('traffic_data.csv', sep=',', header=TRUE)

# Check if loaded fine. 
head(traffic_df)

# Format the date, to R readable date and
# not Char strings.
traffic_df$datetime = as.Date(traffic_df$datetime,format="%Y-%m-%d")

# Convert to xts
traffic_df_xts = xts(x=traffic_df$count, order.by=traffic_df$datetime)

# We need to get a start date from data, i got (68)
# How did i do that. Check next
# > head(traffic_df)
# > as.POSIXlt(i = "2014-03-10", origin="2014-03-10")$yday
##    [1] 68
# Add one since that starts at "0" and convert to normal ts()
traffic_df_ts = ts(traffic_df_xts, freq=365, start=c(2014, 68))
png('Traffic_forecast.png')
plot(forecast(ets(traffic_df_ts), 1), main="Traffic Forecast")
dev.off()


# Ignore the frequenxy warnings.
# Check the downloaded plot in same folder/dir. 
# Note from graphs that 2014.4 means "day number 365 * 0.4" (day 146 in the year).
# So actual date is run > as.Date(146, origin="2014-03-10")
# Answer: "2014-08-03"
# This is just the default method, using othee methods lile ARIMA, works better.

Saturday, April 19, 2014

R joins - revisited (OpenDataKit merge problem)

I've been hacking around #Opendatakit and after pulling data with it's ODKBriefcase tool, you find yourself with a number of .csv files. The first file is the main flat file while the rest are due to the loops that users were adding to the forms. I won't go further into ODK tool but what i know is that the R merge() function allows us to take two data sets and combine them into one, based on a common
variable. To test this, import the following data by running this command:

       

            # Set working directory
# Read data from the web, but download the files first in .csv
download.file('https://www.dropbox.com/s/w12qrdkbp6gpsg2/survey_jan.csv', destfile='survery.csv', method='wget')
download.file('https://www.dropbox.com/s/w12qrdkbp6gpsg2/survey_jan_member.csv', destfile='survery_jan_member.csv', method='wget')

#Load the data in R.  data <- read.csv('survey_jan.csv', sep=',', header=TRUE)
member <- read.csv('survey_jan_member.csv', sep=',', header=TRUE)

And to check that it has imported correctly, which is always a good idea, run:

       

# Check the loaded data.
head(data)
head(member)

We are now tackling a JOIN problem and i always find my self falling back to this JOIN explained page on SO. HERE
For me this is the best part, we now have two data sets; data, which contains a list of survey entries called data, and members, which contains a list including those people as well as additional people who are members of the specific households.
The next step is to combine the two. What we are going to do is select the unique KEYS in the "member" data frame who also appear in the main "data" data frame, and copy their details into a new data frame, along with all the information.

We will refer to the two data frames as x and y. The x data frame is data; and the y is member. In x, the
column containing the list of id's is called “KEY”, and in y, it is called “PARENT_KEY”. The parameters of the merge function ﬁrst accept the two table names, and then the lookup columns as by.x or by.y. You should also include all.x=TRUE as a ﬁnal parameter. This tells the function to keep all the records in x, but only those in y that match.

       

main_survey <- merge(data, member, by.x = "KEY", by.y="PARENT_KEY", all.x = TRUE)

To see what this command has done, type main_survey to show the content of the new data
frame. This should look like:

> head(main_survey)
                                        KEY X.x           SubmissionDate                    sstart
1 uuid:68ec3f5d-078d-4ce9-a197-c3377eee720b   1 Apr 19, 2014 12:07:09 PM Apr 19, 2014 12:05:12 PM
2 uuid:68ec3f5d-078d-4ce9-a197-c3377eee720b   1 Apr 19, 2014 12:07:09 PM Apr 19, 2014 12:05:12 PM
3 uuid:68ec3f5d-078d-4ce9-a197-c3377eee720b   1 Apr 19, 2014 12:07:09 PM Apr 19, 2014 12:05:12 PM
4 uuid:68ec3f5d-078d-4ce9-a197-c3377eee720b   1 Apr 19, 2014 12:07:09 PM Apr 19, 2014 12:05:12 PM
5 uuid:a513043a-0450-47fa-8495-4c2611ece384   2 Apr 19, 2014 12:04:22 PM Apr 19, 2014 12:02:05 PM
6 uuid:a513043a-0450-47fa-8495-4c2611ece384   2 Apr 19, 2014 12:04:22 PM Apr 19, 2014 12:02:05 PM
                       end        today respondent.r_name respondent.position
1 Apr 19, 2014 12:07:03 PM Apr 19, 2014      Ngamita mary           Head food
2 Apr 19, 2014 12:07:03 PM Apr 19, 2014      Ngamita mary           Head food
3 Apr 19, 2014 12:07:03 PM Apr 19, 2014      Ngamita mary           Head food
4 Apr 19, 2014 12:07:03 PM Apr 19, 2014      Ngamita mary           Head food
5 Apr 19, 2014 12:04:18 PM Apr 19, 2014       Jona okello             Prefect
6 Apr 19, 2014 12:04:18 PM Apr 19, 2014       Jona okello             Prefect

Finally, this is a very important note and don't forget that if the by column names were named the same in both x and y (e.g. both called "KEY”), we could specify this

more simply with by="column name" rather than by.x and by.y; and ﬁnally, a critical issue when making any join is assuring that the “by” columns are in the same format.

I hope this helps someone out there working with normal joins and also ODK data.

Monday, March 31, 2014

Weekend Hacks: Fabrication and Furniture workshop.

You might have wondered what i was up to during last Christmas and in the past 3 months, working on over the weekends. I was putting together and assembling a fabrication, papyrus weaving and wood workshop for some cool ideas to come like cheap home cctv, automatic gates fabrication and wiring, auto-light sensors , motion detection sensors and just some cool home fabricated and woven furniture from papyrus and water hyacinths from Lake Victoria.

Now it's been 3 months, but i don't have any exciting NEWS to share :( and i can say the main activity that has really picked up has been the wood and weaving part. It's been really hard to get a few skilled folks (Graduate Elec, and Mechan Engineers) who can buy into my idea and work pro-bono with me as we try to find what works and what doesn't work. In addition, to that, the whole fabrication industry got so many challenges that i hadn't foreseen. Things like battling with the local Electricity company to provide good connections, recently our Welding machine short circuited and i lost ~1k$ USD instantly. Challenges with the City Council Authority and Revenue who won't understand that this is a trial process. I won't give up yet but just slow down in this path, as i go back to the table for plan B. Watch this space for our weekend playground and hacker space area in #Kampala. For now, please pass by Ntinda in Kampala and check out our furnitures at www.facebook.com/wtfworkshop