Saturday, April 19, 2014

R joins - revisited (OpenDataKit merge problem)

I've been hacking around #Opendatakit and after pulling data with it's ODKBriefcase tool, you find yourself with a number of .csv files. The first file is the main flat file while the rest are due to the loops that users were adding to the forms. I won't go further into ODK tool but what i know is that the R merge() function allows us to take two data sets and combine them into one, based on a common
variable. To test this, import the following data by running this command:


       

            # Set working directory
# Read data from the web, but download the files first in .csv
download.file('https://www.dropbox.com/s/w12qrdkbp6gpsg2/survey_jan.csv', destfile='survery.csv', method='wget')
download.file('https://www.dropbox.com/s/w12qrdkbp6gpsg2/survey_jan_member.csv', destfile='survery_jan_member.csv', method='wget')

#Load the data in R.  
data <- read.csv('survey_jan.csv', sep=',', header=TRUE)
member <- read.csv('survey_jan_member.csv', sep=',', header=TRUE)


And to check that it has imported correctly, which is always a good idea, run:
       

# Check the loaded data.
head(data)
head(member)
     
 


We are now tackling a JOIN problem and i always find my self falling back to this JOIN explained page on SO. HERE
For me this is the best part, we now have two data sets; data, which contains a list of survey entries called data, and members, which contains a list including those people as well as additional people who are members of the specific households.
The next step is to combine the two. What we are going to do is select the unique KEYS in the "member" data frame who also appear in the main "data" data frame, and copy their details into a new data frame, along with all the information.

We will refer to the two data frames as x and y. The x data frame is data; and the y is member. In x, the
column containing the list of id's  is called “KEY”, and in y, it is called “PARENT_KEY”. The parameters of the merge function first accept the two table names, and then the lookup columns as by.x or by.y. You should also include all.x=TRUE as a final parameter. This tells the function to keep all the records in x, but only those in y that match.

       

main_survey <- merge(data, member, by.x = "KEY", by.y="PARENT_KEY", all.x = TRUE) 
 



To see what this command has done, type main_survey to show the content of the new data
frame. This should look like:


> head(main_survey)
                                        KEY X.x           SubmissionDate                    sstart
1 uuid:68ec3f5d-078d-4ce9-a197-c3377eee720b   1 Apr 19, 2014 12:07:09 PM Apr 19, 2014 12:05:12 PM
2 uuid:68ec3f5d-078d-4ce9-a197-c3377eee720b   1 Apr 19, 2014 12:07:09 PM Apr 19, 2014 12:05:12 PM
3 uuid:68ec3f5d-078d-4ce9-a197-c3377eee720b   1 Apr 19, 2014 12:07:09 PM Apr 19, 2014 12:05:12 PM
4 uuid:68ec3f5d-078d-4ce9-a197-c3377eee720b   1 Apr 19, 2014 12:07:09 PM Apr 19, 2014 12:05:12 PM
5 uuid:a513043a-0450-47fa-8495-4c2611ece384   2 Apr 19, 2014 12:04:22 PM Apr 19, 2014 12:02:05 PM
6 uuid:a513043a-0450-47fa-8495-4c2611ece384   2 Apr 19, 2014 12:04:22 PM Apr 19, 2014 12:02:05 PM
                       end        today respondent.r_name respondent.position
1 Apr 19, 2014 12:07:03 PM Apr 19, 2014      Ngamita mary           Head food
2 Apr 19, 2014 12:07:03 PM Apr 19, 2014      Ngamita mary           Head food
3 Apr 19, 2014 12:07:03 PM Apr 19, 2014      Ngamita mary           Head food
4 Apr 19, 2014 12:07:03 PM Apr 19, 2014      Ngamita mary           Head food
5 Apr 19, 2014 12:04:18 PM Apr 19, 2014       Jona okello             Prefect
6 Apr 19, 2014 12:04:18 PM Apr 19, 2014       Jona okello             Prefect

Finally, this is a very important note and don't forget that if the by column names were named the same in both x and y (e.g. both called "KEY”), we could specify this
more simply with by="column name" rather than by.x and by.y; and finally, a critical issue when making any join is assuring that the “by” columns are in the same format.

I hope this helps someone out there working with normal joins and also ODK data

No comments:

Post a Comment

Add any comments if it helped :)