Monday, December 9, 2013

How to Count, Sum and aggregate in R.

I recently had a chat with a friend who works for an SMS aggregator company and he told me about the challenges he has with analyzing some SMS data they log and that it takes him weeks to put together with Excel. I showed him how to do this in minutes with R and start taking coffee at work :).

The Problem: He frequently wants to to count(SUM) and aggregate things in data frames. For example, he wants to know how much revenue the different musicians made from the ringtones and under what categories etc 

A snap shot of the raw data frame is below. You might want to download the zipped file here and load into your R session with below.

> sms <- read.csv(“sms.csv”, sep=”,”, header=T )  

> dim(sms)


[1] 624  10


You must have 10 columns and 624 rows to make sure you loaded the right data and well :).


customer_name report_date       content_name cms_provider                artist mrp
1      AIRTELUG  11/22/2012              AYAKA   AA_SMSDONE               NANJEGO  30
2      AIRTELUG  11/26/2012       CHOICE YANGE   AA_SMSDONE        GOEFREY LUTAYA  30
3      AIRTELUG   11/5/2012      OJANGA NOSABA   AA_SMSDONE BUGEMBE AND BOBI WINE  30
4      AIRTELUG  11/14/2012         ENSI EKUBA   AA_SMSDONE           CHRIS EVANS  30
5      AIRTELUG  11/15/2012         MR KATAALA   AA_SMSDONE             BOBI WINE  30
6      AIRTELUG   11/3/2012 ABAKUBI BA PULAANI   AA_SMSDONE       GEOFREY LUTAAYA  30
 cms_clip_promo_id    trans_desc revenue count
1           2579072 Song Download     120     4
2           2579000  Song Renewal   10140   338
3           2579196  Song Renewal     960    32
4           2517666  Song Renewal     120     4
5           2517527  Song Renewal     330    11
6           2517665  Song Renewal     360    12



How many times does each artist name occur in our data? There is a guy called Hadley Wickham who came up with an excellent plyr package and that’s what we shall use through out this blog post.
> library(plyr)

> ?count


> count(sms, "artist")
                              artist freq
1                  2 STARS├â┬§├é┬┐├é┬╜    3
2                                 AK47    4
3              AKAYO CHOIR├â┬§├é┬┐├é┬╜    2
4         ALICIOS FT JULIANA KANYOMOZI    1
5                AMBASSADORS OF CHRIST    2
6                       ANGELLA KALULE    1

How many times did each artist appear on specific promo clip? In case you want to tally things up by more than one column use the c function to combine things into a vector:
> count(sms, c("artist", "cms_clip_promo_id")) 
                              artist cms_clip_promo_id freq
1                  2 STARS├â┬§├é┬┐├é┬╜           2517483    2
2                  2 STARS├â┬§├é┬┐├é┬╜           2517589    1
3                                 AK47           2579164    2
4                                 AK47           2579166    2
5              AKAYO CHOIR├â┬§├é┬┐├é┬╜           2517454    1
6              AKAYO CHOIR├â┬§├é┬┐├é┬╜           2517590    1
7         ALICIOS FT JULIANA KANYOMOZI           2599168    1
8                AMBASSADORS OF CHRIST           2517492    2
9                       ANGELLA KALULE           2517507    1
10                        BAINE VIOLLA           2585894    1


R just makes life easy. All i did was just tell count which data frame i was using, then which columns i want to tally by, and it does the counting very quickly and efficiently even on millions of rows.
How much did each artist song gain in revenue total? How much revenue did I it gain in relation to promo clip ? Now i came across the awesome part of aggregate that will do the job for this kind of figuring.
> aggregate(revenue ~ artist  + cms_clip_promo_id, data = sms, sum)
                              artist cms_clip_promo_id revenue
1        BOBI WINE PREV PHILLY LUTAAYA           2517453      90
2              AKAYO CHOIR├â┬§├é┬┐├é┬╜           2517454     990
3                       JULIUS MUHOOZI           2517455     750
4             FRED SEBATTA├â┬§├é┬┐├é┬╜           2517456    1020
5                       RONALD MAYINJA           2517457    1560
6                            BOBI WINE           2517458     150
7                           LADY DIANA           2517460      30
8                      SOPHIE NANTONGO           2517462      30
9                       WILSON BUGEMBE           2517464    2610


To a lay man this command can be interpreted as "I want to apply the sum function to the revenue column while aggregating rows based on unique values in the artist and cmc_clip_promo_id columns." #damn this is so easy and cool right?
How much did each artist gain in revenue  total? Forget about aggregating by clip_promo_id, and just aggregate by artist name:
> aggregate(revenue ~ artist, data = sms, sum)
                              artist revenue
1                  2 STARS├â┬§├é┬┐├é┬╜     720
2                                 AK47   16320
3              AKAYO CHOIR├â┬§├é┬┐├é┬╜    1080
4         ALICIOS FT JULIANA KANYOMOZI     810
5                AMBASSADORS OF CHRIST    1170
6                       ANGELLA KALULE      30
7                         BAINE VIOLLA      30
8                   BEBE COOL FT ALPHA     120
9                            BOBI WINE    3360
10       BOBI WINE PREV PHILLY LUTAAYA      90

What was the mean revenue that the artists gained ? Change sum to mean in the formula:
> aggregate(revenue ~ artist, data = sms, mean)
                              artist     revenue
1                  2 STARS├â┬§├é┬┐├é┬╜   240.00000
2                                 AK47  4080.00000
3              AKAYO CHOIR├â┬§├é┬┐├é┬╜   540.00000
4         ALICIOS FT JULIANA KANYOMOZI   810.00000
5                AMBASSADORS OF CHRIST   585.00000
6                       ANGELLA KALULE    30.00000
7                         BAINE VIOLLA    30.00000
8                   BEBE COOL FT ALPHA    40.00000
9                            BOBI WINE   240.00000
10       BOBI WINE PREV PHILLY LUTAAYA    90.00000
11                           BODDO S S    30.00000

No comments:

Post a Comment

Add any comments if it helped :)