Exploring ICANN new gTLD Data with R

I've covered ICANN's new gTLD programme in a previous post and provided links to a very slick infographic that I came across. I have recently been getting to grips with R, a free software package and programming language for statistical computing. R is comparable to packages such as SAS and SPSS among others.

R is popular in academia, financial services organisations, life sciences and is also used by a number of large organisations such as Google and Facebook. With the increasing interest in "Big Data" it is gaining traction as a means for analysis and deriving insight from data. Although I just mentioned, R and "Big DataQ, R cannot easily handle large volumes of data but that's because it's intended to be used with a sample of your data. This is changing as there are parallel implementations of R and various projects that allow you to use R with Hadoop, such as RHIPE, RHadoop and R+ Streaming.The Register has a good article on the origins of R and its evolution.

Back to the topic at hand, I've been working with R to obtain a better understanding of its features and capabilities. In doing so I've been analysing the data that ICANN has published and yes I know you don't need R for what I am about to show you and it can be fairly easily be achieved with Excel and Pivot Tables - but the point is to show you the versatility of R (that and I'm just starting out with R!).

You can obtain R from the R project website and I would recommend using the RStudio IDE as it brings the console, graphics viewer, help information all together (as opposed to the rather confusing disjointed experience you get from the interface that's shipped with R).

Before you actually import the data from ICANN which is supplied as a CSV file, you'll need to convert it from UCS-2 Little Endian format to UTF-8, otherwise the data will not be imported correctly. You can used a text editor such as Notepad++ to do this.

Here's how you can import the CSV file into a data.frame (a type of object in R that can be logically thought of as a table):
tlds <- read.csv("Reference Library/Blog/TLDs/New-gTLDs-Applications-2012-06-13 - ORIGINAL.csv", header=TRUE, sep="\t", quote="\"")

The header=TRUE parameter as you might have guessed means the first row in the CSV file contains the names of the column headers, and sep specifies the values are column delimited, and quote specifies the character used for enclosing strings.

Once you have the data imported you can quickly start generating statistics and graphs:

# The sum function can be used to determine the number of Geographic TLDs applications
> sum(tlds$Geographic == 'Yes')
[1] 66
# The number of applications for IDNs
> sum(tlds$IDN == 'Yes')
[1] 116
# The number of applications for strings that are non-Latin
  -- Arab Cyrl Deva Hang Hani Hans Hant Hebr Jpan Kana Latn Thai 
1814   15    8    3    3   12   57    4    1    2    8    2    1 
# Convert the Region column from a factor to character, then replace NA, which stands for North America to AMER
tlds$Region <- as.character(tlds$Region)
tlds$Region[is.na(tlds$Region)] <- "AMER"
# Then convert back to factor so we can graph the data
tlds$Region <- as.factor(tlds$Region)

# Displays a plot (histogram) of regions and the number of applications from each of these regions
> plot(tlds$Region, main="Applications by Region", col=c("red","green","yellow","blue"), xlab="Region", ylab="No. of Applications")

# The same data but as a pie chart
region_table <- table(tlds$Region)
pie(region_table, main="Applications by Region", col=c("red","green","yellow","blue"))
region_labels =  <- round( region_table /sum(region_table$Freq) * 100, 1)
pie(region_table, main="Applications by Region", col=rainbow(length(tlds$Region))
# Concatenate a '%' char after each value
region_labels <- paste(region_labels, "%", sep="")
colours5 <- rainbow(5)
# Create a pie chart with defined heading and custom colors and labels
pie(region_table, main="Applications by Region", col=colours5, labels=region_labels, cex=0.8)
# Create a legend at the right   
legend(1.5, 0.5, c("Africa","NA","ASIA PAC","EUR","LAC"), cex=0.8, fill=colours5)

# Applications by two-letter ISO country codes. Use the table function to generate frequency count and then convert to a data.frame
locs <- data.frame(table(tlds$Location))
# Only consider those countries with more 10 or more applications
sub_locs <- subset(locs,subset=(locs$Freq > 9))
# Change the default column names to something more meaning fule than Var1
colnames(sub_locs) <- c("Country","Freq")
# create a bar chart of the data
barplot(sub_locs$Freq, main="Applications by Country", col=rainbow(length(sub_locs$Country)), xlab="Country", ylab="Freq", names.arg=sub_locs$Country, cex.axis=0.7, cex=0.5)

# Organisations and frequency of applications
applicants <- data.frame(table(tlds$Applicant))
# Stats on the number of applications by organisation on average (arithmetic mean) only ~2 application per organisation
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   1.000   1.672   1.000 101.000 
# Change the column names to something meaningful
colnames(applicants) <- c("Applicant","Freq")
# Sort in descending order
applicants <- applicants[order(-applicants$Freq),]
# Let's see which organisations are in the top ten applicants
applicants[applicants$Freq > 9,]
                             Applicant Freq
188      Charleston Road Registry Inc.  101
43                 Amazon EU S.Ã  r.l.   76
1058 Top Level Domain Holdings Limited   70
1084                Uniregistry, Corp.   54
30                     Afilias Limited   26
1085            United TLD Holdco Ltd.   26
640                           L'Oréal   14
869                 Richemont DNS Inc.   14
268               Dish DBS Corporation   13
653    Lifestyle Domain Holdings, Inc.   13
762                      NU DOT CO LLC   13
1088                     VeriSign Sarl   12
697              Microsoft Corporation   11
1057             Top Level Design, LLC   10

Although I've only covered very simple functions, R is actually extremely powerful, as this blog post from Revolution Analytics shows. In case you were wondering I used the Pretty R syntax highlighter for the code in this blog post.


Popular Posts