Exploring ICANN new gTLD Data with R
I've covered ICANN's new gTLD programme in a previous post and provided links to a very slick infographic that I came across. I have recently been getting to grips with R, a free software package and programming language for statistical computing. R is comparable to packages such as SAS and SPSS among others.
R is popular in academia, financial services organisations, life sciences and is also used by a number of large organisations such as Google and Facebook. With the increasing interest in "Big Data" it is gaining traction as a means for analysis and deriving insight from data. Although I just mentioned, R and "Big DataQ, R cannot easily handle large volumes of data but that's because it's intended to be used with a sample of your data. This is changing as there are parallel implementations of R and various projects that allow you to use R with Hadoop, such as RHIPE, RHadoop and R+ Streaming.The Register has a good article on the origins of R and its evolution.
Back to the topic at hand, I've been working with R to obtain a better understanding of its features and capabilities. In doing so I've been analysing the data that ICANN has published and yes I know you don't need R for what I am about to show you and it can be fairly easily be achieved with Excel and Pivot Tables - but the point is to show you the versatility of R (that and I'm just starting out with R!).
You can obtain R from the R project website and I would recommend using the RStudio IDE as it brings the console, graphics viewer, help information all together (as opposed to the rather confusing disjointed experience you get from the interface that's shipped with R).
Before you actually import the data from ICANN which is supplied as a CSV file, you'll need to convert it from UCS-2 Little Endian format to UTF-8, otherwise the data will not be imported correctly. You can used a text editor such as Notepad++ to do this.
Here's how you can import the CSV file into a data.frame (a type of object in R that can be logically thought of as a table):
The header=TRUE parameter as you might have guessed means the first row in the CSV file contains the names of the column headers, and sep specifies the values are column delimited, and quote specifies the character used for enclosing strings.
Once you have the data imported you can quickly start generating statistics and graphs:
Although I've only covered very simple functions, R is actually extremely powerful, as this blog post from Revolution Analytics shows. In case you were wondering I used the Pretty R syntax highlighter for the code in this blog post.
R is popular in academia, financial services organisations, life sciences and is also used by a number of large organisations such as Google and Facebook. With the increasing interest in "Big Data" it is gaining traction as a means for analysis and deriving insight from data. Although I just mentioned, R and "Big DataQ, R cannot easily handle large volumes of data but that's because it's intended to be used with a sample of your data. This is changing as there are parallel implementations of R and various projects that allow you to use R with Hadoop, such as RHIPE, RHadoop and R+ Streaming.The Register has a good article on the origins of R and its evolution.
Back to the topic at hand, I've been working with R to obtain a better understanding of its features and capabilities. In doing so I've been analysing the data that ICANN has published and yes I know you don't need R for what I am about to show you and it can be fairly easily be achieved with Excel and Pivot Tables - but the point is to show you the versatility of R (that and I'm just starting out with R!).
You can obtain R from the R project website and I would recommend using the RStudio IDE as it brings the console, graphics viewer, help information all together (as opposed to the rather confusing disjointed experience you get from the interface that's shipped with R).
Before you actually import the data from ICANN which is supplied as a CSV file, you'll need to convert it from UCS-2 Little Endian format to UTF-8, otherwise the data will not be imported correctly. You can used a text editor such as Notepad++ to do this.
Here's how you can import the CSV file into a data.frame (a type of object in R that can be logically thought of as a table):
The header=TRUE parameter as you might have guessed means the first row in the CSV file contains the names of the column headers, and sep specifies the values are column delimited, and quote specifies the character used for enclosing strings.
Once you have the data imported you can quickly start generating statistics and graphs:
# The sum function can be used to determine the number of Geographic TLDs applications > sum(tlds$Geographic == 'Yes') [1] 66 # The number of applications for IDNs > sum(tlds$IDN == 'Yes') [1] 116 # The number of applications for strings that are non-Latin summary(tlds$Script.Code) -- Arab Cyrl Deva Hang Hani Hans Hant Hebr Jpan Kana Latn Thai 1814 15 8 3 3 12 57 4 1 2 8 2 1 # Convert the Region column from a factor to character, then replace NA, which stands for North America to AMER tlds$Region <- as.character(tlds$Region) tlds$Region[is.na(tlds$Region)] <- "AMER" # Then convert back to factor so we can graph the data tlds$Region <- as.factor(tlds$Region)
# The same data but as a pie chart region_table <- table(tlds$Region) pie(region_table, main="Applications by Region", col=c("red","green","yellow","blue")) region_labels = <- round( region_table /sum(region_table$Freq) * 100, 1) pie(region_table, main="Applications by Region", col=rainbow(length(tlds$Region)) # Concatenate a '%' char after each value region_labels <- paste(region_labels, "%", sep="") colours5 <- rainbow(5) # Create a pie chart with defined heading and custom colors and labels pie(region_table, main="Applications by Region", col=colours5, labels=region_labels, cex=0.8) # Create a legend at the right legend(1.5, 0.5, c("Africa","NA","ASIA PAC","EUR","LAC"), cex=0.8, fill=colours5)
# Applications by two-letter ISO country codes. Use the table function to generate frequency count and then convert to a data.frame locs <- data.frame(table(tlds$Location)) # Only consider those countries with more 10 or more applications sub_locs <- subset(locs,subset=(locs$Freq > 9)) # Change the default column names to something more meaning fule than Var1 colnames(sub_locs) <- c("Country","Freq") # create a bar chart of the data barplot(sub_locs$Freq, main="Applications by Country", col=rainbow(length(sub_locs$Country)), xlab="Country", ylab="Freq", names.arg=sub_locs$Country, cex.axis=0.7, cex=0.5)
# Organisations and frequency of applications applicants <- data.frame(table(tlds$Applicant)) # Stats on the number of applications by organisation on average (arithmetic mean) only ~2 application per organisation summary(applicants$Freq) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 1.000 1.672 1.000 101.000 # Change the column names to something meaningful colnames(applicants) <- c("Applicant","Freq") # Sort in descending order applicants <- applicants[order(-applicants$Freq),] # Let's see which organisations are in the top ten applicants applicants[applicants$Freq > 9,] Applicant Freq 188 Charleston Road Registry Inc. 101 43 Amazon EU S.à r.l. 76 1058 Top Level Domain Holdings Limited 70 1084 Uniregistry, Corp. 54 30 Afilias Limited 26 1085 United TLD Holdco Ltd. 26 640 L'Oréal 14 869 Richemont DNS Inc. 14 268 Dish DBS Corporation 13 653 Lifestyle Domain Holdings, Inc. 13 762 NU DOT CO LLC 13 1088 VeriSign Sarl 12 697 Microsoft Corporation 11 1057 Top Level Design, LLC 10
Although I've only covered very simple functions, R is actually extremely powerful, as this blog post from Revolution Analytics shows. In case you were wondering I used the Pretty R syntax highlighter for the code in this blog post.
Comments
Post a Comment