Thursday, August 9, 2012

How Many Data Scientists Are There?


How Many Data Scientists Are There?
I've seen a lot of articles lately about “Big Data” and the looming “talent gap.” This article from the Wall Street Journal is a good example. It cites a McKinsey estimate that states that we will need 1.5 million more managers and analysts who are conversant with “big data.” Of course, some of this is the media latching on the the next “big thing” (data), but some of it is true. Even anecdotal evidence, such as the number of job postings you find when you search for “data science,” indicates that there is a significant unmet demand for data analysis skills.

This led me to wonder how we could quantify this gap, and once we figure out how to quantify it, if we can figure out if there has been a commensurate increase in the number of people with the skills to work with big data.

This is interesting from the perspective of someone who works with data simply because I want to know the state of the field. I am a pretty recent entrant into the area, and I would like to see more people get into it.

Potential Ways to Quantify Data Analysis Supply and Demand

There are a few different ways we can go in terms of this.
  1. We can use Google Search trends to find the trend for the term “big data.”
    -Pros: Easy. Look, I already did it.
    -Cons: Trivial, doesn't really give us a way to disambiguate supply (number of data scientists) and demand (companies looking for analytical skills).
  2. We can crawl job websites looking for jobs that mention data in some way.
    -Pros: Probably a pretty comprehensive way to look at the “demand” side of the equation. We can use number of days a job posting is up as a proxy for supply.
    -Cons: Our proxy for supply will be pretty noisy because some job listings stay up forever, and when they are taken down doesn't necessarily correlate with when they are filled. Also, this will give us very unstructured data, and will be very dependent on which job sites we crawl.
  3. We can look at Kaggle.com, which is a website that hosts data science competitions, to see how many people enter each competition, and how that count changes over time.
    -Pros: Relatively structured data (competitor count). Gives us a window into both sides of the market (supply is number of competitors, demand is number of companies hosting competitions).
    -Cons: Not all companies have heard of Kaggle, and not all people who work with data have heard of Kaggle. The numbers will be biased because competitor count is dependent on two things: people hearing about Kaggle, and people being interested in/having the skills to work with data. We are only really interested in the second part, but it will be affected by the first part.
Ultimately, I chose to go with option 3, for a few reasons:

-It is simpler than creating a crawler, and much less trivial than relying on Google Trends.
-I am familiar with the platform.
-While the competitor count is dependent to some extent on who has heard of Kaggle, I think that it is fair to say that as of a few months ago, most people who work with data had heard of Kaggle in some form (this is strictly anecdotal, though).

Defining and Solving the Problem

So now, we still want to look at the supply of data scientists, and the demand for data scientists, but we want to do it within the context of Kaggle.com.

If you are unfamiliar with Kaggle, it is a crowdsourcing platform that allows companies to host competitions on various aspects of their data. For example, one company wanted individuals to predict bond prices in the future. Competitors register for these competitions, and are ultimately awarded prize money based on their final position in the standings.

So, this gives us an easy way to define supply and demand. Supply can be defined as the number of competitors that are actively engaged in the Kaggle platform at any one time. As each competition has an end date, a competitor will only be counted as “active” if he or she has an entry in a competition that has not ended yet.

Demand can simply be counted as the number of competitions that are currently active. This is a bit tricky, because it seems that demand for the Kaggle platform from the company side might be increasing faster than the overall demand for Big Data (given that Kaggle is relatively new and it takes time for word to circulate).

Now, to define our procedure:

Each Kaggle competition has a leaderboard associated with it. The leaderboard lists all active participants, along with their rank in the competition. Additionally, Kaggle allows for old leaderboards to be seen.

Because we can see the leaderboard at various points in time, we can easily figure out how many active participants each competition had at any given time point. Adding the unique users for each active competition will give us a count of active users at that time point. If we do this for multiple time points across all competitions, we can figure out how the number of active users changed over time.

Active Participants by Competition

This leads us to a (fairly messy) chart that shows how each competition gained active participants over time. As data was only scraped on a weekly basis, the figures might not be 100% accurate in terms of competition start times. Once the number of competitors line becomes flat, it indicates that the competition is closed. Generally, competitions that are higher up in the legend are more recent.
This is interesting within the context of Kaggle, but it doesn't really tell us much about the overall supply and demand for data scientists. We can see that some of the older competitions gained participants faster than some of the newer ones, which may indicate that demand is outstripping supply, but we will need to look at the aggregate numbers across competitions to make judgements there.

Active Participants Overall

Now, we can aggregate the numbers, and look at unique active users across all competitions. So if “Bob” is active in competition A and competition B at the same time, he is only counted once.
As participants are no longer counted as “active” when a competition ends, we see some rather dramatic oscillations. We can better understand these oscillations if we graph number of active competitions alongside number of unique participants. I am scaling the number of competitions by a factor of 100 to make them appear legibly on the same chart as user count.

We can see how closely related the number of active competitions is to the number of unique competitors. In fact, the linear correlation between the two is .71.
We can fit a linear model to the data and figure out what the expected number of users based solely on the number of active competitions should be.


We can see that they are pretty well correlated. Most of the variation in the number of users seems to be explained by the number of competitions, although we do see that when competitions are close to ending, there appears to be large rush of participants. Also, very recently (the month of July), we see that the actual number of unique active users is far below the expected number (ignore the very end of the chart, where all numbers drop to zero).

How Quickly Competitions Attract Participants

Another way to look at the supply of data scientists versus demand is to see if more recent competitions (that have come about in a time when there are more active competitions overall) gain participants more slowly or more quickly than previous competitions.


This plot shows us how many users per day each competition attracts, and how that has changed over time. Although it may look like there is a trend in this plot (particularly towards the very end, when slopes are small), there is no significant correlation between date of competition launch and number of users gained per day.

Total Participants Over Time

In previous charts, we looked at unique active users or unique participants over time. We can also look at aggregate number of unique users over time–the total number of unique individuals who have submitted an entry to any Kaggle competition. This shows us how the platform is growing.

 

New Users Over Time

We can figure out new users at each weekly time period (users who have submitted entries in the week who did not submit any previous entries).
Graphing this allows us to see how the community is growing and expanding.

We can see that the number of new users each week is somewhat correlated with the competition count, and has remained somewhat steady over the past few months.

Some Random Observations

Okay, so we have some data and some (maybe) pretty pictures, but what does it tell us? We can gain some insight from this.
  1. There are clearly some competitions that are favored strongly over others. As someone who has participated in a lot of Kaggle competitions, I can safely say that these are competitions that have an interesting premise, the potential for an interesting opportunity (KDD Cup/Facebook Recruiting), or have data that is easy to work with in terms of format and presentation (Bio response).
  2. It seems like the average number of competitors per competition has been pretty constant, even as Kaggle has ramped up their number of concurrent competitions. Some recent competitions have not seen much uptake, but that could be the combination of several relatively insignificant factors (summer, uninteresting competitions, etc), rather than signalling that we have reached capacity.
  3. As we can see by the number of people who have submitted an entry overall vs the number of people who are active at any given time, there is a large data science community that only uses Kaggle when they see something interesting. As fresh people begin to compete, it seems that older users stop using the platform, whether it be from boredom, lack of time, or another issue. This keeps the number of active unique users much more constant (slow growth) than the aggregate number of users.
    This points to there being constant new entrants into the data science world, at least within the context of Kaggle, but it is hard to figure out if these entrants are new to data science entirely, or simply new to the Kaggle platform. The Kaggle forums suggest a mixture of the two.
  4. Along with a rising overall supply of data scientists, we have also seen rising demand, as the number of competitions has been steadily increasing. This could simply reflect rising interest in the Kaggle platform, but it might also point to a rising interest in data science at the corporate level.

Inconclusive Conclusions

It's hard to generalize from this data, as it call came within the context of a platform. You can think of Kaggle as a fisherman that has gradually invested in better technology and better bait. Over time, more data scientists and more companies have been “caught,” but whether that reflects the better bait, or the fact that the number of fish in the ocean is increasing, it is hard to say.
So what can we conclude? A few key items jump off the page:
  1. People will enter the field of data science, but only if they can find something interesting/rewarding to work on. We see a lot of active unique entrants in a few competitions that have low barriers to entry or offer commensurately high rewards. We also see a rising amount of new users surrounding particularly interesting competitions.
  2. Problems that are less exciting, or perhaps less accessible, may need to be reformulated to appeal to the mainstream data community, and crossovers from other fields. If a company wants to attract high quality talent, they need to interest and engage them. We see a lot of competitions get very little traction.
  3. The amount of new users on Kaggle seems fairly steady. This may indicate that demand may soon outstrip supply, as more competitions are run without a commensurate increase in the number of participants, but it does seem like the number of participants and competition count is pretty correlated.
    The fact that there is a constant stream of new users is also encouraging, because, anecdotally, most people in the data community heard about Kaggle months ago. This indicates that both existing data scientists are always looking for interesting problems to tackle, and that new people are moving into data science as they see interesting problems.
  4. Corporate interest in data science overall seems to be increasing more quickly than the supply of new data scientists.
None of these are definitive, and the method used for analysis constrains the interpretability of the results. Nonetheless, I think that there are some interesting threads here, and would love to hear anyone's thoughts on this.

Monday, June 18, 2012

Tracking US Sentiments Over Time In Wikileaks

Introduction

I recently posted about using the Wikileaks cable corpus to find word use patterns, both over time, and in secret cables vs unclassified cables.

I received a lot of good suggestions for further topics to pursue with the corpus, and probably the most interesting was the idea to do sentiment analysis over time on a variety of named entities.

Sentiment analysis is the process of discovering whether a writer feels negatively or positively about a topic. Named entities in this case would be country names such as China and India, and the names of important world figures, such as Saddam Hussein or Tony Blair.

So, in essence, we are seeing how US diplomats, and by extension the US, felt about a variety of topics, and how those feelings changed over time, from the first available cables (1980's) to present.

The goal is to get a chart like this one:

US Sentiment over Time

How will we do this?

Useful sentiment analysis can be extremely complex at times, requiring a corpus of sentences to be mapped to sentiment scores.

In order to make this exercise simpler, I traded off some accuracy and used a word list instead (the AFINN list). This word list assigns a “sentiment score” of -5 to 5 to 2477 English words. For example, the word adore has a score of 3, denoting a positive sentiment, whereas the word abhorred has a sentiment of -3, indicating negative sentiment.

Our next task is named entity recognition. We will use the AFINN word list in conjunction with a list of named entities. Named entities in this case would be important topics from the news, so we will use the JRC-Names word list, which pulls out important keywords from news articles. We will use these keywords to define our topics. For example, “China” is a keyword, as is “India”. These are the topics that we will analyze sentiment for.

Now, in order to find the sentiment for a given topic, we will need to find out whether it appears in conjunction with negative or positive words. For example, the phrase “China abandoned an environmental project” would indicate negative sentiment, whereas “China is building partnerships” would indicate positive sentiment. In order to do this, we will need to find out when our topic words (named entities) and our words that indicate sentiment appear together in a sentence.

To accomplish this, we can use a technique called random indexing which allows us to build up a matrix that shows how topic words and sentiment words occur together. I opted to use random indexing because it builds a relatively small matrix in terms of dimensionality, and it allows us to capture information on a fairly granular level. The optimal method would be to create a full Term-Document matrix and decompose it to find relations, but it is impractical in this case due to the high sentence count.

Our plan

Now that we have all the prelimiaries, here is a high-level look at what we will do:

  1. Get cables for multiple time periods from the database
    • Because there are more cables from 2000 onwards than from pre-2000, we will define 5 year time periods from 1985 to 2000, and 1 year time periods after.
  2. Split the cables into sentences.
  3. Build up matrices using random indexing that contain the topic words from JRC-Names and the sentiment words from AFINN.
  4. Use cosine similarity measures to see how often topic words occur with negative/positive words.
  5. Assign a final “sentiment score” to each topic for each time range.

This plan will give us reasonable results. Because of the way that we are doing sentiment analysis, it won't be perfect (far from it), but it will show some interesting patterns, at least.

Formatting JRC-Names and AFINN

JRC-Names and AFINN are not in the best format for this (you will see when you download them), so we need to reformat them to get a character vector of topics. The reformatting also needs to be done because cables frequently refer to people by only their last name and JRC-names contains a full name. We need to make everything into 1-grams.

jrc_names <- read.delim(file = "entities.txt", stringsAsFactors = FALSE)[, 
    4]
bad_names <- grep("[^\\w+]", jrc_names, perl = TRUE)
jrc_names <- jrc_names[-bad_names]
jrc_names <- sapply(jrc_names, function(x) strsplit(x, "+", fixed = TRUE))
jrc_tab <- sort(table(tolower(unlist(jrc_names))), decreasing = TRUE)
jrc_names <- names(jrc_tab)[jrc_tab > 2]
jrc_names <- jrc_names[nchar(jrc_names) < 15 & nchar(jrc_names) > 
    2]

afinn_list <- read.delim(file = "AFINN-111.txt", header = FALSE, 
    stringsAsFactors = FALSE)
names(afinn_list) <- c("word", "score")
afinn_list$word <- tolower(afinn_list$word)

full_term_list <- c(jrc_names, afinn_list$word)

This code will remove non-English words from jrc-names, split it by the + sign that appears in each term, and reconstruct a vector in which only the terms that appear at least twice are included.

Defining Date Ranges

We now need to define what date ranges we want our cables to come from. Because there aren't many cables available pre-2000, we will select 5 years at a time from 1985-2000.

date_min_list <- c("1985", "1990", "1995", "2000", "2001", "2002", 
    "2003", "2004", "2005", "2006", "2007", "2008", "2009")
date_max_list <- c("1990", "1995", "2000", "2001", "2002", "2003", 
    "2004", "2005", "2006", "2007", "2008", "2009", "2010")

Generating Sentiment Scores

Now, we need to follow our plan from above and have the code that generates our final sentiment scores. The load or install function is documented here.

This code is very inefficient, so please feel free to improve it. To get it to run on low-memory systems, you can lower the ri_cols or max_cables_to_sample attributes. A higher ri_cols or max_cables_to_sample setting will be less memory efficient, but more accurate.

You can find the code for this here, as sentiment_score_generation.R.

This is a very long piece of code, but it is basically doing what our plan stated. It is getting cables for each time period, splitting them into sentences, and finding out which sentiment words and topic words occur together. It is then finding out which topic is associated with negative sentiment, and which is associated with positive sentiment, and then assigning a final score to each topic on that basis.

Plotting the results

Now, we are ready to make plots indicating sentiment over time.

You can find the plotting code here, as sentiment_plot.R.

This generates the following plot:

US Sentiments-Middle East

The black line indicates the mean sentiment by year. You can see that the average US sentiment dips around 2003 (the year on the x-axis is the ending year for the gathered cables, so 2010 would be cables from January 1st, 2009 to January 1st, 2010, for example). This is likely due to countries not supporting the US war effort in Iraq. If you have a better interpretation, I would love to hear it.

More country plots

Here are US sentiments towards the english speaking world. “New Zealand” becomes “Zealand” because we are only dealing with 1-grams:

US Sentiments-English World

You can see that we seem to have much better sentiment towards the English speaking world, overall.

Here are US Sentiments towards some of the countries with recent protests/overthrows. Tripoli is a proxy for Libya, and Tunis is a proxy for Tunisia, because those terms did not seem to make it into the JRC-names list that we constructed:

US Sentiments-Arab Spring

US Sentiments - Europe

US Sentiments - Asia

Country Interpretation

The US seems to have slightly negative sentiment towards every country, particularly after 2003. This could be due to many factors:

  • Countries not supporting the Iraq war.
  • A change from Madeline Albright (1997-2001) to Colin Powell (2001-2005) to Condoleeza Rice (2005-2009). Perhaps their attitudes shaped the attitudes of the cable writers.
  • Changes in administration from Bill Clinton (1993-2001) to George Bush (2001-2009) to Barack Obama (2009-). The attitude of the President can definitely impact cable writing, as I can attest, and you can see some upticks in sentiment from 2009-2010, when Obama took office.

Personally, I think that the war may have been the biggest factor in the changing cable language, but this is just speculation, so I would love to hear any ideas on this.

World Figure Plots

Now, we can also plot major world figures:

US Sentiments- Dictators

The above are some of the ex-dictators that have been in the news lately. You can see some very interesting patterns (Hussein becomes associated with very negative sentiment right when the second Iraq war starts, for example).

Here are US Sentiments towards some world leaders:

US Sentiments- World Leaders

World figure interpretation

The US seems to have some strange sentiments towards world figures/leaders.

  • The dictators do not seem to have been universally reviled prior to their ousters.
  • Sentiment seems to be improving from 2009-2010 (perhaps due to Obama taking office).

Any more interpretation/thoughts would be appreciated!

Conclusion

This has been a very interesting post for me, and I hope that it can be built upon. Please let me know your thoughts, and/or if you would like to see any different analyses done.

Tuesday, June 12, 2012

NBA Predictions -- Finals

Now we are on to the finals! The algorithm enters the finals with a 6-4 record so far. Here is what we have for tonight:



So, let's see if OKC wins this one.

Finding word use patterns in Wikileaks cables

6/18: A follow-up to this post is now available here.

Recent Discoveries

When I was a diplomat, I was always interested in the Wikileaks cables and what could be done with them. Unfortunately, I never got a chance to look at the site in depth, due to security policies. Now that the ex- is firmly prepended to diplomat in my resume, I think that I am finally ready to take that step.
I recently realized that the wikileaks cables are available in a handy .sql file online. This of course allowed me to download all 250,000 and import them into a database table (I used psql and the /i command).
If you are interested in obtaining the cables for yourself, you will need to download the torrent from here.
Let me just clarify here that I will not be printing the text of any of these cables (which has been done in several newspapers), and that I will not be using any data that is not readily publicly available online.

That's great, but what can we do with them?

After I had the cables, I brainstormed to see what I could actually do with them that would be interesting. I came up with a few ideas:
  1. Find how topics have changed over time.
    • It's reasonable to assume that the focus of the cables would have shifted from “Soviet Union” this and “USSR” that to the Middle East.
  2. Find out what words typify State Department writers.
    • Anyone who has read cables knows that while they are (mostly) in English, its a strange kind of English.
  3. Find out what words/topics typify secret/classified vs unclassified cables.
    • What topics are more likely to be classified? Does word choice change in classified vs unclassified cables?
I will get into these topics and more as we continue on through this post.

Starting to work with the data

The first thing we need to do is read the data from a database. I interfaced with my PostgreSQL database via ODBC.

channel <- odbcConnect(db_name, uid = "", pwd = "")
 
Now, let's get all the cables from 2010 onwards:

cable_frame <- sqlQuery(channel, "SELECT * from cable WHERE date > '2010-01-01'", 
    stringsAsFactors = FALSE, errors = TRUE)
 
We can make a plot of which senders sent the most cables from 2010 onwards:

last_10 <- tail(sort(table(cable_frame$origin)), 10)
qplot(names(last_10), last_10, geom = "bar") + opts(axis.title.x = theme_blank()) + 
    opts(axis.title.y = theme_blank()) + opts(axis.text.x = theme_text(size = 8))
 

We can see that the Secretary of State and Embassy Baghdad are the two biggest offenders.

Now, we can get all of the cables in the database and see how cable traffic changed over time (or perhaps Wikileaks had a biased sample):

all_cables <- sqlQuery(channel, "SELECT * from cable", stringsAsFactors = FALSE, 
    errors = TRUE)
 date_tab <- table(as.POSIXlt(all_cables$date)$year + 1900)
qplot(names(date_tab), as.numeric(date_tab), geom = "bar") + opts(axis.title.x = theme_blank()) + 
    opts(axis.title.y = theme_blank()) + opts(axis.text.x = theme_text(size = 8)) 
 

The amount of cables rises almost exponentially from 2000 until 2009. I'm assuming that only some of the cables for 2010 were leaked, explaining the low count there.

We can get rid of the all_cables file, as we won't need it going forward:
rm(all_cables)
gc()
 
Comparing word usage in the 80's and 90's to word usage today

Now, we can get to something interesting: we can compare how word usage/topics shifted from 1980-1995 to today. Because there are relatively few cables from early on, we have to specify a 15 year range, which nets us only around 675 cables.

cable_present <- sqlQuery(channel, "SELECT * from cable WHERE date > '2010-02-15'", 
    stringsAsFactors = FALSE, errors = TRUE)
 cable_past <- sqlQuery(channel, "SELECT * from cable WHERE date > '1980-01-01' AND date < '1995-01-01'", 
    stringsAsFactors = FALSE, errors = TRUE)
 
Now, we have two challenges. The cables all have line breaks and returns (\r and \n), and a lot of the older cables are in all caps. We will get rid of these issues by removing the breaks/returns and converting everything to all lower case.

ppatterns <- c("\\n", "\\r")
combined <- tolower(gsub(paste("(", paste(ppatterns, collapse = "|"), 
    ")", sep = ""), "", c(cable_past$content, cable_present$content)))
 
Now, we can construct a term document matrix which counts the number of times each term occurs in each document:

corpus <- Corpus(VectorSource(combined))
corpus <- tm_map(corpus, stripWhitespace)
cable_mat <- as.matrix(TermDocumentMatrix(corpus, control = list(weighting = weightTf, 
    removePunctuation = TRUE, removeNumbers = TRUE, wordLengths = c(4, 15))))
cable_mat <- cable_mat[rowSums(cable_mat) > 3, ]
 
We remove any words that are under 4 characters or over 15 characters, and additionally remove any terms that appear less than 3 times in the whole group of cables.

For convenience, we can split the matrix into one containing past cables and one containing current cables:

present_mat <- cable_mat[, (nrow(cable_past) + 1):ncol(cable_mat)]
past_mat <- cable_mat[, 1:nrow(cable_past)]
rm(cable_mat)
gc()
 
Now we can get to the good stuff and find differential word usage between the two sets of cables:

chisq_vals <- chisq(rowSums(past_mat), ncol(past_mat) * 100, rowSums(present_mat), 
    ncol(present_mat) * 100)
chisq_direction <- rep(-1, length(chisq_vals))
mean_frame <- data.frame(past_mean = rowSums(past_mat)/ncol(past_mat), 
    present_mean = rowSums(present_mat)/ncol(present_mat))
chisq_direction[mean_frame[, 2] > mean_frame[, 1]] <- 1
chisq_vals <- chisq_vals * chisq_direction
cloud_frame <- data.frame(word = rownames(present_mat), chisq = chisq_vals, 
    past_sum = rowSums(past_mat), present_sum = rowSums(present_mat))
pal <- brewer.pal(9, "Set1")
 
The above code will calculate the statistical difference (chisq) between the terms in the first set of cables (1980-1995), and the second set (cables from february 2010).

Now we can make some word clouds. This first cloud contains words that appear in the 2010 cables in a more significant way than in the 1980-1995 cables. A larger size indicates that it more significantly appears in the 2010 cables:

wordcloud(cloud_frame$word, cloud_frame$chisq, scale = c(8, 0.3), 
    min.freq = 2, max.words = 100, random.order = T, rot.per = 0.15, colors = pal, 
    vfont = c("sans serif", "plain"))


This second cloud indicates the words that appear in a significant way in the 1980-1995 cables, but not in the 2010 cables:

wordcloud(cloud_frame$word, -cloud_frame$chisq, scale = c(8, 0.3), 
    min.freq = 2, max.words = 100, random.order = T, rot.per = 0.15, colors = pal, 
    vfont = c("sans serif", "plain"))
 

As we can see, february is very significant in the first plot, which is to be expected, because all of the cables are from february. But, we can also see interesting patterns, like trafficking becoming very important in 2010 vs 1980-1995, and words like development and training gaining prominence. In the second plot, we see more interest in topics like zagreb, soviet, saudi, and croatia.

Find out what words typify secret/classified cables vs unclassified in 2010

Let's take a look at what words/topics are more prevalent in secret or classified cables. Let's first look at how many cables of each type are in our cable_present data frame:

table(cable_present$classification)
## 
##                        CONFIDENTIAL                CONFIDENTIAL//NOFORN 
##                                 719                                  67 
##                              SECRET                      SECRET//NOFORN 
##                                 188                                  51 
##                        UNCLASSIFIED UNCLASSIFIED//FOR OFFICIAL USE ONLY 
##                                 643                                 756 

Now, we will do something similar to what we did above, where the data was split into 2 chunks and the words in each chunk were compared to generate clouds. I have made the code generic by changing the names to set one and set two.

cable_set_one <- cable_present[cable_present$classification %in% 
    c("SECRET", "SECRET//NOFORN"), ]
cable_set_two <- cable_present[cable_present$classification %in% 
    c("UNCLASSIFIED", "UNCLASSIFIED//FOR OFFICIAL USE ONLY"), ]
ppatterns <- c("\\n", "\\r")
combined <- tolower(gsub(paste("(", paste(ppatterns, collapse = "|"), 
    ")", sep = ""), "", c(cable_set_one$content, cable_set_two$content)))
corpus <- Corpus(VectorSource(combined))
corpus <- tm_map(corpus, stripWhitespace)
cable_mat <- as.matrix(TermDocumentMatrix(corpus, control = list(weighting = weightTf, 
    removePunctuation = TRUE, removeNumbers = TRUE, wordLengths = c(4, 15))))
cable_mat <- cable_mat[rowSums(cable_mat) > 3, ]
one_mat <- cable_mat[, 1:nrow(cable_set_one)]
two_mat <- cable_mat[, (nrow(cable_set_one) + 1):ncol(cable_mat)]
rm(cable_mat)
gc()
chisq_vals <- chisq(rowSums(one_mat), ncol(one_mat) * 100, rowSums(two_mat), 
    ncol(two_mat) * 100)
chisq_direction <- rep(-1, length(chisq_vals))
mean_frame <- data.frame(one_mean = rowSums(one_mat)/ncol(one_mat), 
    two_mean = rowSums(two_mat)/ncol(two_mat))
chisq_direction[mean_frame[, 2] > mean_frame[, 1]] <- 1
chisq_vals <- chisq_vals * chisq_direction
cloud_frame <- data.frame(word = rownames(one_mat), chisq = chisq_vals, 
    one_sum = rowSums(one_mat), two_sum = rowSums(two_mat))
pal <- brewer.pal(9, "Set1")
 
We are now ready to plot these new word clouds. Here are words that are typical of set 1 (secret cables) that separate it from set 2 (unclassified cables):

wordcloud(cloud_frame$word, -cloud_frame$chisq, scale = c(8, 0.3), 
    min.freq = 2, max.words = 100, random.order = T, rot.per = 0.15, colors = pal, 
    vfont = c("sans serif", "plain"))


And here are words that are typical of set 2 (unclassified cables) that separate it from set 1 (secret cables) :

wordcloud(cloud_frame$word, cloud_frame$chisq, scale = c(8, 0.3), 
    min.freq = 2, max.words = 100, random.order = T, rot.per = 0.15, colors = pal, 
    vfont = c("sans serif", "plain"))


This makes sense, as the first cloud has words like icbms and bombers, whereas the second has words like labor and victims, which would be typical of the trafficking in persons/human rights reports.

Find out what words typify secret/classified cables vs unclassified from 1960-2000

Now, we can look at what words differentiated secret cables from unclassified cables from 1960 to 2000.

Here is the cloud that shows what words appear significantly in the secret cables, but not in the unclassified cables:

wordcloud(cloud_frame$word, -cloud_frame$chisq, scale = c(8, 0.3), 
    min.freq = 2, max.words = 100, random.order = T, rot.per = 0.15, colors = pal, 
    vfont = c("sans serif", "plain"))


And here is the cloud that shows what words appear significantly in the unclassified cables, but not in the secret cables:

wordcloud(cloud_frame$word, cloud_frame$chisq, scale = c(8, 0.3), 
    min.freq = 2, max.words = 100, random.order = T, rot.per = 0.15, colors = pal, 
    vfont = c("sans serif", "plain"))


Conclusion

It's very interesting to see how these patterns change over time. Particularly, seeing what the classified topics were from 1960-2000 versus unclassified is interesting. I really wanted to see how State Department writers differ from normal english writers, but I don't have the time to do it right now. It will have to wait for the next post.

Saturday, June 9, 2012

NBA Playoffs Update 5 (5-4)

This is the sixth post in my series on predicting the NBA playoffs with an algorithm. After the Boston loss in their last game, the algorithm is now 5-4 in the playoffs. Hopefully it is correct tonight!

Open Sourcing the Code

I have had a couple of requests to open source the code, which I had planned to do at the end of this series of posts. However, there is one stumbling block in that the data I am scraping cannot be redistributed (I think). If anyone has access to box score data for the 2010-2011 and 2011-2012 seasons that has a public license, please let me know. You can contact me via the email in my profile, or in the comments section.

Being able to get the data would simplify things a lot, but I would still have to clean up and comment the code a bit. Expect to see it out in a week or so.

Predictions

The algorithm likes Miami tonight:

Thursday, June 7, 2012

NBA Playoff Predictions Update 4 (5-3)

This is update 4 to my original post about predicting the NBA playoffs with R. With the Thunder beating the Spurs and the Heat losing to the Celtics, the algorithm went 1-1 on predictions, making it 5-3 so far.

Making some improvements

I have been posting for some time about incorporating more data into the models, and I finally got around to it. It is a common truism in data science that more (high-quality) data almost always leads to a better model, and it is no exception here. The fact that the 2011-2012 season was strike-shortened also meant that only relying on data from this season really limited the potential of the algorithm.

I decided to start slowly, and incorporate data from both the 2010-2011 season, and the 2011-2012 season. Due to the aforementioned strike, this actually increases the data available by 128% overall. I made no other tweaks to the algorithm in this time, so this is a good test of how much value additional data on its own can add.

The new accuracy value across both seasons is 65.6%, which means that it is predicting 1.91 times as many winners as losers. Here is the confusion matrix:

Differences between seasons

I thought it would be interesting to look at p-values between different season statistics to see if there were any significant differences between the 2010-2011 season and the 2011-2012 season. A big deal is always made about how the lockout affected different statistics, but I haven't seen any analysis on it yet.

We can easily do a t-test on each column of the data frame with all of the per team statistics:
p_vals<-foreach(i=7:ncol(frame_2011)) %do%
{
   t.test(frame_2011[,i],frame_2012[,i])$p.value
}
We end up with a table of each calculated statistic the p-values associated with each one. A p-value indicates if there is a statistically significant difference between two distributions. In this case, we might be looking to see whether there is a statistically significant difference between rebounding in the 2010-2011 season, and the 2011-2012 season. The test gave back some interesting results. Here are some of them:

1. There was a very significant difference in the number of players who fouled out between the two seasons. In 2012, far fewer players fouled out than in 2011. Also, less personal fouls were assessed overall, which also dropped the number of free throws attempted.

2. Starters played significantly more unique positions in 2012. For example, if the starting lineup consists of a C, a PF, an SF, an SG/SF, and a PG, there are 5 unique positions. On the other hand, if it consists of a PF, a PF, an SF/SG, an SF/SG, and a PG, that is only 3 unique positions. I am not sure why this increased between seasons, but maybe it indicates the rise of more true centers? Maybe a different way of keeping track of positions?

3. 3 point percentage went down significantly from 2010-2011 to 2011-2012. Less practice time? Not clear why this happened.

4. Rebounding went up overall, as did defensive rebounding.

5. Starters played less minutes in 2011-2012 than in 2010-2011.

Predictions for Tonight

And finally, the algorithm is predicting Boston to win tonight:

Tuesday, June 5, 2012

NBA Playoff Predictions Update 3 (4-2)

This is my third update to my original post on predicting the NBA playoffs with an algorithm. Here are updates 1 and 2.

The algorithm correctly predicted a Boston win, but missed on the Spurs/Thunder game, so it is currently 4-2. Haven't had any time to update yet, so I will only be able to give you predictions for the next games, unfortunately: Predicting a Miami win and an Oklahoma City win.