<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-8712623323474199769</id><updated>2012-02-16T07:38:48.416-08:00</updated><category term='Kaggle'/><category term='regression'/><category term='randomForest'/><category term='loop'/><category term='doSnow'/><category term='Sweave'/><category term='Bailout'/><category term='foreach'/><category term='data analysis'/><category term='Ruby'/><category term='Linux'/><category term='Statistics'/><category term='Machine Learning'/><category term='spending'/><category term='parallel'/><category term='government'/><category term='diagnostics'/><category term='Windows'/><category term='Banking'/><category term='Algorithms'/><category term='LaTeX'/><category term='R'/><category term='Finance'/><title type='text'>R, Ruby, and Finance</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://viksalgorithms.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://viksalgorithms.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Vik Paruchuri</name><uri>http://www.blogger.com/profile/14436550620869189803</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>16</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-8712623323474199769.post-7097400540850195138</id><published>2012-02-09T09:12:00.000-08:00</published><updated>2012-02-09T13:48:26.681-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='foreach'/><category scheme='http://www.blogger.com/atom/ns#' term='randomForest'/><category scheme='http://www.blogger.com/atom/ns#' term='doSnow'/><category scheme='http://www.blogger.com/atom/ns#' term='parallel'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Monitoring Progress Inside a Foreach Loop</title><content type='html'>The foreach package for R is excellent, and allows for code to easily be run in parallel.  One problem with foreach is that it creates new RScript instances for each iteration of the loop, which prevents status messages from being logged to the console output.  This is particularly frustrating during long-running tasks, when we are often unsure how much longer we need to wait, or even if the code is doing what it is intended to.The solution to this can be found in the sink() function.  This function redirects output to a file.  I will show you a simple example of this using the iris data set.The code below will execute without printing any status messages, even though do.trace is enabled, which periodically displays the status of the randomForest. The random forest code is slightly adapted from one of the foreach package examples.&lt;pre&gt;&lt;br /&gt;library(foreach)&lt;br /&gt;library(doSNOW)&lt;br /&gt;library(randomForest)&lt;br /&gt;data(iris)&lt;br /&gt;&lt;br /&gt;cores&lt;-2&lt;br /&gt;&lt;br /&gt;num_trees&lt;-ceiling(2000/cores)&lt;br /&gt;c1&lt;-makeCluster(cores)&lt;br /&gt;registerDoSNOW(c1)&lt;br /&gt;&lt;br /&gt;rf_fit&lt;-foreach(ntree=rep(num_trees,cores),.combine=combine,&lt;br /&gt;.packages=c("randomForest")) %dopar% {&lt;br /&gt;randomForest(iris[,-5],y=iris[,5],ntree=ntree,do.trace=100) &lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;stopCluster(c1)  &lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;We can easily correct this with the sink function:&lt;pre&gt;&lt;br /&gt;library(foreach)&lt;br /&gt;library(doSNOW)&lt;br /&gt;library(randomForest)&lt;br /&gt;data(iris)&lt;br /&gt;&lt;br /&gt;cores&lt;-2&lt;br /&gt;&lt;br /&gt;num_trees&lt;-ceiling(2000/cores)&lt;br /&gt;c1&lt;-makeCluster(cores)&lt;br /&gt;registerDoSNOW(c1)&lt;br /&gt;&lt;br /&gt;writeLines(c(""), "log.txt")&lt;br /&gt;&lt;br /&gt;rf_fit&lt;-foreach(ntree=rep(num_trees,cores),.combine=combine,&lt;br /&gt;.packages=c("randomForest")) %dopar% {&lt;br /&gt;sink("log.txt", append=TRUE)&lt;br /&gt;randomForest(iris[,-5],y=iris[,5],ntree=ntree,do.trace=100) &lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;stopCluster(c1)&lt;br /&gt;&lt;/pre&gt;The writeLines function will clear out the file "log.txt" before the loop is run, ensuring that only output that is relevant to the current run is displayed when the file is opened.  The log file can be opened at any time during the run, and the progress can be checked.  We can even print the number of the iteration as we go through:&lt;pre&gt;&lt;br /&gt;rf_fit&lt;-foreach(iteration=1:cores,ntree=rep(num_trees,cores),&lt;br /&gt;.combine=combine,.packages=c("randomForest")) %dopar% {&lt;br /&gt;sink("log.txt", append=TRUE)&lt;br /&gt;cat(paste("Starting iteration",iteration,"\n"))&lt;br /&gt;randomForest(iris[,-5],y=iris[,5],ntree=ntree,do.trace=100) &lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;This is useful when you have significantly more iterations than processor cores, and you want to know how far along you are.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8712623323474199769-7097400540850195138?l=viksalgorithms.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://viksalgorithms.blogspot.com/feeds/7097400540850195138/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://viksalgorithms.blogspot.com/2012/02/monitoring-progress-inside-foreach-loop.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/7097400540850195138'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/7097400540850195138'/><link rel='alternate' type='text/html' href='http://viksalgorithms.blogspot.com/2012/02/monitoring-progress-inside-foreach-loop.html' title='Monitoring Progress Inside a Foreach Loop'/><author><name>Vik Paruchuri</name><uri>http://www.blogger.com/profile/14436550620869189803</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8712623323474199769.post-5897587368307715451</id><published>2012-01-30T18:24:00.000-08:00</published><updated>2012-01-31T02:16:53.941-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Windows'/><category scheme='http://www.blogger.com/atom/ns#' term='LaTeX'/><category scheme='http://www.blogger.com/atom/ns#' term='Sweave'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Using LaTeX, R, and Sweave to Create Reports in Windows</title><content type='html'>&lt;a href="http://www.latex-project.org/"&gt;LaTeX&lt;/a&gt; is a typesetting system that can easily be used to create reports and scientific articles, and has excellent formatting options for displaying code and mathematical formulas.  &lt;a href="http://www.statistik.lmu.de/%7Eleisch/Sweave/"&gt;Sweave&lt;/a&gt; is a package in base R that can execute R code embedded in LaTeX files and display the output.  This can be used to generate reports and quickly fix errors when needed.&lt;br /&gt;&lt;br /&gt;There are some barriers to entry with LaTeX that seem much steeper than they actually are.  In this article, I will show you how to setup LaTeX and an IDE for it in Windows, and how to start developing for it.  Note that you will at least R version 2.14 to follow this entire article, because the command line sweave pdf export option was added then.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Install MiKTeX and TeXnicCenter&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;LaTeX documents are edited in a text editor, then compiled by a compiler, and finally are displayed in a PDF or postscript viewer.  We will begin by installing a LaTeX compiler for Windows, &lt;a href="http://miktex.org/"&gt;MiKTeX&lt;/a&gt;.  Once you grab the installer, you can go ahead and install it.  Please see the bottom of the post for the full edit, but installing LaTeX in paths with spaces in them can cause issues.  As such, I would recommend installing to a directory without them.&lt;br /&gt;&lt;br /&gt;Our next step will be to install &lt;a href="http://www.texniccenter.org/"&gt;TeXnicCenter&lt;/a&gt;, an IDE for LaTeX.  After installing TeXnicCenter and starting it for the first time, you will be asked to find your LaTeX executable file in the configuration wizard. This can generally be found in the folder C:\Program Files\MiKTeX 2.9\miktex\bin by default, or a custom path if you did not use spaces in your path.  You can leave the fields for PostScript viewer blank, unless you need the functionality.  You should now be finished with the installation.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Setup TeXnicCenter to Work with Sweave&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Now, we can setup TeXnicCenter to use Sweave directly.  This will require at least R 2.14.  To do this, click on the Build menu and go to Define Output Profiles.  Hit the "Add" button in the bottom left to create a new output profile.  You can name the profile anything you like.  I named mine "Sweave".  &lt;br /&gt;&lt;br /&gt;Now we need to configure the new output profile.  The first tab should look similar to mine if you use 64-bit R, but your R path will be different if you use 32-bit or used a non-default installation directory.  The directory will need to match yours.  The command line arguments box should read: CMD Sweave --pdf %nm .&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-jCXziuoRBIM/TydIiHrghJI/AAAAAAAAADQ/dusNBwWVyiY/s1600/sweave2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/-jCXziuoRBIM/TydIiHrghJI/AAAAAAAAADQ/dusNBwWVyiY/s1600/sweave2.png" /&gt;&lt;/a&gt;&lt;/div&gt;Now, we only need to worry about the viewer tab.  I use Foxit Reader, so my settings are below.  I suggest using the same settings as you find in your LaTeX=&amp;gt;PDF output profile.&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-CczH5S2fDhE/TydMaScDCSI/AAAAAAAAADc/KiA8QIxil2k/s1600/sweave3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/-CczH5S2fDhE/TydMaScDCSI/AAAAAAAAADc/KiA8QIxil2k/s1600/sweave3.png" /&gt;&lt;/a&gt;&lt;/div&gt;Once we have all that set, you can go ahead and write your first document.  You will need to name the document something.Rnw. To build the file, you will have to select "Sweave"(or whatever you named your profile" in the drop down box below the build item on the menu bar, and then select Build-&amp;gt;Current File-&amp;gt;Build and View to build your first file.&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-I4bxBXMUbss/TydN7mQEvVI/AAAAAAAAADo/u8TUvXex2lQ/s1600/sweave4.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/-I4bxBXMUbss/TydN7mQEvVI/AAAAAAAAADo/u8TUvXex2lQ/s1600/sweave4.png" /&gt;&lt;/a&gt;&lt;/div&gt;The above shows a trivial R script.  &amp;lt;&amp;lt;&amp;gt;&amp;gt;= signifies the start of an R script, and @ signifies the end.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Further Reading&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;To learn more about Sweave, I suggest reading the &lt;a href="http://www.stat.uni-muenchen.de/%7Eleisch/Sweave/Sweave-manual.pdf"&gt;user manual&lt;/a&gt;. &lt;a href="http://stackoverflow.com/questions/8366193/writing-big-documents-with-sweave-is-it-possible-to-do-as-with-latex"&gt;Here&lt;/a&gt; is a good discussion of workflow with big documents and Sweave.  To learn more about LaTeX, I suggest the &lt;a href="http://en.wikibooks.org/wiki/LaTeX"&gt;LaTeX wikibook&lt;/a&gt;, and &lt;a href="http://tobi.oetiker.ch/lshort/lshort.pdf"&gt;The Not So Short Introduction to LaTeX&lt;/a&gt;.  &lt;a href="http://amath.colorado.edu/documentation/LaTeX/basics/example.html"&gt;This page&lt;/a&gt; shows how to make a basic document in LaTeX, and can be a good template to work off of.This &lt;a href="http://sachaem47.fortyseven.versio.nl/latexcourse/lec4/SweaveInstall.pdf"&gt;link&lt;/a&gt; gave me some of the concepts used above, and also has instructions for how to use Stangle with TeXnicCenter.&lt;br/&gt;&lt;br/&gt;&lt;b&gt;Edit&lt;/b&gt;A commenter has pointed out that using a directory path with spaces in it can create bugs in LaTeX.  You can find more information on this &lt;a href="http://tex.stackexchange.com/questions/4315/include-image-with-spaces-in-path-directory-to-be-processed-with-dvips"&gt;here&lt;/a&gt;, and &lt;a href="http://groups.google.com/group/latexusersgroup/browse_thread/thread/f78aab8bdaedcdf7?pli=1"&gt;here&lt;/a&gt;.  I have updated my post.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8712623323474199769-5897587368307715451?l=viksalgorithms.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://viksalgorithms.blogspot.com/feeds/5897587368307715451/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/using-latex-r-and-sweave-to-create.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/5897587368307715451'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/5897587368307715451'/><link rel='alternate' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/using-latex-r-and-sweave-to-create.html' title='Using LaTeX, R, and Sweave to Create Reports in Windows'/><author><name>Vik Paruchuri</name><uri>http://www.blogger.com/profile/14436550620869189803</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-jCXziuoRBIM/TydIiHrghJI/AAAAAAAAADQ/dusNBwWVyiY/s72-c/sweave2.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8712623323474199769.post-3736701907267312270</id><published>2012-01-26T19:13:00.000-08:00</published><updated>2012-01-26T19:13:19.281-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='foreach'/><category scheme='http://www.blogger.com/atom/ns#' term='loop'/><category scheme='http://www.blogger.com/atom/ns#' term='parallel'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Parallel R Model Prediction Building and Analytics</title><content type='html'>Modifying R code to run in parallel can lead to huge performance gains.  Although a significant amount of code can easily be run in parallel, there are some learning techniques, such as the Support Vector Machine, that cannot be easily parallelized.  However, there is an often overlooked way to speed up these and other models.  It involves executing the code that generates predictions and other analytics in parallel, instead of executing the model building phase in parallel, which is sometimes impossible.  I will show you how this can be done in this post.&lt;br/&gt;&lt;br/&gt;First, we will set up our variables.  The setup is fairly similar to the one I have used in other posts, but note that the length of the vectors has been increased by a magnitude of 100 to more easily show how much time can be saved by parallelizing the prediction building phase.&lt;pre&gt;&lt;br /&gt;set.seed(10)&lt;br /&gt;y&lt;-c(1:100000)&lt;br /&gt;x1&lt;-c(1:100000)*runif(100000,min=0,max=2)&lt;br /&gt;x2&lt;-c(1:100000)*runif(100000,min=0,max=2)&lt;br /&gt;x3&lt;-c(1:100000)*runif(100000,min=0,max=2)&lt;br /&gt;&lt;br /&gt;all_data&lt;-data.frame(y,x1,x2,x3)&lt;br /&gt;positions &lt;- sample(nrow(all_data),size=floor((nrow(all_data)/4)*3))&lt;br /&gt;training&lt;- all_data[positions,]&lt;br /&gt;testing&lt;- all_data[-positions,]&lt;br /&gt;&lt;/pre&gt;We now have a testing set of 25,000 rows, and a training set of 75,000 rows, which is somewhat linear.  We will train an SVM on this data.  Note that it may take more than 10 minutes to train an SVM on this data, particularly if you have an older computer.  If you do have an older computer, feel free to reduce the number of rows in the data frame as needed.&lt;pre&gt;&lt;br /&gt;library(e1071)&lt;br /&gt;svm_fit&lt;-svm(y~x1+x2+x3,data=training)&lt;br /&gt;svm_predictions&lt;-predict(svm_fit,newdata=testing)&lt;br /&gt;error&lt;-sqrt((sum((testing$y-svm_predictions)^2))/nrow(testing))&lt;br /&gt;error&lt;br /&gt;&lt;/pre&gt;After we have built the model, we generate predictions for it, which yields an error of 13045.9 for me(although this may be different for your data).  Our next step is timing the prediction phase to see how long it takes.&lt;pre&gt;&lt;br /&gt;system.time(predict(svm_fit,newdata=testing))&lt;br /&gt;&lt;/pre&gt;On my system, this took 35.1 seconds.&lt;br/&gt;&lt;br/&gt;We are now ready to set up our parallel infrastructure.  I run Windows, and will use foreach and doSNOW here, although you can certainly use other parallel packages here if you prefer.  You can read &lt;a href="http://viksalgorithms.blogspot.com/2012/01/parallel-r-loops-for-windows-and-linux.html"&gt;this post&lt;/a&gt; if you want an introduction to foreach, doSNOW, and doMC.  If you do not elect to use doSNOW, you will not need to use the stopCluster() function that appears in some of the code below.&lt;pre&gt;&lt;br /&gt;library(foreach)&lt;br /&gt;library(doSNOW)&lt;br /&gt;cl&lt;-makeCluster(4) #change the 4 to your number of CPU cores&lt;br /&gt;registerDoSNOW(cl)  &lt;br /&gt;&lt;/pre&gt;Now, we have the groundwork for our parallel foreach loop, but we need to find a way to split the data up in order to perform predictions on small sets of data in parallel.  &lt;pre&gt;&lt;br /&gt;num_splits&lt;-4&lt;br /&gt;split_testing&lt;-sort(rank(1:nrow(testing))%%4)&lt;br /&gt;&lt;/pre&gt;This will create a numeric vector that can be used to split the testing data frame into 4 parts.  I suggest setting num_splits to some multiple of your number of CPU cores in order to execute the below foreach loop as quickly as possible.  Now that we have a way to split the data up, we can go ahead and create a loop that will generate predictions in parallel.&lt;pre&gt;&lt;br /&gt;svm_predictions&lt;-foreach(i=unique(split_testing),&lt;br /&gt;.combine=c,.packages=c("e1071")) %dopar% {&lt;br /&gt;as.numeric(predict(svm_fit,newdata=testing[split_testing==i,]))&lt;br /&gt;}&lt;br /&gt;stopCluster(c1)&lt;br /&gt;&lt;/pre&gt;It is very important that the .packages argument be used to load the package that corresponds to the prediction function you are going to use in the loop, or R will get confused about which prediction function to use and generate an error.  The .combine argument tells the foreach loop to combine the outputs of the foreach loop into a vector.  A hidden argument that defaults to true ensures that all the outputs remain in order.&lt;br/&gt;&lt;br/&gt;Now, we test to make sure that everything is okay by checking what the error value is:&lt;pre&gt;&lt;br /&gt;error&lt;-sqrt((sum((testing$y-svm_predictions)^2))/nrow(testing))&lt;br /&gt;error&lt;br /&gt;&lt;/pre&gt;I got 13045.9, which matches the value I got before, and confirms that both the parallel and non-parallel prediction routines return the exact same results.&lt;br/&gt;&lt;br/&gt;Now, we can create a function and time it to see how fast the parallel technique is:&lt;pre&gt;&lt;br /&gt;parallel_predictions&lt;-function(fit,testing)&lt;br /&gt;{&lt;br /&gt;cl&lt;-makeCluster(4)&lt;br /&gt;registerDoSNOW(cl)&lt;br /&gt;num_splits&lt;-4&lt;br /&gt;split_testing&lt;-sort(rank(1:nrow(testing))%%4)&lt;br /&gt;predictions&lt;-foreach(i=unique(split_testing),&lt;br /&gt;.combine=c,.packages=c("e1071")) %dopar% {&lt;br /&gt;as.numeric(predict(fit,newdata=testing[split_testing==i,]))&lt;br /&gt;}&lt;br /&gt;stopCluster(cl)&lt;br /&gt;predictions&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;system.time(parallel_predictions(svm_fit,testing))&lt;br /&gt;&lt;/pre&gt;This takes 12.76 seconds on my system, which is significantly faster than the non-parallel implementation.&lt;br/&gt;&lt;br/&gt;This technique can be extended to other analytics functions that can be run after the model is built, and it can generate predictions for any model, not just for the svm that the example uses.  While creating the model can take up much more time than generating predictions, it is not always feasible to parallelize model creation.  Running the prediction phase in parallel, particularly on high dimensional data, can save significant time.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8712623323474199769-3736701907267312270?l=viksalgorithms.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://viksalgorithms.blogspot.com/feeds/3736701907267312270/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/parallel-r-model-prediction-building.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/3736701907267312270'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/3736701907267312270'/><link rel='alternate' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/parallel-r-model-prediction-building.html' title='Parallel R Model Prediction Building and Analytics'/><author><name>Vik Paruchuri</name><uri>http://www.blogger.com/profile/14436550620869189803</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8712623323474199769.post-7517490718162120166</id><published>2012-01-23T18:47:00.000-08:00</published><updated>2012-01-23T20:49:34.138-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data analysis'/><category scheme='http://www.blogger.com/atom/ns#' term='government'/><category scheme='http://www.blogger.com/atom/ns#' term='spending'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Analyzing US Government Contract Awards in R</title><content type='html'>As I was exploring open data sources, I came across &lt;a href="http://usaspending.gov/"&gt;USA spending&lt;/a&gt;.  This site contains information on US government contract awards and other disbursements, such as grants and loans.  In this post, we will look at data on contracts awarded in the state of Maryland in the fiscal year 2011, which is available by selecting "Maryland" as the state where the contract was received and awarded &lt;a href="http://usaspending.gov/data?carryfilters=on"&gt;here&lt;/a&gt;.  I will use Maryland as a proxy for the nation, as the data set for the whole nation will be a bit more unwieldy to analyze, and the USA spending site appears to need a significant amount of time to generate the data file for it.  We may take a look at the data for the whole nation later on. &lt;br /&gt;&lt;br /&gt;First, we download the data file(leave all the options on the usa spending site the same, except select Maryland as the state where the contracts were received and performed when you download the file if you want to follow along), and read it into R:&lt;br /&gt;&lt;pre&gt;spend&amp;lt;-read.csv(file="marylandspendingbasic.csv",&lt;br /&gt;head=TRUE,sep=",",stringsAsFactors=FALSE,strip.white=TRUE)&lt;br /&gt;&lt;/pre&gt;By using the names() function, we can see that the data file has a lot of interesting columns.  One column is called extentcompeted, and has values indicating how much competition was involved in the bidding process for the contracts.  Understandably, having an open bidding process is in the public interest as it can reduce taxpayer costs.  We will start by looking at competition in the bidding process for contracts:&lt;br /&gt;&lt;pre&gt;library(ggplot2)&lt;br /&gt;competition&amp;lt;-tapply(spend$ExtentCompeted,spend$ExtentCompeted,&lt;br /&gt;function(x) length(x)/length(spend$ExtentCompeted))&lt;br /&gt;&lt;br /&gt;qplot(names(competition),weight=competition*100,xlab="Competition Type",&lt;br /&gt;ylab="Percent of Contracts",fill=names(competition)) &lt;br /&gt;+ opts(axis.text.x = theme_blank()) &lt;br /&gt;+ scale_fill_discrete("Competition Type")&lt;br /&gt;&lt;/pre&gt;This lets us see what percentage of the data falls into each competition category, and then graphs it using the excellent ggplot2 package.&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-juBIDA_U5jo/Tx4UTUZZmVI/AAAAAAAAACs/JYhrg-HsobA/s1600/competition.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-juBIDA_U5jo/Tx4UTUZZmVI/AAAAAAAAACs/JYhrg-HsobA/s1600/competition.png" /&gt;&lt;/a&gt;&lt;/div&gt;This shows us that about 58% of contracts went through a full competitive bidding process, whereas only approximately 15% of the contracts did not undergo any kind of competition process.&lt;br /&gt;&lt;br /&gt;Now, lets see how many dollars of spending fall into each competition category:&lt;br /&gt;&lt;pre&gt;comp_dollars&amp;lt;-tapply(spend$DollarsObligated,spend$ExtentCompeted,&lt;br /&gt;function(x) sum(x)/sum(spend$DollarsObligated,na.rm=TRUE))&lt;br /&gt;qplot(names(comp_dollars),weight=comp_dollars*100,xlab="Competition Type",&lt;br /&gt;ylab="Percentage of Dollars Obligated",fill=names(comp_dollars)) &lt;br /&gt;+ opts(axis.text.x = theme_blank()) &lt;br /&gt;+ scale_fill_discrete("Competition Type")&lt;br /&gt;&lt;/pre&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-qS7ggV--ErM/Tx4OGl6XebI/AAAAAAAAACY/Pd4_DX-GWRM/s1600/comp_dollars.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/-qS7ggV--ErM/Tx4OGl6XebI/AAAAAAAAACY/Pd4_DX-GWRM/s1600/comp_dollars.png" /&gt;&lt;/a&gt;&lt;/div&gt;This plot tells a very different story, showing that about 31% of all dollars that were spent by the government went to contracts that did not involve competition. Further, only 44.5% of all dollars that were obligated went to contracts that were bid on under a fair and open competitive bidding process.&lt;br /&gt;&lt;br /&gt;This indicates that large contracts tend to receive less bidding than small contracts.  This is strange, as large contracts are the ones that are more likely to have reduced costs as a result of competitive bidding, simply because saving 5% of a larger sum is preferable to saving 5% of a smaller sum.  Let's take a look at where these large no-bid contracts are going.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;companies&amp;lt;-sort(tapply(spend$DollarsObligated,spend$RecipientName,sum))&lt;br /&gt;top_companies&amp;lt;-tail(companies/sum(spend$DollarsObligated),10)*100&lt;br /&gt;sum(spend$DollarsObligated/1e9)&lt;br /&gt;qplot(names(top_companies),weight=top_companies,xlab="Company",&lt;br /&gt;ylab="Percentage of Total Dollars Obligated",fill=names(top_companies)) &lt;br /&gt;+ opts(axis.text.x = theme_blank()) &lt;br /&gt;+ scale_fill_discrete("Company Name") &lt;br /&gt;+ opts(legend.text = theme_text(size = 8))&lt;br /&gt;&lt;/pre&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-LOlrhnotsDA/Tx4VdLZhxOI/AAAAAAAAAC4/W8eG1lTJBLs/s1600/companies.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/-LOlrhnotsDA/Tx4VdLZhxOI/AAAAAAAAAC4/W8eG1lTJBLs/s1600/companies.png" /&gt;&lt;/a&gt;&lt;/div&gt;First, we note that 15.7 billion dollars in federal contracts were obligated to vendors in the state of Maryland in the fiscal year 2011.  Next, we note that about 11.5% of the total federal money allocated in contracts went to Lockheed Martin!  In fact, the top 10 companies in terms of dollar value of contracts received were given 41% of all the contracted dollars, which amounts to about 6.5 billion dollars.  The top 100 contractors received 72% of all contracted dollars.  Now, it is becoming clearer where the large no-bid contracts are going.  We will look more in depth at the contracts that did not involve competitive bidding to zero in on the issue:&lt;pre&gt;&lt;br /&gt;nc_spend&lt;-spend[spend$ExtentCompeted=="Not Competed under SAP" &lt;br /&gt;| spend$ExtentCompeted=="Not Competed" &lt;br /&gt;| spend$ExtentCompeted=="Not Available for Competition" &lt;br /&gt;| spend$ExtentCompeted=="Full and Open Competition after exclusion of sources",]&lt;br /&gt;nc_companies&lt;-sort(tapply(nc_spend$DollarsObligated,nc_spend$RecipientName,sum))&lt;br /&gt;nc_top_companies&lt;-tail(nc_companies/sum(nc_spend$DollarsObligated),10)*100&lt;br /&gt;sum(nc_spend$DollarsObligated)/1e9&lt;br /&gt;sum(tail(nc_companies,10))/sum(nc_spend$DollarsObligated)&lt;br /&gt;&lt;/pre&gt;This tells us that out of the 7.6 billion dollars in government contracts that were disbursed with limited or no competition, 55% of these went to the top 10 contractors who received contracts with limited or no competition.  This shows that large contractors tend to receive a much greater share of no compete contracts than other contractors, as the top 10 contractors only received 41% of all federal dollars, yet received 55% of contracts with limited competition.&lt;pre&gt;&lt;br /&gt;mean(nc_spend$DollarsObligated)&lt;br /&gt;mean(spend[spend$ExtentCompeted=="Full and Open Competition",]$DollarsObligated)&lt;br /&gt;&lt;/pre&gt;This shows that the average contract size with no or limited competition was $253507.1, whereas the average contract size with full and open competition was $113936.3, meaning that, on average, contracts twice as large are given out with no competition.&lt;br/&gt;&lt;br/&gt; I will leave this analysis here for now, but please feel free to continue on if you wish.  This is a very interesting data set, and I may come back to it if I have time down the line.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8712623323474199769-7517490718162120166?l=viksalgorithms.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://viksalgorithms.blogspot.com/feeds/7517490718162120166/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/analyzing-us-government-contract-awards.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/7517490718162120166'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/7517490718162120166'/><link rel='alternate' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/analyzing-us-government-contract-awards.html' title='Analyzing US Government Contract Awards in R'/><author><name>Vik Paruchuri</name><uri>http://www.blogger.com/profile/14436550620869189803</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-juBIDA_U5jo/Tx4UTUZZmVI/AAAAAAAAACs/JYhrg-HsobA/s72-c/competition.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8712623323474199769.post-5452396106797238718</id><published>2012-01-20T12:02:00.000-08:00</published><updated>2012-01-20T12:02:19.249-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='diagnostics'/><category scheme='http://www.blogger.com/atom/ns#' term='regression'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>R Regression Diagnostics Part 1</title><content type='html'>Linear regression can be a fast and powerful tool to model complex phenomena.  However, it makes several assumptions about your data, and quickly breaks down when these assumptions, such as the assumption that a linear relationship exists between the predictors and the dependent variable, break down.  In this post, I will introduce some diagnostics that you can perform to ensure that your regression does not violate these basic assumptions.  To begin with, I highly suggest reading &lt;a href="http://www.duke.edu/~rnau/testing.htm"&gt;this article&lt;/a&gt; on the major assumptions that linear regression is predicated on.&lt;br/&gt;&lt;br/&gt;We will make use of the same variables as we did in the &lt;a href="http://viksalgorithms.blogspot.com/2012/01/intro-to-ensemble-learning-in-r.html"&gt;intro to ensemble learning&lt;/a&gt; post:&lt;br/&gt;&lt;br/&gt;&lt;pre&gt;&lt;br /&gt;set.seed(10)&lt;br /&gt;y&lt;-c(1:1000)&lt;br /&gt;x1&lt;-c(1:1000)*runif(1000,min=0,max=2)&lt;br /&gt;x2&lt;-(c(1:1000)*runif(1000,min=0,max=2))^2&lt;br /&gt;x3&lt;-log(c(1:1000)*runif(1000,min=0,max=2))&lt;br /&gt;&lt;/pre&gt;Note that x2 and x3 are significantly nonlinear by design(due to the squared and log modifications), and will cause linear regression to make spurious estimates.&lt;br/&gt;&lt;br/&gt;Component residual plots, an extension of partial residual plots, are a good way to see if the predictors have a linear relationship to the dependent variable.  A partial residual plot essentially attempts to model the residuals of one predictor against the dependent variable.  A component residual plot adds a line indicating where the line of best fit lies.  A significant difference between the residual line and the component line indicates that the predictor does not have a linear relationship with the dependent variable.  A good way to generate these plots in R is the car package.&lt;br/&gt;&lt;br/&gt;&lt;pre&gt;&lt;br /&gt;library(car)&lt;br /&gt;lm_fit&lt;-lm(y~x1+x2+x3)&lt;br /&gt;crPlots(lm_fit)&lt;br /&gt;&lt;/pre&gt;Your plot should look like this:&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-CclCuFhXsuA/TxnDKGAXCxI/AAAAAAAAABw/Glk3q3Mc9fM/s1600/crplots.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="400" width="400" src="http://3.bp.blogspot.com/-CclCuFhXsuA/TxnDKGAXCxI/AAAAAAAAABw/Glk3q3Mc9fM/s400/crplots.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br/&gt;Looking at the plot, we see that 2 of the predictors(x2 and x3) are significantly non-normal, based on the differences between the component and the residual lines.  In order to "correct" these differences, we can attempt to alter the predictors.  Typical alterations are sqrt(x), 1/x, log(x), and n^x.  In this situation, we already know how the data are nonlinear, so we will be able to easily apply the appropriate transformation.  In a "real-world" situation, it may take trial and error to come up with an appropriate transformation to make the predictor appear more linear.  If none of the transformations work, you may have to consider not using the predictor, or switching to a nonlinear model.&lt;br/&gt;&lt;br/&gt;&lt;pre&gt;&lt;br /&gt;library(car)&lt;br /&gt;lm_fit&lt;-lm(y~x1+sqrt(x2)+exp(x3))&lt;br /&gt;crPlots(lm_fit)&lt;br /&gt;&lt;/pre&gt;The above code modifies the linear model by using the square root of x2 as a predictor, and e^x3 as a predictor, which cancel out the squared transformation on x2 and the natural log transformation on x3, respectively.  Your plot should look like this:&lt;br/&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-uvVRMzNsE_s/TxnGQYv8jJI/AAAAAAAAAB8/S1ThQ0oft5o/s1600/crplots2.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="400" width="400" src="http://1.bp.blogspot.com/-uvVRMzNsE_s/TxnGQYv8jJI/AAAAAAAAAB8/S1ThQ0oft5o/s400/crplots2.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br/&gt;As you can see, this plot shows a much more linear relationship between x2 and y and x3 and y.  However, the component and residual lines for the predictors x1, x2, and x3 do not show a perfect overlap.  We will look more into this in a later post.&lt;br/&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8712623323474199769-5452396106797238718?l=viksalgorithms.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://viksalgorithms.blogspot.com/feeds/5452396106797238718/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/r-regression-diagnostics-part-1.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/5452396106797238718'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/5452396106797238718'/><link rel='alternate' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/r-regression-diagnostics-part-1.html' title='R Regression Diagnostics Part 1'/><author><name>Vik Paruchuri</name><uri>http://www.blogger.com/profile/14436550620869189803</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-CclCuFhXsuA/TxnDKGAXCxI/AAAAAAAAABw/Glk3q3Mc9fM/s72-c/crplots.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8712623323474199769.post-8388541914380065480</id><published>2012-01-19T12:30:00.000-08:00</published><updated>2012-01-19T17:59:29.384-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Finance'/><category scheme='http://www.blogger.com/atom/ns#' term='Bailout'/><category scheme='http://www.blogger.com/atom/ns#' term='Banking'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Analyzing Federal Government Bailout Recipients in R</title><content type='html'>I was searching for open data recently, and stumbled on &lt;a href="http://opendata.socrata.com/"&gt;Socrata&lt;/a&gt;. Socrata has a lot of interesting data sets, and while I was browsing around, I found a data set on federal bailout recipients.  &lt;a href="http://opendata.socrata.com/Government/Bailout-Recipients/gbdy-vjgr"&gt;Here&lt;/a&gt; is the data set.  However, data sets on Socrata are not always the most recent versions, so I followed a link to the data source at &lt;a href="http://projects.propublica.org/bailout/list/index"&gt;Propublica&lt;/a&gt;, where I was able to find a data set that was last updated on January 17, 2012.  I downloaded the data in csv format.  In the rest of this post, I will perform basic analysis on this data, and show that R can be used to do the same analysis as Excel in a much simpler and more powerful way.&lt;br /&gt;&lt;br /&gt;The data contains several columns, the most salient of which are the name of the company, its location, its industry, the amount of bailout funds received, and the amount repaid.&lt;br /&gt;&lt;br /&gt;First, we read in the data:&lt;br /&gt;&lt;pre&gt;bailout&amp;lt;-read.csv(file="bailout.csv",head=TRUE,sep=",",&lt;br /&gt;stringsAsFactors=FALSE,strip.white=TRUE)&lt;br /&gt;&lt;/pre&gt;Next, we generate a variable called owed that calculates how much the company owes the government, or how much the government has made in profit from the company, and create a new data frame that is sorted based on how much is owed the government:&lt;br /&gt;&lt;pre&gt;bailout&amp;lt;-transform(bailout,owed=bailout$total_disbursed&lt;br /&gt;-bailout$total_revenue-bailout$total_payback)&lt;br /&gt;ordered_bailout&amp;lt;-bailout[with(bailout,order(-owed)),]&lt;br /&gt;&lt;/pre&gt;Now, we can use ggplot2 to create a nice looking bar chart, and use dev.copy() to export it(you do not need to run the last two lines, as they are solely for exporting the chart):&lt;br /&gt;&lt;pre&gt;library(ggplot2)&lt;br /&gt;qplot(name,weight=owed/1e9, data = ordered_bailout[1:4,], &lt;br /&gt;geom = "bar", xlab = "Institution Name",ylab="Amount Owed in Billions")&lt;br /&gt;dev.copy(png,"most_owed.png")&lt;br /&gt;dev.off()&lt;br /&gt;&lt;/pre&gt;This results in this chart:&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-1QVSKEGphIY/Txhyu30riRI/AAAAAAAAABA/pvoOwBkzqI0/s1600/most_owed.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="400" src="http://1.bp.blogspot.com/-1QVSKEGphIY/Txhyu30riRI/AAAAAAAAABA/pvoOwBkzqI0/s400/most_owed.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;As you can see, Fannie Mae, Freddie Mac, GM, and AIG all still owe huge amounts of money to the government.  Now, we look at the top companies that the government made a profit from:&lt;br /&gt;&lt;pre&gt;qplot(name,weight=abs(owed/1e9), data = tail(ordered_bailout,4), &lt;br /&gt;geom = "bar", xlab = "Institution Name",&lt;br /&gt;ylab="Government Profit in Billions")&lt;br /&gt;&lt;/pre&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-hxMrlpXHUjk/TxhzakvX0mI/AAAAAAAAABM/A1-VNtMG60o/s1600/most_repaid.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="400" src="http://4.bp.blogspot.com/-hxMrlpXHUjk/TxhzakvX0mI/AAAAAAAAABM/A1-VNtMG60o/s400/most_repaid.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;As we can see, banks have given the government a significant profit. &lt;br /&gt;&lt;br /&gt;When we look at the sum of the money owed to the government, we see that &lt;b&gt;242.81 billion&lt;/b&gt; dollars have yet to be repaid:&lt;br /&gt;&lt;pre&gt;sum(ordered_bailout$owed)/1e9&lt;br /&gt;&lt;/pre&gt;Using the tapply function reveals the following(each heading is a sector name, and the number below is how much they owe the government in billions):&lt;br /&gt;&lt;pre&gt;tapply(bailout$owed/1e9,bailout$category,sum)&lt;/pre&gt;&lt;pre&gt;&amp;nbsp;&lt;/pre&gt;&lt;pre&gt;                      Bank              FHA Refinance Fund &lt;br /&gt;                   -11.36362217                      0.05000000 &lt;br /&gt;         SBA Security Purchases              State Housing Orgs &lt;br /&gt;                     0.06002858                      0.65537461 &lt;br /&gt;              Mortgage Servicer      Financial Services Company &lt;br /&gt;                     1.93507502                     10.47640782 &lt;br /&gt;                Investment Fund                    Auto Company &lt;br /&gt;                    13.32980009                     28.53374026 &lt;br /&gt;              Insurance Company  Government-Sponsored Enterprise &lt;br /&gt;                    48.36100458                    150.77000000 &lt;br /&gt;&lt;/pre&gt;As you can see, the government sponsored enterprises(Fannie Mae and Freddie Mac), and the insurance companies(primarily AIG), owe the most while the banks have made a profit for the government.  I will not inject any analysis into this, as we are only looking at the numbers here.&lt;br/&gt;&lt;br/&gt;Now, we will look at how much is owed by the top 10 states:&lt;pre&gt;&lt;br /&gt;states&lt;-sort(tapply(bailout$owed/1e9,bailout$state,sum))&lt;br /&gt;qplot(tail(names(states),10),weight=tail(states,10))&lt;br /&gt;&lt;/pre&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-_4tMgivu_48/Txh3zd4e2YI/AAAAAAAAABk/IA9wjCIP5I8/s1600/owed_by_state.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="400" width="400" src="http://2.bp.blogspot.com/-_4tMgivu_48/Txh3zd4e2YI/AAAAAAAAABk/IA9wjCIP5I8/s400/owed_by_state.png" /&gt;&lt;/a&gt;&lt;/div&gt;Not surprisingly, considering that Fannie Mae and Freddie Mac are based in DC, DC owes the most, followed by Virginia.  Puerto Rico is a very interesting addition to the top 10 states/territories that owe the government, and further inspection reveals this:&lt;pre&gt;&lt;br /&gt;bailout[bailout$state=="PR",]$name&lt;br /&gt;[1] "Popular, Inc."                "First BanCorp"               &lt;br /&gt;[3] "RG Mortgage Corporation"      "Scotiabank de Puerto Rico"   &lt;br /&gt;[5] "Banco Popular de Puerto Rico"&lt;br /&gt;&lt;/pre&gt;Apparently the banking sector in Puerto Rico did not do well in the financial crisis!&lt;br/&gt;&lt;br/&gt;  I noticed an interesting column in the data called "is_stress_tested", which took on a false value, a true value, or a NULL value.  I am not an expert on banking, and if someone can please shed more light on the stress test, I would appreciate it, but I believe that stress testing is a method that discovers how prone to failure the institution is.&lt;br/&gt;&lt;br/&gt;Now, when we see how much money the institutions that have had stress testing performed versus those that have not owe, we get the following:&lt;pre&gt;&lt;br /&gt;tapply(bailout$owed/1e9,bailout$is_stress_tested,sum)&lt;br /&gt;              false      true &lt;br /&gt; 13.31646 242.71883 -13.22748 &lt;br /&gt;&lt;/pre&gt;This tells us that institutions that have not been stress tested owe 242 billion to the government, whereas those that have been stress tested have made a profit of 13.2 billion for the government.  I am not sure if stress testing can be performed on an institution like Fannie Mae, but it seems odd that the testing has not been done on those institutions that collectively owe 242 billion dollars.&lt;br/&gt;&lt;br/&gt;Finally, looking at the percentage of disbursed funds that are still outstanding by sector reveals the following(each heading is a sector name, and the number below is the percentage of the bailout funds that they still owe):&lt;pre&gt;&lt;br /&gt;sort(tapply(bailout$owed,bailout$category,sum)/&lt;br /&gt;tapply(bailout$total_disbursed,bailout$category,sum))&lt;br /&gt;                           Bank          SBA Security Purchases &lt;br /&gt;                    -0.04811175                      0.16305670 &lt;br /&gt;                   Auto Company      Financial Services Company &lt;br /&gt;                     0.46090662                      0.46762480 &lt;br /&gt;              Insurance Company Government-Sponsored Enterprise &lt;br /&gt;                     0.66995920                      0.82478118 &lt;br /&gt;                Investment Fund              FHA Refinance Fund &lt;br /&gt;                     0.84152859                      1.00000000 &lt;br /&gt;              Mortgage Servicer              State Housing Orgs &lt;br /&gt;                     1.00000000                      1.00000000 &lt;br /&gt;&lt;/pre&gt;In general, it appears that Investment Funds, Government-Sponsored Enterprises, and Insurance Companies have been poor at repaying the money they were lent.  The high values for Government Sponsored Enterprises and Insurance Companies are explained by the high outstanding amounts from Fannie Mae, Freddie Mac, and AIG, but the trend in Investment Funds in more interesting.  Despite relatively small bailouts(and I emphasize the relatively, because 15.8 billion dollars were disbursed to Investment Funds), it appears that money has been extremely slow to come back to the Government, with 84% of the funds still outstanding.&lt;br/&gt;&lt;br/&gt;With that, I conclude this post.  I hope that this has shown you how simple data analysis can be done in R in an quick and efficient manner, and I hope that I have also been able to do interesting analyses and draw interesting facts from this data.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8712623323474199769-8388541914380065480?l=viksalgorithms.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://viksalgorithms.blogspot.com/feeds/8388541914380065480/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/analyzing-federal-bailout-recipients-in.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/8388541914380065480'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/8388541914380065480'/><link rel='alternate' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/analyzing-federal-bailout-recipients-in.html' title='Analyzing Federal Government Bailout Recipients in R'/><author><name>Vik Paruchuri</name><uri>http://www.blogger.com/profile/14436550620869189803</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-1QVSKEGphIY/Txhyu30riRI/AAAAAAAAABA/pvoOwBkzqI0/s72-c/most_owed.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8712623323474199769.post-4026568941847141175</id><published>2012-01-19T09:31:00.000-08:00</published><updated>2012-01-19T09:35:07.132-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Machine Learning'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>An Intro to Ensemble Learning in R</title><content type='html'>&lt;b&gt;Introduction&lt;/b&gt;&lt;br/&gt;&lt;br/&gt;This post incorporates parts of &lt;a href="http://viksalgorithms.blogspot.com/2012/01/build-your-own-bagging-function-in-r.html"&gt;yesterday's post&lt;/a&gt; about bagging.  If you are unfamiliar with bagging, I suggest that you read it before continuing with this article.&lt;br/&gt;&lt;br/&gt;I would like to give a basic overview of ensemble learning.  Ensemble learning involves combining multiple predictions derived by different techniques in order to create a stronger overall prediction.  For example, the predictions of a random forest, a support vector machine, and a simple linear model may be combined to create a stronger final prediction set.  The key to creating a powerful ensemble is model diversity.  An ensemble with two techniques that are very similar in nature will perform more poorly than a more diverse model set.&lt;br/&gt;&lt;br/&gt;  Some ensemble learning techniques, such as Bayesian model combination and stacking, attempt to weight the models prior to combining them.  I will save the discussion of how to use these techniques for another day, and will instead focus on simple model combination.&lt;br/&gt;&lt;br/&gt;&lt;b&gt;Initial Setup&lt;/b&gt;&lt;br/&gt;&lt;br/&gt;We will begin with similar variables to those that I used yesterday in the introduction to bagging.  However, I have altered x2 and x3 to introduce distinct nonlinear tendencies, in order to evaluate the performance of nonlinear learning techniques.&lt;br/&gt;&lt;pre&gt;&lt;br /&gt;set.seed(10)&lt;br /&gt;y&lt;-c(1:1000)&lt;br /&gt;x1&lt;-c(1:1000)*runif(1000,min=0,max=2)&lt;br /&gt;x2&lt;-(c(1:1000)*runif(1000,min=0,max=2))^2&lt;br /&gt;x3&lt;-log(c(1:1000)*runif(1000,min=0,max=2))&lt;br /&gt;&lt;br /&gt;lm_fit&lt;-lm(y~x1+x2+x3)&lt;br /&gt;summary(lm_fit)&lt;br /&gt;&lt;/pre&gt;As you can see, the R-squared value is now .6658, indicating that the predictors have less of a linear correlation to the dependent variable than the predictors from the bagging intro yesterday.&lt;pre&gt;&lt;br /&gt;set.seed(10)&lt;br /&gt;all_data&lt;-data.frame(y,x1,x2,x3)&lt;br /&gt;positions &lt;- sample(nrow(all_data),size=floor((nrow(all_data)/4)*3))&lt;br /&gt;training&lt;- all_data[positions,]&lt;br /&gt;testing&lt;- all_data[-positions,]&lt;br /&gt;&lt;br /&gt;lm_fit&lt;-lm(y~x1+x2+x3,data=training)&lt;br /&gt;predictions&lt;-predict(lm_fit,newdata=testing)&lt;br /&gt;error&lt;-sqrt((sum((testing$y-predictions)^2))/nrow(testing))&lt;br /&gt;error&lt;br /&gt;&lt;/pre&gt;Dividing the data into training and testing sets and using a simple linear model to make predictions about the testing set yields a root mean squared error of 177.36.&lt;br/&gt;&lt;pre&gt;&lt;br /&gt;library(foreach)&lt;br /&gt;length_divisor&lt;-6&lt;br /&gt;iterations&lt;-5000&lt;br /&gt;predictions&lt;-foreach(m=1:iterations,.combine=cbind) %do% {&lt;br /&gt;training_positions &lt;- sample(nrow(training), size=floor((nrow(training)/length_divisor)))&lt;br /&gt;train_pos&lt;-1:nrow(training) %in% training_positions&lt;br /&gt;lm_fit&lt;-lm(y~x1+x2+x3,data=training[train_pos,])&lt;br /&gt;predict(lm_fit,newdata=testing)&lt;br /&gt;}&lt;br /&gt;predictions&lt;-rowMeans(predictions)&lt;br /&gt;error&lt;-sqrt((sum((testing$y-predictions)^2))/nrow(testing))&lt;br /&gt;error&lt;br /&gt;&lt;/pre&gt;Creating 5000 simple linear models and averaging the results creates a prediction error of 177.2591, which is marginally superior to the initial results.&lt;br/&gt;&lt;br/&gt;&lt;b&gt;Creating the First Ensemble&lt;/b&gt;&lt;br/&gt;&lt;br/&gt;To create our initial ensemble, we will use the linear model, and a random forest.  A random forest can be a powerful learning technique.  It creates multiple decision trees from randomized sets of predictors and observations.  It will be less useful here, as we only have three predictors, but will still illustrate the general point.&lt;br/&gt;&lt;br/&gt;Now, we will test the efficacy of a random forest on our data:&lt;pre&gt;&lt;br /&gt;library(randomForest)&lt;br /&gt;rf_fit&lt;-randomForest(y~x1+x2+x3,data=training,ntree=500)&lt;br /&gt;predictions&lt;-predict(rf_fit,newdata=testing)&lt;br /&gt;error&lt;-sqrt((sum((testing$y-predictions)^2))/nrow(testing))&lt;br /&gt;error&lt;br /&gt;&lt;/pre&gt;My error value came to 134.98 with the above code.  As you can see, random forests can make much better predictions than linear models.  In this case, the linear model could not deal with the nonlinear predictors, whereas the random forest could.  One should beware the trap of overfitting, however, as while random forests are much less prone to it than a single decision tree, it can still occur in very noisy data.&lt;br/&gt;&lt;br/&gt;Note that a random forest already incorporates the idea of bagging into the basic algorithm, so we will gain little to nothing by running a random forest through our bagging function.  What we will do instead is this:&lt;pre&gt;&lt;br /&gt;length_divisor&lt;-6&lt;br /&gt;iterations&lt;-5000&lt;br /&gt;predictions&lt;-foreach(m=1:iterations,.combine=cbind) %do% {&lt;br /&gt;training_positions &lt;- sample(nrow(training), size=floor((nrow(training)/length_divisor)))&lt;br /&gt;train_pos&lt;-1:nrow(training) %in% training_positions&lt;br /&gt;lm_fit&lt;-lm(y~x1+x2+x3,data=training[train_pos,])&lt;br /&gt;predict(lm_fit,newdata=testing)&lt;br /&gt;}&lt;br /&gt;lm_predictions&lt;-rowMeans(predictions)&lt;br /&gt;&lt;br /&gt;library(randomForest)&lt;br /&gt;rf_fit&lt;-randomForest(y~x1+x2+x3,data=training,ntree=500)&lt;br /&gt;rf_predictions&lt;-predict(rf_fit,newdata=testing)&lt;br /&gt;predictions&lt;-(lm_predictions+rf_predictions)/2&lt;br /&gt;error&lt;-sqrt((sum((testing$y-predictions)^2))/nrow(testing))&lt;br /&gt;error&lt;br /&gt;&lt;/pre&gt;This combines the results of the linear model and the random forest using equal weights.  Not surprisingly, considering that the linear model is weaker than the random forest, we end up with an error of 147.97.  While this is not a fantastic result, it does illustrate ensemble learning.&lt;br/&gt;&lt;br/&gt;&lt;b&gt;Improving the Performance of Our Ensemble&lt;/b&gt;&lt;br/&gt;&lt;br/&gt;From here, we have two options.  We can either combine the predictions of the linear model and the random forest in different ratios until we achieve a better result(I would do this with caution in the real world, and use one hold out set to find the optimal weights, and test the weightings on another hold out set, as this can lead to overfitting if done improperly), or we can replace the linear model with a better one.  We will try both, but first we will try combining the results in different ratios.  Our prior knowledge of the error rates tells us to use a small ratio for the linear model.&lt;br/&gt;&lt;br/&gt;&lt;pre&gt;&lt;br /&gt;predictions&lt;-(lm_predictions+rf_predictions*9)/10&lt;br /&gt;error&lt;-sqrt((sum((testing$y-predictions)^2))/nrow(testing))&lt;br /&gt;error&lt;br /&gt;&lt;/pre&gt;This results in an error rate of 136.23, which is not an improvement on the random Forest alone.  Next, we replace the linear model with a support vector machine(svm) from the e1071 package, which provides an R interface to libSVM.  A support vector machine is based on some complicated mathematics, but is basically a stronger machine learning technique that can pick up nonlinear tendencies in the data, depending on what kind of kernel is used.&lt;br/&gt;&lt;pre&gt;&lt;br /&gt;library(e1071)&lt;br /&gt;svm_fit&lt;-svm(y~x1+x2+x3,data=training)&lt;br /&gt;svm_predictions&lt;-predict(svm_fit,newdata=testing)&lt;br /&gt;error&lt;-sqrt((sum((testing$y-svm_predictions)^2))/nrow(testing))&lt;br /&gt;error&lt;br /&gt;&lt;/pre&gt;The error when using an svm is 129.87, which is superior to both the random forest and the linear models.  Next we will try using the svm with our bagging function.&lt;pre&gt;&lt;br /&gt;length_divisor&lt;-6&lt;br /&gt;iterations&lt;-5000&lt;br /&gt;predictions&lt;-foreach(m=1:iterations,.combine=cbind) %do% {&lt;br /&gt;training_positions &lt;- sample(nrow(training), size=floor((nrow(training)/length_divisor)))&lt;br /&gt;train_pos&lt;-1:nrow(training) %in% training_positions&lt;br /&gt;svm_fit&lt;-svm(y~x1+x2+x3,data=training[train_pos,])&lt;br /&gt;predict(svm_fit,newdata=testing)&lt;br /&gt;}&lt;br /&gt;svm2_predictions&lt;-rowMeans(predictions)&lt;br /&gt;error&lt;-sqrt((sum((testing$y-svm2_predictions)^2))/nrow(testing))&lt;br /&gt;error&lt;br /&gt;&lt;/pre&gt;The error with 5000 iterations of an svm model is 141.14. In this case, it appears that the svm performs better without bagging techniques.  It may be that there is too little data for it to be effective when used like this.  However, the time it takes an svm increases exponentially(I believe) with more observations, so sometimes various techniques, including reduction in the number of observations or features, will need to be performed to improve svm performance to tolerable levels.  Going forward, we will use the results of the single svm for the rest of this article.&lt;pre&gt;&lt;br /&gt;predictions&lt;-(svm_predictions+rf_predictions)/2&lt;br /&gt;error&lt;-sqrt((sum((testing$y-predictions)^2))/nrow(testing))&lt;br /&gt;error&lt;br /&gt;&lt;/pre&gt;When we equally combine the svm predictions from the single model with the random forest predictions, we get an error rate of 128.8, which is superior to either the svm model alone, or the random forest model alone.&lt;pre&gt;&lt;br /&gt;predictions&lt;-(svm_predictions*2+rf_predictions)/3&lt;br /&gt;error&lt;-sqrt((sum((testing$y-predictions)^2))/nrow(testing))&lt;br /&gt;error&lt;br /&gt;&lt;/pre&gt;If we tweak the ratios to emphasize the stronger svm model, we lower our error to 128.34.&lt;br/&gt;&lt;br/&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br/&gt;&lt;br/&gt;As we have seen, ensemble learning can outperform any one single model when used properly.  A good exercise to continue on from here would be to see if other models can improve performance when added to the ensemble.  Please feel free to email me or leave a comment if you see something incorrect or have any questions.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8712623323474199769-4026568941847141175?l=viksalgorithms.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://viksalgorithms.blogspot.com/feeds/4026568941847141175/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/intro-to-ensemble-learning-in-r.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/4026568941847141175'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/4026568941847141175'/><link rel='alternate' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/intro-to-ensemble-learning-in-r.html' title='An Intro to Ensemble Learning in R'/><author><name>Vik Paruchuri</name><uri>http://www.blogger.com/profile/14436550620869189803</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8712623323474199769.post-7265748392976191068</id><published>2012-01-18T07:49:00.000-08:00</published><updated>2012-01-18T15:17:23.985-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Improve Predictive Performance in R with Bagging</title><content type='html'>Bagging, aka bootstrap aggregation, is a relatively simple way to increase the power of a predictive statistical model by taking multiple random samples(with replacement) from your training data set, and using each of these samples to construct a separate model and separate predictions for your test set.  These predictions are then averaged to create a, hopefully more accurate, final prediction value.&lt;br/&gt;&lt;br/&gt; One can quickly intuit that this technique will be more useful when the predictors are more unstable.  In other words, if the random samples that you draw from your training set are very different, they will generally lead to very different sets of predictions.  This greater variability will lead to a stronger final result.  When the samples are extremely similar, all of the predictions derived from the samples will likewise be extremely similar, making bagging a bit superfluous.&lt;br/&gt;&lt;br/&gt;   Taking smaller samples from your training set will induce greater instability, but taking samples that are too small will result in useless models.  Generally, some fraction of your training set between 1/50th and 1/2 will be useful for bagging purposes(of course, this greatly depends on how many observations are in your training set).  The smaller your bagging samples, the more samples you will need to collect and the more models you will need to generate to create more stability in the final predictors.&lt;br/&gt;&lt;br/&gt; While there are some libraries that will take care of bagging for you in R, writing your own function allows you a greater degree of control and understanding over the process, and it is a relatively quick exercise.&lt;br/&gt;&lt;br/&gt; Okay, enough theoretical framework.  Lets jump into the code.  Anyone who wants more theory can consult &lt;a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.121.7654&amp;rep=rep1&amp;type=pdf"&gt;this paper&lt;/a&gt; about bagging.&lt;br/&gt;&lt;br/&gt;   I'm going to construct a relatively trivial example where a dependent variable, y, can be predicted by some combination of the independent variables x1, x2, and x3.&lt;br/&gt;&lt;br/&gt; &lt;pre&gt;&lt;br /&gt;set.seed(10)&lt;br /&gt;y&lt;-c(1:1000)&lt;br /&gt;x1&lt;-c(1:1000)*runif(1000,min=0,max=2)&lt;br /&gt;x2&lt;-c(1:1000)*runif(1000,min=0,max=2)&lt;br /&gt;x3&lt;-c(1:1000)*runif(1000,min=0,max=2)&lt;br /&gt;&lt;/pre&gt;As you can see, y is a sequence of the values from 1 to 1000.  x1, x2, and x3 are permutations of y, but with random errors added.  runif generates a specified number of random numbers from 0 to 1, unless a min and max are specified, in which case the numbers fall between those values.  Each of the x sequences will roughly approximate y, but with random errors thrown in.  The set.seed function is simply to ensure that the subsequent random number generation proceeds in a predictable fashion, so that your results match mine.&lt;br/&gt;&lt;br/&gt; Fitting a linear model to the variables results in an R squared of .7042:&lt;pre&gt;&lt;br /&gt;lm_fit&lt;-lm(y~x1+x2+x3)&lt;br /&gt;summary(lm_fit)&lt;br /&gt;&lt;/pre&gt;Now we will see how well the x values predict y.  First, we designate a random sample of y to be our "test" set.  The rest will be the training set.&lt;pre&gt;&lt;br /&gt;set.seed(10)&lt;br /&gt;all_data&lt;-data.frame(y,x1,x2,x3)&lt;br /&gt;positions &lt;- sample(nrow(all_data),size=floor((nrow(all_data)/4)*3))&lt;br /&gt;training&lt;- all_data[positions,]&lt;br /&gt;testing&lt;- all_data[-positions,]&lt;br /&gt;&lt;/pre&gt;The above code places all of our variables into a data frame, then randomly selects 3/4 of the data to be the training set, and places the rest into the testing set.&lt;br/&gt;&lt;br/&gt;We are now able to generate predictions for the testing set by creating a linear model on the training set and applying it to the testing set.  We are also able to calculate the prediction error by subtracting the actual values from the predicted values (the error calculation here is root mean squared error):&lt;pre&gt;&lt;br /&gt;lm_fit&lt;-lm(y~x1+x2+x3,data=training)&lt;br /&gt;predictions&lt;-predict(lm_fit,newdata=testing)&lt;br /&gt;error&lt;-sqrt((sum((testing$y-predictions)^2))/nrow(testing))&lt;br /&gt;&lt;/pre&gt;The calculated error should be 161.15.&lt;br/&gt;&lt;br/&gt;The next step is to run a function that implements bagging.  In order to do this, I will be using the foreach package.  Although I will not use it in parallel mode, this code is designed for parallel execution, and I highly recommend reading &lt;a href="http://viksalgorithms.blogspot.com/2012/01/parallel-r-loops-for-windows-and-linux.html"&gt;my post&lt;/a&gt; about how to do it if you do not know how.&lt;pre&gt;&lt;br /&gt;library(foreach)&lt;br /&gt;length_divisor&lt;-4&lt;br /&gt;iterations&lt;-1000&lt;br /&gt;predictions&lt;-foreach(m=1:iterations,.combine=cbind) %do% {&lt;br /&gt;training_positions &lt;- sample(nrow(training), size=floor((nrow(training)/length_divisor)))&lt;br /&gt;train_pos&lt;-1:nrow(training) %in% training_positions&lt;br /&gt;lm_fit&lt;-lm(y~x1+x2+x3,data=training[train_pos,])&lt;br /&gt;predict(lm_fit,newdata=testing)&lt;br /&gt;}&lt;br /&gt;predictions&lt;-rowMeans(predictions)&lt;br /&gt;error&lt;-sqrt((sum((testing$y-predictions)^2))/nrow(testing))&lt;br /&gt;&lt;/pre&gt;The above code randomly samples 1/4 of the training set in each iteration, and generates predictions for the testing set based the sample.  It will execute the number of time specified by iterations.  When iterations was set to 10, I received an error value of 161.10.  At 300 iterations, error went to 161.12, at 500 iterations, error went to 161.19, at 1000 iterations, error went to 161.13, and at 5000 iterations, error went to 161.07.  Eventually, bagging will converge, and more iterations will not help any further.  However, the potential for improvement in results exists.  You should be extremely cautious and assess the stability of the results before deploying this approach, however, as too few iterations or too large a length divisor can cause extremely unstable results.  This example is trivial, but this can lead to better results in a more "real-world" application.&lt;br/&gt;&lt;br/&gt;Finally, we can place this code into a function to wrap it up nicely:&lt;pre&gt;&lt;br /&gt;bagging&lt;-function(training,testing,length_divisor=4,iterations=1000)&lt;br /&gt;{&lt;br /&gt;predictions&lt;-foreach(m=1:iterations,.combine=cbind) %do% {&lt;br /&gt;training_positions &lt;- sample(nrow(training), size=floor((nrow(training)/length_divisor)))&lt;br /&gt;train_pos&lt;-1:nrow(training) %in% training_positions&lt;br /&gt;lm_fit&lt;-lm(y~x1+x2+x3,data=training[train_pos,])&lt;br /&gt;predict(lm_fit,newdata=testing)&lt;br /&gt;}&lt;br /&gt;rowMeans(predictions)&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;As you can see, bagging can be a useful tool when used correctly.  Although this is a trivial example, you can replace the data and even replace the simple linear model with more powerful models to make this function more useful.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8712623323474199769-7265748392976191068?l=viksalgorithms.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://viksalgorithms.blogspot.com/feeds/7265748392976191068/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/build-your-own-bagging-function-in-r.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/7265748392976191068'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/7265748392976191068'/><link rel='alternate' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/build-your-own-bagging-function-in-r.html' title='Improve Predictive Performance in R with Bagging'/><author><name>Vik Paruchuri</name><uri>http://www.blogger.com/profile/14436550620869189803</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8712623323474199769.post-4707394308354424625</id><published>2012-01-17T16:59:00.000-08:00</published><updated>2012-01-18T07:51:32.634-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Finance'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Time Based Arbitrage Opportunities in Tick Data</title><content type='html'>I recently posted an &lt;a href="http://viksalgorithms.blogspot.com/2012/01/introduction-to-kaggle-algorithmic.html"&gt;introduction&lt;/a&gt; to the &lt;a href="http://www.kaggle.com/c/AlgorithmicTradingChallenge"&gt;Kaggle Algorithmic Trading Challenge&lt;/a&gt;, which I competed in.&lt;br /&gt;&lt;br /&gt;I said that I would post about my experiences, and this is hopefully the first of a series.  We were given tick data from the London Stock Exchange(specifically, the FTSE 100) over random time intervals during parts of 37 days.  Each data row that we were given corresponded to a liquidity shock event and the surrounding bid/ask prices.  Given 50 bid prices and 50 ask prices prior to the liquidity shock event, we had to predict the next 50 bid and ask prices.&lt;br /&gt;&lt;br /&gt;I, and others, noticed that there were distinct areas of high volatility in the tick data at certain times of day.  Specifically, 10:15, 13:30, and 15:00 (london time) all showed abnormal volatility.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-pGmuLzJO6kQ/TxYNiAenKRI/AAAAAAAAAAc/Vtxg4XvRBBE/s1600/trade_time_vs_residuals.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="287" src="http://4.bp.blogspot.com/-pGmuLzJO6kQ/TxYNiAenKRI/AAAAAAAAAAc/Vtxg4XvRBBE/s400/trade_time_vs_residuals.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;The above chart plots the residuals of a simple linear model that predicts the average of the first five bid and ask prices after the liquidity shock event at t=50 against the time the trade was made.  The x-axis(time the trade was made) is in the units of hours after 8:00, so a value of 2 on the x-axis corresponds to 10:00.  The residuals of the linear fit indicate how "hard" the values of the bid-ask time series were to predict, which is a proxy for volatility.The below R code generated the plot:&lt;br /&gt;&lt;pre&gt;lm_fit&amp;lt;-lm(current_formula,data=cbind(extracted_training_dependent_variable,&lt;/pre&gt;&lt;pre&gt;extracted_training_data))&lt;br /&gt;plot(x=extracted_training_data$trade_time_50,y=lm_fit$residuals)&lt;br /&gt;&lt;/pre&gt;This can also be observed when plotting the normalized standard deviation of the bid series against trade time, although the spike at 10:15 is harder to see:&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-fqF2vP7nCK0/TxYRpfUN_dI/AAAAAAAAAAk/7R90oy4Gtx0/s1600/sd_vs_trade_time.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="286" src="http://1.bp.blogspot.com/-fqF2vP7nCK0/TxYRpfUN_dI/AAAAAAAAAAk/7R90oy4Gtx0/s400/sd_vs_trade_time.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;Although the spike in volatility at 13:30 appears to be caused by the opening of the NASDAQ and NYSE, it is unclear what the other two spikes might represent, although other markets may be opening at those times.  Specifically, the spikes appear to be caused by other markets/traders valuing the stocks differently, which leads to an opportunity for profit taking.&lt;br /&gt;&lt;br /&gt;Here is a typical series in which a spike occurs:&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-X8K_opJvEEc/TxYVhrxgenI/AAAAAAAAAAs/XcVRj5x3Ck0/s1600/example_time_series.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="287" src="http://1.bp.blogspot.com/-X8K_opJvEEc/TxYVhrxgenI/AAAAAAAAAAs/XcVRj5x3Ck0/s400/example_time_series.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;In this series, a liquidity shock occurs when x=50(although there are several shocks prior to this).  The whole sequence from x=1 to x=50 takes place over 13 seconds, and each x value is a distinct trade or quote event.  The red points are ask prices, and the blue points are bid prices.  As you can see, there appears to be a significant arbitrage opportunity over a fairly long time scale as the ask prices lower, perhaps in response to the volatile bid prices(which may be a result of traders from other areas eating up the available liquidity on the bid side), then rise again after the shock at t=50. &lt;br /&gt;&lt;br /&gt;This same pattern exists in other tick data time series that were given to competitors in this challenge, which could make for potentially interesting arbitrage opportunities if it is able to be exploited.&lt;br /&gt;&lt;br /&gt;Another interesting feature of the tick data when plotted over time is the fact that there is much higher volatility earlier in the day.&amp;nbsp; This corresponds with several papers that noted the same phenomenon.&amp;nbsp; When training a predictive model on tick data, a significant portion of the days error(30-40%) can occur in the first hour of trading.&amp;nbsp; This points to the need for specialized models to be developed for the first hour of trading, but it also points to the fact that the first hour, is, in some ways, relatively unpredictable.&amp;nbsp; A model that does not attempt to make predictions in the first hour may make significantly more "correct" guesses and thus significantly more profit than one that does not.&lt;br /&gt;&lt;br /&gt;The below plot is an easier way to look at this.&amp;nbsp; It uses the tapply function to find the mean normalized residual for each one-minute time slice in the trading day:&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-k5dKWN6LjPM/TxYY4-CcZEI/AAAAAAAAAA0/m6KGaa0tG3s/s1600/tapply_error_plot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="286" src="http://3.bp.blogspot.com/-k5dKWN6LjPM/TxYY4-CcZEI/AAAAAAAAAA0/m6KGaa0tG3s/s400/tapply_error_plot.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;As you can see, there is more than 4 times as much prediction error(remember, a proxy for volatility) in the first minute as there is in the last.&lt;br /&gt;&lt;br /&gt;I will wrap this post up at this point.&amp;nbsp; I will post more of my findings soon.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8712623323474199769-4707394308354424625?l=viksalgorithms.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://viksalgorithms.blogspot.com/feeds/4707394308354424625/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/time-based-arbitrage-opportunities-in.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/4707394308354424625'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/4707394308354424625'/><link rel='alternate' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/time-based-arbitrage-opportunities-in.html' title='Time Based Arbitrage Opportunities in Tick Data'/><author><name>Vik Paruchuri</name><uri>http://www.blogger.com/profile/14436550620869189803</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-pGmuLzJO6kQ/TxYNiAenKRI/AAAAAAAAAAc/Vtxg4XvRBBE/s72-c/trade_time_vs_residuals.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8712623323474199769.post-2026240887464657682</id><published>2012-01-17T14:41:00.000-08:00</published><updated>2012-01-18T07:51:22.273-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Linux'/><category scheme='http://www.blogger.com/atom/ns#' term='Windows'/><category scheme='http://www.blogger.com/atom/ns#' term='parallel'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Parallel R Loops for Windows and Linux</title><content type='html'>Parallel computation may seem difficult to implement and a pain to use, but it is actually quite simple to use.  The foreach package provides the basic loop structure, which can utilize various parallel backends to execute the loop in parallel.First, let's go over the basic structure of a foreach loop.  To get the foreach package, run the following command:&lt;br /&gt;&lt;pre&gt;install.packages("foreach")&lt;/pre&gt;Then, initialize the library:&lt;br /&gt;&lt;pre&gt;library(foreach)&lt;/pre&gt;A basic, nonparallel, foreach loop looks like this:&lt;br /&gt;&lt;pre&gt;foreach(i=1:10) %do% {&lt;br /&gt;&lt;br /&gt;#loop contents here&lt;br /&gt;&lt;br /&gt;}&lt;/pre&gt;To execute the loop in parallel, the %do% command must be replaced with %dopar%:&lt;br /&gt;&lt;pre&gt;foreach(i=1:10) %dopar% {&lt;br /&gt;&lt;br /&gt;#loop contents here&lt;br /&gt;&lt;br /&gt;}&lt;/pre&gt;To capture the return values of the loop:&lt;br /&gt;&lt;pre&gt;list&amp;lt;-foreach(i=1:10) %do% {&lt;br /&gt;i&lt;br /&gt;}&lt;/pre&gt;Note that the foreach loop returns a list of values by default.  The foreach package will always return a result with the items in the same order as the counter, even when running in parallel.  For example, the above loop will return a list with indices 1 through 10, each containing the same value as their index(1 to 10).&lt;br /&gt;&lt;br /&gt;In order to return the results as a matrix, you will need to alter the .combine behavior of the foreach loop.  This is done in the following code:&lt;br /&gt;&lt;pre&gt;matrix&amp;lt;-foreach(i=1:10,.combine=cbind) %do% {&lt;br /&gt;i&lt;br /&gt;}&lt;/pre&gt;This will return a matrix with 10 columns, with values in order from 1 to 10.&lt;br /&gt;&lt;br /&gt;Likewise, this will return a matrix with 10 rows:&lt;br /&gt;&lt;pre&gt;matrix&amp;lt;-foreach(i=1:10,.combine=rbind) %do% {&lt;br /&gt;i&lt;br /&gt;}&lt;/pre&gt;This can be done with multiple return values to create n x k matrices.  For example, this will return a 10 x 2 matrix:&lt;br /&gt;&lt;pre&gt;matrix&amp;lt;-foreach(i=1:10,.combine=rbind) %do% {&lt;br /&gt;c(i,i)&lt;br /&gt;}&lt;/pre&gt;&lt;br /&gt;&lt;b&gt;Parallel Backends&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;In order to run the foreach loop in parallel(using the %dopar% command), you will need to install and register a parallel backend.  Because windows does not support forking, the same backend that works a linux or an OS X environment will not work for windows.Under linux, the doMC package provides a convenient parallel backend.&lt;br /&gt;&lt;br /&gt;Here is how to use the package(of course, you need to install doMC first):&lt;br /&gt;&lt;pre&gt;library(foreach)&lt;br /&gt;library(doMC)&lt;br /&gt;registerDoMC(2)  #change the 2 to your number of CPU cores  &lt;br /&gt;&lt;br /&gt;foreach(i=1:10) %dopar% {&lt;br /&gt;&lt;br /&gt;#loop contents here&lt;br /&gt;&lt;br /&gt;}&lt;/pre&gt;Under windows, the doSNOW package is very convenient, although it has some issues.  I do not recommend the doSMP package, as it has significant issues.&lt;br /&gt;&lt;pre&gt;library(doSNOW)&lt;br /&gt;library(foreach)&lt;br /&gt;cl&amp;lt;-makeCluster(2) #change the 2 to your number of CPU cores&lt;br /&gt;registerDoSNOW(cl)&lt;br /&gt;&lt;br /&gt;foreach(i=1:10) %dopar% {&lt;br /&gt;&lt;br /&gt;#loop contents here&lt;br /&gt;&lt;br /&gt;} &lt;/pre&gt;&lt;br /&gt;Edit:&amp;nbsp; Thanks to an alert reader, I noticed that I neglected to add in the code to stop the clusters.&amp;nbsp; This will need to be run after you finish executing all of your parallel code if you are using doSNOW.&lt;pre&gt;&lt;br /&gt;stopCluster(cl)&lt;br /&gt;&lt;/pre&gt; Also please note that you will need to set the parameter in the makeCluster and registerDoMC functions to the number of CPU cores that your computer possesses, or less if you do not want to use all of your CPU cores.&lt;br/&gt;&lt;br/&gt;I hope that this has been a good introduction to parallel loops in R.  The new version of R(2.14), also includes the parallel package, which I will discuss further in a later post.  You can find more information on the packages mentioned in this article on CRAN.  &lt;a href="http://cran.r-project.org/web/packages/foreach/index.html"&gt;Foreach&lt;/a&gt;, &lt;a href="http://cran.r-project.org/web/packages/doSNOW/index.html"&gt;doSNOW&lt;/a&gt;, and &lt;a href="http://cran.r-project.org/web/packages/doMC/index.html"&gt;doMC&lt;/a&gt; can all be found there.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8712623323474199769-2026240887464657682?l=viksalgorithms.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://viksalgorithms.blogspot.com/feeds/2026240887464657682/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/parallel-r-loops-for-windows-and-linux.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/2026240887464657682'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/2026240887464657682'/><link rel='alternate' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/parallel-r-loops-for-windows-and-linux.html' title='Parallel R Loops for Windows and Linux'/><author><name>Vik Paruchuri</name><uri>http://www.blogger.com/profile/14436550620869189803</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8712623323474199769.post-218547080807202456</id><published>2012-01-16T15:52:00.000-08:00</published><updated>2012-01-18T07:51:09.882-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Machine Learning'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><title type='text'>Resources for Learning Statistics and Data Mining</title><content type='html'>There are an abundance of resources on the web to help novices teach themselves statistics and machine learning, but they can be hard to track down.  Below are a few resources that have helped me in the past.&lt;br&gt;&lt;br&gt;1. &lt;b&gt;&lt;a href="http://www.khanacademy.org/#statistics"&gt;Khan Academy&lt;/a&gt;&lt;/b&gt; has a series of excellent videos on the basics of statistics and probability.  They range from 5 minutes to 20 minutes in length.  You can create an account to track your progress if you desire. Although good for the basics, Khan Academy does not cover advanced topics.&lt;br&gt;&lt;br&gt;2. &lt;b&gt;&lt;a href="http://www-stat.stanford.edu/~tibs/ElemStatLearn/"&gt;The Elements of Statistical Learning&lt;/a&gt;&lt;/b&gt;, a free ebook, is a relatively in-depth overview of the major concepts and techniques involved in machine learning.  You will need to learn inferential statistics before reading this book, as it assumes that the reader understands the basics.&lt;br&gt;&lt;br&gt; 3. &lt;b&gt;&lt;a href="http://www.statsoft.com/textbook/"&gt;The Statsoft Online Statistics Textbook&lt;/a&gt;&lt;/b&gt; is a good resource on statistics and machine learning.  A reader will need a basic understanding of statistics and probability before reading this text.  It is not particularly in depth, but is good for an overview of major concepts, or as a refresher.&lt;br&gt;&lt;br&gt;4. &lt;b&gt;&lt;a href="http://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA&amp;feature=plcp"&gt;Machine Learning Videos from mathematicalmonk&lt;/a&gt;&lt;/b&gt; can be found on youtube(the link is to the full playlist).  Although I have not viewed all of these videos personally, they appear to be a good overview of machine learning techniques.&lt;br&gt;&lt;br&gt;5. &lt;b&gt;&lt;a href="http://www.ml-class.org/"&gt;Andrew Ng's Online Machine Learning Class&lt;/a&gt;&lt;/b&gt; is a simplified version of the Stanford class CS229.  It glosses over most of the mathematics involved, but is a very good introductory resource for machine learning.  Mlclass also features quizzes and programming exercises that create additional engagement.  I would recommend that the reader study basic statistics before attempting this class.&lt;br&gt;&lt;br&gt;6. &lt;b&gt;&lt;a href="http://faculty.vassar.edu/lowry/webtext.html"&gt;Concepts and Applications of Inferential Statistics&lt;/a&gt;&lt;/b&gt; is written by a professor at Vassar, and is a good introduction to basic statistics.&lt;br&gt;&lt;br&gt;7. &lt;b&gt;&lt;a href="http://onlinestatbook.com/2/index.html"&gt;Online Stat Book&lt;/a&gt;&lt;/b&gt;  is a good resource for basic statistical concepts.  It has lots of interactive exercises interspersed throughout the text.&lt;br&gt;&lt;br&gt;8. &lt;b&gt;&lt;a href="http://www.math.umass.edu/~lavine/Book/book.pdf"&gt;Introduction to Statistical Thought&lt;/a&gt;&lt;/b&gt; covers basic probability and statistics, and covers some more advanced topics such as time series and survival analysis. It also has R code that the reader can implement.&lt;br&gt;&lt;br&gt;9.  &lt;b&gt;&lt;a href="http://joshua.smcvt.edu/linearalgebra/"&gt;Linear Algebra&lt;/a&gt;&lt;/b&gt;, a free to download ebook, provides an overview of linear algebra equivalent to an initial undergraduate course.  I have not read through it yet.&lt;br&gt;&lt;br&gt;10. &lt;b&gt;&lt;a href="http://ocw.mit.edu/index.htm"&gt;MIT Opencourseware&lt;/a&gt;&lt;/b&gt; is an excellent site that has full video lecture series and problem sets for hundreds of classes including linear algebra, calculus, and statistics.&lt;br&gt;&lt;br&gt;&lt;b&gt;Further Reading&lt;/b&gt;&lt;br&gt;&lt;br&gt;&lt;a href="http://steve-yegge.blogspot.com/2006/03/math-for-programmers.html"&gt;This&lt;/a&gt; is an interesting blog post on how to learn mathematics as a programmer.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8712623323474199769-218547080807202456?l=viksalgorithms.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://viksalgorithms.blogspot.com/feeds/218547080807202456/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/resources-for-learning-statistics-and.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/218547080807202456'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/218547080807202456'/><link rel='alternate' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/resources-for-learning-statistics-and.html' title='Resources for Learning Statistics and Data Mining'/><author><name>Vik Paruchuri</name><uri>http://www.blogger.com/profile/14436550620869189803</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8712623323474199769.post-4281820141342452496</id><published>2012-01-10T19:00:00.000-08:00</published><updated>2012-01-18T07:50:55.439-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Finance'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Time Series Cointegration in R</title><content type='html'>Cointegration can be a valuable tool in determining the mean reverting properties of 2 time series.  A full description of cointegration can be found on &lt;a href="http://en.wikipedia.org/wiki/Cointegration"&gt;Wikipedia&lt;/a&gt;.  Essentially, it seeks to find stationary linear combinations of the two vectors.&lt;br&gt;&lt;br&gt;The below R code, which has been modified from &lt;a href="http://quanttrader.info/public/testForCoint.html"&gt;here&lt;/a&gt;, will test two series for integration and return the p-value indicating the likelihood of correlation.  It runs significantly faster than the original code, however.  I used this for relatively short time series(50 observations), and while it functioned relatively quickly for small numbers of series, it became cumbersome to use when attempting to serially cointegrate over 100k pairs of bid-ask price series when using it with an mapply function.  So scaling up may be an issue.&lt;pre class="ruby"&gt;&lt;br /&gt;library(tseries)&lt;br /&gt;cointegration&lt;-function(x,y)&lt;br /&gt;{&lt;br /&gt;vals&lt;-data.frame(x,y)&lt;br /&gt;beta&lt;-coef(lm(vals[,2]~vals[,1]+0,data=vals))[1]&lt;br /&gt;(adf.test(vals[,2]-beta*vals[,1], alternative="stationary", k=0))$p.value&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;This runs an &lt;a href="http://en.wikipedia.org/wiki/Augmented_Dickey%E2%80%93Fuller_test"&gt;augmented Dickey-Fuller test&lt;/a&gt; and will return a p-value indicating whether the series are mean-reverting or not.  You can use the typical p-value as a test of significance if you like(ie, a p-value below .05 indicates a mean-reverting spread), or you can use an alternate value. This assumes that your two series were observed at the same time points.  The &lt;a href="http://quanttrader.info/public/testForCoint.html"&gt;original post&lt;/a&gt; that this code as modified from contains a further description of cointegration, along with more time series data type handling.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8712623323474199769-4281820141342452496?l=viksalgorithms.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://viksalgorithms.blogspot.com/feeds/4281820141342452496/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/time-series-cointegration-in-r.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/4281820141342452496'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/4281820141342452496'/><link rel='alternate' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/time-series-cointegration-in-r.html' title='Time Series Cointegration in R'/><author><name>Vik Paruchuri</name><uri>http://www.blogger.com/profile/14436550620869189803</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8712623323474199769.post-8113942018418819016</id><published>2012-01-10T18:38:00.000-08:00</published><updated>2012-01-18T07:50:47.496-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Ruby'/><category scheme='http://www.blogger.com/atom/ns#' term='Windows'/><title type='text'>Playing MP3s in Ruby on Windows</title><content type='html'>I recently needed to play music in Ruby in order to create an alarm clock of sorts.  The code to do this was (surprise surprise) fairly simple, but I want to post it for posterity.&lt;pre class="ruby"&gt;&lt;br /&gt;require 'win32ole'&lt;br /&gt;player = WIN32OLE.new('WMPlayer.OCX')&lt;br /&gt;player.OpenPlayer('C:\alarm.mp3')&lt;br /&gt;&lt;/pre&gt;This will open a new instance of Windows Media Player that plays the selected song.&lt;br&gt;&lt;br&gt;Note that win32ole should be installed by default, and does not need to be installed as a gem.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8712623323474199769-8113942018418819016?l=viksalgorithms.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://viksalgorithms.blogspot.com/feeds/8113942018418819016/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/playing-mp3s-in-ruby-on-windows.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/8113942018418819016'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/8113942018418819016'/><link rel='alternate' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/playing-mp3s-in-ruby-on-windows.html' title='Playing MP3s in Ruby on Windows'/><author><name>Vik Paruchuri</name><uri>http://www.blogger.com/profile/14436550620869189803</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8712623323474199769.post-4664167274065059563</id><published>2012-01-10T18:35:00.000-08:00</published><updated>2012-01-18T07:50:37.283-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Finance'/><category scheme='http://www.blogger.com/atom/ns#' term='Kaggle'/><category scheme='http://www.blogger.com/atom/ns#' term='Algorithms'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Introduction to Kaggle Algorithmic Trading Challenge</title><content type='html'>I recently participated in the Kaggle Algorithmic Trading Competition under the username VikP.  For those who do not know what &lt;a href="http://kaggle.com"&gt;Kaggle&lt;/a&gt; is, it is a web site where individuals and corporations can host data analysis competitions.  This particular &lt;a href="http://www.kaggle.com/c/AlgorithmicTradingChallenge"&gt;competition&lt;/a&gt; involved the prediction of how the prices of 50,000 observations of 102 different securities at the tick level recovered after both buyer and seller initiated liquidity shocks.&lt;br&gt;&lt;br&gt;Each competitor was provided with approximately 750,000 rows of training data, each of which corresponded to a separate liquidity shock event. Each row contained observations of the bid and ask prices and an event indicator(trade event or quote event) for the 50 time points immediately preceding the liquidity shock, and the bid and ask prices alone for the 50 events immediately following the event.  There was also metadata provided, such as the number of shares that were traded to create the liquidity shock event and whether the event was buyer or seller initiated.&lt;br&gt;&lt;br&gt;There were some limitations to what predictions could be made from the data, notably because volume information was missing for each trade.  Because the evaluation metric was RMSE, which valued higher prices stocks much more highly lower priced ones, the competition became heavily dependent on filtering outliers.&lt;br&gt;&lt;br&gt;I had a great time participating, enjoyed the high level of competition, and highly recommend Kaggle competitions to aspiring data miners.  During the competition, I gained many insights into tick data and into analysis of financial data.  I will share some of these insights in the next week or so, but I just wanted to introduce the competition first!&lt;br&gt;&lt;br&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8712623323474199769-4664167274065059563?l=viksalgorithms.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://viksalgorithms.blogspot.com/feeds/4664167274065059563/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/introduction-to-kaggle-algorithmic.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/4664167274065059563'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/4664167274065059563'/><link rel='alternate' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/introduction-to-kaggle-algorithmic.html' title='Introduction to Kaggle Algorithmic Trading Challenge'/><author><name>Vik Paruchuri</name><uri>http://www.blogger.com/profile/14436550620869189803</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8712623323474199769.post-4542393457123998720</id><published>2012-01-10T18:20:00.001-08:00</published><updated>2012-01-18T07:50:23.032-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Ruby'/><category scheme='http://www.blogger.com/atom/ns#' term='Windows'/><category scheme='http://www.blogger.com/atom/ns#' term='parallel'/><title type='text'>Create a New Ruby Process in Windows</title><content type='html'>I recently had a problem whereby I needed one Ruby program to spawn another Ruby program, but I did not need or want the two programs to interact after the second program was instantiated.  I solved this issue by using the system function in Ruby and the Windows start command.&lt;br&gt;&lt;pre class="ruby"&gt;system('start ruby.exe C:\script.rb')&lt;/pre&gt;This will create a new Ruby window, which will run the specified script, and close when it is complete.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8712623323474199769-4542393457123998720?l=viksalgorithms.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://viksalgorithms.blogspot.com/feeds/4542393457123998720/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/create-new-ruby-process-in-windows.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/4542393457123998720'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/4542393457123998720'/><link rel='alternate' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/create-new-ruby-process-in-windows.html' title='Create a New Ruby Process in Windows'/><author><name>Vik Paruchuri</name><uri>http://www.blogger.com/profile/14436550620869189803</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8712623323474199769.post-3086433290133376275</id><published>2012-01-10T18:08:00.000-08:00</published><updated>2012-01-18T07:50:06.463-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Ruby'/><category scheme='http://www.blogger.com/atom/ns#' term='Windows'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Using R in Ruby</title><content type='html'>Integrating R into more traditional programming languages can be incredibly rewarding due to R's powerful built-in statistical tools, but it can also be extremely frustrating at times.  Thankfully, like much else to do with Ruby, integrating R and Ruby is quite a simple process.To begin, install the gem rinruby and require it in your script.&lt;br&gt;&lt;br&gt;&lt;pre class="ruby"&gt;gem install rinruby&lt;/pre&gt;&lt;pre class="ruby"&gt;&lt;br /&gt;require 'rubygems'&lt;br /&gt;require 'rinruby'&lt;br /&gt;&lt;/pre&gt;There is no further installation or configuration required.  To evaluate an R expression, use the R.eval command.&lt;br&gt;&lt;pre class="ruby"&gt; R.eval "test=1*1"&lt;/pre&gt;To get a value from R, use the R.pull command.&lt;pre class="ruby"&gt; test=R.pull "test" &lt;/pre&gt;If you are running large code blocks, it is easy to use the R.eval command with the R source function.  This will evaluate the R instructions in a given text file.&lt;pre class="ruby"&gt;R.eval "source('C:/randomscript.R')"&lt;/pre&gt;If you are receiving the below error when attempting to load packages, your R library path is not properly set in your rinruby created instance of R.  &lt;pre class="ruby"&gt;Error in library(x) : there is no package called 'x'&lt;/pre&gt; You can correct this using the R .libPaths() function.&lt;pre class="ruby"&gt;R.eval('.libPaths("C:/Users/R library path")')&lt;/pre&gt;To find out what your R library paths are, you should use the Sys.getenv function in R.&lt;pre class="ruby"&gt;Sys.getenv(c("R_LIBS", "R_LIBS_USER"))&lt;/pre&gt;You should make sure that you set your library paths in the R instance created by rinruby to the same paths that appear when you run the Sys.getenv command in your R GUI.  A list of other useful environment variables can be found &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/EnvVar.html"&gt;here&lt;/a&gt;.Another useful function to use in conjunction with rinruby is the setwd command.&lt;pre class="ruby"&gt;R.eval('setwd("C:/Users/Documents")')&lt;/pre&gt;If you replace the file path above with your R working directory path, rinruby will save files in the right spot.&lt;br&gt;&lt;br&gt;&lt;b&gt;Notes:&lt;/b&gt;&lt;br&gt;&lt;br&gt;When typing file paths into Ruby for use in R, one should always use the forward slash(/), or four backslashes(\\\\), as single and double backslashes will not escape properly.&lt;br&gt;&lt;br&gt;Note that using the R.pull command with multiple R instructions(or using it with the source() function) will often cause long runtimes, or will not complete.  Using the R.eval command to evaluate scripts, and then using the R.pull command to pull variables out after the script has finished executing will solve this issue.&lt;br&gt;&lt;br&gt;You can find more information on this at the &lt;a href="https://sites.google.com/a/ddahl.org/rinruby-users/"&gt;rinruby&lt;/a&gt; site.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8712623323474199769-3086433290133376275?l=viksalgorithms.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://viksalgorithms.blogspot.com/feeds/3086433290133376275/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/using-r-in-ruby.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/3086433290133376275'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8712623323474199769/posts/default/3086433290133376275'/><link rel='alternate' type='text/html' href='http://viksalgorithms.blogspot.com/2012/01/using-r-in-ruby.html' title='Using R in Ruby'/><author><name>Vik Paruchuri</name><uri>http://www.blogger.com/profile/14436550620869189803</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
