\ How-to | TechnoTaste

How-to


More random R tips that I had a hard time Googling.

When you're writing SQL queries you will often want to do something like SELECT * FROM table1 WHERE id IN (SELECT id FROM table2). Of course you don't need to do it with a sub-query. But the point is, often you will want to pull out only those rows that have an id which appears in another result set.

I need to do this all the time in R. For example, I have a dataframe in which each row has a unique ID. I do a bunch of processing and end up with a vector if ids that meet my criteria, and I would like to pull out only those rows from the original dataframe which are in this new vector of ids. Of course there are many ways to do this, but I think the SQL-like way is straightforward, and it turns out it's easy to do it in R, if only you can find the damned documentation! Well, here you go: use the %in% operator. It works just like it does in a SQL query:

result <- data[data$id %in% subset.ids,]

This says select all columns from data where the data$id field is in the vector called subset.ids. I find this to be super useful, hopefully you will too!

File this one under "Oh my god I will seriously jam this pen in my eye."

About a week ago I noticed a warning from Symantec Endpoint Protection telling me virus definitions were out of date. (I am running Windows 7 64-bit, incidentally.) Running LiveUpdate manually threw an error. Err, what?

So, suspecting some form of abstract corruption, I used the Symantec Cleanwipe Tool to remove all traces and reinstalled. No dice. The installation would get only so far and then roll-back with an annoyingly non-specific error.

I tried many fixes, finally found the right one. The problem with the aborted installation and the original failure of LiveUpdate were related to a DCOM conflict. Solution is detailed here. I have a feeling if I had known I could have implemented this solution without re-installing SEP to begin with. Argh.

I've been on the verge of seppuku for an hour now because of problems trying to install Eclipse plugins on Win 7. I use the same automatic install feature I always have, installing from an update site. Everything seems to work fine, but when Eclipse restarts, there's no plugin. It appears in the list of install software, but no plugin. Checking out the eclipse directories, I see that the proper plugin files aren't there. Argh!

Well, it appears that on my machine one needs administrator permission to change the Program Files directory. I'm not sure why it became necessary, but the fix is simple: Run Eclipse as an administrator. To do that, just navigate to the directory, right click on the file, and click "Run as Administrator." You can also set it to run that way permanently by right-clicking on the file, choosing "Properties", then the "Compatibility" tab, and checking "Run this program as an administrator".

Bleargh!

Here's something I never thought I'd have to do, but a recent journal submission requires that tables be submitted in .doc format — despite the fact that they fire everything back into LaTeX when they're typesetting the final journal anyway! Well, I guess there's just no arguing, so I had to figure out how to convert a LaTeX table to Word format. This process turned out to be surprisingly easy. Here's what I've figured out to be the easiest way to do this:

  1. Get Latex2RTF, which is an open source converter for both Windows and Linux. Install it.
  2. The GUI front-end that came with the software didn't work, so I skipped directly to the command line. I navigated to the proper directory – for me that's C:\Program Files\latex2rtf. The basic command worked great for me: "latex2rt {path to table}\table1.tex" Remember that on Windows, if you have spaces in your directory names you may need to put the whole path to the file in quotes. It'll do this for you automatically if you use the tab autocomplete when typing the path.
  3. Latex2RTF seems pretty complete, but there are still one or two latex commands that it just won't handle. For example, I had to remove the @{} format from my table declaration, which was no big deal. Finding the commands that bonk is easy, just use the debugging option at level 4, like "latex2rt -d4 {path to table}\table1.tex"
  4. I got an almost perfect RTF out of this process, but make sure to check the table carefully. I found, for example, that it didn't convert my text daggers, so I had to add them back. After a minimum of fiddling with font sizes, row spacing, etc., save as .doc, CELEBRATE!

Another case of a lost day and some failure of Google. Here's what went down (going back a bit):

  1. A few weeks ago I installed the latest Vista service pack. The installation apparently reset my preferences for automatic updates. I always choose 'Download updates and notify me' rather than the 'automatic install' option. But, SP2 reverted it to the auto. install option, with the time set for 3am.
  2. Vista wakes my laptop from sleep in the middle of the night, installs updates that require reboot. The nag comes up, counts down, then force quits all of my open programs, notably MS Word, and reboots the computer.
  3. I wake up to find that my copy of Word is completely and totally shazonked. There are a variety of bizarre symptoms. It starts, then hangs. It starts, doesn't crash, but I can't move the mouse. It starts, lets me access the options menu, but crashes when I try to change anything. You name it, I saw it.

Here's the failure of Google. What do I search for? At first, I think it's about the EndNote add-in. I remove it, no dice. I try to repair the install. I try to repair the install from the original DVD. Nothing. I'm trying to disable all the templates and add-ins, but it crashes when I try to change anything. And here's the kicker… you can start Word from the command line with the '/a' operator to load it without any templates and add-ins if you think one is corrupted and causing a problem. But when you do that, it's not like booting in safe mode where you can view and change the normal configuration. You go look at the options and find out that no add-ins or templates are there. Argh!

Anyway, enough back story. To the solution: create a system restore point, and then delete the registry key at HKEY_CURRENT_USER\Software\Microsoft\Office\version number\Word\Data. Word will regenerate it automatically the next time you start it. Apparently, force quitting can cause the data stored in that key to be corrupted and cause all manner of crazy problems. Sheesh.

I just wanted to share the solution to a problem with Dreamhost that I just ran in to. A site I designed for a client a few months back stopped working all of a sudden. No one had touched the code. I can access the root page, but none of the interior pages are working. Clicking on the link returns a 'No Input File Specified' error in Firefox, and a 404 in IE.

The problem was with the .htaccess redirect I use to create SEO friendly URLs. I use the method that I wrote about here, and which works on many, many other sites I host on Dreamhost. So, I'm thinking WTF?

Of course, the Dreamhost support people were no help. I figure there must have been some kind of configuration change on the server-side… no other way to explain this. And I notice that the server we're on was recently upgraded to Apache 2.2.11. Then I find a forum thread started by someone who's having my exact problem. And based on a tip in one of the posts, I made a one line change in my .htaccess that solves the problem:

Before: RewriteRule ^(.*)$ index.php/$1 [L]

After: RewriteRule ^(.*)$ index.php?/$1 [L]

Adding that question mark did the trick. I'm still not sure why it's now required, or why it changed suddenly. I'd be grateful to anyone who can explain that to me!

R is a wonderful platform for statistics, not least of all because of the large number of resources out there. There are packages and notes on just about anything you'd want to do. But given the volume of stuff, it's sometimes very hard to find exactly what you need.

For example, yesterday I was looking for the best way to randomize a vector. I basically wanted the equivalent of the shuffle function in PHP. I had a data set, and I want to sample randomly from the observed distribution N number of times. This is a common thing if you're doing any kind of bootstrapping.

Anyway, Google totally let me down. Although I did eventually figure out the best way to create a random vector. The first thing I tried was generating an array of N random indexes using the runif command. This was fine, except to apply them I had to create a huge for-loop, which is very, very slow. It would have taken 10 days to finish, literally.

Then I took a cue from the description above, and figured out the best (and by far the fastest) way is to use the sample command. Very, very fast. Usage is:

sample(x, size, replace = FALSE, prob = NULL)

x: vector to sample from
size: # of times to sample
replace: sample with or without replacement (sampling with replacement means a value is NOT removed from the pool once it's been randomly selected, so it could be selected many times)
prob: a vector of probability weights for x (I didn't use this)

Hopefully I've filled this post with enough keywords so others can find it!

I have to deal with fairly large database imports from time to time, which can be a bit of a challenge. phpmyadmin fails pretty miserably when dealing with large database dumps or database imports. Either the SQL files are too large, or the import process takes too long. There are lots of ways around this, but after trying most of them I've found what I think is the easiest way to handle large database imports into MySQL on Windows.

All credit due to Greg at dittio.net who posted a tutorial on this. My only contribution is to say that this works on Windows as well. Here's how I did it:

  1. To make things easy, I moved the SQL file I was importing to a root directory – C:\ or similar.
  2. Run mysql. I don't have it in my path, so I had to navigate to the proper directory to find the .exe. I use the excellent XAMPP, so for me that was C:\xampp\mysql\bin\. On the laptop that I use for development I have no password set for the root user (security alert! OMG!), so the command was just 'mysql -u root'. If you had a password set the command would be 'mysql -u root -p'.
  3. This tutorial assumes there's already a database you want to use. If not, look here for a tutorial on how to create one. Otherwise, switch to that database using the command 'USE {databasename}'.
  4. Then just identify the source SQL file using 'SOURCE {pathtofile};' So, for me that was 'SOURCE C:\dbDump.sql'. Don't forget that semi-colon!

That's it. The import of a 3mb SQL file took about a second. Swoosh.

Here's another issue I couldn't find a simple answer to. I'm building a website in PHP, and I want a simple front controller. In other words, I want every bit of traffic to run through a central script that farms out the traffic by parsing the URL. And I want it to be silent – I don't want the user to know the redirects are happening. Now, I could do this with query strings – that's like http://mysite.com/index.php?page=foo&action=bar But that's ugly, and it gets complicated fast.

Instead, I want to do it in the user-friendly, semantically meaningful, SEO-ok way, like: http://mysite.com/foo/bar I want to use the path structure instead of the query string. This is a common pattern for Ruby and Python, but maybe less so in PHP. So, here's how I did it.

(Note: This isn't rocket science. Yet it was still hard to find. And I'm sure there are 1000 ways this could be done better, cleaner, faster, more. But I'm a functional coder. I don't care if it's the most elegant, I care if it works. And this works.)

There are two parts to the controller. The first is the .htaccess file. I'm using XAMPP on Windows Vista as my development environment. If you need help getting .htaccess to work there, check out my earlier post.

RewriteEngine on
RewriteCond $1 !^(index\.php|lib|parts|pages|images|robots\.txt)
RewriteRule ^(.*)$ /your/webroot/index.php/$1 [L]

Here's what this does. (Full disclosure: I adapted this from CodeIgniter's model…) The first line enables mod_rewrite. It's a must. The second line sets the conditions for the rewrite. It says, rewrite everything except what's listed in the parenthesis. If you have additional directories that you keep images, css files, or other things in, you just add them to the list in parens with a '|' between. Finally, the last line sends everything through the main controller, which is index.php, but without actually changing the URL in the address bar. This part is what makes the whole thing transparent to the user. Good stuff.

Ok, so now, we can type something like 'http://mysite.com/foo/bar' and it will silently redirect to 'http://mysite/com/index.php/foo/bar'. So now all we have to do is set up index.php to handle the incoming request. Now, there are lots of more object oriented ways to do this, but for my purposes, the simple procedural way works best: a switch statement.

< ?
//A file with all the common configuration, like web roots, security,
//database, language, etc.
require_once('lib/config.php')

//Your header, the same regardless of the page
require_once('parts/header.php')

//This is the content area that will change based on the URL. 
//My div is called 'textarea'. Yours might be called something else.
//echo "
"; //The switch statement $url = substr($_SERVER['REQUEST_URI'], strlen(URLROOT)); switch($url){ case (''): require_once('pages/root.php'); break; case ('foo'): require_once('pages/foo.php'); break; case ('bar'): require_once('pages/bar.php'); break; default: //the default is an error! require_once('pages/error.php'); break; } //End your content div here //echo ''; //Your footer, the same regardless of the page require_once('parts/footer.php') ?>

That's it. No mystery. In that first line of PHP ($url = …), I get the path string that's in the address bar – so this will show what the user typed before we did the redirect. Then, in my config.php I've set a global variable called 'URLROOT' that corresponds to the path in my local environment. Using substr, I snip out only the part of the path that comes after the root, and feed that to my switch statement.

Like I said, I'm sure a more experienced coder has 1000 better ways to do this. But, give it a try. It worked for me!

Recently I was trying to get .htaccess control working under Windows and XAMPP. I found it surprisingly hard to find the answer, even though it's a simple one. You've got to make two changes to the httpd.conf file, which for me was found in C:/xampp/apache/conf.

  1. Search the httpd.conf file for 'mod_rewrite'. You'll find the statement that loads that module is commented out. Remove the '#' at the start of the line to un-comment it.
  2. By default, XAMPP on Windows does not allow .htaccess files to override the httpd.conf file. Search the httpd.conf file for 'AllowOverride'. You should find one or more statements that say 'AllowOverride None'. Wherever you find it (or more selectively, if you're savvy like that), change it to 'AllowOverride All'.

Don't forget to restart Apache, and you're done!