February 2011


More random R tips that I had a hard time Googling.

When you're writing SQL queries you will often want to do something like SELECT * FROM table1 WHERE id IN (SELECT id FROM table2). Of course you don't need to do it with a sub-query. But the point is, often you will want to pull out only those rows that have an id which appears in another result set.

I need to do this all the time in R. For example, I have a dataframe in which each row has a unique ID. I do a bunch of processing and end up with a vector if ids that meet my criteria, and I would like to pull out only those rows from the original dataframe which are in this new vector of ids. Of course there are many ways to do this, but I think the SQL-like way is straightforward, and it turns out it's easy to do it in R, if only you can find the damned documentation! Well, here you go: use the %in% operator. It works just like it does in a SQL query:

result <- data[data$id %in% subset.ids,]

This says select all columns from data where the data$id field is in the vector called subset.ids. I find this to be super useful, hopefully you will too!

File this one under "Oh my god I will seriously jam this pen in my eye."

About a week ago I noticed a warning from Symantec Endpoint Protection telling me virus definitions were out of date. (I am running Windows 7 64-bit, incidentally.) Running LiveUpdate manually threw an error. Err, what?

So, suspecting some form of abstract corruption, I used the Symantec Cleanwipe Tool to remove all traces and reinstalled. No dice. The installation would get only so far and then roll-back with an annoyingly non-specific error.

I tried many fixes, finally found the right one. The problem with the aborted installation and the original failure of LiveUpdate were related to a DCOM conflict. Solution is detailed here. I have a feeling if I had known I could have implemented this solution without re-installing SEP to begin with. Argh.