\ Research Methods | TechnoTaste

Research Methods

The Personas Project out of the MIT Media Lab's Sociable Media Group has been making the rounds lately. It uses data-mining and natural language processing (NLP) to gather data about you from the web and distill it into a pretty info. graphic that is supposed to represent 'how the web sees you.' Here's mine:

My Results from the Personas Project

My Results from the Personas Project

(Click for a larger view.)

Like any art project, this one seems intended to make us think about how we're portrayed on the web and about the data that's out there floating around. A project like this is intended to inspire an 'OMG, is this how the web sees me?' reaction. It's meant to shock and awe us by getting things right, and perturb us by getting things wrong. It's one of many that tries to do these sorts of things, and I think it's getting attention right now partly because it comes from MIT and partly because it's very pretty. The authors are due a lot of credit, however, for recognizing the limits of their tool:

[The Personas Project] is meant for the viewer to reflect on our current and future world, where digital histories are as important if not more important than oral histories, and computational methods of condensing our digital traces are opaque and socially ignorant.

I love that last clause. Well put. I think data mining is particularly opaque and socially ignorant when it's employed for abstract purposes. There are lots of really tightly wound, well scoped questions that we can answer with data mining techniques. I use data mining as a tool myself, but I use it to gather evidence of behavior in support of very narrow, specific claims. But as a tool for telling us how we're viewed on the web, for example, they stink. I'd imagine that most researchers know they stink. But we're still using them as a primary tool to talk about social processes, even though social processes require context and time-lines that data mining can't begin to capture. Why is that? Why haven't we seen more backlash against these autistic methods? Why haven't we seen more studies that mix data-mining with qualitative methods, for example, to lend context where it's lacking? I suppose it's because data-mining is easy. Computation and storage are cheap, and talking to people is hard.

I hope that this sort of tool will reveal the danger of these methods, and encourage researchers to advance the state of the field. We've dwelled too long on the digital traces / privacy meme. At this point it's just getting tired and exploitative of the digital paranoia that's rampant in the media right now. We need to get past it.

I just read through Oded Nov's paper from Communications of the ACM:

Nov, O. (2007). What motivates Wikipedians? Commun. ACM, 50(11), 60-64. (link)

Two things occur to me. First, Nov explains away the potential influence of social desirability in about two sentences, but I'm not buying it. When you ask people why they do something, there's a huge number of social factors that are going to come into play. In the case of Wikipedia I also think there are likely to be lots of soft and implicit attitudes. Soft attitudes are expressions that don't reflect beliefs, but rather answers to questions someone might not have thought about previously. For example, if I asked you "How do you feel about Kobe Bryant elbowing Ron Artest in the neck last night?", you might respond by saying it's abhorrent. If I took that at face value, I'd be ignoring the fact that many people don't know about basketball, don't know who Bryant or Artest are, don't know the contest, or don't care. Unconscious attitudes, on the other hand, are attitudes that we hold and act on but can't express. To me, neither of these things makes survey research of this type invalid – I do similar surveys myself! But they're important issues, too often left out of discussions.

The second issue, maybe more important, is about scope. There are a fair number of studies now about motivations for contributing to various online collective actions. But they almost always focus on people who contribute a lot. However, these papers, like Nov's, usually don't make that distinction. They make claims about motivations for all contributors. In reality, the motivations of casual or infrequent contributors are likely to be very, very different. Harder to study, though! On the one hand, by studying the heavy contributors we capture motivations for majority of the work that gets done, but we do that at the expense of attention to the vast majority of people who contribute.

In sum: Social desirability, soft attitudes, etc. need more consideration when we talk about motivation. Studies that focus on heavy contributors should say as much, and more studies should look at casual contributors' motivations.

Slashdot has syndicated a story about some research claiming that Facebook use is correlated with getting worse grades in college. Apparently:

…Facebook user GPAs were in the 3.0 to 3.5 range on average, compared to 3.5 to 4.0 for non-users. Facebook users also studied anywhere from one to five hours per week, compared to non-users who studied 11 to 15 or more hours per week.

If this seems fishy to you, you're not alone. At least the author of the press piece communicates the researcher's note that correlation isn't causation. The researcher herself (a doctoral student from Ohio State named Aryn Karpinski) seems convinced what she's seeing is an unobserved variable problem. I think that's likely to be true, but I'd also guess there's a huge bias in this type of self-report data. I'm guessing that, on average, college students use Facebook about the same regardless of their GPA. But – and this is a big but – if you're a person who's getting good grades, you probably also carry around a set of social norms about what you should be doing with your time. So when someone asks you how much time you spend on a distraction you're likely to under-report your time on Facebook, and over-report the time you spend studying. That would be especially true for something like Facebook, which increasingly carries a stigma as a frivolous time-sink.

In fairness, the researcher in this case seems to have worked as hard as possible to communicate her findings, and had her story twisted through the popular press. Blogger Ted Shelton wrote a fairly snarky piece about her (OSU Researcher Discovers Dorks), to which she responded directly, and Ted posted the response (and an apology):

The main thing to remember is that this research is correlational, which the media does not seem to understand (no surprise). I am not saying that Facebook CAUSES poor academic performance. I am saying that the research shows that there is a RELATIONSHIP between Facebook use and academic performance. There are a host of third variables that need to be examined that are potentially influencing this relationship such as personality, work, extracurricular involvement, other distractions, etc. Also, I'm sure that if it wasn't Facebook it would be another distraction. See how they twisted my words? Fun fun…

Amazon Web Services is making a variety of large data sets available in the cloud. This is great news, as these giant data sets are often difficult to find, compile, and host.

So far the list of data sets includes some biological and chemical data, census info. and labor reports. I'd love to see this list grow to include the complete history of the GSS, for example. In another area, Amazon should keep a complete, unpacked, current dump of Wikipedia in the cloud. The complete XML dump of the English language Wikipedia with all revisions is in the 10s of terabytes, I think.

I know many people who shy away from doing statistic in R for various good reasons. R is hard. Many of the things that make it great are things that only previously experienced coders can take advantage of. If you've never programmed before, learning stats in R means learning statistics and learning to code at the same time. That's nuts.

But if you're already comfortable with basic coding, R is wonderful. Still, the syntax is a lot to remember, and I have a hard time keeping it in my brain when it's been a few weeks since my last R analysis session. I use two wonderful resources to refresh my memory and learn new things:

  1. Quick-R – While the site is branded as a way for SPSS/SAS/Stata users to learn R, it's really just the best all around resource for 90% of the analysis and visualization you'll want to do on a daily basis. The site provides great, simple explanation, sample code, and pointers to packages that have lots of shortcut methods.
  2. R Wiki – The R-project itself has a great wiki with lots of detailed info. and code samples on many R functions. If it's not at Quick-R, this is where I'm going next.