getting a feel for what’s out there in the world

There are lots of bits and pieces of information around that seem to tell you about what’s going on in the world, but even the best of these are, finally, someone’s subjective judgement call, what they think is happening. They may be professionals, etc., but I want to get a feel for myself of what’s happening, not just accept the opinions of others– for example, there is usually a pretty serious gap between the politics of Berkeley, California and the rest of the U.S. There’s no easy way to get a sense of how big the gulf is, no easy road to figure out how anyone could conceivably want to have someone like Sarah Palin as Vice President.

Of course, I can immerse myself in firehoses of data of all kinds: populate my RSS reader with Republicans, watch 500 channels, visit 10,000 folksy web sites. But I really just want broad themes, changes, and trends. I want an engine that analyzes and summarizes lots of raw data into a few groups of things that are similar.

scratching down the dataAnd no, this isn’t some smirky internet-age bullshit! Based on Ben Fry’s recommendation in his book, I got a copy of Exploratory Data Analysis via inter-library loan. It was published in 1977, before most people knew what a PC was, and waaay before anything like the Internet. In the very first chapter, John Tukey starts with talking about making simple “stem and leaf” graphs with paper and pencil of the megawatts generated by hydroelectric dams in the U.S. Why? He wants to show you how to “write down a bunch of numbers in such a way as to give a general feel of ‘what they are like.'” What a great thing! I don’t care too much about the numbers, I just want to know if there’s any rough pattern there, and there is: most of these dams either generate about 30 kilowatts, or around 1,100, two groups that tell me where to start if I want to know more about that (I don’t).

In information retrieval land, this is known as clustering (one of my hobby horses that I usually talk about with anyone I know for longer than 30 minutes). Instead of a laundry list of thousands of things, I get a set of groups of the words that are used together often. I can get a feel for what’s happening in the data without sorting through it myself.

For example: I vaguely know who Kenzaburō Ōe is, but I don’t really know what his books are about. I just want a feel for that, with some idea of what the themes of his books were. I can either read a page of Google search results, or I could see derived clusters for those documents. To me, seeing the groups is a much better interface to start exploring if I want to know about him (and I do).

The motivation to get a sense of what’s out there in the world is an old one– it goes back farther than 1977, even. Now that more and more of the stuff of everyday life (cats, politics, hobbies, cancer, errands, religion, etc.) is in the form of data online, I can imagine doing that. I should heavily note that clustering techniques applied to social data still produce very uneven results, and a lot more work is needed. But it’s becoming possible to think of a better interface to information than a search box.

Post a Comment

Your email is never shared. Required fields are marked *