posted by Andrew Mahon on Jan 31st, 2010
VoxPop is designed to run on one or many Amazon EC2 instances. Here I will outline the steps to start an EC2 instance, and get VoxPop up and running.
Running VoxPop requires Python 2.5 with NLTK, CouchDB, Beanstalkd, and Memcached.
Assumptions
In walking through my process, I will make a couple of assumptions:
1. I assume that the reader is on an OS X machine. While most of the directions should translate pretty smoothly to any Unix based system, they are written for an OS X environment.
2. I assume the reader has cursory understanding of how to use Terminal. While I will try to provide comprehensive directions, some experience would probably be good.
3. I assume that the reader will use Textmate. Textmate is a great (and highly recommended) text editor for OS X. Textmate provides shell linkages that allow it to be invoked with the command ‘mate‘. If you are not using Textmate, please replace ‘mate‘ with the text editor of your choice.
Start EC2 Instance
Amazon Web Services EC2 makes it easy to start a remote server instance with very short lead time.
To get started with EC2, you will need to start an account with Amazon Web Services, which you can do at http://aws.amazon.com/.
Once you have AWS Credentials you should download and install the Elasticfox plugin for Firefox. Elasticfox provides a GUI for starting and managing AWS EC2 instances.
The first thing we need to do in Elasticfox is enter your AWS credentials in the ‘Credentials’ panel accessed witha button in on the top of the Elasticfox interface. Elasticfox allows you to maintain multiple sets of credentials. AWS Credentials can be found on the AWS site under Account->Security Credentials.
Before we can actually go about using EC2 instances, we need to set up a SSH key to identify your local machine. In Elasticfox, open the ‘KeyPairs’ tab, and create a new key by pressing the green key button. You will be prompted to enter a name for the new key, I chose ‘voxpop_awmpro,’ describing both the instance purpose and my local machine.

Once you have input a name, a download will initiate, I suggest you save it to your home directory. The file will be named id_+the name you selected, so in my case, the file is named ‘id_voxpop_awmpro’. Lets open up terminal and move the file into the .ssh directory in your home directory. Once the SSH file is in the .ssh directory, we need to restrict its permissions. AWS requires that private keys are only accessable by their owner. We will change the file permissions using ‘chmod‘. After changing the private key’s permissions, we need to open up (or create) the SSH config file, and add a reference to the new key.
Open terminal and type, replacing ‘id_voxpop_awmpro’ with the name of your SSH file:’
mv ./id_voxpop_awmpro ./.ssh/id_voxpop_awmpro
chmod 600 ./id_voxpop_awmpro
mate ./config
Once ./config is open in your text editor, add the line (again, replacing id_voxpop_awmpro, with your key filename):
IdentityFile ~/.ssh/id_voxpop_awmpro
Read the rest of this entry »
posted by Andrew Mahon on Jan 27th, 2010

The above diagram represents the VoxPop server architecture for the first iteration of design.
One backend instance of VoxPop consists of two processes, the Producer, and the Worker. The producer is responsible for serving pages and data, and receiving requests and creating tasks. The worker is then responsible for executing the tasks. The two processes are linked by the Beanstalkd task queue, and the Memcached and CouchDB persistance layers.
As the producer receives requests, it adds tasks to the Beanstalkd task queue. Beanstalkd is lightweight task queue originally developed for Causes on Facebook. Beanstalkd allows tasks to exist in multiple ‘tubes,’ each of which can be prioritized and watched separately.
The Worker process is multi-threaded with at least one thread per Beanstalkd tube. As soon as a thread is ready to accept a task, it pulls it off the queue and executes it. Each task either produces new tasks to be added to Beanstalkd, or results to be persisted into Memcached and Beanstalkd.
As results are produced, they are persisted in both Memcached and Beanstalked. Persisting into Memcached provides nearly instantaneous availability to the Producer to allow it to serve out results as soon as they are produced. Persisting into CouchDB is a little bit slower, so results are stored there as soon as possible. CouchDB allows results to be retrieved indefinitely.
posted by Zeke Shore on Jan 27th, 2010
Back in November we had the honor of presenting our progress to the R&D department at the New York Times, followed by a similar presentation to our upcoming thesis advisor, Dave Carroll for the final semester of the project. Feel free to check out the slide deck, however links to the working prototypes may be down (running servers is expensive!)
VoxPop Midpoint Presentation
posted by Zeke Shore on Jan 26th, 2010
While we have been good about posting research progress as it comes, progress on the design front has been a bit too quite. Here are some early iterations of the User Interface design process, and what we are learning as we go.
Working off of the data that we were beginning to generate, our starting point was a collection of user comments for all of the New York Times articles that would be returned for any given query. By parsing through the comments, we could match words against the General Inquirer Dictionary across Charles Osgood’s three-axis theory of Semantic Differentiation. You can read more about the first version of our emotive analysis process in my previous post on the subject.

So the initial output we decided to shoot for was essentially six lists of words from the comments for each article that is retrieved for a given query. Along the evaluative axis we would have a list of ‘positive’ words (shown above in green) and ‘negative’ words (in red), along the activity axis we would have a list of ‘active’ words (in orange) and passive words (in brown), and along the potency axis we would have a list of ’strong’ words (in blue) and ‘weak’ words (in gray).

Flushing out the design of this model included four “states.” Collapsing the emotive word lists for each article would yield colored bars extending above and bellow a base line of articles. Theoretically, this would reveal trends in the quantities of these emotively charged words over time for discussions surrounding any keyword. Clicking on a specific article would reveal the actual list of words that are being described by the colored bars of the collapsed view. Extending the idea, hovering over any word could potentially show the sentence from which that word was retrieved, and hovering over the article title could reveal the abstract of the article, and clicking either would bring the user through to the article or the specific comment on the New York Times website, all in an effort to provide easy contextual access as a validation tool.
So we built a prototype of this visualization. We did not build out all of the interaction levels spec’ed in the initial mockups, but even just getting a list of articles with their corresponding lists of emotively classified words from the discussions surrounding them seemed like a good starting point for exploring the data.

This prototype revealed a lot. The first obvious conclusion is that we are dealing with way more data than could be meaningfully expressed as ‘lists of words’. Even scaling the text size down bellow legibility did not allow most lists to be viewed in their entirety in a normal web browser window.
Another problem is that the data is really hard to read if you don’t already have a strong understanding of what was going on behind the scenes. This organization does not show the three clear axises that the discussions are being mapped against. Furthermore, this model gives equal weight to all of our emotive axises, despite Osgood’s conclusion that evaluative distinction (positive/negative) carries the most emotive weight, which is then supported by activity and potency as the second two most significant factors.
One more problem that this prototype revealed is the homogenizing effect that results from extracting lists of words at the level of the entire conversation rather than specific comments. One really long nasty comment could skew the negative word count for an entire conversation when looking at the data at this level of abstraction, and that sort of misrepresentation could be a serious cause for concern. The project is called VoxPop stemming from the Latin term Vox Populi, meaning “voice of the people.” This visualizing attempt was not yet showing the voices of any ‘people’… rather averaging out the ebbs and flows of entire conversations.
More to come on our newer design iterations soon.
posted by Zeke Shore on Nov 15th, 2009
We have been trying to hunt down more information about the General Inquirer Dictionary, since it is currently serving as our primary reference table for emotively evaluating the words within the discussions of New York Times articles. We were able to get in contact with Roger Hurwitz, a research scientist at MIT’s Artificial Intelligence Lab, and one of the GI dictionary’s moderators, who was able to shed some light:
The General Inquirer scores sentiment in texts on the basis of surface text words whose root forms and contextually disambiguated senses mark negative or positive attitudes, per the General Inquirer dictionary. I realize that sounds circular, but there are many such words in the dictionary, so that coverage has proved adequate and results have acceptable inter-coder reliability with scoring of the same texts by human coders. The GI also scores texts in just over 200 other fields or any subset thereof per users’ desires. these fields include expressions of the eight social values that political scientist Harold Lasswell found basic to human social activity. Namenwirth and Weber using the GI and Lasswell values dictionaries to code American political party platforms and speeches from the British throne, respectively, found long and short value cycles in American and English society (following a relative attention paradigm, as measured by frequency of mention.) The book Dynamics of Culture (Boston: Allen & Unwin, 1987) may be out of print. However, an article by Namenwirth lays out the theory and is available online.
So I found and reviewed the J. Zvi Namenwirth study that was published in the Journal of Interdisciplinary History (MIT Press) in 1973. Namenwirth is mostly mapping public values through the content of presidential campaign transcriptions from 1844 to 1964.
The following two graphs show the frequency of the word ‘wealth’ over time, normalizing for transcript lengths, and begin to reveal some interesting cyclical patterns over the 120 year stretch.


These early natural language processing studies are interesting to look at, partially because of how much was accomplished with such little computational resources available. While word count may be a relatively trivial metric by today’s NLP capabilities, it does reveal interesting patterns over longer time lines.
This seems to validate our efforts to develop a lens through which the pre-aggregated corpus of the web can be analyzed through more rigorous NLP systems, revisiting what the General Inquirer Dictionary might be able to reveal.
The study is not openly published, so I cannot post the PDF on the site, but here is the citation and Jstor link:
J.Z. Namenwirth, “The Wheels of Time and the Interdependence of Value Change,” J. Interdisciplinary History, 3 (1973): 649-683
Stable URL: http://www.jstor.org/stable/202687
posted by Zeke Shore on Nov 11th, 2009

I wrote previously about Jonathan Harris‘ project We Feel Fine from 2006, but it appears that the project has not grown dormant since it’s initial buzz. Harris has recently completed a book documenting his process of emotive exploration, and has compiled some interesting data over the three years that the project has existed. It’s exciting to see Harris return to printed work, since so many of his projects have lived within the digital realm. That said, the sample pages that he has up on the book’s website are very interesting.

Harris’ playful aesthetic appears to carry through to print form elegantly (I’m excited to see these spreads in the actual context of the book), while still managing to take a refreshingly academic departure from the original project. Reflecting back on the project after three years also adds the notion of time to the project that was frustratingly absent from it’s original manifestation.

The book will be available December 1st, published by Simon and Schuster, and it will be finding it’s way to my bookshelf shortly thereafter.
posted by Zeke Shore on Nov 11th, 2009
Here is a quick update on how our emotive analysis engine is playing out. The end to end process (for this initial prototype) will work as follows:
First, the user provides a search query, and we pull (and cache) all of the NY Times articles that are related to that query that have comments using the Article Search API and the Community API (this will be made more efficient in the near future… more to come on that later).
After article or comment results to a query are returned from either the cache or a new API call, what we will need to deal with initially on the Natural Language Processing (NLP) side of the equation will be comments, in the form of text strings.
Using NLTK in python, there is an information extraction architecture that is structured as follows:

For our purposes, one of the more difficult challenges that we have is knowing what words we care about. If we are trying to visualize the emotional or affective characteristics of the discourse surround a keyword, we cannot just look at the full thread of comments for an article that was returned for a given keyword, and log every word that holds emotive weight. The NY Times article Bipartisan Spirit, at Least for a Moment is a perfect example as to why not. The article is about a meeting between President Obama and George Bush Sr. So as one may guess, that article would have been returned when querying either ‘Bush’ or “Obama,’ and the 38-comment discussion that follows the article contains references to both.
So before any sort of emotive analysis can occur, we must parse the text down to the words that we care about. This first involves identifying instances of our keyword within each comment, and extracting the sentences that contain the keyword.
For further coverage, and also to account for the fact the web-based comments are often less verbose and less refined than other forms of discourse, if our keyword is a proper noun, we might also look at sentences with pronouns that immediately precede or follow sentences with our keyword.
Ultimately, we will need to develop a comprehensive weighted dependency grammar, so that we can efficiently parse the sentences that we care about into relatively accurate dependency structures. This will allows to know (with far more precision) what words are referring to or modifying our keyword, and should therefor be emotively classified.

So now the fun part. Once we know what words we care about in relation to our keyword, we will go back to Charles Osgood’s Semantic Differential Theory which maps words along three main axises: the Evaluative (good/bad), the Potency (strong/weak) and the Activity (active/passive) which I have discussed in a previous post. We can do this using the General Inquirer Dictionary, including the Lasswell Value Dictionary and the Harvard IV-4 dictionary, which maps about 12,000 words across Osgood’s semantic differential axises (among other classifications).
To make the process more efficient, since we have tagged the part of speech of every word, we can throw out words that we know should have neutral affective values, such any determiners (’the,’ ‘a,’ etc) or any proper nouns, and map every other word against our three axises. For each axis, we will give a word a value of 1, 0, or -1, so on the evaluative (EVA) axis, for example, any word living at the ‘positive’ or ‘good’ end of the axis would hold a value of 1, whereas a word living at the ‘negative’ or ‘bad’ end of the axis would hold a value of -1, and of course words that are neutral on the evaluative scale would hold a value of 0. This system would carry out across the activity (ACT) and potency (POT) axises as well in the form of
affectiveValue(word) = [EVA, ACT, POT]
affectiveValue(respect) = [1, -1, 0]
Where the word “respect” holds an evaluative value of ‘positive’ or ‘good,’ an active value of ‘passive’ and a potency value of ‘neutral’ (neither ’strong’ nor ‘weak’).
So ultimately this will leave us with six lists of words for each article in relationship to a given keyword, which we can then use as metrics for our data visualization.
posted by Zeke Shore on Nov 11th, 2009

Christian Swinehart, an MFA student at RISD recently completed a project that explores the narrative paths of those Choose Your Own Adventure books that were popular in the 1980s. While the topic of exploration might be a bit trivial, the visualization solutions and execution are definitely noteworthy.
Specifically, Swinehart achieves a surprisingly sophisticated aesthetic utilizing a dark background, which can be difficult to pull off successfully in web based contexts. The color pallet is both diverse and cohesive, with points of saturation used sparingly within primarily light gray structural forms.

The Flash based animations that Swinehart uses to demonstrate narrative flows are also quite beautiful, unfortunately at the expense of removing themselves from any sort of informative context. However, this does serve as an intriguing precedent for visualizing flows of connection that exist within a parenting organization system (in this case, the time line of the story). This is an idea we may explore if we end up trying to visualize how users react to each others comments within a discourse.
posted by Andrew Mahon on Nov 7th, 2009
Took this screen capture last night – nothing TOO special, just the beginnings of my Beanstalkd Dashboard. I like watching the NLP jobs jump up, and quickly get processed.

posted by Andrew Mahon on Nov 7th, 2009
Here is a quick tutorial on how to download and run a python file from within Terminal. I will provide a quick overview of each command’s functionality. This is nowhere near comprehensive, but should help those unexperienced with the Terminal CLI get a cursory understanding.
Upon opening Terminal, you will be located in your home folder. On your main drive, it is located at /Users/username, or symbolically at ~/. If you dont already have one, lets create a folder to keep your python projects in.
The mkdir command makes a directory. Since you are in your home folder, your user has permissions to make and delete folders here. Check out wikipedia for more info on mkdir.
Now that we have created the folder ~/python, lets move to it. Type this one into Terminal instead of copying and pasting it. As you begin typing python, hit ‘esc’. It should autocomplete the remainder of the path for you. This is a great thing to remember when using terminal.
The cd command changes the directory that you are in. Its most basic usage is ‘cd path’. After entering the above command, you should have moved to the ~/python folder. Check out wikipedia for more info on cd.
Now, lets download our source file. I have posted it to the voxpop site, and it is compressed as a .tar.gz. We will download it using cURL.
curl -O http://blog.typeslashcode.com/voxpop/files/chunker.tar.gz
cURL is popular software that contains curl, a command line tool getting files using the URL syntax. The above command uses the -O flag, which indicates that we want to write the output of the request to a file. By default, the destination file will assume the name of the source file. More information on cURL can be found on wikipedia.
Lets make sure that our cURL request was successful and that there is a file named ‘cunker.tar.gz’ in our current directory. We will use the ls command.
The ls command lists the contents of a directory. If the file was correctly downloaded, we should see ‘chunker.tar.gz’ in the above command’s output. Without any flags, ls lists all non-hidden files in the current directory. More on ls can be found at on wikipedia.
Since the file is compressed as a tarball, we need to extract it. We will use the tar command. While entering this command, remember to try autocomplete.
Tar is a populary utility for compressing and extracting tarballs. By default, tar will extract The flags above represent e(x)tract, (v)erbose, g(z)ip and (f)ilename. Generally, this flagset should work to extract most files you encounter. For more details on tar, you should check out wikipedia.
The tarball you just untarred should have extracted to the folder ./chunker. Try ls to confirm this, and then switch to the new folder using cd.
We will now be in ~/python/chunker, where chunker.py should have been extracted. Lets confirm with ls, and then try to run the file through python.
If all works out, the script should execute, and output its results!
This was just a very brief rundown of some vital Terminal commands. This is by no means comprehensive, and in fact excludes other vitals, such as sudo, rm &c.
For help on individual commands, you can try checking the BSD General Commands Manual for some details. For example, the command below will open the Manual page for ls.
Most commands also have help built-in. Usually built in help is a cursory overview of possible flags, but occasionally it includes a more detailed guide. The command below will open help for tar.
For a fairly comprehensive list of OS X terminal commands, check out SS64. Wikipedia is also a good resource, just search for the commands you are curious about.