New York Times Innovation Portfolio

tags:

posted by Zeke Shore on Mar 2nd, 2010

nyt_innovation

The New York Times online consistently delivers interesting data visualizations to help enrich the stories surrounding popular news topics. The New York Times Innovation Portfolio provides a beautiful overview of all of these interactive explorations, organized by topic with project overviews, documention, and links to the actual interactive pieces.

Since VoxPop is working with New York Times data, this collection of existing data visualizations is a treasure trove of strong precedents, several of which relate very closely to our project. Here are a handful that live within the realm of reader sentiment.

Health Care Debate

nyt_healthcare

Health Care Debate is a conversation platform that allows users to discuss various issues within the health care debate. The most interesting aspect of this tool is how the relevance of specific sub-topics within the debate can be instantly comprehended at first glance, with the surface of the tool depicting multiple “rooms” that are scaled relative to the number comments relating to that subtopic.

Obama’s Address In Cairo

nyt_interactive_video

This interactive video of Obama’s speech to the Muslim world allows users to provide comments along the timeline of the speech, allowing a global discussion to unfold in the context of the time-based content that is seeding the discussion.

Election Word Train

nyt_election_words

Election Word Train asked New York times readers to share one word that describes their current state of mind on the day of the 2008 presidential election. Much like a tag cloud, words are scaled relative to the number of people sharing the sentiment, and can be filtered to show words shared by Obama or McCain supporters. By leveraging scale and letting these words ’speak for themselves’ does effectively provide a general glimpse of reader sentiment, even if the forum is is somewhat contrived, specifically with the goal reducing group sentiment into a few dozen words, possibly hindering truly organic sentiment visualization.

Inaugural Words

nyt_inaugural

Inaugural Words ranks the frequency of words used by presidents in Inaugural Addresses, showing what words each president used the most. While is not really reflecting reader sentiment. it does show an interesting break down of word frequency across time and political position.

Twitter Bowl

nyt_superbow_tweets

The Twitter Bowl interactive visualization maps twitter chatter over the course of the 2009 Super Bowl, according to key topic mentions. This hits an interesting cross section of communicating time, space, and group sentiment, even if it is somewhat cryptic in what is actually being communicated. There is something very satisfying about seeing topics grow and shrink geographically over time, although it does not reveal what specifically about “steelers” or “ads” or “springsteen” people are sharing.

These projects all have several aspects that worth analyzing and building upon. As we begin to re-think how people engage with the news, its exciting to see major players like the New York Times continuing to push the envelope, and continue to keep their data open so that others can do the same.

Visualizing NYT Discourse – Design Iteration 3

tags:

posted by Zeke Shore on Mar 1st, 2010

v2.0_crop

The last design iteration I wrote about a couple weeks ago started to take a departure from earlier iterations by exploring the idea of representing the personality of every comment on the New York Times website (that relates to any given topic) as it’s own entity, and visually describing it’s sentiment or personality.

After reflecting back on our original reasons for wanting to visualize online discussions, our thesis question really centers around how can the ‘Vox Populi‘ still be heard as reader participation in the journalistic process scales to hundreds of thousands of comments spread across hundreds of articles and blog posts for even just one news source.

So this resulted in a design prototype that involved rendering comments for a given topic as balls swarming around the article that seeded the conversation, representing sentiment with color, opacity, and speed of movement, describing each comment’s polarity (how positive or negative), strength (strong or weak) and activity (active or passive) respectively.

v2_screen_grab

While this iteration was both readable and interesting to look at, it suffered in terms of scalability. We could realistically only look at a couple conversations at a time for any given topic. So the next phase of the design process involved trying pull some of the more successful aspects of this iteration into a more real estate friendly composition. The logical progression of this involved breaking conversation into a linear organization (all of the following mock ups are not visualizing real data, but rather serving as design explorations).

v1.5.2

Of course horizontal flows of information are rarely web-friendly, despite it being a logical way to organize content chronologically. So this quickly evolved into a vertical orientation, and opened the door for exploring the concept of possibly showing when commenters reference each other within a conversation.

Read the rest of this entry »

Midpoint Presentation

tags:

posted by Zeke Shore on Jan 27th, 2010

Back in November we had the honor of presenting our progress to the R&D department at the New York Times, followed by a similar presentation to our upcoming thesis advisor, Dave Carroll for the final semester of the project. Feel free to check out the slide deck, however links to the working prototypes may be down (running servers is expensive!)

VoxPop Midpoint Presentation

Design Iteration 1

tags:

posted by Zeke Shore on Jan 26th, 2010

While we have been good about posting research progress as it comes, progress on the design front has been a bit too quite. Here are some early iterations of the User Interface design process, and what we are learning as we go.

Working off of the data that we were beginning to generate, our starting point was a collection of user comments for all of the New York Times articles that would be returned for any given query. By parsing through the comments, we could match words against the General Inquirer Dictionary across Charles Osgood’s three-axis theory of Semantic Differentiation. You can read more about the first version of our emotive analysis process in my previous post on the subject.

VP_GUI_1.6_f2_detail

So the initial output we decided to shoot for was essentially six lists of words from the comments for each article that is retrieved for a given query. Along the evaluative axis we would have a list of ‘positive’ words (shown above in green) and ‘negative’ words (in red), along the activity axis we would have a list of ‘active’ words (in orange) and passive words (in brown), and along the potency axis we would have a list of ’strong’ words (in blue) and ‘weak’ words (in gray).

VP_GUI_1_Spread_sm

Flushing out the design of this model included four “states.” Collapsing the emotive word lists for each article would yield colored bars extending above and bellow a base line of articles. Theoretically, this would reveal trends in the quantities of these emotively charged words over time for discussions surrounding any keyword. Clicking on a specific article would reveal the actual list of words that are being described by the colored bars of the collapsed view. Extending the idea, hovering over any word could potentially show the sentence from which that word was retrieved, and hovering over the article title could reveal the abstract of the article, and clicking either would bring the user through to the article or the specific comment on the New York Times website, all in an effort to provide easy contextual access as a validation tool.

So we built a prototype of this visualization. We did not build out all of the interaction levels spec’ed in the initial mockups, but even just getting a list of articles with their corresponding lists of emotively classified words from the discussions surrounding them  seemed like a good starting point for exploring the data.

VP_GUI_1_live

This prototype revealed a lot. The first obvious conclusion is that we are dealing with way more data than could be meaningfully expressed as ‘lists of words’. Even scaling the text size down bellow legibility did not allow most lists to be viewed in their entirety in a normal web browser window.

Another problem is that the data is really hard to read if you don’t already have a strong understanding of what was going on behind the scenes. This organization does not show the three clear axises that the discussions are being mapped against. Furthermore, this model gives equal weight to all of our emotive axises, despite Osgood’s conclusion that evaluative distinction (positive/negative) carries the most emotive weight, which is then supported by activity and potency as the second two most significant factors.

One more problem that this prototype revealed is the homogenizing effect that results from extracting lists of words at the level of the entire conversation rather than specific comments. One really long nasty comment could skew the negative word count for an entire conversation when looking at the data at this level of abstraction, and that sort of misrepresentation could be a serious cause for concern. The project is called VoxPop stemming from the Latin term Vox Populi, meaning “voice of the people.” This visualizing attempt was not yet showing the voices of any ‘people’… rather averaging out the ebbs and flows of entire conversations.

More to come on our newer design iterations soon.

Emotive Analysis Process V1.1

tags:

posted by Zeke Shore on Nov 11th, 2009

Here is a quick update on how our emotive analysis engine is playing out. The end to end process (for this initial prototype) will work as follows:

First, the user provides a search query, and we pull (and cache) all of the NY Times articles that are related to that query that have comments using the Article Search API and the Community API (this will be made more efficient in the near future… more to come on that later).

After article or comment results to a query are returned from either the cache or a new API call, what we will need to deal with initially on the Natural Language Processing (NLP) side of the equation will be comments, in the form of text strings.

Using NLTK in python, there is an information extraction architecture that is structured as follows:

ie-architecture

For our purposes, one of the more difficult challenges that we have is knowing what words we care about. If we are trying to visualize the emotional or affective characteristics of the discourse surround a keyword, we cannot just look at the full thread of comments for an article that was returned for a given keyword, and log every word that holds emotive weight. The NY Times article Bipartisan Spirit, at Least for a Moment is a perfect example as to why not. The article is about a meeting between President Obama and George Bush Sr. So as one may guess, that article would have been returned when querying either ‘Bush’ or “Obama,’ and the 38-comment discussion that follows the article contains references to both.

So before any sort of emotive analysis can occur, we must parse the text down to the words that we care about. This first involves identifying instances of our keyword within each comment, and extracting the sentences that contain the keyword.

For further coverage, and also to account for the fact the web-based comments are often less verbose and less refined than other forms of discourse, if our keyword is a proper noun, we might also look at sentences with pronouns that immediately precede or follow sentences with our keyword.

Ultimately, we will need to develop a comprehensive weighted dependency grammar, so that we can efficiently parse the sentences that we care about into relatively accurate dependency structures. This will allows to know (with far more precision) what words are referring to or modifying our keyword, and should therefor be emotively classified.

depgraph0

So now the fun part. Once we know what words we care about in relation to our keyword, we will go back to Charles Osgood’s Semantic Differential Theory which maps words along three main axises: the Evaluative (good/bad), the Potency (strong/weak) and the Activity (active/passive) which I have discussed in a previous post. We can do this using the General Inquirer Dictionary, including the Lasswell Value Dictionary and the Harvard IV-4 dictionary, which maps about 12,000 words across Osgood’s semantic differential axises (among other classifications).

To make the process more efficient, since we have tagged the part of speech of every word, we can throw out words that we know should have neutral affective values, such any determiners (’the,’ ‘a,’ etc) or any proper nouns, and map every other word against our three axises. For each axis, we will give a word a value of 1, 0, or -1, so on the evaluative (EVA) axis, for example, any word living at the ‘positive’ or ‘good’ end of the axis would hold a value of 1, whereas a word living at the ‘negative’ or ‘bad’ end of the axis would hold a value of -1, and of course words that are neutral on the evaluative scale would hold a value of 0. This system would carry out across the activity (ACT) and potency (POT) axises as well in the form of

affectiveValue(word) = [EVA,  ACT, POT]

affectiveValue(respect) = [1, -1, 0]

Where the word “respect” holds an evaluative value of ‘positive’ or ‘good,’ an active value of ‘passive’ and a potency value of ‘neutral’ (neither ’strong’ nor ‘weak’).

So ultimately this will leave us with six lists of words for each article in relationship to a given keyword, which we can then use as metrics for our data visualization.

Research Progress Presentation

tags:

posted by Zeke Shore on Oct 24th, 2009

We recently presented our research progress, mostly focusing around proof of concept results of using the NYT API as an effective corpus, and exploring the work of Charles Osgood, Shortest Path Distance mapping with WordNet in the NLTK, and mapping words against the Lasswell Value Dictionary. We show some initial emotive analysis on a NYT article comment (which was part of a 38 comment discourse) to show what our WordNet + Lasswell engine might reveal.

nyt_lasswel_analysis_2

Download the full presentation below:

research_presentation

Jer Thorp: blprnt

tags:

posted by Zeke Shore on Oct 13th, 2009

Canadian artist Jer Thorp over at blog.blprnt does some pretty interesting computational (primarily with processing) information design pieces. Recently he has been doing some projects using the NYT API. One if his first experiments with the API maps the frequency of the words ‘internet’, ‘web’ and ‘twitter’ in the New York Times from the 1990-2008:

Is Twitter the New Internet?

blprnt_nyt1

Thorp’s newest visualization again explores the NYT API, this time looking at a map of tag relationships over the past year:

NYT 365/360

blprnt_nyt2

In addition to interesting work, Thorp also provides several comprehensive data processing development tutorials, and releases many of his projects as open Processing libraries, allowing the information design community to evolves his concepts and push development efforts forward.