posted by Zeke Shore on Feb 17th, 2010
While exploring existing sentiment analysis processes, we stumbled across what looks like a fully integrate open source solution to several issues identified in our recent round of research.
OpinionFinder appears to be hosted and primarily developed at the University of Pittsburgh with contributions from Cornell University and University of Utah. While the OpinionFinder system was only mentioned off hand in Bo Pang’s article Opinion Mining and Sentiment Analysis, it appears to include some of the best solutions available for a lot of the common challenges that accompany effective sentiment analysis.
OpinionFinder, which was initially released in 2006, employs a multi-stage NLP process. As stated in the project’s extended abstract,
“OpinionFinder aims to identify subjective sentences and to mark various aspects of subjectivity in these sentences, including the source (holder) of the subjectivity and words that are included in phrases expressing positive or negative sentiments.”
Working in “batch” mode as more of a back-end pipe, OpinionFinder works as follows:
Document Processing
Taking any incoming text source, HTML or XML meta info is removed, and sentences are split and POS tagged using OpenNLP. Next, stemming is accomplished using Steven Abney’s SCOL v1K stemmer program. SUNDANCE (Sentence UNDerstanding And Concept Extraction), a partial parser from the NLP laboratory at the University of Utah, is used by Autoslog-TS to identify extraction patterns needed by the sentence classifiers and the SourceFinder (which identifies the source of subjective content, distinguishing author statements from related or quoted statements). A final parse in batch mode establishes constituency parse trees which are converted to dependency parse trees for Named Entity and subject detection.
Subjectivity and Sentiment Analysis
At this point a Naive Bayes classifier identifies subjective sentences. The specs seem to indicate that the classifier is trained against subjective and objective sentences generated by two additional “rule-based” (unsupervised?) classifiers drawing from “a large corpus.” This point in the process will require some exploration and validation.
Next a direct subjective expression and speech event classifier, built by Eric Breck, tags the direct subjective expressions and speech events found within the document using WordNet.
The final step applies actual sentiment analysis to sentences that have been identified as subjective. This is accomplished with two classifiers that were developed using the BoosTexter machine learning program and trained on the MPQA Corpus.
Evaluation
While we still need to rigorously explore the source code, this system appears to be a gold mine of solutions to both previously unresolved and newly discovered issues in our sentiment analysis process. Named Entity detection along with dependency parse trees will help us filter content to only include sentiment regarding the actual topic being explored (rather than visualizing all subjective content in a comment) as well as helping to reveal popular related topics that exist within any given topic of discussion.
Subjectivity detection and Speech Event Classification are challenges that are acknowledged in a lot of research on the topic of sentiment analysis, but comprehensive solutions have been much more difficult to come by. This system seems to combine a few processes towards those goals (including leveraging WordNet in a new way), and again could really help us filter down our corpus to relevant statements of sentiment for a given topic.
Finally the actual positive/negative sentiment analysis that is applied to subjective sentences is different than any other process I have read about (most including WordNet and trained classifiers, or our original ad hoc method of matching against the General Inquirer Dictionary). We might want to experiment a bit with this phase to see how more or less effective different methods are.
One process that is surprisingly absent from the OpinionFinder system is any sort of negation detection. We may want to explore possibly integrating the algorithm Bruno Ohana experimented with in his dissertation on sentiment analysis, or investigate other solutions.
It also maybe be interesting to see how things change if we begin to stack some of the process used by OpinionFinder with systems that we already have in place, such as our GI Osgood Emotive Assignments.
You can download OpinionFinder for free from the project’s website under an open academic license, or download a PDF of the extended abstract/description of the project here:
OpinionFinder-Extended Abstract
posted by Zeke Shore on Feb 16th, 2010
One issue of accurate sentiment analysis identified in a recent round of research is the problem of negation detection. This is the process by which a negating word (such as ‘not’) inverts the evaluative value of an affective word ( for example, “not good” is similar to saying “bad”). This can be resolved in natural language processing by identifying negating words, and then inverting the value of any positive or negative word within n-words of the negating word, where n is the window of potential negation.
In Bruno Ohan’s 2009 dissertation “Opinion Mining with with the SentWordNet Lexical Resource” (Dublin Institute of Technology), a python algorithm is presented to perform this task. While Ohan’s tests with this negation detection algorithm only yielded accuracy improvements of about 0.5%, this might be a good start point for further exploration.
#
# populates array of negated terms based on document terms
# negation[i] indicates if term in doc[i] is negated
#
def getNegationArray(doc, windowsize):
PSEUDO = ( 'no increase', 'no wonder', 'no change' , 'not cause' ,
'not only' , 'not necessarily' )
PRENEGATION = ( 'not' , 'no' , 'n\'t' ,'cannot', 'declined' ,
'denied' , 'denies' , 'free of' , 'fails to' , 'no evidence' ,
'no new' , 'no sign' , 'no suspicious' . 'no suggestion' ,
'rather than', 'with no' , 'unremarkable', 'without' ,
'rules out' , 'ruled out', 'rule out')
POSNEGATION = ( 'unlikely', 'free', 'ruled out' )
ENDOFWINDOW = ( '.', ':', ',', 'but' , 'however' , 'nevertheless' ,
'yet' , 'though' , 'although' , 'still' , 'aside from' , 'except' ,
'apart from')
# Initialise array
vNEG = [ 0 for t in range(len(doc)) ]
# Initialise window counters
winstart = 0
winend = min( windowsize, len(doc) - 1 )
docsize = len(doc)
i = 0
found_pseudo = 0
found_neg_fwd = 0
found_neg_bck = 0
inwindow = 0
for i in range(docsize):
#
# build 1-ter and 2-term strings
#
unigram = doc[i].split('/')[0]
if i < (docsize - 1):
bigram = unigram + ' ' + doc[i+1].split('/')[0]
else:
bigram = unigram
#
# Search for pseudo negations
#
for negterm in PSEUDO:
if bigram == negterm:
found_pseudo=1
##print 'found pseudo!', bigram, i
if (found_pseudo == 0):
#
# Look for pre negations
#
for negterm in PRENEGATION:
if unigram == negterm or bigram == negterm:
found_neg_fwd = 1
for negterm in POSNEGATION:
if unigram == negterm or bigram == negterm:
found_neg_bck = 1
#
# If found fwd/backw negation, then negate window
#
if (found_neg_fwd == 1):
##print 'found forwards!', unigram, bigram, i
#
# negate terms forward up to window
#
if inwindow < windowsize:
vNEG[i] = 1
inwindow+=1
else:
# out of window space
found_neg_fwd = 0
inwindow = 0
#
# backward negation
#
if (found_neg_bck == 1):
##print 'found backwards!', unigram, bigram, i
#
# negate back until window start
#
for counter in range(max(winstart, i-windowsize), i):
vNEG[counter] = 1
#
# done with backwards negation
#
found_neg_bck = 0
#
# now move window
#
for negterm in ENDOFWINDOW:
if unigram == negterm or bigram == negterm:
#
# found end of negation, must reset windows
#
##print 'found negterm!', unigram, bigram, i
inwindow = 0
found_neg_fwd = 0
winstart = i
winend = min( windowsize + i, len(doc) - 1 )
return vNEG
posted by Andrew Mahon on Jan 27th, 2010

The above diagram represents the VoxPop server architecture for the first iteration of design.
One backend instance of VoxPop consists of two processes, the Producer, and the Worker. The producer is responsible for serving pages and data, and receiving requests and creating tasks. The worker is then responsible for executing the tasks. The two processes are linked by the Beanstalkd task queue, and the Memcached and CouchDB persistance layers.
As the producer receives requests, it adds tasks to the Beanstalkd task queue. Beanstalkd is lightweight task queue originally developed for Causes on Facebook. Beanstalkd allows tasks to exist in multiple ‘tubes,’ each of which can be prioritized and watched separately.
The Worker process is multi-threaded with at least one thread per Beanstalkd tube. As soon as a thread is ready to accept a task, it pulls it off the queue and executes it. Each task either produces new tasks to be added to Beanstalkd, or results to be persisted into Memcached and Beanstalkd.
As results are produced, they are persisted in both Memcached and Beanstalked. Persisting into Memcached provides nearly instantaneous availability to the Producer to allow it to serve out results as soon as they are produced. Persisting into CouchDB is a little bit slower, so results are stored there as soon as possible. CouchDB allows results to be retrieved indefinitely.
posted by Zeke Shore on Nov 11th, 2009
Here is a quick update on how our emotive analysis engine is playing out. The end to end process (for this initial prototype) will work as follows:
First, the user provides a search query, and we pull (and cache) all of the NY Times articles that are related to that query that have comments using the Article Search API and the Community API (this will be made more efficient in the near future… more to come on that later).
After article or comment results to a query are returned from either the cache or a new API call, what we will need to deal with initially on the Natural Language Processing (NLP) side of the equation will be comments, in the form of text strings.
Using NLTK in python, there is an information extraction architecture that is structured as follows:

For our purposes, one of the more difficult challenges that we have is knowing what words we care about. If we are trying to visualize the emotional or affective characteristics of the discourse surround a keyword, we cannot just look at the full thread of comments for an article that was returned for a given keyword, and log every word that holds emotive weight. The NY Times article Bipartisan Spirit, at Least for a Moment is a perfect example as to why not. The article is about a meeting between President Obama and George Bush Sr. So as one may guess, that article would have been returned when querying either ‘Bush’ or “Obama,’ and the 38-comment discussion that follows the article contains references to both.
So before any sort of emotive analysis can occur, we must parse the text down to the words that we care about. This first involves identifying instances of our keyword within each comment, and extracting the sentences that contain the keyword.
For further coverage, and also to account for the fact the web-based comments are often less verbose and less refined than other forms of discourse, if our keyword is a proper noun, we might also look at sentences with pronouns that immediately precede or follow sentences with our keyword.
Ultimately, we will need to develop a comprehensive weighted dependency grammar, so that we can efficiently parse the sentences that we care about into relatively accurate dependency structures. This will allows to know (with far more precision) what words are referring to or modifying our keyword, and should therefor be emotively classified.

So now the fun part. Once we know what words we care about in relation to our keyword, we will go back to Charles Osgood’s Semantic Differential Theory which maps words along three main axises: the Evaluative (good/bad), the Potency (strong/weak) and the Activity (active/passive) which I have discussed in a previous post. We can do this using the General Inquirer Dictionary, including the Lasswell Value Dictionary and the Harvard IV-4 dictionary, which maps about 12,000 words across Osgood’s semantic differential axises (among other classifications).
To make the process more efficient, since we have tagged the part of speech of every word, we can throw out words that we know should have neutral affective values, such any determiners (’the,’ ‘a,’ etc) or any proper nouns, and map every other word against our three axises. For each axis, we will give a word a value of 1, 0, or -1, so on the evaluative (EVA) axis, for example, any word living at the ‘positive’ or ‘good’ end of the axis would hold a value of 1, whereas a word living at the ‘negative’ or ‘bad’ end of the axis would hold a value of -1, and of course words that are neutral on the evaluative scale would hold a value of 0. This system would carry out across the activity (ACT) and potency (POT) axises as well in the form of
affectiveValue(word) = [EVA, ACT, POT]
affectiveValue(respect) = [1, -1, 0]
Where the word “respect” holds an evaluative value of ‘positive’ or ‘good,’ an active value of ‘passive’ and a potency value of ‘neutral’ (neither ’strong’ nor ‘weak’).
So ultimately this will leave us with six lists of words for each article in relationship to a given keyword, which we can then use as metrics for our data visualization.
posted by Andrew Mahon on Nov 7th, 2009
Here is a quick tutorial on how to download and run a python file from within Terminal. I will provide a quick overview of each command’s functionality. This is nowhere near comprehensive, but should help those unexperienced with the Terminal CLI get a cursory understanding.
Upon opening Terminal, you will be located in your home folder. On your main drive, it is located at /Users/username, or symbolically at ~/. If you dont already have one, lets create a folder to keep your python projects in.
The mkdir command makes a directory. Since you are in your home folder, your user has permissions to make and delete folders here. Check out wikipedia for more info on mkdir.
Now that we have created the folder ~/python, lets move to it. Type this one into Terminal instead of copying and pasting it. As you begin typing python, hit ‘esc’. It should autocomplete the remainder of the path for you. This is a great thing to remember when using terminal.
The cd command changes the directory that you are in. Its most basic usage is ‘cd path’. After entering the above command, you should have moved to the ~/python folder. Check out wikipedia for more info on cd.
Now, lets download our source file. I have posted it to the voxpop site, and it is compressed as a .tar.gz. We will download it using cURL.
curl -O http://blog.typeslashcode.com/voxpop/files/chunker.tar.gz
cURL is popular software that contains curl, a command line tool getting files using the URL syntax. The above command uses the -O flag, which indicates that we want to write the output of the request to a file. By default, the destination file will assume the name of the source file. More information on cURL can be found on wikipedia.
Lets make sure that our cURL request was successful and that there is a file named ‘cunker.tar.gz’ in our current directory. We will use the ls command.
The ls command lists the contents of a directory. If the file was correctly downloaded, we should see ‘chunker.tar.gz’ in the above command’s output. Without any flags, ls lists all non-hidden files in the current directory. More on ls can be found at on wikipedia.
Since the file is compressed as a tarball, we need to extract it. We will use the tar command. While entering this command, remember to try autocomplete.
Tar is a populary utility for compressing and extracting tarballs. By default, tar will extract The flags above represent e(x)tract, (v)erbose, g(z)ip and (f)ilename. Generally, this flagset should work to extract most files you encounter. For more details on tar, you should check out wikipedia.
The tarball you just untarred should have extracted to the folder ./chunker. Try ls to confirm this, and then switch to the new folder using cd.
We will now be in ~/python/chunker, where chunker.py should have been extracted. Lets confirm with ls, and then try to run the file through python.
If all works out, the script should execute, and output its results!
This was just a very brief rundown of some vital Terminal commands. This is by no means comprehensive, and in fact excludes other vitals, such as sudo, rm &c.
For help on individual commands, you can try checking the BSD General Commands Manual for some details. For example, the command below will open the Manual page for ls.
Most commands also have help built-in. Usually built in help is a cursory overview of possible flags, but occasionally it includes a more detailed guide. The command below will open help for tar.
For a fairly comprehensive list of OS X terminal commands, check out SS64. Wikipedia is also a good resource, just search for the commands you are curious about.
posted by Andrew Mahon on Oct 21st, 2009
This is actually much simpler then previously explained. Given two words,we use NLTK to find the Wordnet Synset’s for each word, and then use more built in functionality to find the shortest path distance. For this to work, the NLTK Wordnet Corpus needs to be installed.
I wrote a simple http GET function that returns a JSON packet containing both words, and the Shortest Path Distance between them. Super easy! One thing you should note, and will be fixed, is that if multiple synsets are found for a word, ie. if a word can serve as multiple parts of speech (I bicycle to the bicycle store), this function will choose the first one returned by Wordnet.
def get_distance(self,word_1=None,word_2=None):
"""
Returns a JSON packet containing word_1 and word_2, and the
Wordnet Shortest Path Distance between word_1 and word_2.
"""
logging.info("#### Wordnet.get_distance\
["+str(word_1)+","+str(word_2)+"]")
if word_1 is not None and len(word_1) > 0 and (
word_2 is not None and len(word_2)) > 0:
output = {'word_1':word_1, 'word_2': word_2}
w1_synset = nltk.corpus.wordnet.synsets(word_1)[0]
w2_synset = nltk.corpus.wordnet.synsets(word_2)[0]
output['distance'] = w1_synset.shortest_path_distance(w2_synset)
return self.json(output)
else:
logging.error('Must provide two words to find the \
distance between.')
return self.json({'error':'Must provide two words to find\
the distance between.'})
Please note that this function belongs to a class that extends a more general web controller with the function json(). Json() serves to properly render JSON packets to the browser by adding the proper content-type header
posted by Zeke Shore on Oct 15th, 2009
Returns the distance of the shortest path linking the two synsets (if one exists). For each synset, all the ancestor nodes and their distances are recorded and compared. The ancestor node common to both synsets that can be reached with the minimum number of traversals is used. If no ancestor nodes are common, -1 is returned. If a node is compared with itself 0 is returned.
@type other: L{Synset}
@param other: The Synset to which the shortest path will be found.
@return: The number of edges in the shortest path connecting the two
nodes, or -1 if no path exists.
if self == other: return 0
path_distance = -1
dist_list1 = self.hypernym_distances(0)
dist_dict1 = {}
dist_list2 = other.hypernym_distances(0)
dist_dict2 = {}
# Transform each distance list into a dictionary. In cases where
# there are duplicate nodes in the list (due to there being multiple
# paths to the root) the duplicate with the shortest distance from
# the original node is entered.
for (l, d) in [(dist_list1, dist_dict1), (dist_list2, dist_dict2)]:
for (key, value) in l:
if key in d:
if value < d[key]:
d[key] = value
else:
d[key] = value
# For each ancestor synset common to both subject synsets, find the
# connecting path length. Return the shortest of these.
for synset1 in dist_dict1.keys():
for synset2 in dist_dict2.keys():
if synset1 == synset2:
new_distance = dist_dict1[synset1] + dist_dict2[synset2]
if path_distance < 0 or new_distance < path_distance:
path_distance = new_distance
return path_distance
nltk.wordnet.synset source code
Now these shortest path distance values for any word we wish to emotively classify can get checked against evaluative extremes (good/bad), it’s potency factor (strong/weak) and it’s activity factor (active/passive).

This is the equation as described by Kamps and Marx in their essay on Using WordNet to Measure Semantic Orientations of Adjectives for determining the evaluative characteristics of a word, returning a value between -1 and 1. A value closeer to -1 indicates the word as more strongly linked to ‘bad,’ and similarly a value closer to +1 indicates a strong link to ‘good.’ A value close to 0 describes the word as neutral. Download the full PDF of Kamps and Marx research bellow:
kamp_usin03
posted by Andrew Mahon on Oct 12th, 2009
Kicking off VoxPop we made a decision to push forward with cutting edge technologies:

Python: While this is not *cutting edge* per se, it is the language on which the Natural Language Toolkit is built. It also forms the base upon which Web.py, the minimalist web framework that VoxPop will be driven by. Developing Make History over this past summer, Local Project’s Brian House built the site’s back-end on Web.Py and has only had good things to say about it. Beyond his recommendations, I enjoy the fact that it provides a robust base to build things onto and does not try to provide too much functionality.

CouchDB: Over the summer I worked on a project built on Google’s App Engine, and used their Datastore database, a document based db built on top of BigTable. While DBDMS provide their fare share of development challenges, I enjoyed the flexibility. After beginning to develop in MySQL, I started searching for alternatives, and CouchDB up as one of the best options. Built in Erlang, and queried using a Map/Reduce implementation running pm Mozilla’s Spidermonkey Javascript engine, I was instantly interested in giving it a shot. In later posts, I will explore the challenged and benefits of developing with a DBDMS.
HTML5/Canvas: Anything but flash, really. More to come on this later.