posted by Zeke Shore on Feb 17th, 2010
While exploring existing sentiment analysis processes, we stumbled across what looks like a fully integrate open source solution to several issues identified in our recent round of research.
OpinionFinder appears to be hosted and primarily developed at the University of Pittsburgh with contributions from Cornell University and University of Utah. While the OpinionFinder system was only mentioned off hand in Bo Pang’s article Opinion Mining and Sentiment Analysis, it appears to include some of the best solutions available for a lot of the common challenges that accompany effective sentiment analysis.
OpinionFinder, which was initially released in 2006, employs a multi-stage NLP process. As stated in the project’s extended abstract,
“OpinionFinder aims to identify subjective sentences and to mark various aspects of subjectivity in these sentences, including the source (holder) of the subjectivity and words that are included in phrases expressing positive or negative sentiments.”
Working in “batch” mode as more of a back-end pipe, OpinionFinder works as follows:
Document Processing
Taking any incoming text source, HTML or XML meta info is removed, and sentences are split and POS tagged using OpenNLP. Next, stemming is accomplished using Steven Abney’s SCOL v1K stemmer program. SUNDANCE (Sentence UNDerstanding And Concept Extraction), a partial parser from the NLP laboratory at the University of Utah, is used by Autoslog-TS to identify extraction patterns needed by the sentence classifiers and the SourceFinder (which identifies the source of subjective content, distinguishing author statements from related or quoted statements). A final parse in batch mode establishes constituency parse trees which are converted to dependency parse trees for Named Entity and subject detection.
Subjectivity and Sentiment Analysis
At this point a Naive Bayes classifier identifies subjective sentences. The specs seem to indicate that the classifier is trained against subjective and objective sentences generated by two additional “rule-based” (unsupervised?) classifiers drawing from “a large corpus.” This point in the process will require some exploration and validation.
Next a direct subjective expression and speech event classifier, built by Eric Breck, tags the direct subjective expressions and speech events found within the document using WordNet.
The final step applies actual sentiment analysis to sentences that have been identified as subjective. This is accomplished with two classifiers that were developed using the BoosTexter machine learning program and trained on the MPQA Corpus.
Evaluation
While we still need to rigorously explore the source code, this system appears to be a gold mine of solutions to both previously unresolved and newly discovered issues in our sentiment analysis process. Named Entity detection along with dependency parse trees will help us filter content to only include sentiment regarding the actual topic being explored (rather than visualizing all subjective content in a comment) as well as helping to reveal popular related topics that exist within any given topic of discussion.
Subjectivity detection and Speech Event Classification are challenges that are acknowledged in a lot of research on the topic of sentiment analysis, but comprehensive solutions have been much more difficult to come by. This system seems to combine a few processes towards those goals (including leveraging WordNet in a new way), and again could really help us filter down our corpus to relevant statements of sentiment for a given topic.
Finally the actual positive/negative sentiment analysis that is applied to subjective sentences is different than any other process I have read about (most including WordNet and trained classifiers, or our original ad hoc method of matching against the General Inquirer Dictionary). We might want to experiment a bit with this phase to see how more or less effective different methods are.
One process that is surprisingly absent from the OpinionFinder system is any sort of negation detection. We may want to explore possibly integrating the algorithm Bruno Ohana experimented with in his dissertation on sentiment analysis, or investigate other solutions.
It also maybe be interesting to see how things change if we begin to stack some of the process used by OpinionFinder with systems that we already have in place, such as our GI Osgood Emotive Assignments.
You can download OpinionFinder for free from the project’s website under an open academic license, or download a PDF of the extended abstract/description of the project here:
OpinionFinder-Extended Abstract
posted by Zeke Shore on Feb 16th, 2010
I have come across some fantastic Semantic Analysis research over the past few days, and was able to tap into several research papers and dissertations exploring computational Sentiment Analysis or Opinion Mining (OM). Two that provided significant insight were “Opinion Mining and Sentiment Analysis” (Pang et al, 2008) and “Opinion Mining with the SentWordNet Lexical Resource” (Ohana, 2009).
Recent progress in Opinion Mining techniques within natural language processing tasks identify a handful of challenges and potential solutions for accurate sentiment analysis of text based content.
Subjectivity
If our goal is to extract the sentiment, opinions or emotions of users, then we should really only be looking at subjective statements within a user’s comment. This will prevent positively or negatively charged words that are present in objective statements to effect the comment’s overall sentiment score. Subjectivity could be assed through a trained classifier algorithm like Naive Bayes or Max Entropy.
On Topic
A concern for topic relevance is an issue that we were already aware of, and were searching (with much difficulty) for solutions with dependency grammars. This new round of research seems to dismiss that approach as unrealistically difficult (I’m thinking that could be a project on its own). Unfortunately no good solution strategies were explored for this issue.
Polarity
This is our root goal of applying a negative or positive sentiment score at various text-unit levels, such as word, sentence, or comment. While VoxPop has thus far been using the General Inquirer Dictionary evaluative definitions… It appears a few recent projects have been utilizing the WordNet (which we explored earlier in our research) and news SentiWordNet lexicons for evaluative sentiment assignments.
Negation Detection
An issue that was just now revealed to us is the problem of Negation Detection. Consider the following two sentences:
Obama’s policies are good.
Obama’s policies are not good.
A normal polarity tagger would give these two sentences the same sentiment score, both of them containing containing 1 positive word (good). Of course our second sentence expresses the opposite of positive sentiment, with the adverb ‘not’ inverting the value of “good.” A negation detection process aims to identify these negating word, and then invert the value of any positive or negative words that appear wither n-words before or after the negating term.
Here are PDFs of two of the more informative articles:
Opinion Mining and Sentiment Analysis
Bo Pang, Lillian Lee
Opinion mining and sentiment analysis
Opinion Mining with the SentWordNet Lexicon
Bruno Ohan
Opinion mining with the SentWordNet lexical resource
posted by Zeke Shore on Oct 24th, 2009
We recently presented our research progress, mostly focusing around proof of concept results of using the NYT API as an effective corpus, and exploring the work of Charles Osgood, Shortest Path Distance mapping with WordNet in the NLTK, and mapping words against the Lasswell Value Dictionary. We show some initial emotive analysis on a NYT article comment (which was part of a 38 comment discourse) to show what our WordNet + Lasswell engine might reveal.

Download the full presentation below:
research_presentation
posted by Andrew Mahon on Oct 21st, 2009
This is actually much simpler then previously explained. Given two words,we use NLTK to find the Wordnet Synset’s for each word, and then use more built in functionality to find the shortest path distance. For this to work, the NLTK Wordnet Corpus needs to be installed.
I wrote a simple http GET function that returns a JSON packet containing both words, and the Shortest Path Distance between them. Super easy! One thing you should note, and will be fixed, is that if multiple synsets are found for a word, ie. if a word can serve as multiple parts of speech (I bicycle to the bicycle store), this function will choose the first one returned by Wordnet.
def get_distance(self,word_1=None,word_2=None):
"""
Returns a JSON packet containing word_1 and word_2, and the
Wordnet Shortest Path Distance between word_1 and word_2.
"""
logging.info("#### Wordnet.get_distance\
["+str(word_1)+","+str(word_2)+"]")
if word_1 is not None and len(word_1) > 0 and (
word_2 is not None and len(word_2)) > 0:
output = {'word_1':word_1, 'word_2': word_2}
w1_synset = nltk.corpus.wordnet.synsets(word_1)[0]
w2_synset = nltk.corpus.wordnet.synsets(word_2)[0]
output['distance'] = w1_synset.shortest_path_distance(w2_synset)
return self.json(output)
else:
logging.error('Must provide two words to find the \
distance between.')
return self.json({'error':'Must provide two words to find\
the distance between.'})
Please note that this function belongs to a class that extends a more general web controller with the function json(). Json() serves to properly render JSON packets to the browser by adding the proper content-type header
posted by Zeke Shore on Oct 15th, 2009
Returns the distance of the shortest path linking the two synsets (if one exists). For each synset, all the ancestor nodes and their distances are recorded and compared. The ancestor node common to both synsets that can be reached with the minimum number of traversals is used. If no ancestor nodes are common, -1 is returned. If a node is compared with itself 0 is returned.
@type other: L{Synset}
@param other: The Synset to which the shortest path will be found.
@return: The number of edges in the shortest path connecting the two
nodes, or -1 if no path exists.
if self == other: return 0
path_distance = -1
dist_list1 = self.hypernym_distances(0)
dist_dict1 = {}
dist_list2 = other.hypernym_distances(0)
dist_dict2 = {}
# Transform each distance list into a dictionary. In cases where
# there are duplicate nodes in the list (due to there being multiple
# paths to the root) the duplicate with the shortest distance from
# the original node is entered.
for (l, d) in [(dist_list1, dist_dict1), (dist_list2, dist_dict2)]:
for (key, value) in l:
if key in d:
if value < d[key]:
d[key] = value
else:
d[key] = value
# For each ancestor synset common to both subject synsets, find the
# connecting path length. Return the shortest of these.
for synset1 in dist_dict1.keys():
for synset2 in dist_dict2.keys():
if synset1 == synset2:
new_distance = dist_dict1[synset1] + dist_dict2[synset2]
if path_distance < 0 or new_distance < path_distance:
path_distance = new_distance
return path_distance
nltk.wordnet.synset source code
Now these shortest path distance values for any word we wish to emotively classify can get checked against evaluative extremes (good/bad), it’s potency factor (strong/weak) and it’s activity factor (active/passive).

This is the equation as described by Kamps and Marx in their essay on Using WordNet to Measure Semantic Orientations of Adjectives for determining the evaluative characteristics of a word, returning a value between -1 and 1. A value closeer to -1 indicates the word as more strongly linked to ‘bad,’ and similarly a value closer to +1 indicates a strong link to ‘good.’ A value close to 0 describes the word as neutral. Download the full PDF of Kamps and Marx research bellow:
kamp_usin03
posted by Zeke Shore on Oct 9th, 2009
Abstract
The traditional notion of word meaning used in natural language processing is literal or lexical meaning as used in dictionaries and lexicons. This relatively objective notion of lexical meaning is different from more subjective notions of emotive or affective meaning. Our aim is to come to grips with subjective aspects of meaning expressed in written texts, such as the attitude or value expressed in them. This paper explores how the structure of the WordNet lexical database might be used to assess affective or emotive meaning. In particular, we construct measures based on Osgood’s semantic differential technique.
By Jaap Kamps and Maarten Marx
Download the full PDF bellow
Words with attitude
Notes and Further Exploration
Kamps and Marx present some interesting research that is very applicable to emotively analyzing online discourse. In this paper, subjective understanding is computationally extracted from text using Charles Osgood’s Theory of Semantic Differentiation as a guide for mapping word relationships in Princeton University’s WordNet Lexical Database. Osgood’s work in the late 1950’s established
“semantic differential technique is using several pairs of bipolar adjectives to scale the responses of subjects to words, short phrases, or texts. That is, subjects are asked to rate their meaning on scales like active–passive; good–bad; optimistic–pessimistic; positive–negative; strong–weak; serious–humorous; and ugly–beautifully”
Osgood research further revealed that most variance in affective meaning assigned itself to three major factors:
“These three factors of the affective or emotive meaning are the evaluative factor (e.g., good–bad); the potency factor (e.g., strong-weak); and the activity factor (e.g., active–passive). Among these three factors, the evaluative factor has the strongest relative weight.”
Using these three emotive axises, Kamps and Mark mapped words as they related to “good” and “bad” within the context of the WordNet lexical database through finding minimal path lengths:
The minimal path-length is a straightforward generalization of the synonymy relation. The synonymy relation connects words with similar meaning, so the minimal distance between words says something on the similarity of their meaning.
Words can now be scaled via this process as either negative, neutral, or positive. This same process can be followed to assign activity and potency ratings to words as well, revealing clear emotive qualities of the content.