beanstalkd

tags:

posted by Andrew Mahon on Oct 26th, 2009

A couple days ago, I implemented a beanstalkd driven queuing system for VoxPop’s longer running tasks: New York Times API requests and natural language processing functions. From the beanstalkd website: “Beanstalk is a simple, fast workqueue service. Its interface is generic, but was originally designed for reducing the latency of page views in high-volume web applications by running time-consuming tasks asynchronously.”

As implemented right now, VoxPop maintains 3 tubes, one for each the Article API, and Community API, and one for NLP tasks. Python worker threads watch these tubes, and process each tubes jobs sequentially. The API threads implement a delay to prevent overloading the API server. I am working on developing a system to watch the queued jobs as they are processed.

Building and running beanstalkd on OS X Snow Leopard 10.6 was pretty straightforward. Unless it is already installed, we must first compile and install libevent, the events engine the beanstalkd runs on:

mkdir temp
cd temp
curl -O http://www.monkey.org/~provos/libevent-1.4.12-stable.tar.gz
tar xvzf libevent-1.4.12-stable.tar.gz
cd libevent-1.4.12-stable.tar.gz
./configure
make
sudo make install

Once libevent is up and available, we can move onto compiling and installing beanstalkd:

mkdir temp
cd temp
curl -O http://xph.us/dist/beanstalkd/beanstalkd-1.4.2.tar.gz
tar xvzf beanstalkd-1.4.2.tar.gz
cd beanstalkd-1.4.2.tar.gz
./configure
make
sudo make install

Once beanstalkd is installed, we can test it by firing it up:

./beanstalkd -d -l 10.0.1.5 -p 11300

The Beanstalk protocol doucumentation can be found here: http://github.com/kr/beanstalkd/blob/v1.1/doc/protocol.txt

couchDB Slideset

tags:

posted by Andrew Mahon on Oct 26th, 2009

A set of slides on CouchDB. While it is a bit lacking without accompanying narration, the set gives a visual idea of some of couchDB’s fundamental ideas.

Shortest Path Distance with NLTK

tags:

posted by Andrew Mahon on Oct 21st, 2009

This is actually much simpler then previously explained. Given two words,we use NLTK to find the Wordnet Synset’s for each word, and then use more built in functionality to find the shortest path distance. For this to work, the NLTK Wordnet Corpus needs to be installed.

I wrote a simple http GET function that returns a JSON packet containing both words, and the Shortest Path Distance between them. Super easy! One thing you should note, and will be fixed, is that if multiple synsets are found for a word, ie. if a word can serve as multiple parts of speech (I bicycle to the bicycle store), this function will choose the first one returned by Wordnet.

def get_distance(self,word_1=None,word_2=None):
 """
 Returns a JSON packet containing word_1 and word_2, and the
 Wordnet Shortest Path Distance between word_1 and word_2.
 """
 logging.info("#### Wordnet.get_distance\
     ["+str(word_1)+","+str(word_2)+"]")
 if word_1 is not None and len(word_1) > 0 and (
     word_2 is not None and len(word_2)) > 0:
  output = {'word_1':word_1, 'word_2': word_2}
  w1_synset = nltk.corpus.wordnet.synsets(word_1)[0]
  w2_synset = nltk.corpus.wordnet.synsets(word_2)[0]
  output['distance'] = w1_synset.shortest_path_distance(w2_synset)
  return self.json(output)
 else:
  logging.error('Must provide two words to find the \
     distance between.')
  return self.json({'error':'Must provide two words to find\
     the distance between.'})

Please note that this function belongs to a class that extends a more general web controller with the function json(). Json() serves to properly render JSON packets to the browser by adding the proper content-type header

Returning WordNet Shortest Path Distance with NLTK

tags:

posted by Zeke Shore on Oct 15th, 2009

Returns the distance of the shortest path linking the two synsets (if one exists). For each synset, all the ancestor nodes and their distances are recorded and compared. The ancestor node common to both synsets that can be reached with the minimum number of traversals is used. If no ancestor nodes are common, -1 is returned. If a node is compared with itself 0 is returned.

@type other: L{Synset}
@param other: The Synset to which the shortest path will be found.
@return: The number of edges in the shortest path connecting the two
nodes, or -1 if no path exists.

if self == other: return 0 
 
path_distance = -1 
 
dist_list1 = self.hypernym_distances(0)
dist_dict1 = {} 
 
dist_list2 = other.hypernym_distances(0)
dist_dict2 = {} 
 
# Transform each distance list into a dictionary. In cases where
# there are duplicate nodes in the list (due to there being multiple
# paths to the root) the duplicate with the shortest distance from
# the original node is entered. 
 
for (l, d) in [(dist_list1, dist_dict1), (dist_list2, dist_dict2)]:
       for (key, value) in l:
           if key in d:
               if value < d[key]:
                  d[key] = value
          else:
               d[key] = value 
 
# For each ancestor synset common to both subject synsets, find the
# connecting path length. Return the shortest of these. 
 
for synset1 in dist_dict1.keys():
    for synset2 in dist_dict2.keys():
       if synset1 == synset2:
        new_distance = dist_dict1[synset1] + dist_dict2[synset2]
          if path_distance < 0 or new_distance < path_distance:
              path_distance = new_distance 
 
return path_distance

nltk.wordnet.synset source code

Now these shortest path distance values for any word we wish to emotively classify can get checked against evaluative extremes (good/bad), it’s potency factor (strong/weak) and it’s activity factor (active/passive).
eva_formula

This is the equation as described by Kamps and Marx in their essay on Using WordNet to Measure Semantic Orientations of Adjectives for determining the evaluative characteristics of a word, returning a value between -1 and 1. A value closeer to -1 indicates the word as more strongly linked to ‘bad,’ and similarly a value closer to +1 indicates a strong link to ‘good.’ A value close to 0 describes the word as neutral. Download the full PDF of Kamps and Marx research bellow:

kamp_usin03

Dev Environment

tags:

posted by Andrew Mahon on Oct 12th, 2009

Kicking off VoxPop we made a decision to push forward with cutting edge technologies:

webpy

Python: While this is not *cutting edge* per se, it is the language on which the Natural Language Toolkit is built. It also forms the base upon which Web.py, the minimalist web framework that VoxPop will be driven by. Developing Make History over this past summer, Local Project’s Brian House built the site’s back-end on Web.Py and has only had good things to say about it. Beyond his recommendations, I enjoy the fact that it provides a robust base to build things onto and does not try to provide too much functionality.

CouchDB: Over the summer I worked on a project built on Google’s App Engine, and used their Datastore database, a document based db built on top of BigTable. While DBDMS provide their fare share of development challenges, I enjoyed the flexibility. After beginning to develop in MySQL, I started searching for alternatives, and CouchDB up as one of the best options. Built in Erlang, and queried using a Map/Reduce implementation running pm Mozilla’s Spidermonkey Javascript engine, I was instantly interested in giving it a shot. In later posts, I will explore the challenged and benefits of developing with a DBDMS.

HTML5/Canvas: Anything but flash, really. More to come on this later.