Archive for January, 2010

Using Muddy as a simple entity extractor

Muddy performs a few different tasks, and you may find you don’t need all of them initially.  Before building on top of Muddy, the most common task people want it to perform is to act as a simple term/concept/entity extraction API.  That is, given a piece of text, return the notable things that occur in it.  In order to support this we’ve recently added a simpler API method (‘extract’) that doesn’t require a collection and doesn’t store the entity extraction results. The API can be used with or without a muddy account, you’ll be limited by IP address if you’re not authenticated.

A sample (unauthenticated) curl session is shown below :

echo '<page>
<text>Gordon Brown and Tony Blair went to town.</text>
<options>
<realtime>true</realtime>
</options>
</page>' | curl -X POST -H 'Content-type: text/xml' -H 'Accept: text/xml' -d @- http://muddy.it/extract

<?xml version="1.0" encoding="UTF-8"?>
<response status="OK">
  <title></title>
  <entities>
    <entity>
      <term>Tony Blair</term>
      <uri>http://dbpedia.org/resource/Tony_Blair</uri>
      <confidence>1.0</confidence>
      <classification>http://muddy.it/ontology/Person</classification>
      <position>17</position>
    </entity>
...

Some sample code to extract ‘terms’ from a given piece of source text using the muddyit_fu gem and the new extract method is shown below :

#!/usr/bin/ruby
require 'rubygems'
require 'muddyit_fu'
muddyit =  Muddyit.new('./config.yml')
page = muddyit.extract(ARGV[0], :disambiguate => false, :include_unclassified => true)
puts "Contains:"
page.entities.each do |entity|
  puts "\t#{entity.term}"
end

The script expects a text string as it’s first argument and prints out the extracted terms to STDOUT :

ruby extract.rb "Gordon Brown and Tony Blair went to town"
Contains:
	Tony Blair
	Gordon Brown

As we want to retrieve as many terms as possible from the source text, we expand our list of available entities by including ones that have no classification and we disable disambiguation to improve response times (as we’re only interested in the text terms rather than a grounded entity). If we wanted to retrieve disambiguated, grounded entities, rather than just text terms, then the ‘disambiguate’ option can be enabled again to ensure any entities identified have been disambiguated where appropriate.

Building with Muddy and OAuth

There are two authentication methods provided when building against the muddy system, OAuth and HTTP Basic Auth.  We strongly recommend using OAuth when allowing other systems access to your data in Muddy, as using HTTP Basic Auth can be a security risk.  However, HTTP Basic Auth is easier to use, often has better support in many development languages and can be appropriate to use if you are aware of it’s risks and are happy to work with them.

In the introductory ‘Getting Started with Muddy‘ article, we used HTTP Basic Auth in the example given.  We’ll now re-work it using OAuth.   If you are unfamiliar with OAuth, then you might want to have a look at oauth.net for further information.

Authenticating with Muddy

In order to allow your programs to work with Muddy, you’ll need to register them as client applications with your Muddy account first.  To register an application, login and then visit the oauth clients page, click ‘Register your application’ :

Muddy Register Application

Add a title and application URL and any other relevent attributes and then click ‘Register’ :

Muddy Registered Application

The ‘Consumer Key’ and ‘Consumer Secret’ are the attributes you’ll need to authorise your client application to access your Muddy data.

A sample application : Newsminer

In the previous article we created a small application called ‘Newsminer’.  We’ll rework this now, using OAuth instead of HTTP Basic Auth.   Again, we’ll use the muddyit_fu Ruby client library.

#!/usr/bin/ruby
require 'rubygems'
require 'muddyit_fu'
require 'rss'
require 'open-uri'
config = { :collection_token => 'mwkllxs7',
           :consumer_key => 'Ta0kS7jAkezMmJTQYMKStQ',
           :consumer_secret => 'sEXDiVSWHVc9kqjWQ2bRDU3I1gnplDTDwB5MEJWxnNE',
           :access_token => 'Har7Us3ZsOaN6TpqwW0AA',
           :access_token_secret => '96PJgoZIxAKXiJKwu323wyh6UlhezPoLdtQShsbL0'
}
# Connect to Muddy
muddyit =  Muddyit.new(:consumer_key => config[:consumer_key],
                       :consumer_secret => config[:consumer_secret],
                       :access_token => config[:access_token],
                       :access_token_secret => config[:access_token_secret])
collection = muddyit.collection.find(config[:collection_token])
# Parse RSS
rss_content = ''
open('http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/uk_politics/rss.xml') do |f|
  rss_content = f.read
end
rss = RSS::Parser.parse(rss_content, false)
# Loop through, analyse and display entities
rss.items.each do |item|
  page = collection.pages.create(item.guid.content, :realtime => true, :store => true)
  puts "#{item.guid.content} contains:"
  page.entities.each do |entity|
    puts "\t#{entity.term}, #{entity.classification}"
  end
end

In order for the script to work, you’ll need to login note down the token for the collection your content is stored in (the ‘collection_token’), you can access this by visiting ‘Dashboard’ → ‘View analysed Pages’ → ‘Settings’.  You’ll also need to authorise the script via OAuth.  To do this you’ll need to register a client application as described previously, you can then use the convenience script provided with muddyit_fu to obtain the authentication details required by the newsminer script, a sample session is shown below :

$ ruby ./examples/oauth.rb

> enter consumer key
45048ANdEByjSuF2IogpQ
> enter consumer secret
9uew3saTCM2RlEU0k122RgbkMUZdNKpTLJM1mJiX5jw
> redirecting you to muddy to authorize
> opening http://muddy.it/oauth/authorize?oauth_token=ZXdoJsaphYwdBpLpt9xSZw
> authorize in the browser and then press enter

Access Details

Token : tuiBqD5ct6eZ1RlxNKdQ
Secret : EO9wJB2Xz7sEneoWqcOCnqslkSit4M9muJes4SF4

Add these details into the script and then execute it and you’ll see the BBC News pages being indexed and the entities identified in them and if you login to Muddy you’ll see the indexed pages :

Muddy - BBC News Stories

Thats it, as you can see OAuth is a bit more complicated to use than HTTP Basic Auth but it’s well worth using if you’re giving third parties access to your data.