Using Muddy as a simple entity extractor

Muddy performs a few different tasks, and you may find you don’t need all of them initially.  Before building on top of Muddy, the most common task people want it to perform is to act as a simple term/concept/entity extraction API.  That is, given a piece of text, return the notable things that occur in it.  In order to support this we’ve recently added a simpler API method (‘extract’) that doesn’t require a collection and doesn’t store the entity extraction results. The API can be used with or without a muddy account, you’ll be limited by IP address if you’re not authenticated.

A sample (unauthenticated) curl session is shown below :

echo '<page>
<text>Gordon Brown and Tony Blair went to town.</text>
<options>
<realtime>true</realtime>
</options>
</page>' | curl -X POST -H 'Content-type: text/xml' -H 'Accept: text/xml' -d @- http://muddy.it/extract

<?xml version="1.0" encoding="UTF-8"?>
<response status="OK">
  <title></title>
  <entities>
    <entity>
      <term>Tony Blair</term>
      <uri>http://dbpedia.org/resource/Tony_Blair</uri>
      <confidence>1.0</confidence>
      <classification>http://muddy.it/ontology/Person</classification>
      <position>17</position>
    </entity>
...

Some sample code to extract ‘terms’ from a given piece of source text using the muddyit_fu gem and the new extract method is shown below :

#!/usr/bin/ruby
require 'rubygems'
require 'muddyit_fu'
muddyit =  Muddyit.new('./config.yml')
page = muddyit.extract(ARGV[0], :disambiguate => false, :include_unclassified => true)
puts "Contains:"
page.entities.each do |entity|
  puts "\t#{entity.term}"
end

The script expects a text string as it’s first argument and prints out the extracted terms to STDOUT :

ruby extract.rb "Gordon Brown and Tony Blair went to town"
Contains:
	Tony Blair
	Gordon Brown

As we want to retrieve as many terms as possible from the source text, we expand our list of available entities by including ones that have no classification and we disable disambiguation to improve response times (as we’re only interested in the text terms rather than a grounded entity). If we wanted to retrieve disambiguated, grounded entities, rather than just text terms, then the ‘disambiguate’ option can be enabled again to ensure any entities identified have been disambiguated where appropriate.

Building with Muddy and OAuth

There are two authentication methods provided when building against the muddy system, OAuth and HTTP Basic Auth.  We strongly recommend using OAuth when allowing other systems access to your data in Muddy, as using HTTP Basic Auth can be a security risk.  However, HTTP Basic Auth is easier to use, often has better support in many development languages and can be appropriate to use if you are aware of it’s risks and are happy to work with them.

In the introductory ‘Getting Started with Muddy‘ article, we used HTTP Basic Auth in the example given.  We’ll now re-work it using OAuth.   If you are unfamiliar with OAuth, then you might want to have a look at oauth.net for further information.

Authenticating with Muddy

In order to allow your programs to work with Muddy, you’ll need to register them as client applications with your Muddy account first.  To register an application, login and then visit the oauth clients page, click ‘Register your application’ :

Muddy Register Application

Add a title and application URL and any other relevent attributes and then click ‘Register’ :

Muddy Registered Application

The ‘Consumer Key’ and ‘Consumer Secret’ are the attributes you’ll need to authorise your client application to access your Muddy data.

A sample application : Newsminer

In the previous article we created a small application called ‘Newsminer’.  We’ll rework this now, using OAuth instead of HTTP Basic Auth.   Again, we’ll use the muddyit_fu Ruby client library.

#!/usr/bin/ruby
require 'rubygems'
require 'muddyit_fu'
require 'rss'
require 'open-uri'
config = { :collection_token => 'mwkllxs7',
           :consumer_key => 'Ta0kS7jAkezMmJTQYMKStQ',
           :consumer_secret => 'sEXDiVSWHVc9kqjWQ2bRDU3I1gnplDTDwB5MEJWxnNE',
           :access_token => 'Har7Us3ZsOaN6TpqwW0AA',
           :access_token_secret => '96PJgoZIxAKXiJKwu323wyh6UlhezPoLdtQShsbL0'
}
# Connect to Muddy
muddyit =  Muddyit.new(:consumer_key => config[:consumer_key],
                       :consumer_secret => config[:consumer_secret],
                       :access_token => config[:access_token],
                       :access_token_secret => config[:access_token_secret])
collection = muddyit.collection.find(config[:collection_token])
# Parse RSS
rss_content = ''
open('http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/uk_politics/rss.xml') do |f|
  rss_content = f.read
end
rss = RSS::Parser.parse(rss_content, false)
# Loop through, analyse and display entities
rss.items.each do |item|
  page = collection.pages.create(item.guid.content, :realtime => true, :store => true)
  puts "#{item.guid.content} contains:"
  page.entities.each do |entity|
    puts "\t#{entity.term}, #{entity.classification}"
  end
end

In order for the script to work, you’ll need to login note down the token for the collection your content is stored in (the ‘collection_token’), you can access this by visiting ‘Dashboard’ → ‘View analysed Pages’ → ‘Settings’.  You’ll also need to authorise the script via OAuth.  To do this you’ll need to register a client application as described previously, you can then use the convenience script provided with muddyit_fu to obtain the authentication details required by the newsminer script, a sample session is shown below :

$ ruby ./examples/oauth.rb

> enter consumer key
45048ANdEByjSuF2IogpQ
> enter consumer secret
9uew3saTCM2RlEU0k122RgbkMUZdNKpTLJM1mJiX5jw
> redirecting you to muddy to authorize
> opening http://muddy.it/oauth/authorize?oauth_token=ZXdoJsaphYwdBpLpt9xSZw
> authorize in the browser and then press enter

Access Details

Token : tuiBqD5ct6eZ1RlxNKdQ
Secret : EO9wJB2Xz7sEneoWqcOCnqslkSit4M9muJes4SF4

Add these details into the script and then execute it and you’ll see the BBC News pages being indexed and the entities identified in them and if you login to Muddy you’ll see the indexed pages :

Muddy - BBC News Stories

Thats it, as you can see OAuth is a bit more complicated to use than HTTP Basic Auth but it’s well worth using if you’re giving third parties access to your data.

Getting Started with Muddy

For those of you that aren’t aware, Muddy is a webservice that allows you to mine your content and use it in ways you hadn’t previously been able to.  It combines elements of entity extraction, natural language processing and linked data to enable you to pick out the notable ‘things’ in your content and provides web-scale identifiers to describe them, allowing you to dig into your content and data and provide new views on existing and newly published content.  This post is going to be a quick introduction to Muddy and how to start using it with your own data.

The basics

Everything in Muddy belongs to a ‘collection’, a collection is a container for analysed content, think of it like a folder for documents.  You only get one collection with a free Muddy account, so don’t worry too much about this for now, all your content will end up in that collection.  Within a collection there are multiple ‘pages’, a page is a piece of web content (or text) that has been analysed by Muddy.  Finally, in every page there are ‘entities’, an entity is a notable ‘thing’ that has been identified as occurring in the content.   Every entity is ‘grounded’  using a linked data identifier, by this we mean it’s unambiguous.  For example, when talking about ‘Apple’ Computers this identifier http://dbpedia.org/resource/Apple_Inc. is used, when talking about ‘Apple’ Records the identifier http://dbpedia.org/resource/Apple_Records is used.  This allows Muddy to describe the ambiguous term ‘Apple’ in different ways based on the context of the page it appears in.

Linked data

Every entity identified in Muddy is notable in some way.  Muddy uses Wikipedia as it’s proxy for notability, if there’s a page on Wikipedia for it, then Muddy should know about it.  In many cases Muddy also knows what kind of thing it is, be it a Person, a Place or a Company (or many others).  Muddy uses a common identifier for each entity it identifies (as defined by the dbpedia project) meaning you can relate your data to other web content that uses the same identifiers or possibly start marking up your content in new ways (have you seen commontag ?).

How does it work ?

Muddy uses the dbpedia project as it’s list of notable things, it’s ‘controlled vocabulary‘.  Muddy analyses the submitted content and finds relevant notable things that are mentioned and determines if they are the ones in the controlled vocabulary.  Many terms in the English language are ambiguous, fortunately dbpedia ‘knows’ if something is ambiguous and Muddy picks the correct disambiguation based on the textual content of the page being analysed.  Muddy provides a confidence score based on a number of factors, including the ambiguity of the identified term and it’s contextual relevance to the content it appears in.  This confidence score can be used to filter the quality of the results returned. Muddy uses intelligent extraction algorithms to identify and analyse only the core text for a submitted webpage, it can determine where the key content on a page is and analyse only that, meaning that irrelevant content such as sidebar and footer elements aren’t included. Let’s see an example of Muddy in use, here we have the results page for a news story from the Guardian.  It shows the entities extracted from the article :

Muddy - Entities View

You can also start to ‘dig into the data’, for example, seeing which other articles feature a particular entity :

Muddy - Stories from entities

By finding content that shares similar entities, it’s possible to define new paths through the indexed content, whether that’s by aggregating pages around entities or finding related pages by looking at pages that share common entities.  Muddy makes this easier by providing views and APIs for both of these.

Building with Muddy

Muddy is a webservice, it’s designed to be built on.  What kinds of things could you build ?  How about defining new ways into content for the BBC :

Channelography

Or examining where the news happens in the UK ?

Newsography

Both of these applications were built using Muddy.  How do you go about building your own ? Muddy exposes it’s functionality as RESTful APIs with multiple response formats.  You can see sample XML responses by adding .xml to the end of most URL’s presented by Muddy.  For more in depth details (including API options), please refer to the Muddy developer guide.

A sample application : Newsminer

Now we’ve covered the basics of Muddy, lets try and build a simple application.  In this case we’ll create an RSS indexer for indexing the latest news stories from the BBC.  To do this, we’ll use the muddyit_fu Ruby client library.

#!/usr/bin/ruby
require 'rubygems'
require 'muddyit_fu'
require 'rss'
require 'open-uri'

# Connect to Muddy using HTTP Basic Auth
muddyit =  Muddyit.new(:username => 'myusername', :password => 'mypassword')
collection = muddyit.collection.find(config[:collection_token])
# Parse RSS
rss_content = ''
open('http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/uk_politics/rss.xml') do |f|
  rss_content = f.read
end
rss = RSS::Parser.parse(rss_content, false)
# Loop through, analyse and display entities
rss.items.each do |item|
  page = collection.pages.create(item.guid.content, :realtime => true, :store => true)
  puts "#{item.guid.content} contains:"
  page.entities.each do |entity|
    puts "\t#{entity.term}, #{entity.classification}"
  end
end

Muddy provides two ways to authenticate your requests, OAuth and HTTP Basic Auth.  We strongly recommend using OAuth as it represents less of a security risk.  For brevity, we’ve used HTTP Basic Auth in this example, however you can find the same example with the OAuth setup details in the ‘Building with Muddy and OAuth‘ article.

Execute the script and you’ll see the BBC News pages being indexed and the entities identified in them and if you login to Muddy you’ll see the indexed pages :

Muddy - BBC News Stories

Hopefully, this has given you a useful introduction to Muddy, how it works and how you could go about using it in your own applications. For further details on the various elements of the API please see the developer guide.

Genesis

Not the band or that famous book, you’ll need to go elsewhere for that.  This is about the backstory to Muddy, which we thought would be nice to share, because although Muddy is essentially ‘middleware’, we want to think of it as a ‘consumable’, as an application with a life, that real people use and engage with (albeit not that many!).

Way back in March 2007, Rob and I submitted an idea to the BBC Labs, then run by Matt Locke, to improve the (horizontal) navigation across the BBC by grounding news articles in ’subjects’ people could peruse.  It wasn’t an earth shattering idea, but it came from our frustration in a BBC News experience that was still about ‘pages’ and very ‘flat’ (it’s improved since then). So, together with Paul Farnell (a designer friend and CEO of Litmus) we spent five days in North Yorkshire taking the idea to pieces and re-building it.

What came of this process was a) a commission from BBC News and b) a greater appreciation of Wikipedia (and dbpedia which extracts structured information from Wikipedia ‘infoboxes’ and creates usable subject-predicate-object relationships from that data) for joining up content by acting as a ‘controlled vocabulary’, a glossary for an ever expanding range of concepts and things.

So, we produced a prototype ‘application’ for BBC News called Muddy Boots (we called it Muddy Boots because we felt we were trampling across the rather pristine lawn that is the BBC).  Muddy Boots took BBC News articles and identified ‘notable things’ (i.e. things in Wikipedia) in the articles and then via an algorithm and a social bookmarking service we attempted to provide relevant links on the web for that news story.  It kinda worked.  Jonathan Austin did a write up of it on the BBC News Journalism Labs blog which gives a fair bit of detail.  Whilst we were waiting for the testing phase run by BBC News to happen, we continued to develop Muddy Boots as we were interested in where it could go.  As we developed it we dropped the ‘Boots’ bit of the name.

Read the rest of this entry »