For those of you that aren’t aware, Muddy is a webservice that allows you to mine your content and use it in ways you hadn’t previously been able to. It combines elements of entity extraction, natural language processing and linked data to enable you to pick out the notable ‘things’ in your content and provides web-scale identifiers to describe them, allowing you to dig into your content and data and provide new views on existing and newly published content. This post is going to be a quick introduction to Muddy and how to start using it with your own data.
The basics
Everything in Muddy belongs to a ‘collection’, a collection is a container for analysed content, think of it like a folder for documents. You only get one collection with a free Muddy account, so don’t worry too much about this for now, all your content will end up in that collection. Within a collection there are multiple ‘pages’, a page is a piece of web content (or text) that has been analysed by Muddy. Finally, in every page there are ‘entities’, an entity is a notable ‘thing’ that has been identified as occurring in the content. Every entity is ‘grounded’ using a linked data identifier, by this we mean it’s unambiguous. For example, when talking about ‘Apple’ Computers this identifier http://dbpedia.org/resource/Apple_Inc. is used, when talking about ‘Apple’ Records the identifier http://dbpedia.org/resource/Apple_Records is used. This allows Muddy to describe the ambiguous term ‘Apple’ in different ways based on the context of the page it appears in.
Linked data
Every entity identified in Muddy is notable in some way. Muddy uses Wikipedia as it’s proxy for notability, if there’s a page on Wikipedia for it, then Muddy should know about it. In many cases Muddy also knows what kind of thing it is, be it a Person, a Place or a Company (or many others). Muddy uses a common identifier for each entity it identifies (as defined by the dbpedia project) meaning you can relate your data to other web content that uses the same identifiers or possibly start marking up your content in new ways (have you seen commontag ?).
How does it work ?
Muddy uses the dbpedia project as it’s list of notable things, it’s ‘controlled vocabulary‘. Muddy analyses the submitted content and finds relevant notable things that are mentioned and determines if they are the ones in the controlled vocabulary. Many terms in the English language are ambiguous, fortunately dbpedia ‘knows’ if something is ambiguous and Muddy picks the correct disambiguation based on the textual content of the page being analysed. Muddy provides a confidence score based on a number of factors, including the ambiguity of the identified term and it’s contextual relevance to the content it appears in. This confidence score can be used to filter the quality of the results returned. Muddy uses intelligent extraction algorithms to identify and analyse only the core text for a submitted webpage, it can determine where the key content on a page is and analyse only that, meaning that irrelevant content such as sidebar and footer elements aren’t included. Let’s see an example of Muddy in use, here we have the results page for a news story from the Guardian. It shows the entities extracted from the article :

You can also start to ‘dig into the data’, for example, seeing which other articles feature a particular entity :

By finding content that shares similar entities, it’s possible to define new paths through the indexed content, whether that’s by aggregating pages around entities or finding related pages by looking at pages that share common entities. Muddy makes this easier by providing views and APIs for both of these.
Building with Muddy
Muddy is a webservice, it’s designed to be built on. What kinds of things could you build ? How about defining new ways into content for the BBC :

Or examining where the news happens in the UK ?

Both of these applications were built using Muddy. How do you go about building your own ? Muddy exposes it’s functionality as RESTful APIs with multiple response formats. You can see sample XML responses by adding .xml to the end of most URL’s presented by Muddy. For more in depth details (including API options), please refer to the Muddy developer guide.
A sample application : Newsminer
Now we’ve covered the basics of Muddy, lets try and build a simple application. In this case we’ll create an RSS indexer for indexing the latest news stories from the BBC. To do this, we’ll use the muddyit_fu Ruby client library.
#!/usr/bin/ruby
require 'rubygems'
require 'muddyit_fu'
require 'rss'
require 'open-uri'
# Connect to Muddy using HTTP Basic Auth
muddyit = Muddyit.new(:username => 'myusername', :password => 'mypassword')
collection = muddyit.collection.find(config[:collection_token])
# Parse RSS
rss_content = ''
open('http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/uk_politics/rss.xml') do |f|
rss_content = f.read
end
rss = RSS::Parser.parse(rss_content, false)
# Loop through, analyse and display entities
rss.items.each do |item|
page = collection.pages.create(item.guid.content, :realtime => true, :store => true)
puts "#{item.guid.content} contains:"
page.entities.each do |entity|
puts "\t#{entity.term}, #{entity.classification}"
end
end
Muddy provides two ways to authenticate your requests, OAuth and HTTP Basic Auth. We strongly recommend using OAuth as it represents less of a security risk. For brevity, we’ve used HTTP Basic Auth in this example, however you can find the same example with the OAuth setup details in the ‘Building with Muddy and OAuth‘ article.
Execute the script and you’ll see the BBC News pages being indexed and the entities identified in them and if you login to Muddy you’ll see the indexed pages :

Hopefully, this has given you a useful introduction to Muddy, how it works and how you could go about using it in your own applications. For further details on the various elements of the API please see the developer guide.