Using Muddy as a simple entity extractor

Muddy performs a few different tasks, and you may find you don’t need all of them initially.  Before building on top of Muddy, the most common task people want it to perform is to act as a simple term/concept/entity extraction API.  That is, given a piece of text, return the notable things that occur in it.  In order to support this we’ve recently added a simpler API method (‘extract’) that doesn’t require a collection and doesn’t store the entity extraction results. The API can be used with or without a muddy account, you’ll be limited by IP address if you’re not authenticated.

A sample (unauthenticated) curl session is shown below :

echo '<page>
<text>Gordon Brown and Tony Blair went to town.</text>
<options>
<realtime>true</realtime>
</options>
</page>' | curl -X POST -H 'Content-type: text/xml' -H 'Accept: text/xml' -d @- http://muddy.it/extract

<?xml version="1.0" encoding="UTF-8"?>
<response status="OK">
  <title></title>
  <entities>
    <entity>
      <term>Tony Blair</term>
      <uri>http://dbpedia.org/resource/Tony_Blair</uri>
      <confidence>1.0</confidence>
      <classification>http://muddy.it/ontology/Person</classification>
      <position>17</position>
    </entity>
...

Some sample code to extract ‘terms’ from a given piece of source text using the muddyit_fu gem and the new extract method is shown below :

#!/usr/bin/ruby
require 'rubygems'
require 'muddyit_fu'
muddyit =  Muddyit.new('./config.yml')
page = muddyit.extract(ARGV[0], :disambiguate => false, :include_unclassified => true)
puts "Contains:"
page.entities.each do |entity|
  puts "\t#{entity.term}"
end

The script expects a text string as it’s first argument and prints out the extracted terms to STDOUT :

ruby extract.rb "Gordon Brown and Tony Blair went to town"
Contains:
	Tony Blair
	Gordon Brown

As we want to retrieve as many terms as possible from the source text, we expand our list of available entities by including ones that have no classification and we disable disambiguation to improve response times (as we’re only interested in the text terms rather than a grounded entity). If we wanted to retrieve disambiguated, grounded entities, rather than just text terms, then the ‘disambiguate’ option can be enabled again to ensure any entities identified have been disambiguated where appropriate.

One Response to “Using Muddy as a simple entity extractor”

  1. [...] text mining and has been refining Muddy and updating the supporting documentation around term extraction.  We’ve decided to have a business review of Muddy to determine how best to proceed with it. [...]

Leave a Reply