Get Started with Toolkit

The SupWSD toolkit is a supervised word sense disambiguation system. The flexible framework of SupWSD allows users to combine different preprocessing modules, to select features extractors and choose which classifier to use. SupWSD is very light and has very small memory requirements; it provides a simple xml file to configure the disambiguation process.

The SupWSD toolkit requires JRE 1.8 or above. The zip file is available from the download page.

Installation

To work with the SupWSD toolkit, unpack the zip file with:

Copied to clipboard
unzip supwsd-toolkit.zip

You must also define the path of the Wordnet dictionary in resources/wndictionary/prop.xml, through the value attribute of the dictionary_path parameter.

Running the Toolkit

To train one model, navigate to the installation folder using your shell and type:

Copied to clipboard
java -jar supwsd-toolkit.jar train config/supwsd.xml corpus.xml senses.keys

To test one file, navigate to the installation folder using your shell and type:

Copied to clipboard
java -jar supwsd-toolkit.jar test config/supwsd.xml tests.xml responses.keys

The toolkit will print precision, recall, and F-measure values at the end of the test run.

See an example of training instances and keys from Word Sense Disambiguation: a Unified Evaluation Framework and Empirical Comparison

Working with Eclipse

Create your Eclipse project (File > New > Maven or Grandle project, give the project a name and click Finish). This creates a new folder with the project name under the Eclipse workspace folder.

Copy the config and resources folder from the supwsdtoolkit folder into your workspace/projectFolder.

Now, include the supwsd-toolkit.jar file in the project build classpath:

  1. Select the project from Package Explorer view (Windows -> Show View -> Package Explorer).
  2. From the menu bar click on Project and then Properties. Select Java Build Path from the contents column on the left and open the Libraries tab.
  3. Click on the Add External JARs button and select the supwsd-toolkit.jar file.

Next, add the required libraries to the project:

Add the following to the pom.xml file:

Copied to clipboard
<project>
	<dependencies>
		<dependency>
			<groupId>edu.stanford.nlp</groupId>
			<artifactId>stanford-corenlp</artifactId>
			<version>3.9.2</version>
		</dependency>
		<dependency>
			<groupId>edu.stanford.nlp</groupId>
			<artifactId>stanford-corenlp</artifactId>
			<version>3.9.2</version>
			<classifier>models</classifier>
		</dependency>
		<dependency>
			<groupId>edu.mit</groupId>
			<artifactId>jwi</artifactId>
			<version>2.2.3</version>
		</dependency>
		<dependency>
			<groupId>net.sf.extjwnl</groupId>
			<artifactId>extjwnl</artifactId>
			<version>2.0.2</version>
		</dependency>
		<dependency>
			<groupId>net.sf.extjwnl</groupId>
			<artifactId>extjwnl-data-wn31</artifactId>
			<version>1.2</version>
		</dependency>
		<dependency>
			<groupId>org.apache.commons</groupId>
			<artifactId>commons-collections4</artifactId>
			<version>4.3</version>
		</dependency>
		<dependency>
			<groupId>org.apache.opennlp</groupId>
			<artifactId>opennlp-tools</artifactId>
			<version>1.9.1</version>
		</dependency>
		<dependency>
			<groupId>org.ehcache</groupId>
			<artifactId>ehcache</artifactId>
			<version>3.7.1</version>
		</dependency>
		<dependency>
			<groupId>com.google.code.externalsortinginjava</groupId>
			<artifactId>externalsortinginjava</artifactId>
			<version>0.2.5</version>
		</dependency>
		<dependency>
			<groupId>org.annolab.tt4j</groupId>
			<artifactId>org.annolab.tt4j</artifactId>
			<version>1.2.1</version>
		</dependency>
		<dependency>
			<groupId>tw.edu.ntu.csie</groupId>
			<artifactId>libsvm</artifactId>
			<version>3.23</version>
		</dependency>
		<dependency>
			<groupId>de.bwaldvogel</groupId>
			<artifactId>liblinear</artifactId>
			<version>2.30</version>
		</dependency>
	</dependencies>
</project>

Add the following to the build.gradle file:

Copied to clipboard
repositories {
	mavenCentral()
}
dependencies {
	compile group: 'edu.stanford.nlp', name: 'stanford-corenlp', version: '3.9.2'
	compile group: 'edu.stanford.nlp', name: 'stanford-corenlp', version: '3.9.2', classifier: 'models'
	compile group: 'edu.mit', name: 'jwi', version: '2.2.3'
	compile group: 'net.sf.extjwnl', name: 'extjwnl', version: '2.0.2'
	compile group: 'net.sf.extjwnl', name: 'extjwnl-data-wn31', version: '1.2'
	compile group: 'org.apache.commons', name: 'commons-collections4', version: '4.3'
	compile group: 'org.apache.opennlp', name: 'opennlp-tools', version: '1.9.1
	compile group: 'org.ehcache', name: 'ehcache', version: '3.7.1'
	compile group: 'com.google.code.externalsortinginjava', name: 'externalsortinginjava', version: '0.2.5'
	compile group: 'org.annolab.tt4j', name: 'org.annolab.tt4j', version: '1.2.1'
	compile group: 'tw.edu.ntu.csie', name: 'libsvm', version: '3.23'
	compile group: 'de.bwaldvogel', name: 'liblinear', version: '2.30'
}

The SupWSD class is the entry point of the library and provides two static methods to train and test your datasets:

Copied to clipboard
SupWSD.train("config file path", "corpus file path", "keys file path");
SupWSD.test("config file path", "tests file path", "keys file path");

Configuration

You can customize the toolkit pipeline using the supconfig.xml file inside config folder:

Copied to clipboard
<supwsd xsi:noNamespaceSchemaLocation="supconfig.xsd">
	<working_directory></working_directory>
	<parser></parser>
	<preprocessor>
		<splitter model=""></splitter>
		<tokenizer model=""></tokenizer>
		<tagger model=""></tagger>
		<lemmatizer model=""></lemmatizer>
		<dependency_parser model=""></dependency_parser>
	</preprocessor>
	<extraction>
		<features>
			<pos_tags cutoff=""></pos_tags>
			<local_collocations cutoff=""></local_collocations>
			<surrounding_words cutoff="" window=""></surrounding_words>
			<word_embeddings strategy="" window="" vectors="" vocab="" cache=""></word_embeddings>
			<syntactic_relations></syntactic_relations>
		</features>
	</extraction>
	<classifier></classifier>
	<writer></writer>
	<sense_inventory dict=""></sense_inventory>
</supwsd>

Working directory

Specify the directory path in the file system where the trained models are to be saved.

Tag working_directory

Parser

SupWSD has many different parser available, targeted to the various formats of the Senseval/SemeEval WSD competition (both all-words and lexical sample), along with a parser for plain text.

Tag parser
Value lexical | senseval | semeval7 | semeval13 | semeval15 | plain

Preprocessor

This tag can be used to set the components of the preprocessing pipeline.

For each component you can specify the model to be applied using the model attribute. The simple component performs string splitting using the value of the model attribute. If you want to bypass a phase, set the value of the component at none.

Tag preprocessor
Children splitter , tokenizer , tagger , lemmatizer , dependency_parser

Sentence splitter

Tag splitter
Value stanford | open_nlp | simple | none
Attributes model

Tokenizer

Tag tokenizer
Value stanford | open_nlp | penn_tree_bank | simple | none
Attributes model

Part-Of-Speech tagger

Tag tagger
Value stanford | open_nlp | penn_tree_bank | simple | none
Attributes model

Lemmatizer

Tag lemmatizer
Value stanford | open_nlp | jwnl | tree_tagger | simple | none
Attributes model

Dependency parser

Tag dependency_parser
Value stanford | none
Attributes model

Features

Select which features to use. Set the value of a child to true/false to enable/disable the respective feature.

Tag features
Children pos_tag , surroundings_words , local_collocations , word_embedding , syntactic_relations

Part-Of-Speech tags

Part-of-speech tag of the target word and part-of-speech tags surroundings the target word (with a left and a right window of length 3).

Tag pos_tag
Value true | false
Attributes cutoff numeric Filter the feature according to a minimum frequency threshold. 0 to disable the filter.

Surroundings words

The set of word token (excluding stopwords from a pre-specified list) appearing in the context of the target word.

Tag surroundings_words
Value true | false
Attributes cutoff numeric Filter the feature according to a minimum frequency threshold. 0 to disable the filter.
window numeric Number of sentences in the neighborhood of the current sentence that will be used to extract words (-1 to extract all the words).
stopwords string Path of file containing the list of stop words, one word per line.

Local collocations

Ordered sequences of tokens around the target word.

Tag local_collocations
Value true | false
Attributes cutoff numeric Filter the feature according to a minimum frequency threshold. 0 to disable the filter.
sequences string Path of file containing the extraction sequences, a sequence per line.

Word embeddings

Pre-trained word embeddings, integrated according to three different strategies.

Tag word_embedding
Value true | false
Attributes cache decimal Vector cache size as a percentage of the number of vectors.
strategy typed AVG : centroid of the embeddings
FRA : weighted vectors on the distance from the target word
EXP : weight of vectors decay exponentially.
vectors string Path of file containing the word embeddings, one vector per line.
vocab string Path of file containing vocabulary words, one word per line.
window numeric Number of words in the neighborhood of the tag word that will be used to extract the embeddings.

Syntactic relations

A set of features based on the dependency tree of the sentence.

Tag syntactic_relations
Value true | false

Classifier

Select the machine learning library to run a classification algorithm and generate a model for each senseannotated word type in the input text.

Tag classifier
Value liblinear | libsvm

Writer

Choose the preferred way of printing the test results.

Tag writer
Value all Export results to a single file.
single Generate a file for each test instance.
plain Create a plain text file, a sentence for each line with senses and probabilities for disambiguated words.

Sense inventory

Specifying a sense inventory, you can exploit the Most Frequent Sense (MFS) back-off strategy for those target words for which no training data are available. If no sense inventory is specified, the model does not provide an answer and SupWSD will output "U" (unknow sense) as the answer.

Tag sense_inventory
Value wordnet | none
Attributes dict Set the attribute to point to the directory where the WordNet dictionary is installed.

Customization

Let's now look how to implement new modules for SupWSD and integrate them into the framework at various stages of the pipeline.

New input parser

In order to integrate a new XML parser, it is enough to extend the XMLHandler class and implement the methods startElement, endElement and characters. You can transmit the parsed text to the preprocessing module using the the global variable mAnnotationListener.

Copied to clipboard
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import it.si3p.supWSD.modules.parser.xml.XMLHandler;

public class NewXMLHandler extends XMLHandler{
	
	@Override
	public void startElement(String uri, String localName, String name, Attributes attributes) throws SAXException {
		
		NewLexicalTag tag...;
			
		switch (tag) {
			...
		}

		push(tag);
	}

	@Override
	public void endElement(String uri, String localName, String name) throws SAXException {

		NewLexicalTag tag...;
		
		switch (tag) {
			...
			case READY:
			
				mAnnotationListener.notifyAnnotations();
		}
		
		pop();
	}

	@Override
	public void characters(char ch[], int start, int length) throws SAXException {

		String sentence = new String(ch,start,length);
		
		switch ((NewLexicalTag)get()) {
			...
		}
	}	
}

Instead, in order to integrate a general parser for, it is enough to extend the Parser class and implement the parse method.

New preprocessing module

To add a new module into the pipeline, it is enough to implement the interfaces in the package modules.preprocessing.units.
It is also possible to add a brand new step to the pipeline (e.g. Named Entity Recognition) by extending the class Unit and implementing the methods to load the models asynchronously.

Adding a new feature

A new feature can be implemented with a two-step procedure:

  1. Create a new class that extends the abstract class Feature. The constructor of this class requires a unique key and a name. It is also possible to set a default value for the feature by implementing the method getDeafultValue.

  2. Implement an extractor for the feature via the abstract class FeatureExtractor. In your constructor invoke the superclass's constructor providing the cut-off value; then, declare the name of the class through the method getFeatureClass.

Copied to clipboard
import java.util.Collection;
import it.si3p.supWSD.data.Annotation;
import it.si3p.supWSD.data.Lexel;
import it.si3p.supWSD.modules.extraction.features.Feature;

public abstract class FeatureExtractor {

	private final int mCutOff;

	public FeatureExtractor(int cutoff) {

		mCutOff = cutoff;
	}

	public final int getCutOff() {

		return mCutOff;
	}
	
	public abstract Class<? extends Feature> getFeatureClass();
	public abstract Collection<Feature> extract(Lexel lexel, Annotation annotation);
}

Adding a new classifier

A new classifier can be implemented by extending the generic abstract class Classifier, which declares the methods to train and test the models.
Feature conversion is carried out with the generic method getFeatureNodes.

Copied to clipboard
import java.util.Collection;
import java.util.SortedSet;
import it.si3p.supWSD.modules.classification.instances.AmbiguityTest;
import it.si3p.supWSD.modules.classification.instances.AmbiguityTrain;
import it.si3p.supWSD.modules.classification.scorer.Result;
import it.si3p.supWSD.modules.extraction.features.Feature;

public abstract class Classifier<T,V>{

	public abstract Object train(AmbiguityTrain ambiguity);
	protected abstract double[] predict(T model, V[] featuresNodes);
	protected abstract V[] getFeatureNodes(SortedSet<Feature> features);
	
	public final Collection<Result> evaluate(AmbiguityTest ambiguity, Object model, String cls){
		...
	}
}