Wednesday, September 23, 2015

Thesis Published: DATA MINING IN LARGE AUDIO COLLECTIONS OF DOLPHIN SIGNALS



The study of dolphin cognition involves intensive research of animal vocalisations recorded in the field.
In this dissertation I address the automated analysis of audible dolphin communication.
I propose a system called the signal imager that automatically discovers patterns in dolphin signals.
These patterns are invariant to frequency shifts and time warping transformations.
The discovery algorithm is based on feature learning and unsupervised time series segmentation
using hidden Markov models. Researchers can inspect the patterns visually and interactively
run comparative statistics between the distribution of dolphin signals in different behavioral contexts.
The required statistics for the comparison describe dolphin communication as a combination of
the following models: a bag-of-words model, an n-gram model and an algorithm to learn
a set of regular expressions. Furthermore, the system can use the patterns to automatically
tag dolphin signals with behavior annotations. My results indicate that the signal imager provides
meaningful patterns to the marine biologist and that the comparative statistics are aligned with
the biologists’ domain knowledge.
[PDF]

Sunday, August 30, 2015

Random Intersection Trees

Found an interesting paper with a new idea on how to build interpretable decision rules. The paper
describes an algorithm to find a set of binary features that are together indicative of a class. For example, a set of binary features might be:

  • Humidity = high: true / false
  • Temperature > 10 C: true / false
  • Sunny = true / false
  • Rain = true false
A feature is said to be active whenever the result evaluates to true. For two classes C1 and C2, the goal is to find a subset of active features S  such that P(S|C1) > O2 and P(S|C2) < O1 for two thresholds 0 < O1 < O2 < 1

For example, let the class be good weather and bad weather. Furthermore, I specified the thresholds
as O1 = 0.5 and O2 = 0.25. A good result might then be 

  • P(S = {temperature > 10, sunny} | good weather)  > .5
  • P(S = {temperature > 10, sunny} | bad weather)    < .25
Such a result would be indeed interpretable and we could agree with the results that a higher temperature and sun would result in higher probabilities for good weather.

The algorithm works in two steps: 
  1. Candidate generation for subsets S.
  2. Check subsets for constraint: P(S|C1) > O2 and P(S|C2) < O1 
The main algorithm in the paper is an efficient candidate generation algorithm.
The main idea is to organise the subset search into a random tree of fixed depth and breadth.
At every node in the tree we pick a random example from the dataset with a positive label and then
compute the intersection of that example with the active feature set of the parent node.



By intersecting with positive examples only, the set of features gets more and more refined at each level. All remaining sets at the specified depth are candidates. Afterwards, we only have to check gif the subset holds to the constraints. 

A very interesting method to find rules in a dataset which is an interesting alternative to decision tree learning and apriori rule discovery.

Monday, July 13, 2015

Social Network Sampling

I will start working for a social network in August. My new job is data scientist at the business network Xing. I decided to play with their open API a little. The Xing API gives access to users in the network as well as their contact list. So an external app can analyze the social graph. Well actually there are restrictions on the number of queries so it is possible to
analyze the graph partly. My goal is to extract an unbiased sample from the complete graph by performing a random walk from a start node. That means at every node the algorithm expands all
the node's successors and then picks a random one. The chosen successor is the next node. With
a probability of 0.15 the algorithm returns to the start node. This will ensure that we explore the graph around the start node. The going back is ensuring a trade off between breadth of the exploration and the depth of the exploration. I use the ruby version of the API.
A code snippet for a random walk step in the Xing API
is shown below:

##
# Random walk step: 
#   1) With probability 0.15 go back to start
#   2) Randomly choose a node from the successors
def random_walk_step(user_id, start)
  reset = false
  if Random.rand < 0.15
    user_id = start    
    reset = true
  end
  num_contacts = 5 # control branching
  contacts     = XingApi::Contact.list(user_id, limit: num_contacts)[:contacts][:users]
  next_user    = Random.rand(num_contacts)    
  contacts.map! {|x| x[:id]}
  if reset
    return random_walk_step(contacts[next_user], start)
  end
  return [contacts[next_user], contacts, user_id]
end
The random step function takes a user id as the current node and a node we started the random walk at. First we decide if we restart the search at the start node, then it samples a neighbor. Repeating the process and updating the user id with the sampled neighbor results in a graph. The function below, explores the graph for 25 nodes and returns the graph as a dot file:
##
# Print graph as graphviz file
def dot(path)
  expended = []
  File.open(path, 'w') do |file|
    file.write "digraph my_neighbors {\n"
    start_id = XingApi::User.me[:users].first[:id]
    ids = random_walk_step(start_id, start_id)  
    45.times do
      id = ids[2]
      username = XingApi::User.find(id)[:users].first[:display_name]
      username.gsub! /\W/, ""      
      if not expended.include? username
        expended = expended + [username]
        ids[1].each do |neighbor| 
          neighborname = XingApi::User.find(neighbor)[:users].first[:display_name]
          neighborname.gsub! /\W/, ""       
          file.write "#{username} -> #{neighborname};\n"        
        end
      end
      ids = random_walk_step(ids[0], start_id)      
    end
    file.write "}"
  end
end
A dot file is a graph description that can be converted into graph images. A sample graph of my network is shown below.


Wednesday, May 13, 2015

Integrating Weka's UI into your own code

Weka is a machine learning toolkit written in Java. Despite it's capability to run as a stand alone
user interface, you can also use Weka in your own Java code. That means, all algorithms you can find in the user interface are also trainable and usable in your Java application. The cool thing is, you can also integrate part of Weka's user interface into your own. So my goal was to integrate the option to choose a Weka classifier and configure it using Weka's native components.


Choosing and configuring a Weka classifier, here a Support Vector Machine.


After browsing a little through the Weka UI code I found two classes that can be used to
open the editor to choose a classifier and to configure it. Once the window closes, the classifier
with it's configuration is ready to be used by other functions. The first UI component is the GenericObjectEditor, which is responsible for choosing a classifier. The second is the PropertyDialog which is configuring the current classifier. Blow I showed some code that displays the dialog and prints out the classifier. I found the example in the Weka GenericObjectEditor class at the bottom. For the code to work you have to make sure the weka.jar is in your CLASSPATH.



package wekaui;

import java.awt.Frame;
import java.awt.event.WindowAdapter;
import java.awt.event.WindowEvent;
import java.beans.PropertyEditor;
import weka.classifiers.Classifier;
import weka.classifiers.rules.ZeroR;
import weka.gui.GenericObjectEditor;
import weka.gui.PropertyDialog;

/**
 * Choose a classifier and configure it in your own java code.
 *
 * @author Daniel Kohlsdorf
 */
public class WekaUI {

    public static void main(String[] args) {
        Object initial = new ZeroR();
        GenericObjectEditor.registerEditors();
        GenericObjectEditor editor = new GenericObjectEditor(true);
        ce.setClassType(weka.classifiers.Classifier.class);
        ce.setValue(initial);

        // Setup the property view
        PropertyDialog propertyDialog = new PropertyDialog((Frame) null, editor, 100, 100);

        // Get the classifier when the window is closed
        pd.addWindowListener(new WindowAdapter() {
            @Override
            public void windowDeactivated(WindowEvent e) {
                PropertyEditor propertyEditor = ((PropertyDialog) e.getSource()).getEditor();
                if (propertyEditor.getValue() != null) {
                    Classifier classifier = (Classifier) propertyEditor.getValue();
                    System.out.println("CLASSIFIER: " + classifier);
                }
            }
        });
        propertyDialog.setVisible(true);
    }

}