Skip to main content

Research Seminars 2015

Research Seminars 2015

Probability Models of Evolution


The talk will be about how to model evolution, and how to perform 'evolutionary computation'. I will start by reviewing an apparently different field - statistical computations using Markov-chain Monte-Carlo - and I will describe the basic techniques used, Gibbs sampling and Metropolis-Hastings. Then, starting from first principles, I will present models of evolutionary computation that are also standard MCMC methods. I will discuss the variety of ways in which evolutionary computations can be modelled in this way, and the advantages of doing so.

Programming language integration and migration


Programming languages are islands, each disconnected from the rest. We choose a language for a task and, for better or worse, stick with it. Communicating between programs written in different languages is such a slow, arduous, task that we avoid doing it whenever possible.


In this talk I will show how language composition can lower, and even remove, the barriers between languages, using compositions of Python and Prolog, and Python and PHP as concrete examples. We have pioneered new approaches to the two major challenges in language composition: editing and running composed programs.
Using our novel editor 'Eco', users can write PHP source files that can contain Python functions wherever PHP functions are valid. In other words, users can mix the two languages even within a single file. The two languages can be intertwined in surprising ways. For example, variable scopes are defined across the languages.
We then run programs upon a composed meta-tracing VM. Our preliminary results suggest that performance of composed PHP + Python programs not only exceeds that of the standard php.net interpreter, but is often competitive with high performing mono-language VMs.


I will conclude by examining the possible uses of a PHP and Python composition. For example, some may use it to gradually migrate a system from PHP to Python (or vice versa), while others may use it to take advantage of the different libraries available in each language.


Joint work with Edd Barrett, Carl Friedrich Bolz, and Lukas Diekmann

A New Characterization of Propositional Proofs


Does every Boolean tautology have a short propositional-calculus proof? Here, a propositional calculus proof is a proof starting from a set of axioms and deriving new Boolean formulas using a set of fixed sound derivation rules. Establishing any strong size lower bound on propositional-calculus proofs (in terms of the size of the formulas proved) is a major open problem in proof complexity, and among a handful of fundamental hardness questions in complexity theory by and large. It has also bearing on efficient SAT-solving and other proof-search algorithms.


In this talk I will show that in fact lower bounds on propositional-calculus proofs follow from certain size lower bounds on a very weak model of computation, namely, non-commutative arithmetic formulas. For this weak model of computation, many strong size lower bounds are already known since the early 90s. The argument is a new characterization of propositional proofs as non-commutative formulas.


(No specific prior knowledge will be assumed. Based on a joint work with Fu Li and Zhengyu Wang.)

History trumps conservation. Lessons learned from the comparative analysis of pseudogenes


Pseudogenes have long been considered degraded fossil copies of genes. Leveraging of the completed annotations of human, worm, and fly genomes, the pseudogene complements of the three distant phyla are compared. The results show that pseudogenes are lineage specific, much more so than protein-coding genes, reflecting the different remodelling processes marking each organism's genome evolution.


The large number of processed pseudogenes in mammals suggest that the mammalian pseudogene complement is governed by a large event, the retrotranspositional burst that occurred at the dawn of the primate lineage. In contrast, worm and fly pseudogene complements tell a story of numerous duplication events, selective sweeps, and large population sizes.


Despite large variations between the species, there are also some notable similarities. Overall, a broad spectrum of biochemical activity for pseudogenes ranging from "highly active" to "dead" can be identified in each organisms. The majority of pseudogenes (~75%) are found between these two extremes, exhibiting various proportions of partial activity. In particular, a consistent amount of transcription (~15%) is identified in each organism. The consistent distribution of these activity levels across all species implies a uniform degradation mechanism of functional elements.


In light of the broad spectrum of biochemical activity exhibited by the majority of pseudogenes as well as recent advances in genome biology, it is timely to revisit the notion of pseudogene to better and more accurately describe it and give this entity its rightful place in the sea of active genomic elements.

Positive definite matrices, Procrustes analysis, and other non-Euclidean approaches to diffusion weighted MRI.


Symmetric positive semi-definite (SPD) matrices have recently seen several new applications, including Diffusion Tensor Imaging (DTI) in MRI, covariance descriptors and structure tensors in computer vision, and kernels in machine learning. Depending on the application, various geometries have been explored for statistical analysis of SPD-valued data. We will focus on DTI, where the Euclidean approach was generally criticised for its ``swelling'' effect on interpolation, and for its violation of positive definiteness in extrapolation and other tasks. The affine invariant and log-Euclidean Riemannian metrics were subsequently proposed to remedy the above deficiencies. However, practitioners have recently argued that these geometric approaches are an overkill for some relevant noise models.


We will examine a couple of related alternative approaches that in a sense reside between the two aforementioned extremes. These alternatives are based on the square root Euclidean and Procrustes size-and-shape metrics. Unlike the Riemannian approach, our approaches, we think, operate more naturally with respect to the boundary of the cone of SPD matrices. In particular, we prove that the Procrustes metric, when used to compute weighted Frechet averages, preserves ranks. We also establish and prove a key relationship between these two metrics, as well as inequalities ranking traces and determinants of the interpolants based on the Riemannian, Euclidean, and our alternative metrics. Remarkably, traces and determinants of our alternative interpolants compare differently. A general proof of the determinant comparison was developed (8th February this year!) by Koenraad Audenaert, whose help was also crucial in proving the key relationship between these two metrics.
Several experimental illustrations will be shown based on synthetic and real human brain DT MRI data.


No special background in statistical analysis on non-Euclidean manifolds is assumed.


This is a joint work with Prof. Ian Dryden (University of Nottingham) and Dr Diwei Zhou (Loughborough University), with a recent participation of Dr Koenraad Audenaert (RHUL).

Multi-label hierarchical prediction methods and their application to the automatic function prediction of proteins


Several computational biology problems, ranging from protein function classification to the prediction of gene – disease associations, are characterized by ontologies that represent concepts, attributes and their relationships in the form a Directed Acyclic Graph (DAG). For instance, in the context of gene/protein function prediction the Gene Ontology (GO) represents classes of biological processes, molecular functions or cellular components and in the context of gene disease prioritization the Human Phenotype Ontology (HPO) covers the phenotypic abnormalities commonly encountered in human genetic diseases.


In this seminar we introduce ontology-aware machine learning methods in the context of the Automated Function Prediction (AFP) of proteins, and discuss novel multi-label ensemble methods for the hierarchical prediction of protein functions. Most AFP methods proposed in literature are “flat”, in the sense that they predict GO functional classes independently of each other, thus resulting in inconsistent predictions with respect to the underlying ontology, and moreover loose a priori information about the relationships between classes. To overcome these problems hierarchy-aware learning methods have been proposed in the literature, ranging from methods based on the kernelization of the output space, to hierarchical ensemble methods, but the former do not scale well when complex ontologies are analyzed, and the latter have been mainly applied to tree-structured taxonomies.


In this talk we present our on-going work based on the design and development of hierarchcial ensemble methods able to deal with DAG-structured taxonomies and to scale nicely with large structured output spaces. In particular we introduce HTD-DAG (Hierarchical Top-Down for DAGs), an ensemble method method that recursively exploits the parent-child relationship of the structured taxonomies and adopts a top-down traversal strategy of the ontology from the most general to most specific classes of the GO to combine the predictions of the base learners associated to each class. Then we present TPR-DAG (True Path Rule for DAGs) an alternative approach that adopts both a bottom-up and a top down learning strategy to improve both the sensitivity and the precision of the overall system. We provide theoretical guarantees that the proposed ensemble methods supply predictions consistent with the hierarchy of the taxonomy. Results with several model organisms show that both HTD-DAG and TPR-DAG significantly outperform flat methods and other hierarchical ensemble methods proposed in literature. Open problems of this exciting research area at the cutting edge between machine learning and computational biology are finally considered, outlining novel perspectives for future research.

Dynamic Performance Profiling of Cloud Caches


Large-scale in-memory object caches such as memcached are widely used to accelerate popular web sites and to reduce burden on backend databases. Yet current cache systems give cache operators limited information on what resources are required to optimally accommodate the present workload. This talk focuses on a key question for cache operators: how much total memory should be allocated to the in-memory cache tier to achieve desired performance?


I will present the results of our recent work where we developed several new techniques for online profiling of the cache hit rate as a function of memory size. Our techniques are lightweight and non-intrusive: they can be easily plugged into an existing LRU cache replacement policy, and only incur a constant processing overhead on each cache request. The resulting profiler enables cache operators to dynamically project the cost and performance impact from adding or removing memory resources within a distributed in-memory cache, allowing "what-if" questions about cache performance to be answered without laborious offline tuning.


We incorporated our techniques into a prototype system, called MIMIR, and studied its performance. Our experiments show that MIMIR produces dynamic hit rate curves with over 98% accuracy and 2 - 5% overhead on request latency and throughput when MIMIR is run in tandem with memcached, suggesting online cache profiling can be a practical tool for improving provisioning of large caches.
Joint work with Trausti Saemundsson (Reykjavik University), Hjortur Bjornsson (University of Iceland), and Ymir Vigfusson (Emory University) Published at the 5th ACM Symposium on Cloud Computing (SOCC'14): http://dl.acm.org/citation.cfm?id=2671007


The talk will be self-contained, and no prior knowledge of caching or cache replacement policies will be assumed.

Notions and notations: Babbage's language of thought


By the aid of the Mechanical Notation, the Analytical Engine became a reality: for it became susceptible of demonstration
Babbage has been called the 'great-uncle' of modern computing, a claim that rests simultaneously on his demonstrable understanding of most of the architectural principles underlying the modern computer, and the almost universal ignorance of Babbage's work before 1970.

There has been an explosion of interest both in Babbage's devices and the impact they might have had in some parallel history, as well as in Babbage himself as a man of great originality who had essentially no influence at all on subsequent technological development.


In all this, one fundamental question has been largely ignored: how is it that one individual working alone could have synthesised a workable computer design over quite a short period, designing an object whose complexity of behaviour so far exceeded that of contemporary machines that it would not be matched for over one hundred years?
We believe that the answer lies in the techniques Babbage developed to reason about complex systems: specifically his system of notations. The Leverhulme Trust have recently made a major project grant to support our effort document, formalise and implement simulators for Babbage's Notation.


We doing this by treating his system of notations as a formal language, adopting techniques from software language engineering. The team will develop abstract simulators which make specification written in the notation executable (and as a byproduct generate graphical visualisations of the system described by a specification) and use these simulations to verify our understanding against Babbages's descriptions and actual artefacts.


In this seminar illustrate and explain Babbage's core achievements in mechanical computing. I shall give an early report on progress with our project, including showing off the 'new' difference engine being designed by Piers Plummer.

Understanding Consistency


Maintaining consistency is a central issue of distributed systems. However, there is a fundamental design tension between highly-synchronised models, which are easy to use, and ones with less synchronisation but weaker properties. This results in a complex and confusing design space.


How do consistency models differ, and which one is right for me? To answer, we describe them in terms of elementary, orthogonal properties. We show both what are the guarantees provided by a given property (i.e., what kind of application invariants that it ensures), and, dually, the opportunities for parallelism and implementation freedoms that it provides.
Finally, we present a new top-down invariant-directed approach. The aim is to design a protocol that has just enough synchronisation to ensure the application’s invariants. The approach starts with a specification of the application’s invariants and of its actions. A proof tool tells us whether concurrent execution of actions maintains the invariants, and if not, where additional synchronisation is necessary.

Leveraging Probabilistic Reasoning in Deterministic Planning for Large-Scale Autonomous Search-and-Tracking


Search-And-Tracking (SaT) is the problem of searching for a mobile target and tracking it once it is found. Since SaT platforms such as UAVs and micro-aerial vehicles face many sources of uncertainty and operational constraints, progress in the field has been restricted to simple and unrealistic scenarios. In this talk, I will present a new hybrid approach to SaT that allows us to successfully address large-scale and complex SaT missions. The probabilistic structure of SaT is compiled into a deterministic planning model and Bayesian inference is directly incorporated in the planning mechanism. Thanks to this tight integration between automated planning and probabilistic reasoning, we are able to exploit the power of both approaches. Planning provides the tools to efficiently explore big search spaces, while Bayesian inference, by readily combining prior knowledge with observable data, allows the planner to make more informed and effective decisions. We offer experimental evidence of the potential of our approach.

The complexity of general-valued CSPs


Many computational problems can be cast as Valued Constraint Satisfaction Problems (VCSPs). An instance of VCSP is given by a finite set of variables, a finite domain of labels, and a sum of functions, each function depending on a subset of the variables. Each function can take finite values specifying costs of assignments of labels to its variables or the infinite value, which indicates an infeasible assignment. The goal is to find an assignment of labels to the variables that minimizes the sum. The case when all functions take only values 0 and infinity corresponds to the standard CSP. We study (assuming that P $B!b(BNP) how the complexity of VCSP depends on the set of functions allowed in the instances, the so-called constraint language. Massive progress has been made in the last three years on this complexity classification question, and our work gives, in a way, the final answer to it, modulo the complexity of CSPs. This is joint work with Vladimir Kolmogorov and Michal Rolinek (both from IST Austria).

Reversibility and Concurrency


Reversible computation is attracting increasing interest in recent years, with potential applications in low-energy devices and hardware, recovery-oriented software, reversible debugging, and the modelling of biochemical reactions. In this talk we show how to model several types of reversibility in the setting of concurrent computation as realised abstractly by event structures. Two different forms of event structures are considered: one where causality is defined by the causality relation and the other where a more general enabling relation is used. We then shown how to extend these event structures with reversibility.

Hierarchical Feature Selection for the Classification of Ageing-Related Genes


Ageing is a highly complex biological process that is still poorly understood. With the growing amount of data about the genetics of ageing available on the web, it is timely to apply data mining methods to that data, in order to try to discover patterns that may assist ageing research. In this work we introduce new hierarchical feature selection methods for the classification task of data mining and apply them to data about ageing-related genes from several model organisms. The main novel aspect of the proposed feature selection methods is that they exploit hierarchical relationships in the set of features (Gene Ontology terms) in order to improve the predictive accuracy of classifiers. The results show that our hierarchical feature selection methods improve the predictive performance of standard Bayesian network classifiers.

NetKAT: A Formal System for the Verification of Networks

NetKAT is a relatively new programming language and logic for reasoning about packet switching networks that fits well with the popular software defined networking (SDN) paradigm. NetKAT was introduced quite recently by Anderson et al. (POPL 2014) and further developed by Foster et al. (POPL 2015). The system provides general-purpose programming constructs such as parallel and sequential composition, conditional tests, and iteration, as well as special-purpose primitives for querying and modifying packet headers and encoding network topologies. The language allows the desired behavior of a network to be specified equationally. It has a formal categorical semantics and a deductive system that is sound and complete over that semantics, as well as an efficient decision procedure for the automatic verification of equationally-defined properties of networks.

Towards Optimal Distributed Algorithms for Large Graphs


Processing and analyzing large graph data efficiently is an important problem with many real world applications. Examples of large graphs include social networks, transportation routes, and biological networks. In many cases, these graphs are too large to be processed efficiently by a single commodity type machine. Instead, the data is spread across several machines that form a distributed system and run a distributed algorithm. In this talk, we will study fundamental graph problems in the distributed message-passing model, where a network of machines jointly perform a computation on a large graph. The graph is assumed to be randomly partitioned among the machines, which is a common implementation in many real world systems. We will discuss algorithms for several problems in this setting such as computing the PageRank and finding all triangles in the graph. We will also discuss some complexity lower bounds that show that these algorithms are optimal up to logarithmic factors.

Explore Royal Holloway

Get help paying for your studies at Royal Holloway through a range of scholarships and bursaries.

There are lots of exciting ways to get involved at Royal Holloway. Discover new interests and enjoy existing ones.

Heading to university is exciting. Finding the right place to live will get you off to a good start.

Whether you need support with your health or practical advice on budgeting or finding part-time work, we can help.

Discover more about our 21 departments and schools.

Find out why Royal Holloway is in the top 25% of UK universities for research rated ‘world-leading’ or ‘internationally excellent’.

Royal Holloway is a research intensive university and our academics collaborate across disciplines to achieve excellence.

Discover world-class research at Royal Holloway.

Discover more about who we are today, and our vision for the future.

Royal Holloway began as two pioneering colleges for the education of women in the 19th century, and their spirit lives on today.

We’ve played a role in thousands of careers, some of them particularly remarkable.

Find about our decision-making processes and the people who lead and manage Royal Holloway today.