First passage times dynamics in Markov Models with applications to HMM : induction, sequence classification and graph mining
Contributor(s)UCL. - FSA/INGI - Département d'ingénierie informatique
Full recordShow full item record
AbstractSequential data are encountered in many contexts of everyday life and in numerous scientific applications. They can for instance be SMS typeset on mobile phones, web pages reached while crossing hyperlinks, system logs or DNA samples, to name a few. Generating such data defines a sequential process. This thesis is concerned with the modeling of sequential processes from observed data. Sequential processes are here modeled using probabilistic models, namely discrete time Markov chains (MC), Hidden Markov Models (HMMs) and Partially Observable Markov Models (POMMs). Such models can answer questions such as (i) Which event will occur a couple of steps later? (ii) How many times will a particular event occur? and (iii) When does an event occur for the first time given the current situation? The last question is of particular interest in this thesis and is mathematically formalized under the notion of First Passage Times (FPT) dynamics of a process. The FPT dynamics is used here to solve the three following problems related to machine learning and data mining: (i) HMM/POMM induction, (ii) supervised sequence classification and (iii) relevant subgraph mining. Firstly, we propose a novel algorithm, called POMMStruct, for learning the structure and the parameters of POMMs to best fit the empirical FPT dynamics observed in the samples. Experimental results illustrate the benefit of POMMStruct in the modeling of sequential processes with a complex temporal dynamics while compared to classical induction approaches. Our second contribution is concerned with the classification of sequences. We propose to model the FPT in sequences with discrete phase-type (PH) distributions using a novel algorithm called PHit. These distributions are used to devise a new string kernel and a probabilistic classifier. Experimental results on biological data shows that our methods provides state-of-the-art classification results. Finally, we address the problem of mining subgraphs, which are relevant to model the relationships between selected nodes of interest, in large graphs such as biological networks. A new relevance criterion based on particular random walks called K-walks is proposed as well as efficient algorithms to compute this criterion. Experiments on the KEGG metabolic network and on randomly generated graphs are presented.
(FSA 3)--UCL, 2007