Le Groupe de travail en Statistique est organisé par Gaëlle Chagny et Antoine Channarond.

## Programme 2016-2017

29 juin 2017 (10h15) Lorette Noiret (LMI, INSA de Rouen) Determinants of RBC Alloantibody Detection Duration: Analysis of Multiply Alloimmunized Patients Supports Peri-Transfusion Factors In the talk, I will present a statistical analysis of clinical data that helped to elucidate some of the mechanisms associated with the immune response after blood transfusions. Patients exposed to foreign red blood cell (RBC) antigens following blood transfusion, transplantation, or during pregnancy are at risk of developing alloantibodies. These alloantibodies can cause serious adverse transfusion reactions if the patient is re-exposed to the same RBC antigen. Patients with multiple alloantibodies can be seen as a population that is highly susceptible to alloimmunization and offer an opportunity to study the factors controlling the persistence of alloantibodies. Here, we studied retrospective medical records for alloimmunized patients at Massachusetts General Hospital and Brigham and Women’s Hospital (1,461 patients; 2,187 antibodies). By comparing the variability in detection duration within patients, we began to investigate the possible mechanisms controlling the persistence of alloimmunization.

8 juin 2017 (10h15) Alexander Tartakovsky (AGT StatConsult, Los Angeles, États Unis) Sequential Hypothesis Testing in Multiple Data Streams with Unknown Patterns The problem of sequential hypothesis testing in multiple sensors, populations, streams or in multichannel systems arises in numerous applications, e.g., in medical applications, in environmental monitoring, in cyber security, in genomic applications, in military defense applications, etc. Motivated by these and many other applications, we consider the sequential hypothesis testing problem where observations are acquired in multiple data streams and the number and location of “patterns” (or “signals”) of interest are either completely or partially unknown a priori. The goal is to quickly detect either the absence of all patterns or the presence of an unknown subset of patterns while controlling the probabilities of type-I and type-II errors. We address a general multi-stream scenario where the various streams may be coupled and correlated, and the observations in streams are temporally dependent and non-identically distributed. We develop a general multi-stream asymptotic hypothesis testing theory as the probabilities of errors vanish, assuming that the log-likelihood ratio statistics satisfy certain asymptotic stability properties. Specifically, we show that two multi-stream sequential tests, the Generalized Sequential Likelihood Ratio Test and the Mixture Sequential Likelihood Ratio Test, minimize asymptotically the expected sample size or, more generally, the higher moments of the sample size distribution when the suitably normalized log-likelihood ratios between hypotheses satisfy an r-complete version of the Law of Large Numbers. Several challenging examples are considered.

1er juin 2017 (10h15) Serge Pergamenchtchikov (LMRS) Adaptive efficient estimation for nonparametric regression models with small noise intensity We consider the nonparametric robust estimation problem for regression models in the continuous time on the fixed time interval observed under small intensity Levy noise with jumps. An adaptive model selection procedure is proposed. Under general moment conditions on the noise distribution, a sharp non-asymptotic oracle inequality for the robust risks is obtained and the robust efficiency is shown. We apply this procedure to the estimation problem of the number signals in the multipath connection channel. This is a joint work with S. Beltaief (LMRS) and O. Chernoyarov (Russian National Research University “Moscow Power Engineering Institute”).

6 avril 2017 (10h15) Romain Azaïs (INRIA, Institut Élie Cartan de Lorraine, Nancy) Deux stratégies non-paramétriques pour estimer le taux de saut d'un processus markovien déterministe par morceaux Les processus markoviens déterministes par morceaux ont été introduits dans les années 80 par Davis comme une généralisation des processus de renouvellement marqués. Entre deux sauts, le comportement de ces processus est régi par une équation différentielle dépendant éventuellement de la marque souvent appelée régime dans ce contexte. Le taux de saut d'un tel processus est une fonction de la trajectoire déterministe suivie. Je présenterai deux stratégies non-paramétriques pour tenter d'estimer cette caractéristique du modèle : la première s'appuie sur le modèle à intensité multiplicative d'Aalen traditionnellement utilisé en analyse de survie ; la seconde, basée sur des méthodes à noyau récursives, permet de construire une famille d'estimateurs consistants parmi lesquels on choisit celui de variance minimale.

16 février 2017 (10h15) Sahar Albosaily (LMRS, Univ. Rouen) The optimal investment and consumption for the financial market generated by the spread of risky assets The aim of this talk is to find the optimal investment and consumption for the financial markets generated by the spread of risky assets. As usual in the portfolio optimization problems it is considered the financial assets of geometric Brownian motion type. In this talk we use the model of financial markets ”spread” generated by Ornstein-Uhlenbeck process. This extends the Boguslavsky and Boguslavskaya (2004) pure investment problem of the same model. Moreover, we apply the probabilistic representation method for the solution of parabolic partial differential equations based on the Feynman-Kac mapping. We chose this method as we could not apply the method proposed by Boguslavsky and Boguslavskaya (2004) to find an explicit solution of this equation. As for the problem with the consumption there are additional variables. Finally, the H-J-B equation for this problem is obtained. Also, the existence and uniqueness theorem for the classical solutions for this problem is shown.

2 février 2017 (10h15) Lionel Truquet (ENSAI, Rennes) Local stationarity and time-inhomogeneous Markov chains We revisit a notion of local stationarity for fitting time-inhomogeneous Markov chains models. We consider triangular arrays of time-inhomogeneous Markov chains, defined by some families of contracting and slowly-varying Markov kernels. Using the Dobrushin's contraction coefficients for adapted probability metrics, we show that the distribution of such Markov chains can be approximated locally with the distribution of ergodic Markov chains. Mixing properties of these triangular arrays are also discussed. As a con- sequence of our results, some classical geometrically ergodic homogeneous Markov chains models have a locally stationary version. In particular, we consider a model of finite-state Markov chains with a time-varying transition matrix which is estimated nonparametrically.

8 décembre 2016 (10h15) Youri Kutoyants (LMM, Univ. du Maine, Le Mans) On Misspecifications in Regularity and Properties of Estimators The problem of parameter estimation by the continuous time observations of a deterministic signal in white gaussian noise is considered. The asymptotic properties of the maximum likelihood estimator are described in the asymptotics of small noise (large siglal-to-noise ratio). We are interested by the situation where there is a misspecification in the regularity conditions. At particulary it is supposed that the statistician uses a discontinuous (change-point type) model or cusp-type model of signal, when the true signal is continuously differentiable function of the unknown parameter.

24 novembre 2016 (10h15) Viktor Konev (Tomsk State Univ.) Non-asymptotic inference for the stochastic regression models with Gaussian noises It is well known that, for point and interval estimation of unknown parameters in a variety of models of time series analysis and statistics of random processes, sequential methods based on sampling schemes with random stopping times turn out to be more efficient as compared with the methods of fixed-sample theory. Among the models are AR(1), AR(1)/ARCH(1), threshold autoregressive model TAR(1) and others. The talk focuses on the problems of statistical inference for the stochastic regression models with Gaussian noises in non-asymptotic statement. Two new properties of square integrable martingales with conditionally Gaussian increments, related to special stopping times are established. They are used to construct sequential estimates with non-asymptotic normal distributions.

10 novembre 2016 (10h15) Ghislaine Gayraud (LMAC, UTC Compiègne) Détection minimax de signaux réguliers à structure parcimonieuse Nous considérons un modèle de suites de matrices de taille $M\times N$ dont les entrées sont des variables aléatoires Gaussiennes hétérogènes. Notre modèle permet d'inclure le cadre des problèmes inverses. Nos résultats sont des vitesses de séparation dans le problème de détection d'une sous-matrice significative de taille $m\times n$, avec $m$ et $n$ qui tendent vers l'infini, et sous l'hypothèse de parcimonie, $m/M$ et $n/N$ tendent vers 0. Nous montrons comment rendre ces vitesses de tests adaptatives en $(m, n)$, la taille des sous-matrices significatives. En faisant une hypothèse supplémentaire sur la taille relative des sous-matrices à détecter, nous prouvons les bornes inférieures correspondantes, ce qui assure qu'aucune procédure de test n'est capable de distinguer l'hypothèse nulle de l'alternative avec des vitesses « meilleures » que celles obtenues par notre procédure de test.

20 octobre 2016 (10h15) Catherine Matias (LPMA, Univ. Pierre et Marie Curie) Statistical clustering of temporal networks through a dynamic stochastic block model Statistical node clustering in discrete time dynamic networks is an emerging field that raises many challenges. Here, we explore statistical properties and frequentist inference in a model that combines a stochastic block model (SBM) for its static part with independent Markov chains for the evolution of the nodes groups through time. We model binary data as well as weighted dynamic random graphs (with discrete or continuous edges values). Our approach, motivated by the importance of controlling for label switching issues across the different time steps, focuses on detecting groups characterized by a stable within group connectivity behavior. We study identifiability of the model parameters , propose an inference procedure based on a variational expectation maximization algorithm as well as a model selection criterion to select for the number of groups. We carefully discuss our initialization strategy which plays an important role in the method and compare our procedure with existing ones on synthetic datasets. We also illustrate our approach on dynamic contact networks, one of encounters among high school students and two others on animal interactions. An implementation of the method is available as a R package called dynsbm

29 septembre 2016 (10h15) Svetlana Gribkova (LPMA, Univ. Paris Diderot) Modèles statistiques pour l'analyse de donnees de sequencage à haut debit RNA-Seq single cell Si le génome est identique dans chacune des cellules d'un organisme donné, la spécificité de chaque cellule est déterminée par l'expression de ses gènes qui change selon la fonction, l'état ou le type cellulaire. Le niveau d'expression de chaque gène correspond à la quantite d'ARN produit par ce gène. Les technologies NGS (New Generation Sequencing) telles que le séquencage à haut debit RNA-Seq permettent de mesurer l'expression de plusieurs milliers de gènes simultanement. Jusqu'à très recemment le RNA-Seq pouvait être effectué uniquement sur un échantillon de tissu composé de milliers de cellules. L'expression mesurée pour chaque gène correspondait ainsi à la quantité globale d'ARN produit par ce gène dans l'ensemble des cellules. Cela permet d'étudier l'expression d'un gène à l'échelle d'un tissu, comparer plusieurs tissus et états cellulaires mais ne permet pas d'étudier la variabilité de l'expression entre les cellules individuelles qui composent le tissu. Cette dernière question est très importante pour les études de tissus en développement et de tissus cancéreux qui possèdent des structures hétérogènes.
RNA-Seq single cell est une technique récente et révolutionnaire permettant de quantifier les expressions de gènes dans des cellules individuelles. Désormais on peut suivre la dynamique de l'apparition de nouvelles populations de cellules lorsqu'un organisme se développe à partir de cellule unique ou encore étudier l'hétérogénéité cellulaire de tumeurs. Ces nouvelles questions biologiques posent à leur tour des problèmes statistiques intéressants et non triviaux. Les données RNA-Seq single cell sont les comptages qui présentent une surdispersion et un nombre excessif de zéros qui peuvent correspondre à la fois aux valeurs nulles et aux valeurs manquantes. Cette structure particulière de données rend inefficaces les techniques standards de l'analyse de données et de la réduction de dimension nécessaires pour étudier la structure de l'hétérogénéité entre cellules. Dans cet exposé, je vais présenter une nouvelle technique de la réduction de dimension et de l'analyse de la variabilité pour les données RNA-Seq single cell basée sur la modélisation de données par la loi binomiale négative zero-inflated.