Publications

How News Evolves? Modeling News Text and Coverage using Graphs and Hawkes Process

Monitoring news content automatically is an important problem. News content, unlike traditional text, has a temporal component. However, few works have explored the combination of natural language processing and dynamic system models. One reason is that it is challenging to mathematically model the nuances of natural language. In this paper, we discuss how we built a novel dataset of news articles collected over time. Then, we present a method of converting news text collected over time to a sequence of directed multi-graphs, which represent semantic triples (Subject -> Predicate -> Object). We model the dynamics of specific topological changes from these graphs using discrete-time Hawkes processes. With our real-world data, we show that analyzing the structures of the graphs and the discrete-time Hawkes process model can yield insights on how the news events were covered and how to predict how it may be covered in the future.

submitted for publication

H.Zhang, J.Zhang
arXiv

Graph Coding for Model Selection and Anomaly Detection in Gaussian Graphical Models

A classic application of description length is for model selection with the minimum description length (MDL) principle. The focus of this paper is to extend description length for data analysis beyond simple model selection and sequences of scalars. More specifically, we extend the description length for data analysis in Gaussian graphical models. These are powerful tools to model interactions among variables in a sequence of i.i.d Gaussian data in the form of a graph. Our method uses universal graph coding methods to accurately account for model complexity, and therefore provide a more rigorous approach for graph model selection. The developed method is tested with synthetic and electrocardiogram (ECG) data to find the graph model and anomaly in Gaussian graphical models. The experiments show that our method gives better performance compared to commonly used methods.

Proceedings of the International Symposium on Information Theory 2021 (ISIT)

M.Abolfazli, A.Høst-Madsen,J.Zhang, and A.Bratincsak
arXiv

Differential Description Length for Hyperparameter Selection in Supervised Learning

Minimum description length (MDL) is an established method for model selection. For supervised learning problems, cross-validation is often used for model selection in practice. Reasons are 1) MDL is difficult to apply directly to data; 2) MDL may make restrictive statistical assumptions that decrease performance; and 3) MDL does not directly aim to minimize generalization error. In this paper, we introduce a modification to MDL, which we call differential description length (DDL). DDL partitions the data so that the codelength(s) it computes, reflects the conditional probability of seeing ‘new’ data given ‘old’ data. This differential codelength is what allows DDL to estimate generalization error like cross-validation. DDL is also better than cross-validation because it allows the learning algorithm to use the entire data without having to withhold subsets for validation and testing. Compared with MDL, DDL has both better performance (in finding models with smaller generalization error) and is easier to compute. Experiments with linear regression and deep neural networks show that DDL also outperforms cross-validation.

Proceedings of the International Symposium on Information Theory and Its Applications 2020 (ISITA)

M.Abolfazli, A.Høst-Madsen, and J.Zhang
arXiv

User Empathy in Smart Energy Management

This paper investigates interaction among residential electricity users and utility company in a distribution network with the capability of two-way communication provided by smart grid. The energy consumption scheduling of electricity users is formulated as a game-theoretic problem within a group where all players are not totally selfish. Considering empathetic behavior of human decision-making, empathetic player action to other players actions can be influenced by recognizing the well being of others. The proposed model captures the empathy of electricity users in energy scheduling and demonstrates how this behavior can decrease energy consumption and electricity prices. Furthermore, it is shown that selfish users cannot make full use of energy saving by empathetic users.

Proceedings of the IEEE Power & Energy Society General Meeting 2019 (PESGM)

M. Abolfazli, J. Zhang, and A. Kuh

Coding of graphs with application to graph anomaly detection

This paper has dual aims. First is to develop practical universal coding methods for unlabeled graphs. Second is to use these for graph anomaly detection. The paper develops two coding methods for unlabeled graphs: one based on the degree distribution, the second based on the triangle distribution. It is shown that these are efficient for different types of random graphs, and on real-world graphs. These coding methods is then used for detecting anomalous graphs, based on structure alone. It is shown that anomalous graphs can be detected with high probability.

Proceedings of IEEE International Symposium on Information Theory 2018 (ISIT)

A. Høst-Madsen and J. Zhang
arXiv

Greedy algorithm with approximation ratio for sampling noisy graph signals

We study the optimal sampling set selection problem in sampling a noisy $k$-bandlimited graph signal. To minimize the effect of noise when trying to reconstruct a $k$-bandlimited graph signal from $m$ samples, the optimal sampling set selection problem has been shown to be equivalent to finding a $m \times k$ submatrix with the maximum smallest singular value, $\sigma_{\min}$ \cite{chen2015discrete}. As the problem is NP-hard, we present a greedy algorithm inspired by a similar submatrix selection problem known in computer science and to which we add a local search refinement. We show that 1) in experiments, our algorithm finds a submatrix with larger $\sigma_{\min}$ than prior greedy algorithm \cite{chen2015discrete}, and 2) has a proven worst-case approximation ratio of $1/(1+\epsilon)k$, where $\epsilon$ is a constant.

Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing 2018 (ICASSP)

C. Wu, W. Chen, and J. Zhang,

Who is more at risk in heterogeneous networks?

Network-based epidemics models try to characterize the impact of network topology, which represents contagion pathways, on the spread of infection. Although these models explicitly consider the dynamics of individuals in the given network (i.e., the state of the system is $\mathbf{x}(t)= [x_1(t), x_2(t), \ldots, x_N(t)]^T$), analysis has focused on characterizing the vulnerability of the \emph{entire} population rather than the vulnerability of the individuals in the population. We focus on characterizing the vulnerability of the $i$th individual in the network by studying the marginal probability of infection, $P(x_i = 1)$, of the scaled SIS process. Studying the vulnerability of individuals is important because it may be tempting to assume that $P(x_i =1)$ is related to the degree of the $i$th node. Since infection rate is usually assumed to be dependent on the number of infected neighbors, then it seems reasonable that nodes with more connections (i.e., higher degree) would be more at risk. We show that this is \emph{not} always true. Further, with a closed-form approximation of $P(x_i =1)$, as solving for the exact probability requires the summation of $2^N$ terms, we characterize the conditions for when degree distribution is a good indicator of how susceptible an individual is to infection.

Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

J. Zhang and J.M.F. Moura,

Transmissibility of intra-host hepatitis C virus variants

Intra-host hepatitis C virus (HCV) population is genetically heterogeneous and organized in subpopulations. With exception of blood transfusion, transmission of HCV occurs via a small number of genetic variants, the effect of which is frequently described as a bottleneck. Stochasticity of transmission associated with the bottleneck is usually used to explain genetic differences among HCV populations identified in the source and recipient cases, which may be further exacerbated by intra-host HCV evolution and differential biological capacity of HCV variants to successfully establish population in a new host.

BMC Genomics, vol. 18, no. 10, 2017
D.S. Campo, J. Zhang, S. Ramachandran, Y. Khudyakov
paper

Cascading edge failures: a dynamic network process

This paper studies a network process that can potentially be used to model cascading failures in networks. The Dynamic Bond Percolation (DBP) process models, through stochastic local rules, the failure or recovery of an edge $(i,j)$ in a network. The probability that a working link fails or a failed link recovers may be independent of the state of other links \emph{or} may be dependent locally on the state of neighboring links as described by a cascade function $f$. In applications, this means that failures or recovery of links may have a regional preference, or, alternatively, relationships between neighbors in the network can lead to changes in the links between neighbors of neighbors. This paper shows that the dynamic evolution of $P(\mathbf{A},t)$, the probability that the network is in some state $\mathbf{A}$, describing the collective states of all the links, at time $t$ converges show that it converges to a stationary distribution. We use this distribution to study the emergence of global behaviors like consensus (i.e., catastrophic failure or full recovery of all the edges) or mixed (i.e., some failed and some working substructures). In particular, we show that, depending on the local dynamical rule, different network substructures, such as hub or triangle subgraphs, are more prone to failure.

IEEE Transactions on Network Science and Engineering DOI, 2017
J. Zhang and J.M.F. Moura
paper

Contact process with exogenous infection and the scaled SIS process

Propagation of contagion in networks depends on the graph topology. This paper is concerned with studying the time-asymptotic behavior of the extended contact processes on static, undirected, finite-size networks. This is a contact process with nonzero exogenous infection rate (also known as the $\epsilon$-SIS model). The only known analytical characterization of the equilibrium distribution of this process is for complete networks. For large networks with arbitrary topology, it is infeasible to numerically solve for the equilibrium distribution since it requires solving the eigenvalue-eigenvector problem of a matrix that is exponential in $N$, the size of the network. We derive a condition on the infection rates under which, depending on the degree distribution of the network, the equilibrium distribution of extended contact processes on \emph{arbitrary}, finite-size networks is well approximated by a closed-form formulation. We confirm the goodness of the approximation with small networks answering inference questions like the distribution of the percentage of infected individuals and the most-probable equilibrium configuration. We then use the approximation to analyze the equilibrium distribution of the extended contact process on the 4941-node US Western power grid.

Journal of Complex Networks, vol. 5, no. 5, Oct 2017
J. Zhang and J.M.F. Moura
paper   arXiv

Spectral radius and network processes with spontaneous infection/failure rate

We relate the time-limiting behavior of a network epidemics process to the spectral radius of the underlying network. The process we study is the scaled SIS network process, a continuous-time Markov process on a static network. Our analysis differs from previous work in that the scaled SIS process accounts for the possibility that a healthy individual has a nonzero probability of becoming infected even when all of its neighbors are healthy. For example, the source of infection may be outside a human only contact network for diseases with animal to human transmissions such as Ebola. We show that the sufficient condition for infection to become extinct not only depends on the ratio of infection and healing rates but also on $N$, the size of the network, whereas previous models, assuming no exogenous infection, showed dependency only on the infection and healing rates.

Proceedings of 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP)
J. Zhang and J.M.F. Moura
paper

Finding unique dense communities

Finding densely connected subgraphs, also called communities, in networks are of interest for many applications. In previous work, we showed an optimization method for efficiently finding subgraphs denser than the overall network. This result is derived from our studies of network processes, dynamical processes that model interactions between individual agents in networks (i.e., spread of infection or cascading failures). In this paper, we prove that these subgraphs are also unique in the sense that there are no other subgraphs in the network isomorphic to these subgraphs.

Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
J. Zhang and J.M.F. Moura
paper