{"title": "Discovering Hidden Variables: A Structure-Based Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 479, "page_last": 485, "abstract": null, "full_text": "Discovering Hidden Variables: \nA Structure-Based Approach \n\nGal Elidan Noam Lotner Nir Friedman \n\nHebrew University \n\n{galel,noaml,nir}@cs.huji.ac.il \n\nDaphne Koller \nStanford University \n\nkoller@cs.stanford.edu \n\nAbstract \n\nA serious problem in learning probabilistic models is the presence of hid(cid:173)\nden variables. These variables are not observed, yet interact with several \nof the observed variables. As such, they induce seemingly complex de(cid:173)\npendencies among the latter. In recent years, much attention has been \ndevoted to the development of algorithms for learning parameters, and \nin some cases structure, in the presence of hidden variables. In this pa(cid:173)\nper, we address the related problem of detecting hidden variables that \ninteract with the observed variables. This problem is of interest both for \nimproving our understanding of the domain and as a preliminary step that \nguides the learning procedure towards promising models. A very natural \napproach is to search for \"structural signatures\" of hidden variables -\nsubstructures in the learned network that tend to suggest the presence of \na hidden variable. We make this basic idea concrete, and show how to \nintegrate it with structure-search algorithms. We evaluate this method on \nseveral synthetic and real-life datasets, and show that it performs surpris(cid:173)\ningly well. \n\n1 Introduction \n\nIn the last decade there has been a great deal of research focused on the problem of learning \nBayesian networks (BNs) from data (e.g., [7]). An important issue is the existence of \nhidden variables that are never observed, yet interact with observed variables. Naively, one \nmight think that, if a variable is never observed, we can simply ignore its existence. At \na certain level, this intuition is correct. We can construct a network over the observable \nvariables which is an I-map for the marginal distribution over these variables, i.e., captures \nall the dependencies among the observed variables. However, this approach is weak from a \nvariety of perspectives. Consider, for example, the network in Figure lea). Assume that the \ndata is generated from such a dependency model, but that the node H is hidden. A minimal \nI-map for the marginal distribution is shown in Figure l(b). From a pure representation \nperspective, this network is clearly less useful. It contains 12 edges rather than 6, and the \nnodes have much bigger families. Hence, as a representation of the process in the domain, \nit is much less meaningful. From the perspective of learning these networks from data, the \nmarginalized network has significant disadvantages. Assuming all the variables are binary, \nit uses 59 parameters rather than 17, leading to substantial data fragmentation and thereby \nto nonrobust parameter estimates. Moreover, with limited amounts of data the induced \nnetwork will usually omit several of the dependencies in the model. \n\nWhen a hidden variable is known to exist, we can introduce it into the network and ap(cid:173)\nply known BN learning algorithms. If the network structure is known, algorithms such as \n\n\f(a) with hidden variable \n\n(b) no hidden variable \n\nFigure 1: Hidden variable \nsimplifies structure \n\nEM [3, 9] or gradient ascent [2] can learn parameters. If the structure is not known, the \nStructural EM (SEM) algorithm of [4] can be used to perform structure learning with miss(cid:173)\ning data. However, we cannot simply introduce a \"floating\" hidden variable and expect \nSEM to place it correctly. Hence, both of these algorithms assume that some other mech(cid:173)\nanism introduces the hidden variable in approximately the right location in the network. \nSomewhat surprisingly, only little work has been done on the problem of automatically \ndetecting that a hidden variable might be present in a certain position in the network. \nIn this paper, we investigate what is arguably the most straightforward approach for induc(cid:173)\ning the existence of a hidden variable. This approach, briefly mentioned in [7], is roughly \nas follows: We begin by using standard Bayesian model selection algorithms to learn a \nstructure over the observable variables. We then search the structure for substructures, \nwhich we call semi-cliques, that seem as if they might be induced by a hidden variable. \nWe temporarily introduce the hidden variable in a way that breaks up the clique, and then \ncontinue learning based on that new structure. If the resulting structure has a better score, \nwe keep the hidden variable. Surprisingly, this very basic technique does not seem to have \nbeen pursued. (The approach of [10] is similar on the surface, but is actually quite different; \nsee Section 5.) We provide a concrete and efficient instantiation of this approach and show \nhow to integrate it with existing learning algorithms such as SEM. We apply our approach \nto several synthetic and real datasets, and show that it often provides a good initial place(cid:173)\nment for the introduced hidden variable. We can therefore use it as a preprocessing step for \nSEM, substantially reducing the SEM search space. \n\n2 Learning Structure of Bayesian Networks \nConsider a finite set X = {Xl, ... ,Xn } of discrete random variables where each variable \nXi may take on values from a finite set. A Bayesian network is an annotated directed \nacyclic graph that encodes a joint probability distribution over X. The nodes of the graph \ncorrespond to the random variables Xl, ... , X n. Each node is annotated with a conditional \nprobability distribution that represents P(Xi I Pa(Xi)), where Pa(Xi) denotes the parents \nof Xi in G. A Bayesian network B specifies a unique joint probability distribution over X \ngiven by: PB(X1 , .\u2022. ,Xn ) = n~=l PB(XiIPa(Xi)). \nThe problem of learning a Bayesian network can be stated as follows. Given a training \nset D = {x[I], ... , x[ M]} of instances of X, find a network B that best matches D. The \ncommon approach to this problem is to introduce a scoring function that evaluates each \nnetwork with respect to the training data, and then to search for the best network according \nto this score. The scoring function most commonly used to learn Bayesian networks is the \nBayesian scoring metric [8]. Given a scoring function, the structure learning task reduces \nto a problem of searching over the combinatorial space of structures for the structure that \nmaximizes the score. The standard approach is to use a local search procedure that changes \none arc at a time. Greedy hill-climbing with random restarts is typically used. \nThe problem of learning in the presence of partially observable data (or known hidden \nvariables) is computationally and conceptually much harder. In the case of a fixed network \nstructure, the Expectation Maximization (EM) algorithm of [3] can be used to search for a \n(local) maximum likelihood (or maximum a posteriori) assignment to the parameters. The \nstructural EM algorithm of [4] extends this idea to the realm of structure search. Roughly \nspeaking, the algorithm uses an E-step as part of structure search. The current model -\nstructure as well as parameters -\n\nis used for computing expected sufficient statistics for \n\n\fother candidate structures. The candidate structures are scored based on these expected \nsufficient statistics. The search algorithm then moves to a new candidate structure. We can \nthen run EM again, for our new structure, to get the desired expected sufficient statistics. \n\n3 Detecting Hidden Variables \n\nWe motivate our approach for detecting hidden variables by considering the simple example \ndiscussed in the introduction. Consider the distribution represented by the network shown \nin Figure l(a), where H is a hidden variable. The variable H was the keystone for the \nconditional independence assumptions in this network. As a consequence, the marginal \ndistribution over the remaining variables has almost no structure: each }j depends on all \nthe Xi'S, and the }j's themselves are also fully connected. A minimal I-map for this \ndistribution is shown in Figure l(b). It contains 12 edges compared to the original 6. We \ncan show that this phenomenon is a typical effect of removing a hidden variables: \n\nProposition 3.1: Let G be a network over the variables Xl, . .. ,Xn , H. Let I be the \nconditional independence statements -\nthat are \nimplied by G and do not involve H. Let G' be the graph over X I, ... , X n that contains \nan edge from Xi to X j whenever G contains such an edge, and in addition: G' contains a \nclique over the children}j of H , and G' contains an edge from any parent Xi of H to any \nchild}j of H. Then G' is a minimall-map for I. \n\nstatements of the form J(X; Y \n\n1 Z) -\n\nWe want to define a procedure that will suggest candidate hidden variables by finding \nstructures of this type in the context of a learning algorithm. We will apply our procedure to \nnetworks induced by standard structure learning algorithms [7]. Clearly, it is unreasonable \nto hope that there is an exact mapping between substructures that have the form described in \nProposition 3.1 and hidden variables. Learned networks are rarely an exact reflection of the \nminimal I-map for the underlying distribution. We therefore use a somewhat more flexible \ndefinition, which allows us to detect potential hidden variables. For a node X and a set of \nnodes Y, we define 6. (X ; Y) to be the set of neighbors of X (parents or children) within \nthe subset Y. We define a semi-clique to be a set of nodes Q where each node X E Q \nis linked to at least half of Q: 16.(X; Q)I 2:: ~IQI (This revised definition is the strictest \ncriterion that still accepts a minimally (just one neighbor missing) relaxed 4-Clique.) \n\nWe propose a simple heuristic for finding semi-cliques in the graph. We first observe that \neach semi-clique must contain a seed which is easy to spot; this seed is a 3-vertex clique. \n\nProposition 3.2: Any semi-clique of size 4 or more contains a clique ofsize 3. \n\nThe first phase of the algorithm is a search for all 3-cliques in the graph. The algorithm then \ntries to expand each of them into a maximal semi-clique in a greedy way. More precisely, \nat each iteration the algorithm attempts to add a node to the \"current\" semi-clique. If the \nexpanded set satisfies the semi-clique property, then it is set as the new \"current\" clique. \nThese tests are repeated until no additional variable can be added to the semi-clique. The \nalgorithm outputs the expansions found based on the different 3-clique \"seeds\". We note \nthat this greedy procedure does not find all semi-cliques. The exceptions are typically \ntwo semi-cliques that are joined by a small number of edges, making a larger legal semi(cid:173)\nclique. These cases are of less interest to us, because they are less likely to arise from the \nmarginalization of a hidden variable. \nIn the second phase, we convert each of the semi-cliques to a structure candidate containing \na new hidden node. Suppose Q is a semi-clique. Our construction introduces a new variable \nH, and replaces all of the incoming edges into variables in Q by edges from H. Parents of \nnodes in Q are then made to be parents of H, unless the edge results in a cycle. This process \nresults in the removal of all intra-clique edges and makes H a proxy for all \"outside\" \ninfluences on the nodes in the clique. \n\nIn the third phase, we evaluate each of these candidate structures in attempt to find the \nmost useful hidden variable. There are several possible ways in which this candidate can \n\n\fbe utilized by the learning algorithm. We propose three approaches. The simplest assumes \nthat the network structure, after the introduction of the hidden variable, is fixed. In other \nwords, we assume that the \"true\" structure of the network is indeed the result of applying \nour transformation to the input network (which was produced by the first stage of learning). \nWe can then simply fit the parameters using EM, and score the resulting network. \nWe can improve this idea substantially by noting that our simple transformation of the \nsemi-clique does not typically recover the true underlying structure of the original model. \nIn our construction, we chose to make the hidden variable H the parent of all the nodes in \nthe semi-clique, and eliminate all other incoming edges to variables in the clique. Clearly, \nthis construction is very limited. There might well be cases where some of the edges in the \nclique are warranted even in the presence of the hidden variable. It might also be the case \nthat some of the edges from H to the semi-clique variables should be reversed. Finally, \nit is plausible that some nodes were included in the semi-clique accidentally, and should \nnot be directly correlated with H . We could therefore allow the learning algorithm -\nthe \nto adapt the structure after the hidden variable is introduced. One \nSEM algorithm of [4] -\napproach is to use SEM to fine-tune our model for the part of the network we just changed: \nthe variables in the semi-clique and the new hidden variable. Therefore, in the second \napproach we fix the remaining structure, and consider only adaptations of the edges within \nthis set of variables. This restriction substantially reduces the search space for the SEM \nalgorithm. The third approach allows full structural adaptation over the entire network. \nThis offers the SEM algorithm greater flexibility, but is computationally more expensive. \nTo summarize our approach: In the first phase we analyze the network learned using con(cid:173)\nventional structure search to find semi-cliques that indicate potential locations of hidden \nvariables. In the second phase we convert these semi-cliques into structure candidates \n(each containing a new hidden variable). Finally, in the third phase we evaluate each of \nthese structures (possibly using them as a seed for further search) and return the best scor(cid:173)\ning network we find. \nThe main assumption of our approach is that we can find \"structural signatures\" of hidden \nvariables via semi-cliques. As we discussed above, it is unrealistic to expect the learned \nnetwork G to have exactly the structure described in Proposition 3.1. On the one hand, \nlearned networks often have spurious edges resulting from statistical noise, which might \ncause fragments of the network to resemble these structures even if no hidden variable is \ninvolved. On the other hand, there might be edges that are missing or reversed. Spurious \nedges are less problematic. At worst, they will lead us to propose a spurious hidden variable \nwhich will be eliminated by the subsequent evaluation step. Our definition of semi-clique, \nwith its more flexible structure, partially deals with the problem of missing edges. However, \nif our data is very sparse, so that standard learning algorithms will be very reluctant to \nproduce clusters with many edges, the approach we propose will not work. \n\n4 Experimental Results \n\nOur aim is to evaluate the success of our procedure in detecting hidden variables. To do \nso, we evaluated our procedure on both synthetic and real-life data sets. The synthetic data \nsets were sampled from Bayesian networks that appear in the literature. We then created a \ntraining set in which we \"hid\" one variable. We chose to hide variables that are \"central\" \nin the network (i.e., variables that are the parents of several children). The synthetic data \nsets allow for a controlled evaluation, and for generating training and testing data sets of \nany desired size. However, the data is generated from a distribution that indeed has only \na single hidden variable. A more realistic benchmark is real data, that may contain many \nconfounding influences. In this case, of course, we do not have a generating model to \ncompare against. \nInsurance: A 27-node network developed to evaluate driver's insurance applications [2]. \nWe hid the variables Accident, Age, MakeModel, and VehicleYear (A, G, M, V in Fig(cid:173)\nure 2). Alarm: A 37-node network [1] developed to monitor ICU patients. We hid the \nvariables HR, intubation, LVFailure, and VentLung (H, I, L, V in Figure 2). Stock Data: \n\n\f600 * \n400 \n\n** \n\nbJ 60081200 [ ] \n\n**\nr:1 ..p \n\n+ + \n\n400 \n200 \n\n** ** \n\n** ** \n\n0 \" \n\n-200 \n\n400 \n\n800 \n\n'\n\n. \n\n** \n\n** \n\n' \n\n\u00a2 \n\n. \n\n. -400 ? ' \n\n0 \n\n200 \no \n-200 \n\nAGMV \n\nH ILV \n\nH IL \n\n-\n\n200~ \n\n150 \n\n100 \n50 D \no \n\n20~bJ 4008 1 00 0E ] ' \n\n200& \n\n-200\" \n\n0+ \n\n-400 \n-600 \u00a2 \u00a2 \n-BOO \n\n\u00a2 \n\n\u00a2 \n\n'\n\n0 \n-200 \n-400 \n\n+ ** \n. D \n\nA \n'\" \n\n-1000 \n\n-2000 \n\n. \n\n\u00a2 \n. \n\n100 \n50 \no \n-50 \n\nSI TB Original 0 \nHidden + \n[!] \n\n200 + G Naive \n\n150 \n\n'iEl \n\n'C o \n,g.l!! = CO QI'C \n.11:(cid:173)= til \n\nCl,S \n0 .... \n..JO \n\n.l!! \nCO \n'C \ns::: Cl \no s::: \nQI .(cid:173)\n... \ns::: \no .-\no ~ \n11)1-\n\nAGMV \n\nH ILV \n\nH IL \n\nSI TB \n\nInsurance 1k \n\nAlarm 1k \n\nAlarm 10k \n\nFigure 2: Comparison of the different approaches. Each point in the graph corresponds to \na network learned by one of the methods. The graphs on the bottom row show the log of \nthe Bayesian score. The graphs on the top row show log-likelihood of an independent test \nset. In all graphs, the scale is normalized to the performance of the No-hidden network, \nshown by the dashed line at \"0\". \n\nA real-life dataset that traces the daily change of 20 major US technology stocks for sev(cid:173)\neral years (1516 trading days). These values were discretized to three categories: \"up\", \"no \nchange\", and \"down\". TB: A real-life dataset that records information about 2302 tubercu(cid:173)\nlosis patients in the San Francisco county (courtesy of Dr. Peter Small, Stanford Medical \nCenter). The data set contains demographic information such as gender, age, ethnic group, \nand medical information such as HIV status, TB infection type, and other test results. \n\nIn each data set, we applied our procedure as follows. First, we used a standard model se(cid:173)\nlection procedure to learn a network from the training data (without any hidden variables). \nIn our implementation, we used standard greedy hill-climbing search that stops when it \nreaches a plateau it cannot escape. We supplied the learned network as input to the clique(cid:173)\ndetecting algorithm which returned a set of candidate hidden variables. We then used each \ncandidate as the starting point for a new learning phase. The Hidden procedure returns the \nhighest-scoring network that results from evaluating the different putative hidden variables. \nTo gauge the quality of this learning procedure, we compared it to two \"strawmen\" ap(cid:173)\nproaches. The Naive strawman [4] initializes the learning with a network that has a single \nhidden variable as parent of all the observed variables. It then applies SEM to get an im(cid:173)\nproved network. This process is repeated several times, where each time a random pertur(cid:173)\nbation (e.g., edge addition) is applied to help SEM to escape local maxima. The Original \nstrawman, which applied only in synthetic data set, is to use the true generating network on \nthe data set. That is, we take the original network (that contains the variable we hid) and \nuse standard parametric EM to learn parameters for it. This strawman corresponds to cases \nwhere the learner has additional prior knowledge about domain structure. \n\nWe quantitatively evaluated each of these networks in two ways . First, we computed the \nBayesian score of each network on the training data. Second, we computed the logarithmic \nloss of predictions made by these networks on independent test data. The results are shwon \nin Figure 2. In this evaluation, we used the performance of No-Hidden as the baseline for \ncomparing the other methods. Thus, a positive score of say 100 in Figure 2 indicates a \nscore which is larger by 100 than the score of No-Hidden. Since scores are the logarithm \nof the Bayesian posterior probability of structures (up to a constant), this implies that such \na structure is 2100 times more probable than the structure found by No-Hidden. \n\n\fWe can see that, in most cases, the network learned by Hidden outperforms the network \nlearned by No-hidden. In the artificial data sets, Original significantly outperforms our \nalgorithm on test data. This is no surprise: Original has complete knowledge of the struc(cid:173)\nture which generated the test data. Our algorithm can only evaluate networks according to \ntheir score; indeed, the scores of the networks found by Hidden are better than those of \nOriginal in 12 out of 13 cases tested. Thus, we see that the \"correct\" structure does not \nusually have the highest Bayesian score. Our approach usually outperforms the network \nlearned by Naive. This improvement is particularly significant in the real-life datasets. \nAs discussed in Section 3, there are three ways that a learning algorithm can utilize the \noriginal structure proposed by our algorithm. As our goal was to find the best model for \nthe domain, we ran all three of them in each case, and chose the best resulting network. In \nall of our experiments, the variant that fixed the candidate structure and learned parameters \nfor it resulted in scores that were significantly worse than the networks found by the vari(cid:173)\nants that employed structure search. The networks trained by this variant also performed \nmuch worse on test data. This highlights the importance of structure search in evaluating a \npotential hidden variable. The initial structure candidate is often too simplified; on the one \nhand, it forces too many independencies among the variables in the semi-clique, and on the \nother, it can add too many parents to the new hidden variable. \nThe comparison between the two variants that use search is more complex. In many cases, \nthe variant that gives the SEM complete flexibility in adapting the network structure did \nnot find a better scoring network than the variant that only searches for edges in the area of \nthe new variable. In the cases it did lead to improvement, the difference in score was not \nsignificantly larger. Since the variant that restricts SEM is computationally cheaper (often \nby an order of magnitude), we believe that it provides a good tradeoff between model \nquality and computational cost. \nThe structures found by our procedure are quite appealing. For example, in the stock \nmarket data, our procedure constructs a hidden variable that is the parent of several stocks: \nMicrosoft, Intel, Dell, CISCO, and Yahoo. A plausible interpretation of this variable is \n\"strong\" market vs. \"stationary\" market. When the hidden variable has the \"strong\" value, \nall the stocks have higher probability for going up. When the hidden variable has the \n\"stationary\" probability, these stocks have much higher probability of being in the \"no \nchange\" value. We do note that in the learned networks there were still many edges between \nthe individual stocks. Thus, the hidden variable serves as a general market trend, while the \nadditional edges make better description of the correlations between individual stocks. The \nmodel we learned for the TB patient dataset was also interesting. One value of the hidden \nvariable captures two highly dominant segments of the population: older, HIV-negative, \nforeign-born Asians, and younger, HIV-positive, US-born blacks. The hidden variable's \nchildren distinguished between the two aggregated subpopulations using the HIV-result \nvariable, which was also a parent of most of them. We believe that, had we allowed the \nhidden variable to have three values, it would have separated these populations. \n\n5 Discussion and Future Work \n\nIn this paper, we propose a simple and intuitive algorithm for finding plausible locations \nfor hidden variables in BN learning. It attempts to detect structural signatures of a hidden \nvariable in the network learned by standard structure search. We presented experiments \nshowing that our approach is reasonably successful at producing better models. To our \nknowledge, this paper is also the first to provide systematic empirical tests of any approach \nto the task of discovering hidden variables. \nThe problem of detecting hidden variables has received surprisingly little attention. Spirtes \net at. [11] suggest an approach that detects patterns of conditional independencies that can \nonly be generated in the presence of hidden variables. This approach suffers from two \nlimitations. First, it is sensitive to failure in few of the multiple independence tests it uses. \nSecond, it only detects hidden variables that are forced by the qualitative independence \nconstraints. It cannot detect situations where the hidden variable provides a more succinct \n\n\fmodel of a distribution that can be described by a network without a hidden variable (as in \nthe simple example of Figure 1). \n\nMartin and VanLehn [10] propose an alternative approach that appears, on the surface, to \nbe similar to ours. They start by checking correlations between all pairs of variables. This \nresults in a \"dependency\" graph in which there is an edge from X to Y if their correlation is \nabove a predetermined threshold. Then they construct a two-layered network that contains \nindependent hidden variables in the top level, and observables in the bottom layer, such that \nevery dependency between two observed variables is \"explained\" by at least one common \nhidden parent. This approach suffers from three important drawbacks. First, it does not \neliminate from consideration correlations that can be explained by direct edges among the \nobservables. Thus, it forms clusters even in cases where the dependencies can be fully \nexplained by a standard Bayesian network structure. Moreover, since it only examines \npairwise dependencies, it cannot detect conditional independencies, such as X -+ Y -+ Z, \nfrom the data. (In this case, it would learn a hidden variable that is the parent of all three \nvariables.) Finally, this approach learns a restricted form of networks that requires many \nhidden variables to represent dependencies among variables. Thus, it has limited utility in \ndistinguishing \"true\" hidden variables from artifacts of the representation. \nWe plan to test further enhancements to the algorithm in several directions. First, other \npossibilities for structural signatures (for example the structure resulting from a many par(cid:173)\nent - many children configuration) may expand the range of variables we can discover. \nSecond, our clique-discovering procedure is based solely on the structure of the network \nlearned. Additional information, such as the confidence of learned edges [6, 5], might help \nthe procedure avoid spurious signatures. Third, we plan to experiment with multi-valued \nhidden variables and better heuristics for selecting candidates out of the different proposed \nnetworks. Finally, we are considering approaches for dealing with sparse data, when the \nstructural signatures do not manifest. Information-theoretic measures might provide a more \nstatistical signature for the presence of a hidden variable. \n\nAcknowledgements \n\nThis work was supported in part by ISF grant 244/99, Israeli Ministry of Science grant \n2008-1-99. Nir Friedman was supported by Alon fellowship, and by the generosity of the \nSacher foundation. \n\nReferences \n[1] 1. Beinlich, G. Suermondt, R. Chavez, and G. Cooper. The ALARM monitoring system. In \n\nProc. 2 'nd European Conf. on AI and Medicine. , 1989. \n\n[2] J. Binder, D. Koller, S. Russell, and K. Kanazawa. Adaptive probabilistic networks with hidden \n\nvariables. Machine Learning, 29:213- 244, 1997. \n\n[3] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via \n\nthe EM algorithm. J. Royal Stat. Soc., B 39:1- 39, 1977. \n\n[4] N. Friedman. The Bayesian structural EM algorithm. In UAJ, 1998. \n[5] N. Friedman and D. Koller. Being Bayesian about Network Structure. In UAI, 2000. \n[6] N. Friedman, M. Goldszmidt, and A. Wyner. Data analysis with Bayesian networks: A bootstrap \n\n[7] D. Heckerman. A tutorial on learning with Bayesian networks. \n\nIn Learning in Graphical \n\napproach. In UAJ, 1999. \n\nModels. 1998. \n\n[8] D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The combina(cid:173)\n\ntion of knowledge and statistical data. Machine Learning, 20: 197- 243, 1995. \n\n[9] S. L. Lauritzen. The EM algorithm for graphical association models with missing data. Camp. \n\nStat.and Data Ana., 19:191- 201,1995. \n\n[10] J. Martin and K. VanLehn. Discrete factor analysis: Learning hidden variables in Bayesian \nnetworks. Technical report, Department of Computer Science, University of Pittsburgh, 1995. \n[11] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction and Search. Springer-Verlag, \n\n1993. \n\n\f", "award": [], "sourceid": 1940, "authors": [{"given_name": "Gal", "family_name": "Elidan", "institution": null}, {"given_name": "Noam", "family_name": "Lotner", "institution": null}, {"given_name": "Nir", "family_name": "Friedman", "institution": null}, {"given_name": "Daphne", "family_name": "Koller", "institution": null}]}*