Abstract: The utility problem in speedup learning describes a common behavior of machine learning methods: the eventual degradation of performance due to increasing amounts of learned knowledge. The shape of the learning curve (cost of using a learning method vs. number of training examples) over several domains suggests a parameterized model relating performance to the amount of learned knowledge and a mechanism to limit the amount of learned knowledge for optimal performance. Many recent approaches to avoiding the utility problem in speedup learning rely on sophisticated utility measures and significant numbers of training data to accurately estimate the utility of control knowledge. Empirical results presented here and elsewhere indicate that a simple selection strategy of retaining all control rules derived from a training problem explanation quickly defines an efficient set of control knowledge from few training problems. This simple selection strategy provides a low-cost alternative to example-intensive approaches for improving the speed of a problem solver. Experimentation illustrates the existence of a minimum (representing least cost) in the learning curve which is reached after a few training examples. Stress is placed on controlling the amount of learned knowledge as opposed to which knowledge. An attempt is also made to relate domain characteristics to the shape of the learning curve.
Available as: mr96.ps
A shorter version of this article appears in our IEA/AIE-94 paper.
Abstract: Discovering repetitive substructure in a structural database improves the ability to interpret and compress the data. This paper describes the SUBDUE system that uses domain-independent and domain-dependent heuristics to find interesting and repetitive structures in structural data. This substructure discovery technique can be used to discover fuzzy concepts, compress the data description, and formulate hierarchical substructure definitions. Examples from the domains of scene analysis, chemical compound analysis, computer-aided design, and program analysis demonstrate the benefits of the discovery technique.
Available as: jiis95.ps
Abstract: Intermediate decision trees are the subtrees of the full (unpruned) decision tree generated in a breadth-first order. An extensive empirical investigation evaluates the classification error of intermediate decision trees and compares their performance to full and pruned trees. Empirical results were generated using C4.5 with 66 databases from the UCI machine learning database repository. Results show that when attempting to minimize the error of the pruned tree produced by C4.5, the best intermediate tree performs significantly better in 46 of the 66 databases. These and other results question the effectiveness of decision tree pruning strategies and suggest further consideration of the full tree and its intermediates. Also, the results reveal specific properties satisfied by databases in which the intermediate full tree performs best. Such relationships improve guidelines for selecting appropriate inductive strategies based on domain properties.
Available as: ijcai95.ps
Abstract: Discovering repetitive, interesting, and functional substructures in a structural database improves the ability to interpret and compress the data. However, scientists working with a database in their area of expertise often search for predetermined types of structures, or for structures exhibiting characteristics specific to the domain. This paper presents a method for guiding the discovery process with domain-specific knowledge. In this paper, the SUBDUE discovery system is used to evaluate the benefits of using domain knowledge to guide the discovery process. The domain knowledge is incorporated into SUBDUE following a single general methodology to guide the discovery process. Results show that domain-specific knowledge improves the search for substructures which are useful to the domain, and leads to greater compression of the data. To illustrate these benefits, examples and experiments from the computer programming, computer aided design circuit, and artificially-generated domains are presented.
Available as: kdd95.ps
Abstract: The ability to identify interesting and repetitive substructures is an essential component to discovering knowledge in structural data. We describe a new version of our SUBDUE substructure discovery system based on the minimum description length principle. The SUBDUE system discovers substructures that compress the original data and represent structural concepts in the data. By replacing previously-discovered substructures in the data, multiple passes of SUBDUE produce a hierarchical description of the structural regularities in the data. SUBDUE uses a computationally-bounded inexact graph match that identifies similar, but not identical, instances of a substructure and finds an approximate measure of closeness of two substructures when under computational constraints. In addition to the minimum description length principle, other background knowledge can be used by SUBDUE to guide the search towards more appropriate substructures. Experiments in a variety of domains demonstrate SUBDUE's ability to find substructures capable of compressing the original data and to discover structural concepts important to the domain.
Available as: jair94.ps
Abstract: Because many databases contain or can be embellished with structural information, a method for identifying interesting and repetitive substructures is an essential component to discovering knowledge in such databases. This paper describes the SUBDUE system, which uses the minimum description length (MDL) principle to discover substructures that compress the database and represent structural concepts in the data. By replacing previously-discovered substructures in the data, multiple passes of SUBDUE produce a hierarchical description of the structural regularities in the data. Inclusion of background knowledge guides SUBDUE toward appropriate substructures for a particular domain or discovery goal, and the use of an inexact graph match allows a controlled amount of deviations in the instance of a substructure concept. We describe the application of SUBDUE to a variety of domains. We also discuss approaches to combining SUBDUE with non-structural discovery systems.
Available as: kdd94.ps
Abstract: Many recent approaches to avoiding the utility problem in speedup learning (the eventual degradation of performance due to increasing amounts of learned problem-solver control knowledge) rely on sophisticated utility measures and significant numbers of training problems to accurately estimate the utility of control knowledge. Empirical results presented here and elsewhere indicate that a simple selection strategy of retaining all control rules derived from a training problem solution quickly defines an efficient set of control knowledge from few training problems. This simple selection strategy provides a low-cost alternative to example-intensive approaches for improving the speed of a problem solver. Experimentation illustrates the existence of a minimum (representing least cost) in the learning curve which is reached after a few training examples. Stress is placed on controlling the amount of learned knowledge as opposed to which knowledge. An attempt is also made to relate domain characteristics to the shape of the learning curve.
Available as: ieaaie94.ps
An extended version of this paper appears in our Machine Reasoning book chapter.
Abstract: This paper describes the application of Artificial Intelligence planning techniques to sensor pointing and control. In particular, the SAUSAGES plan execution system has been adapted to execute and control plans generated for the guidance of autonomous vehicles and control of camera movement associated with the vehicle. Additional capabilities have been added to allow for multiple vehicle and sensor coordination.
Available as: iuw94.ps
Abstract: Concept discovery in structural data requires the identification of repetitive substructures in the data. We describe a method for discovering substructures in data using an inexact graph match. A previous implementation of our SUBDUE system discovers substructures based on the psychologically-motivated criteria of cognitive savings, compactness, connectivity and coverage. However, the instances in the data must exactly match the discovered substructures. We describe a new implementation of SUBDUE that employs an inexact graph match to discover substructures which occur often in the data, but not always in the same form. This inexact substructure discovery can be used to formulate fuzzy concepts, compress the data description, and discover interesting structures in data that are found either in an identical or in a slightly convoluted form. Examples from the domains of scene analysis and chemical compound analysis demonstrate the benefits of the inexact discovery technique.
Available as: tkde93.ps
Abstract: Many recent approaches to avoiding the utility problem in speedup learning rely on sophisticated utility measures and significant numbers of training data to accurately estimate the utility of control knowledge. Empirical results presented here and elsewhere indicate that a simple selection strategy of retaining all control rules derived from a training problem explanation quickly defines an efficient set of control knowledge from few training problems. This simple selection strategy provides a low-cost alternative to example-intensive approaches for improving the speed of a problem solver.
Available as: kcsl.ps
Abstract: The overfit problem in inductive learning and the utility problem in speedup learning both describe a common behavior of machine learning methods: the eventual degradation of performance due to increasing amounts of learned knowledge. Plotting the performance of the changing knowledge during execution of a learning method (the performance response) reveals similar curves for several methods. The performance response generally indicates an increase to a single peak followed by a more gradual decrease in performance. The similarity in performance responses suggests a model relating performance to the amount of learned knowledge. This paper provides empirical evidence for the existence of a general model by plotting the performance responses of several learning programs. Formal models of the performance response are also discussed. These models can be used to control the amount of learning and avoid degradation of performance.
Available as: aaai92.ps
Abstract: This paper describes a method for discovering substructures in data using a fuzzy graph match. A previous implementation of the {\sc Subdue} system discovers substructures based on the psychologically-motivated criteria of cognitive savings, compactness, connectivity and coverage. However, the instances in the data must exactly match the discovered substructures. We describe a new implementation of {\sc Subdue} that employs a fuzzy graph match to discover substructures which occur often in the data, but not always in the same form. This fuzzy substructure discovery can be used to formulate fuzzy concepts, compress the data description, and discover interesting structures in data that are found either in their pure form or in a slightly convoluted form. Examples from the domains of scene analysis and chemical compound analysis demonstrate the fuzzy discovery technique.
Available as: ml92.ps
Abstract: The overfit problem in empirical learning and the utility problem in explanation-based learning describe a similar phenomenon: the degradation of performance due to an increase in the amount of learned knowledge. Plotting the performance of learned knowledge during the course of learning (the performance response) reveals a common trend for several learning methods. Modeling this trend allows a control system to constrain the amount of learned knowledge to achieve peak performance and avoid the general utility problem. Experiments evaluate a particular empirical model of the trend, and analysis of the learners derive several formal models. If, as evidence suggests, the general utility problem can be modeled using the same mechanisms for different learning paradigms, then the model serves to unify the paradigms into one framework capable of comparing and selecting different learning methods based on predicted achievable performance.
Available as: kcsl92.ps
Abstract: Experiments measuring knowledge utility during the course of learning reveal a common behavior among several learning methods. As the amount of learned knowledge increases, the utility (performance) of the knowledge initially increases, but eventually degrades. A multistrategy learning system based on a model of this common behavior selects a learning method according to the model's prediction of achievable performance. The model-based adaptive control (MBAC) approach provides this capability by maintaining models for each learning method. MBAC also predicts the amount of knowledge to be acquired by the method in order to avoid the degradation of performance. MBAC unifies multiple learning strategies based on the common behavior of the utility of their knowledge during the course of learning.
Available as: msl91.ps
Abstract: The overfit problem in empirical learning and the utility problem in analytical learning both describe a common behavior of machine learning methods: the eventual degradation of performance due to increasing amounts of learned knowledge. Plotting the performance of the changing knowledge during execution of a machine learning method (the performance response) reveals similar curves for several methods. The performance response generally indicates a single peak performance greater than that attained by popular pruning techniques. The similarity in performance responses suggests a parameterized model relating performance to the amount of learned knowledge. Given this model, a model-based adaptive control (MBAC) approach can be used to update the model based on feedback from the performance element and make control decisions regarding the amount of knowledge to be learned or unlearned.
In view of the large number of alternative learning methods, a more general utility problem exists in determining not only the correct amount of learned knowledge, but also the correct method for learning this knowledge. Relying too heavily on one particular learning method may result in less than optimal performance achievement. Overcoming this general utility problem requires a new control mechanism for determining the correct learning method and amount of learned knowledge in order to achieve the performance objectives of the task. Maintaining models for several learning methods allows the MBAC approach to decide the appropriate type of learning, in addition to the amount.
Experimentation analyzes the ability of the MBAC approach to converge upon the peak of the performance response and avoid generation of low utility knowledge. Results indicate that a quadratic model is sufficient to fit the peak of the performance response and that MBAC using the quadratic model performs well at selecting the best learning method for a given learning task. More formal analysis of the performance response supports the quadratic model for controlling how much knowledge to learn as opposed to which knowledge.
Available as: phdthesis.ps
Abstract: Integration of machine learning methods into knowledge-based systems requires greater control over the application of the learning methods. Recent research in machine learning has shown that isolated and unconstrained application of learning methods can eventually degrade performance. This paper presents an approach called performance-driven knowledge transformation for controlling the application of learning methods. The primary guidance for the control is performance of the knowledge base. The approach is implemented in the PEAK system. Two experiments with PEAK illustrate how the knowledge base is transformed using different learning methods to maintain performance goals. Results demonstrate the ability of performance-driven knowledge transformation to control the application of learning methods and maintain knowledge base performance.
Available as: ieaaie90.ps
Abstract: Experiments have revealed that uncontrolled application of the analytical learning paradigm results in knowledge having low utility. Because the performance element must consider low utility knowledge along with high utility knowledge, the proliferation of low utility knowledge eventually defeats the goal of improved performance. Experiments in empirical learning have demonstrated a similar phenomenon. Uncontrolled application of an empirical learning paradigm may result in inaccurate knowledge, and a post-processing stage is typically needed to repair the degradation in performance. The results from experimentation in both analytical and empirical learning imply a general utility problem in machine learning. This paper presents evidence for such a perspective and recommends a closer dependence between the learning paradigm and the performance goals for which it is designed. A new approach is presented along with experimentation that illustrates the applicability of the approach to the general utility problem.
Abstract: The complexity of most machine learning techniques can be improved by transforming iterative components into their parallel equivalent. Although this parallelization has been considered in theory, few implementations have been performed on existing parallel machines. The parallel architecture of the Connection Machine provides a platform for the implementation and evaluation of parallel learning techniques. The architecture of the Connection Machine is described along with limitations of the language interface that constrain the implementation of learning programs. Connection Machine implementations of two learning programs, Perceptron and AQ, are described, and their computational complexity is compared to that of the corresponding sequential versions using actual runs on the Connection Machine. Techniques for parallelizing ID3 are also analyzed, and the advantages and disadvantages of parallel implementation on the Connection Machine are discussed in the context of machine learning.
Available as: pdp90.ps
Abstract: Integration of machine learning methods with knowledge based systems requires sophisticated control mechanisms for applying methods appropriate to the performance task. Performance-driven knowledge transformation controls the application of learning methods based on their ability to achieve desired performance goals while preserving the performance on other tasks. A means-ends approach to performance-driven knowledge transformation is presented along with experimental results from an implementation. The results indicate that performance-driven knowledge transformation is able to maintain multiple performance goals by applying appropriate machine learning methods to transform a knowledge base.
Abstract: This paper describes the substructure discovery method used in the SUBDUE system. The method involves a computationally constrained best-first search guided by four heuristics: cognitive savings, compactness, connectivity and coverage. Each of the four heuristics are described along with their role in the evaluation of a substructure. An example demonstrates SUBDUE's ability to discover substructure and the advantages to be gained by other learning systems from the discovery of substructure concepts.