Back to the table of contents

Previous      Next

waffles_learn

A command-line tool that wraps supervised and semi-supervised learning algorithms. Here's the usage information:

Full Usage Information
[Square brackets] are used to indicate required arguments.
<Angled brackets> are used to indicate optional arguments.

waffles_learn [command]
   Supervised learning, transduction, cross-validation, etc.
   autotune [dataset] <data_opts> [algname]
      Use cross-validation to automatically determine a good set of parameters
      for the specified algorithm with the specified data. The selected
      parameters are printed to stdout.
      [dataset]
         The filename of a dataset.
      <data_opts>
         -labels [attr_list]
            Specify which attributes to use as labels. (If not specified, the
            default is to use the last attribute for the label.) [attr_list] is
            a comma-separated list of zero-indexed columns. A hypen may be used
            to specify a range of columns.  A '*' preceding a value means to
            index from the right instead of the left. For example, "0,2-5"
            refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
            column. "0-*1" refers to all but the last column.
         -ignore [attr_list]
            Specify attributes to ignore. [attr_list] is a comma-separated list
            of zero-indexed columns. A hypen may be used to specify a range of
            columns.  A '*' preceding a value means to index from the right
            instead of the left. For example, "0,2-5" refers to columns 0, 2,
            3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
            but the last column.
      [algname]
         The name of the algorithm that you wish to automatically tune.
         agglomerativetransducer
            An agglomerative transducer
         decisiontree
            A decision tree
         graphcuttransducer
            A graph-cut transducer
         knn
            A k-nearest-neighbor instance-based learner
         meanmarginstree
            A mean margins tree
         neuralnet
            A feed-foward neural network (a.k.a. multi-layer perceptron)
         naivebayes
            A naive Bayes model
         naiveinstance
            A naive instance model
   train <options> [dataset] <data_opts> [algorithm]
      Trains a supervised learning algorithm. The trained model-file is printed
      to stdout. (Typically, you will want to pipe this to a file.)
      <options>
         -seed [value]
            Specify a seed for the random number generator. (Use this option to
            ensure that your results are reproduceable.)
         -calibrate
            Calibrate the model after it is trained, such that predicted
            distributions will approximate the distributions represented in the
            training data. This switch is typically used only if you plan to
            predict distributions (by calling predictdistribution) instead of
            just class labels or regression values. Calibration will not effect
            the predictions made by regular calls to 'predict', which is used
            by most other tools.
         -embed
            Escape the output model such that it can easily be embedded in C or
            C++ code.
      [dataset]
         The filename of a dataset.
      <data_opts>
         -labels [attr_list]
            Specify which attributes to use as labels. (If not specified, the
            default is to use the last attribute for the label.) [attr_list] is
            a comma-separated list of zero-indexed columns. A hypen may be used
            to specify a range of columns.  A '*' preceding a value means to
            index from the right instead of the left. For example, "0,2-5"
            refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
            column. "0-*1" refers to all but the last column.
         -ignore [attr_list]
            Specify attributes to ignore. [attr_list] is a comma-separated list
            of zero-indexed columns. A hypen may be used to specify a range of
            columns. A '*' preceding a value means to index from the right
            instead of the left. For example, "0,2-5" refers to columns 0, 2,
            3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
            but the last column.
   predict <options> [model-file] [dataset] <data_opts>
      Predict labels for all of the patterns in [dataset]. Results are printed
      in the form of a ".arff" file (containing only predicted labels) to
      stdout.
      <options>
         -seed [value]
            Specify a seed for the random number generator. (Use this option to
            ensure that your results are reproduceable.)
      [model-file]
         The filename of a trained model. (This is the file to which you saved
         the output when you trained a supervised learning algorithm.)
      [dataset]
         The filename of a dataset. (There should already be placeholder labels
         in this dataset. The placeholder labels will be replaced in the output
         by the labels that the model predicts.)
      <data_opts>
         -labels [attr_list]
            Specify which attributes to use as labels. (If not specified, the
            default is to use the last attribute for the label.) [attr_list] is
            a comma-separated list of zero-indexed columns. A hypen may be used
            to specify a range of columns.  A '*' preceding a value means to
            index from the right instead of the left. For example, "0,2-5"
            refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
            column. "0-*1" refers to all but the last column.
         -ignore [attr_list]
            Specify attributes to ignore. [attr_list] is a comma-separated list
            of zero-indexed columns. A hypen may be used to specify a range of
            columns. A '*' preceding a value means to index from the right
            instead of the left. For example, "0,2-5" refers to columns 0, 2,
            3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
            but the last column.
   predictdistribution <options> [model-file] [dataset] <data_opts>
      Predict a distribution for all of the patterns in [dataset]. Results are
      printed in the form of a ".arff" file. (Typically, the '-calibrate'
      switch should be used when training the model. If the model is not
      calibrated, then the predicted distribution may not be a very good
      estimated distribution. Also, some models cannot be used to predict a
      distribution.)
      <options>
         -seed [value]
            Specify a seed for the random number generator. (Use this option to
            ensure that your results are reproduceable.)
      [model-file]
         The filename of a trained model. (This is the file to which you saved
         the output when you trained a supervised learning algorithm.)
      [dataset]
         The filename of a dataset. (There should already be placeholder labels
         in this dataset. The placeholder labels will be replaced in the output
         by the labels that the model predicts.)
      <data_opts>
         -labels [attr_list]
            Specify which attributes to use as labels. (If not specified, the
            default is to use the last attribute for the label.) [attr_list] is
            a comma-separated list of zero-indexed columns. A hypen may be used
            to specify a range of columns.  A '*' preceding a value means to
            index from the right instead of the left. For example, "0,2-5"
            refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
            column. "0-*1" refers to all but the last column.
         -ignore [attr_list]
            Specify attributes to ignore. [attr_list] is a comma-separated list
            of zero-indexed columns. A hypen may be used to specify a range of
            columns.  A '*' preceding a value means to index from the right
            instead of the left. For example, "0,2-5" refers to columns 0, 2,
            3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
            but the last column.
   test <options> [model-file] [dataset] <data_opts>
      Test a trained model using some test data. Results are printed to stdout
      for each dimension in the label vector. Predictive accuracy is reported
      for nominal label dimensions, and mean-squared-error is reported for
      continuous label dimensions.
      <options>
         -seed [value]
            Specify a seed for the random number generator. (Use this option to
            ensure that your results are reproduceable.)
         -confusion
            Print a confusion matrix for each nominal label attribute.
         -confusioncsv
            Print a confusion matrix in comma-separated value format for each
            nominal label attribute.
      [model-file]
         The filename of a trained model. (This is the file to which you saved
         the output when you trained a supervised learning algorithm.)
      [dataset]
         The filename of a test dataset. (This dataset must have the same
         number of columns as the dataset with which the model was trained.)
      <data_opts>
         -labels [attr_list]
            Specify which attributes to use as labels. (If not specified, the
            default is to use the last attribute for the label.) [attr_list] is
            a comma-separated list of zero-indexed columns. A hypen may be used
            to specify a range of columns.  A '*' preceding a value means to
            index from the right instead of the left. For example, "0,2-5"
            refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
            column. "0-*1" refers to all but the last column.
         -ignore [attr_list]
            Specify attributes to ignore. [attr_list] is a comma-separated list
            of zero-indexed columns. A hypen may be used to specify a range of
            columns.  A '*' preceding a value means to index from the right
            instead of the left. For example, "0,2-5" refers to columns 0, 2,
            3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
            but the last column.
   transduce <options> [labeled-set] <data_opts1> [unlabeled-set] <data_opts2> [algorithm]
      Predict labels for [unlabeled-set] based on the examples in
      [labeled-set]. For most algorithms, this is the same as training on
      [labeled-set] and then predicting labels for [unlabeled-set]. Some
      algorithms, however, have no models. These can transduce, even though
      they cannot be trained. The predicted labels are printed to stdout as a
      ".arff" file.
      <options>
         -seed [value]
            Specify a seed for the random number generator. (Use this option to
            ensure that your results are reproduceable.)
      [labeled-set]
         The filename of a dataset. The labels in this dataset are used to
         infer labels for the unlabeled set.
      [unlabeled-set]
         The filename of a dataset. This dataset must have placeholder labels,
         but these will be ignored when predicting new labels.
      <data_opts1>
         -labels [attr_list]
            Specify which attributes to use as labels. (If not specified, the
            default is to use the last attribute for the label.) [attr_list] is
            a comma-separated list of zero-indexed columns. A hypen may be used
            to specify a range of columns.  A '*' preceding a value means to
            index from the right instead of the left. For example, "0,2-5"
            refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
            column. "0-*1" refers to all but the last column.
         -ignore [attr_list]
            Specify attributes to ignore. [attr_list] is a comma-separated list
            of zero-indexed columns. A hypen may be used to specify a range of
            columns.  A '*' preceding a value means to index from the right
            instead of the left. For example, "0,2-5" refers to columns 0, 2,
            3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
            but the last column.
      <data_opts2>
         -labels [attr_list]
            Specify which attributes to use as labels. (If not specified, the
            default is to use the last attribute for the label.) [attr_list] is
            a comma-separated list of zero-indexed columns. A hypen may be used
            to specify a range of columns.  A '*' preceding a value means to
            index from the right instead of the left. For example, "0,2-5"
            refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
            column. "0-*1" refers to all but the last column.
         -ignore [attr_list]
            Specify attributes to ignore. [attr_list] is a comma-separated list
            of zero-indexed columns. A hypen may be used to specify a range of
            columns.  A '*' preceding a value means to index from the right
            instead of the left. For example, "0,2-5" refers to columns 0, 2,
            3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
            but the last column.
   transacc <options> [training-set] <data_opts1> [test-set] <data_opts2> [algorithm]
      Measure the transductive accuracy of [algorithm] with respect to the
      specified training and test sets. Results are printed to stdout for each
      dimension in the label vector. Predictive accuracy is reported for
      nominal labels, and mean-squared-error is reported for continuous labels.
      <options>
         -seed [value]
            Specify a seed for the random number generator. (Use this option to
            ensure that your results are reproduceable.)
      [training-set]
         The filename of a dataset. The labels in this dataset are used to
         infer labels for the unlabeled set.
      [test-set]
         The filename of a dataset. This dataset must have placeholder labels.
         The placeholder labels will be replaced in the output with the new
         predicted labels.
      <data_opts1>
         -labels [attr_list]
            Specify which attributes to use as labels. (If not specified, the
            default is to use the last attribute for the label.) [attr_list] is
            a comma-separated list of zero-indexed columns. A hypen may be used
            to specify a range of columns.  A '*' preceding a value means to
            index from the right instead of the left. For example, "0,2-5"
            refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
            column. "0-*1" refers to all but the last column.
         -ignore [attr_list]
            Specify attributes to ignore. [attr_list] is a comma-separated list
            of zero-indexed columns. A hypen may be used to specify a range of
            columns.  A '*' preceding a value means to index from the right
            instead of the left. For example, "0,2-5" refers to columns 0, 2,
            3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
            but the last column.
      <data_opts2>
         -labels [attr_list]
            Specify which attributes to use as labels. (If not specified, the
            default is to use the last attribute for the label.) [attr_list] is
            a comma-separated list of zero-indexed columns. A hypen may be used
            to specify a range of columns.  A '*' preceding a value means to
            index from the right instead of the left. For example, "0,2-5"
            refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
            column. "0-*1" refers to all but the last column.
         -ignore [attr_list]
            Specify attributes to ignore. [attr_list] is a comma-separated list
            of zero-indexed columns. A hypen may be used to specify a range of
            columns.  A '*' preceding a value means to index from the right
            instead of the left. For example, "0,2-5" refers to columns 0, 2,
            3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
            but the last column.
   splittest <options> [dataset] <data_opts> [algorithm]
      This shuffles the data, then splits it into two parts, trains with one
      part, and tests with the other. (This also works with model-free
      algorithms.) Results are printed to stdout for each dimension in the
      label vector. Predictive accuracy is reported for nominal labels, and
      mean-squared-error is reported for continuous labels.
      <options>
         -seed [value]
            Specify a seed for the random number generator. (Use this option to
            ensure that your results are reproduceable.)
         -trainratio [value]
            Specify the amount of the data (between 0 and 1) to use for
            training. The rest will be used for testing.
         -reps [value]
            Specify the number of repetitions to perform. If not specified, the
            default is 1.
         -writelastmodel [filename]
            Write the model generated on the last repetion to the given
            filename.  Note that this only works when the learner being used
            has an internal model.
      [dataset]
         The filename of a dataset.
      <data_opts>
         -labels [attr_list]
            Specify which attributes to use as labels. (If not specified, the
            default is to use the last attribute for the label.) [attr_list] is
            a comma-separated list of zero-indexed columns. A hypen may be used
            to specify a range of columns.  A '*' preceding a value means to
            index from the right instead of the left. For example, "0,2-5"
            refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
            column. "0-*1" refers to all but the last column.
         -ignore [attr_list]
            Specify attributes to ignore. [attr_list] is a comma-separated list
            of zero-indexed columns. A hypen may be used to specify a range of
            columns.  A '*' preceding a value means to index from the right
            instead of the left. For example, "0,2-5" refers to columns 0, 2,
            3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
            but the last column.
   crossvalidate <options> [dataset] <data_opts> [algorithm]
      Perform cross-validation with the specified dataset and algorithm.
      Results are printed to stdout.
      <options>
         -seed [value]
            Specify a seed for the random number generator. (Use this option to
            ensure that your results are reproduceable.)
         -reps [value]
            Specify the number of repetitions to perform. If not specified, the
            default is 5.
         -folds [value]
            Specify the number of folds to use. If not specified, the default
            is 2.
         -succinct
            Just report the average mean squared error. Do not report results
            at each fold.
      [dataset]
         The filename of a dataset.
      <data_opts>
         -labels [attr_list]
            Specify which attributes to use as labels. (If not specified, the
            default is to use the last attribute for the label.) [attr_list] is
            a comma-separated list of zero-indexed columns. A hypen may be used
            to specify a range of columns.  A '*' preceding a value means to
            index from the right instead of the left. For example, "0,2-5"
            refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
            column. "0-*1" refers to all but the last column.
         -ignore [attr_list]
            Specify attributes to ignore. [attr_list] is a comma-separated list
            of zero-indexed columns. A hypen may be used to specify a range of
            columns.  A '*' preceding a value means to index from the right
            instead of the left. For example, "0,2-5" refers to columns 0, 2,
            3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
            but the last column.
   precisionrecall <options> [dataset] <data_opts> [algorithm]
      Compute the precision/recall for a dataset and algorithm
      <options>
         -seed [value]
            Specify a seed for the random number generator. (Use this option to
            ensure that your results are reproduceable.)
         -labeldims [n]
            Specify the number of dimensions in the label (output) vector. The
            default is 1. (Don't confuse this with the number of class labels.
            It only takes one dimension to specify a class label, even if there
            are k possible labels.)
         -reps [n]
            Specify the number of reps to perform. More reps means it will take
            longer, but results will be more accurate. The default is 5.
         -samples [n]
            Specify the granularity at which to measure recall. If not
            specified, the default is 100.
      [dataset]
         The filename of a dataset.
      <data_opts>
         -labels [attr_list]
            Specify which attributes to use as labels. (If not specified, the
            default is to use the last attribute for the label.) [attr_list] is
            a comma-separated list of zero-indexed columns. A hypen may be used
            to specify a range of columns.  A '*' preceding a value means to
            index from the right instead of the left. For example, "0,2-5"
            refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
            column. "0-*1" refers to all but the last column.
         -ignore [attr_list]
            Specify attributes to ignore. [attr_list] is a comma-separated list
            of zero-indexed columns. A hypen may be used to specify a range of
            columns.  A '*' preceding a value means to index from the right
            instead of the left. For example, "0,2-5" refers to columns 0, 2,
            3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
            but the last column.
   sterilize <options> [dataset] <data_opts> [algorithm]
      Perform cross-validation to generate a new dataset that contains only the
      correctly-classified instances. The new sterilized data is printed to
      stdout.
      <options>
         -seed [value]
            Specify a seed for the random number generator. (Use this option to
            ensure that your results are reproduceable.)
         -folds [n]
            Specify the number of cross-validation folds to perform.
         -diffthresh [d]
            Specify a threshold of absolute difference for continuous labels.
            Predictions with an absolute difference less than this threshold
            are considered to be "correct".
      [dataset]
         The filename of a dataset to sterilize.
      <data_opts>
         -labels [attr_list]
            Specify which attributes to use as labels. (If not specified, the
            default is to use the last attribute for the label.) [attr_list] is
            a comma-separated list of zero-indexed columns. A hypen may be used
            to specify a range of columns.  A '*' preceding a value means to
            index from the right instead of the left. For example, "0,2-5"
            refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
            column. "0-*1" refers to all but the last column.
         -ignore [attr_list]
            Specify attributes to ignore. [attr_list] is a comma-separated list
            of zero-indexed columns. A hypen may be used to specify a range of
            columns.  A '*' preceding a value means to index from the right
            instead of the left. For example, "0,2-5" refers to columns 0, 2,
            3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
            but the last column.
   regress [data] <data_opts> [equation]
      Use a hill climbing algorithm to optimize the parameters of [equation] to
      fit to the [data]. If [data] has d feature dimensions, then [equation]
      must have more than d parameters. The equation must be named f. The first
      d arguments to f are supplied by the data features. The remaining
      arguments are optimized by the hill climber. The data must have exactly 1
      label dimension, which the equation will attempt to predict. The
      sum-squared error and parameter values are printed to stdout.
      [dataset]
         The filename of a dataset.
      <data_opts>
         -labels [attr_list]
            Specify which attributes to use as labels. (If not specified, the
            default is to use the last attribute for the label.) [attr_list] is
            a comma-separated list of zero-indexed columns. A hypen may be used
            to specify a range of columns.  A '*' preceding a value means to
            index from the right instead of the left. For example, "0,2-5"
            refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
            column. "0-*1" refers to all but the last column.
         -ignore [attr_list]
            Specify attributes to ignore. [attr_list] is a comma-separated list
            of zero-indexed columns. A hypen may be used to specify a range of
            columns.  A '*' preceding a value means to index from the right
            instead of the left. For example, "0,2-5" refers to columns 0, 2,
            3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
            but the last column.
      [equation]
         An equation to regress to fit the data. The equation must be named
         'f'. It can call helper-equations, separated by semicolons, if needed.
         Example: "f(x1,x2,p1,p2,p3)=s(p1*x1+p2*x2+p3);s(x)=1/(1+e^(-x))"
   metadata [data] <data_opts>
      Generate a vector of metadata values for the given dataset. This might be
      useful for meta-analysis. The resulting vector is printed to stdout in
      ARFF format.
   usage
      Print usage information.
[algorithm]
   A supervised learning algorithm, or a transductive algorithm.
   agglomerativetransducer
      A model-free transduction algorithm based on single-link agglomerative
      clustering. Unlabeled patterns take the label of the cluster with which
      they are joined. It never joins clusters with different labels.
   bag <contents> end
      A bagging (bootstrap aggregating) ensemble. This is a way to combine the
      power of many learning algorithms through voting. "end" marks the end of
      the ensemble contents. Each algorithm instance is trained using a
      training set created by drawing (with replacement) from the original data
      until the training set has the same number of instances as the original
      data.
      <contents>
         [instance_count] [algorithm]
            Specify the number of instances of a learning algorithm to add to
            the bagging ensemble.
   baseline
      This is one of the simplest of all supervised algorithms. It ignores all
      features. For nominal labels, it always predicts the most common class in
      the training set. For continuous labels, it always predicts the mean
      label in the training set. An effective learning algorithm should never
      do worse than baseline--hence the name "baseline".
   bucket <contents> end
      This uses cross-validation with the training set to select the best model
      from a bucket of models. When accuracy is measured across multiple
      datasets, it will usually do better than the best model in the bucket
      could do. "end" marks the end of the contents of the bucket.
      <contents>
         [algorithm]
            Add an algorithm to the bucket
   bma <contents> end
      A Bayesian model averaging ensemble. This trains each model after the
      manner of bagging, but then combines them weighted according to their
      probability given the data. Uniform priors are assumed.
      <contents>
         [instance_count] [algorithm]
            Specify the number of instances of a learning algorithm to add to
            the BMA ensemble.
   bmc <options> <contents> end
      A Bayesian model combination ensemble. This algorithm is described in
      Monteith, Kristine and Carroll, James and Seppi, Kevin and Martinez,
      Tony, Turning Bayesian Model Averaging into Bayesian Model Combination,
      Proceedings of the IEEE International Joint Conference on Neural Networks
      IJCNN'11, 2657--2663, 2011.
      <options>
         -samples [n]
            Specify the number of samples to draw from the simplex of possible
            ensemble combinations. (Larger values result in better accuracy
            with the cost of more computation.)
      <contents>
         [instance_count] [algorithm]
            Specify the number of instances of a learning algorithm to add to
            the BMA ensemble.
   boost <options> [algorithm]
      Uses ResamplingAdaBoost to create an ensemble that may be more accurate
      than a lone instance of the specified algorithm. (ResamplingAdaBoost is
      similar to AdaBoost, except that it uses resampling to approximate
      weighted instances in the training set. This difference enables it to
      work with algorithms that do not implicitly support weighted samples.)
      <options>
         -trainratio [value]
            When approximating the weighted training set by resampling, use a
            sample of size [value]*training_set_size
         -size [n]
            The number of base learners to use in the ensemble.
   cvdt [n]
      This is a bucket of two bagging ensembles: one with [n] entropy-reducing
      decision trees, and one with [n] meanmarginstrees. (This algorithm is
      specified in Gashler, Michael S. and Giraud-Carrier, Christophe and
      Martinez, Tony. Decision Tree Ensemble: Small Heterogeneous Is Better
      Than Large Homogeneous. In The Seventh International Conference on
      Machine Learning and Applications, Pages 900 - 905, ICMLA '08. 2008)
   decisiontree <options>
      A decision tree.
      <options>
         -autotune
            Automatically determine a good set of parameters for this model
            with the current data.
         -random [draws]
            Use random divisions (instead of divisions that reduce entropy).
            Random divisions make the algorithm train faster, and also increase
            model variance, so it is better suited for ensembles, but random
            divisions also make the decision tree more vulnerable to problems
            with irrelevant features. [draws] is typically 1, but if you
            specify a larger value, it will pick the best out of the specified
            number of random draws.
         -binary
            Use binary divisions. For nominal attributes with more than 2
            categorical values, one specific value will be separated from all
            others at each division.
         -leafthresh [n]
            When building the tree, if the number of samples is <= this value,
            it will stop trying to divide the data and will create a leaf node.
            The default value is 1. For noisy data, larger values may be
            advantageous.
         -maxlevels [n]
            When building the tree, if the depth (the length of the path from
            the root to the node currently being formed, including the root and
            the currently forming node) is [n], it will stop trying to divide
            the data and will create a leaf node.  This means that there will
            be at most [n]-1 splits before a decision is made.  This crudely
            limits overfitting, and so can be helpful on small data sets.  It
            can also make the resulting trees easier to interpret.  If set to
            0, then there is no maximum (which is the default).
   gaussianprocess <options>
      A Gaussian process model.
      <options>
         -noise [var]
            The variance of the noise parameter.
         -prior [var]
            The prior variance for the weights. (This value will be multiplied
            by an identity matrix to form the prior covariance for the weights.
         -maxsamples [n]
            The maximum number of samples to train with. (If the training data
            contains more than [n] rows, then it will automatically randomly
            sub-sample the training data in order to limit computational
            complexity.)
         -kernel [k]
            Specify the kernel to use
            identity
               This simple kernel causes it to learn a linear model. If no
               kernel is specified, this is the default.
            chisquared
               A Chi Squared kernel.
            rbf [var]
               A Gaussian RBF kernel. [var] specifies the variance term for
               this kernel. Larger values result in a smoother model.
            polynomial [ofs] [order]
               A polynomial kernel.
               [ofs]
                  An offset value.
               [order]
                  The order of the polynomial.
   graphcuttransducer <options>
      This is a model-free transduction algorithm. It uses a min-cut/max-flow
      graph-cut algorithm to separate each label from all of the others.
      <options>
         -autotune
            Automatically determine a good set of parameters for this model
            with the current data.
         -neighbors [k]
            Set the number of neighbors to connect with each point in order to
            form the graph.
   hodgepodge
      This is a ready-made ensemble of various unrelated learning algorithms.
   knn <options>
      The k-Nearest-Neighbor instance-based learning algorithm. It uses
      Euclidean distance for continuous features and Hamming distance for
      nominal features.
      <options>
         -autotune
            Automatically determine a good set of parameters for this model
            with the current data.
         -neighbors [k]
            Specify the number of neighbors, k, to use.
         -nonormalize
            Specify not to normalize the scale of continuous features. (The
            default is to normalize by dividing by 2 times the deviation in
            that attribute.)
         -equalweight
            Give equal weight to every neighbor. (The default is to use linear
            weighting for continuous features, and sqared linear weighting for
            nominal features.
         -scalefeatures
            Use a hill-climbing algorithm on the training set to scale the
            feature dimensions in order to give more accurate results. This
            increases training time, but also improves accuracy and robustness
            to irrelevant features.
         -pearson
            Use Pearson's correlation coefficient to evaluate the similarity
            between sparse vectors. (Only compatible with sparse training.)
         -cosine
            Use the cosine method to evaluate the similarity between sparse
            vectors. (Only compatible with sparse training.)
   linear
      A linear regression model
   meanmarginstree
      This is a very simple oblique (or linear combination) tree. (This
      algorithm is specified in Gashler, Michael S. and Giraud-Carrier,
      Christophe and Martinez, Tony. Decision Tree Ensemble: Small
      Heterogeneous Is Better Than Large Homogeneous. In The Seventh
      International Conference on Machine Learning and Applications, Pages 900
      - 905, ICMLA '08. 2008)
   naivebayes <options>
      The naive Bayes learning algorithm.
      <options>
         -autotune
            Automatically determine a good set of parameters for this model
            with the current data.
         -ess [value]
            Specifies an equivalent sample size to prevent unsampled values
            from dominating the joint distribution. Good values typically range
            between 0 and 1.5.
   naiveinstance <options>
      This is an instance learner that assumes each dimension is conditionally
      independant from other dimensions. It lacks the accuracy of knn in low
      dimensional feature space, but scales much better to high dimensionality.
      <options>
         -autotune
            Automatically determine a good set of parameters for this model
            with the current data.
         -neighbors [k]
            Set the number of neighbors to use in each dimension
   neighbortransducer <options>
      This is a model-free transduction algorithm. It is an instance learner
      that propagates labels where the neighbors are most in agreement. This
      algorithm does well when classes sample a manifold (such as with text
      recognition).
      <options>
         -autotune
            Automatically determine a good set of parameters for this model
            with the current data.
         -neighbors [k]
            Set the number of neighbors to use with each point
   neuralnet <options>
      A single or multi-layer feed-forward neural network (a.k.a. multi-layer
      perceptron). It can be trained with online backpropagation (Rumelhart,
      D.E., Hinton, G.E., and Williams, R.J. Learning representations by
      back-propagating errors. Nature, 323:9, 1986.), or several other
      optimization methods.
      <options>
         -add [block]
            Add a block to this neural net.
            linear [units]
               A fully-connected block of linear weights
            bentidentity
               A bent identity nonlinearity block
            gaussian
               A gaussian nonlinearity block
            identity
               A block of identity (pass-through) units
            logistic
               A logistic nonlinearity block
            rectifier
               A rectifier nonlinearity block
            leakyrectifier
               A leaky rectifier nonlinearity block
            sigexp
               A sigexp nonlinearity block
            sine
               A sinusoid nonlinearity block
            softplus
               A softplus nonlinearity block
            softroot
               A softroot nonlinearity block
            tanh
               A softroot nonlinearity block
         -concat [inpos] [block]
            Concatenate a block to the last block in this neural net.
   randomforest [trees] <options>
      A baggging ensemble of decision trees that use random division
      boundaries. (This algorithm is described in Breiman, Leo (2001). Random
      Forests. Machine Learning 45 (1): 5-32. doi:10.1023/A:1010933404324.)
      [trees]
         Specify the number of trees in the random forest
      <options>
         -samples [n]
            Specify the number of randomly-drawn attributes to evaluate. The
            one that maximizes information gain will be chosen for the decision
            boundary. If [n] is 1, then the divisions are completely random.
            Larger values will decrease the randomness.
   reservoir <options>
      A reservoir network.
      <options>
         -augments [d]
            The number of dimensions to augment the data with. (Smaller values
            lead to smoother models.)
         -deviation [dev]
            The deviation to use to randomly initialize the weights in the
            reservoir.
         -layers [n]
            The number of hidden layers to use in the reservoir.
   wag <options>
      A multi-layer perceptron (MLP) that is trained by first training several
      MLP models, and then averaging their weights together using a process
      called wagging. (Before the weights in hidden layers can be averaged,
      they are first aligned using bipartite matching.)
      <options>
         -addlayer [size]
            Add a hidden layer with "size" logisitic units to the network. You
            may use this option multiple times to add multiple layers. The
            first layer added is adjacent to the input features. The last layer
            added is adjacent to the output labels. If you don't add any hidden
            layers, the network is just a single layer of sigmoid units.
         -learningrate [value]
            Specify a value for the learning rate. The default is 0.1
         -models [k]
            Specify the number of MLP models to train and then average
            together.
         -momentum [value]
            Specifies a value for the momentum. The default is 0.0
         -windowepochs [value]
            Specifies the number of training epochs that are performed before
            the stopping criteria is tested again. Bigger values will result in
            a more stable stopping criteria. Smaller values will check the
            stopping criteria more frequently.
         -minwindowimprovement [value]
            Specify the minimum improvement that must occur over the window of
            epochs for training to continue. [value] specifies the minimum
            decrease in error as a ratio. For example, if value is 0.02, then
            training will stop when the mean squared error does not decrease by
            two percent over the window of epochs. Smaller values will
            typically result in longer training times.
         -noalign
            Specify to compute weight averages without first aligning the
            corresponding weights. This option will typically make results
            significantly worse, but it may be useful for evaluating the value
            of aligning the weights before averaging them together.
         -holdout [portion]
            Specify the portion of the data (between 0 and 1) to use as a
            hold-out set for validation. That is, this portion of the data will
            not be used for training, but will be used to determine when to
            stop training. If the holdout portion is set to 0, then no holdout
            set will be used, and the entire training set will be used for
            validation (which may lead to long training time and overfit).
         -dontsquashoutputs
            Don't squash the outputs values with the logistic function. Just
            report the net value at the output layer. This is often used for
            regression.
   usage
      Print usage information.

Previous      Next

Back to the table of contents