Back to the table of contents

Previous Next
waffles_sparse

A command-line tool for learning from and operating on sparse data, as are typically used to represent text-documents, etc. Here's the usage information:
Full Usage Information
[Square brackets] are used to indicate required arguments.
<Angled brackets> are used to indicate optional arguments.

waffles_recommend [command]
   Predict missing values in data, and test collaborative-filtering
   recommendation systems.
   crossvalidate <options> [3col-data] [collab-filter]
      Measure accuracy using cross-validation. Prints MSE and MAE to stdout.
      <options>
         -seed [value]
            Specify a seed for the random number generator.
         -folds [n]
            Specify the number of folds. If not specified, the default is 2.
      [3col-data]
         The filename of 3-column (user, item, rating) dataset. Column 0
         contains a user ID. Column 1 contains an item ID. Column 2 contains
         the known rating for that user-item pair.
   fillmissingvalues <options> [data] [collab-filter]
      Fill in the missing values in an ARFF file with predicted values and
      print the resulting full dataset to stdout. ([data] is in full
      users*items or patterns*attributes format, not the dense 3-column
      format.)
      <options>
         -seed [value]
            Specify a seed for the random number generator.
         -nonormalize
            Do not normalize all of the columns to fall between 0 and 1 before
            imputing the missing values. (The default is to normalize first.)
      [data]
         The filename of a dataset with missing values to impute.
   precisionrecall <options> [3col-data] [collab-filter]
      Compute precision-recall data
      <options>
         -seed [value]
            Specify a seed for the random number generator.
         -ideal
            Ignore the model and compute ideal results (as if the model always
            predicted correct ratings).
      [3col-data]
         The filename of 3-column (user, item, rating) dataset. Column 0
         contains a user ID. Column 1 contains an item ID. Column 2 contains
         the known rating for that user-item pair.
   roc <options> [3col-data] [collab-filter]
      Compute data for an ROC curve. (The area under the curve will appear in
      the comments at the top of the data.)
      <options>
         -seed [value]
            Specify a seed for the random number generator.
         -ideal
            Ignore the model and compute ideal results (as if the model always
            predicted correct ratings).
      [3col-data]
         The filename of 3-column (user, item, rating) dataset. Column 0
         contains a user ID. Column 1 contains an item ID. Column 2 contains
         the known rating for that user-item pair.
   transacc <options> [train] [test] [collab-filter]
      Train using [train], then test using [test]. Prints MSE and MAE to
      stdout.
      <options>
         -seed [value]
            Specify a seed for the random number generator.
      [train]
         The filename of 3-column (user, item, rating) dataset with one row for
         each rating. Column 0 contains a user ID. Column 1 contains an item
         ID. Column 2 contains the known rating for that user-item pair.
      [test]
         The filename of 3-column (user, item, rating) dataset with one row for
         each rating. Column 0 contains a user ID. Column 1 contains an item
         ID. Column 2 contains the known rating for that user-item pair.
   usage
      Print usage information.
[collab-filter]
   A collaborative-filtering recommendation algorithm.
   bag <contents> end
      A bagging (bootstrap aggregating) ensemble. This is a way to combine the
      power of collaborative filtering algorithms through voting. "end" marks
      the end of the ensemble contents. Each collaborative filtering algorithm
      instance is trained on a subset of the original data, where each
      expressed element is given a probability of 0.5 of occurring in the
      training set.
      <contents>
         [instance_count] [collab-filter]
            Specify the number of instances of a collaborative filtering
            algorithm to add to the bagging ensemble.
   baseline
      A very simple recommendation algorithm. It always predicts the average
      rating for each item. This algorithm is useful as a baseline algorithm
      for comparison.
   clusterdense [n] <options>
      A collaborative-filtering algorithm that clusters users based on a dense
      distance metric with k-means, and then makes uniform recommendations
      within each cluster.
      [n]
         The number of clusters to use.
      <options>
         -norm [l]
            Specify the norm for the L-norm distance metric to use.
         -missingpenalty [d]
            Specify the difference to use in the distance computation when a
            value is missing from one or both of the vectors.
   clustersparse [n] <options>
      A collaborative-filtering algorithm that clusters users based on a sparse
      similarity metric with k-means, and then makes uniform recommendations
      within each cluster.
      [n]
         The number of clusters to use.
      <options>
         -pearson
            Use Pearson Correlation to compute the similarity between users.
            (The default is to use the cosine method.)
   instance [k] <options>
      An instance-based collaborative-filtering algorithm that makes
      recommendations based on the k-nearest neighbors of a user.
      [k]
         The number of neighbors to use.
      <options>
         -pearson
            Use Pearson Correlation to compute the similarity between users.
            (The default is to use the cosine method.)
         -regularize [value]
            Add [value] to the denominator in order to regularize the results.
            This ensures that recommendations will not be dominated when a
            small number of overlapping items occurs. Typically, [value] will
            be a small number, like 0.5 or 1.5.
         -sigWeight [value]
            Scale the significane weighting of the items based on how many
            items two users have rated. The default value of 0 indicates the no
            significance weightig will be done. The significance is scaled as
            numItemsRatedByBotheUSers/sigWeight.
   matrix [intrinsic] <options>
      A matrix factorization collaborative-filtering algorithm. (Implemented
      according to the specification on page 631 in Takacs, G., Pilaszy, I.,
      Nemeth, B., and Tikk, D. Scalable collaborative filtering approaches for
      large recommender systems. The Journal of Machine Learning Research,
      10:623-656, 2009. ISSN 1532-4435., except with the addition of
      learning-rate decay and a different stopping criteria.)
      [intrinsic]
         The number of intrinsic (or latent) feature dims to use to represent
         each user's preferences.
      <options>
         -regularize [value]
            Specify a regularization value. Typically, this is a small value.
            Larger values will put more pressure on the system to use small
            values in the matrix factors.
         -miniters [value]
            Specify a the minimum number of iterations to train the model
            before checking its validation error. This ensures that model does
            at least a certain amount of training before converging.
         -decayrate [value]
            Specify a decay rate in the range of (0-1) for the learning rate
            parameter. Value closer to 1 will cause the rate the decay slower
            while rate closer to 0 cause the a faster decay.
         -nonneg
            Constrain all non-bias weights to be non-negative
   nlpca [intrinsic] <options>
      A non-linear PCA collaborative-filtering algorithm. This algorithm was
      published in Scholz, M. Kaplan, F. Guy, C. L. Kopka, J. Selbig, J.,
      Non-linear PCA: a missing data approach, In Bioinformatics, Vol. 21,
      Number 20, pp. 3887-3895, Oxford University Press, 2005. It uses a
      generalization of backpropagation to train a multi-layer perceptron to
      fit to the known ratings, and to predict unknown values.
      [intrinsic]
         The number of intrinsic (or latent) feature dims to use to represent
         each user's preferences.
      <options>
         -addlayer [size]
            Add a hidden layer with "size" logisitic units to the network. You
            may use this option multiple times to add multiple layers. The
            first layer added is adjacent to the input features. The last layer
            added is adjacent to the output labels. If you don't add any hidden
            layers, the network is just a single layer of sigmoid units.
         -learningrate [value]
            Specify a value for the learning rate. The default is 0.1
         -momentum [value]
            Specifies a value for the momentum. The default is 0.0
         -windowepochs [value]
            Specifies the number of training epochs that are performed before
            the stopping criteria is tested again. Bigger values will result in
            a more stable stopping criteria. Smaller values will check the
            stopping criteria more frequently.
         -minwindowimprovement [value]
            Specify the minimum improvement that must occur over the window of
            epochs for training to continue. [value] specifies the minimum
            decrease in error as a ratio. For example, if value is 0.02, then
            training will stop when the mean squared error does not decrease by
            two percent over the window of epochs. Smaller values will
            typically result in longer training times.
         -dontsquashoutputs
            Don't squash the outputs values with the logistic function. Just
            report the net value at the output layer. This is often used for
            regression.
         -noinputbias
            Do not use an input bias.
         -nothreepass
            Use one-pass training instead of three-pass training.
         -regularize [value]
            Specify a regularization value. Typically, this is a small value.
            Larger values will put more pressure on the system to use small
            values in the matrix factors. Note that is only used if three-pass
            training is being used and there is at least on hidden layer.
         -miniters [value]
            Specify a the minimum number of iterations to train the model
            before checking its validation error. This ensures that model does
            at least a certain amount of training before converging.
         -decayrate [value]
            Specify a decay rate in the range of (0-1) for the learning rate
            parameter. Value closer to 1 will cause the rate the decay slower
            while rate closer to 0 cause the a faster decay.
   hybridnlpca [intrinsic] [item_dataset] <data_opts> <options>
      A hybrid content-based recommendation and collaborative filter based on
      NLPCA. This approach uses collaborative filtering and content-based
      recommendation DUDE.
      [intrinsic]
         The number of intrinsic (or latent) feature dims to use to represent
         each user's preferences.
      [items_dataset] <data_opts>
         The dataset representing the item attributes. It is assumed that the
         item dataset matrix is in the form of item id followed by the
         attribute values for each item. It assumes that the item corresponds
         with the first column in the 3-col data.
      <data_opts>
         -labels [attr_list]
            Specify which attributes to use as labels. (If not specified, the
            default is to use the last attribute for the label.) [attr_list] is
            a comma-separated list of zero-indexed columns. A hypen may be used
            to specify a range of columns.  A '*' preceding a value means to
            index from the right instead of the left. For example, "0,2-5"
            refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
            column. "0-*1" refers to all but the last column.
         -ignore [attr_list]
            Specify attributes to ignore. [attr_list] is a comma-separated list
            of zero-indexed columns. A hypen may be used to specify a range of
            columns.  A '*' preceding a value means to index from the right
            instead of the left. For example, "0,2-5" refers to columns 0, 2,
            3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
            but the last column.
      <options>
         -addlayer [size]
            Add a hidden layer with "size" logisitic units to the network. You
            may use this option multiple times to add multiple layers. The
            first layer added is adjacent to the input features. The last layer
            added is adjacent to the output labels. If you don't add any hidden
            layers, the network is just a single layer of sigmoid units.
         -learningrate [value]
            Specify a value for the learning rate. The default is 0.1
         -momentum [value]
            Specifies a value for the momentum. The default is 0.0
         -windowepochs [value]
            Specifies the number of training epochs that are performed before
            the stopping criteria is tested again. Bigger values will result in
            a more stable stopping criteria. Smaller values will check the
            stopping criteria more frequently.
         -minwindowimprovement [value]
            Specify the minimum improvement that must occur over the window of
            epochs for training to continue. [value] specifies the minimum
            decrease in error as a ratio. For example, if value is 0.02, then
            training will stop when the mean squared error does not decrease by
            two percent over the window of epochs. Smaller values will
            typically result in longer training times.
         -dontsquashoutputs
            Don't squash the outputs values with the logistic function. Just
            report the net value at the output layer. This is often used for
            regression.
         -crossentropy
            Use cross-entropy instead of squared-error for the error signal.
         -noinputbias
            Do not use an input bias.
         -nothreepass
            Use one-pass training instead of three-pass training.
         -regularize [value]
            Specify a regularization value. Typically, this is a small value.
            Larger values will put more pressure on the system to use small
            weight values. Note that is only used if three-pass training is
            being used and there is at least on hidden layer.
         -miniters [value]
            Specify a the minimum number of iterations to train the model
            before checking its validation error. This ensures that model does
            at least a certain amount of training before converging.
         -decayrate [value]
            Specify a decay rate in the range of (0-1) for the learning rate
            parameter. Value closer to 1 will cause the rate the decay slower
            while rate closer to 0 cause the a faster decay.
   contentbased [item_dataset] <data_opts> [learning_algorithm] <learning_opts>
      A content-based filter. A content-based recommendation filter is build
      using the supervised learning algorithms provided in the Waffles toolkit.
      [items_dataset] <data_opts>
         The dataset representing the item attributes. It is assumed that the
         item dataset matrix is in the form of item id followed by the
         attribute values for each item. It assumes that the item corresponds
         with the first column in the 3-col data.
      <data_opts>
         -labels [attr_list]
            Specify which attributes to use as labels. (If not specified, the
            default is to use the last attribute for the label.) [attr_list] is
            a comma-separated list of zero-indexed columns. A hypen may be used
            to specify a range of columns.  A '*' preceding a value means to
            index from the right instead of the left. For example, "0,2-5"
            refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
            column. "0-*1" refers to all but the last column.
         -ignore [attr_list]
            Specify attributes to ignore. [attr_list] is a comma-separated list
            of zero-indexed columns. A hypen may be used to specify a range of
            columns.  A '*' preceding a value means to index from the right
            instead of the left. For example, "0,2-5" refers to columns 0, 2,
            3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
            but the last column.
      [learning_algorithm] <learning_opts>
         See the usage statement for the desired learning algorithm using
         "waffles_learn usage".
   cbcf [item_dataset] <data_opts> [learning_algorithm] <learning_opts> -- [k] <inst_options>
      A content-boosted collaborative filter. This algorithm was published in
      P. Melville, R. Mooney, and R. Nagarajan, Content-Boosted Collaborative
      Filtering for Improved Recommendations, in Proceedings of the 18th
      National Conference on Artificial Intelligence (AAAI-02), pp. 187-192,
      2002. It uses a content-based filter to fill in the sparse matrix before
      giving it to a collaborative filter. We followed the Author's
      implementation and used an instance-based collaborative filter. Note that
      this algorithm often takes a while to run.
      [items_dataset] <data_opts>
         The dataset representing the item attributes. It is assumed that the
         item dataset matrix is in the form of item id followed by the
         attribute values for each item. It assumes that the item corresponds
         with the first column in the 3-col data.
      <data_opts>
         -labels [attr_list]
            Specify which attributes to use as labels. (If not specified, the
            default is to use the last attribute for the label.) [attr_list] is
            a comma-separated list of zero-indexed columns. A hypen may be used
            to specify a range of columns.  A '*' preceding a value means to
            index from the right instead of the left. For example, "0,2-5"
            refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
            column. "0-*1" refers to all but the last column.
         -ignore [attr_list]
            Specify attributes to ignore. [attr_list] is a comma-separated list
            of zero-indexed columns. A hypen may be used to specify a range of
            columns.  A '*' preceding a value means to index from the right
            instead of the left. For example, "0,2-5" refers to columns 0, 2,
            3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
            but the last column.
      [learning_algorithm] <learning_opts>
         See the usage statement for the desired learning algorithm using
         "waffles_learn usage".
      --
         Denotes the ending of the learning algorithm parameters and the
         parameters for the collaborative filter.
      [k]
         The number of neighbors to use.
      <inst_options>
         -pearson
            Use Pearson Correlation to compute the similarity between users.
            (The default is to use the cosine method.)
         -regularize [value]
            Add [value] to the denominator in order to regularize the results.
            This ensures that recommendations will not be dominated when a
            small number of overlapping items occurs. Typically, [value] will
            be a small number, like 0.5 or 1.5.
         -sigWeight [value]
            Scale the significane weighting of the items based on how many
            items two users have rated. The default value of 0 indicates the no
            significance weightig will be done. The significance is scaled as
            numItemsRatedByBotheUSers/sigWeight.
Previous Next

Back to the table of contents