|
Back to the table of contents Previous Next waffles_sparseA command-line tool for learning from and operating on sparse data, as are typically used to represent text-documents, etc. Here's the usage information:
Full Usage Information
[Square brackets] are used to indicate required arguments.
<Angled brackets> are used to indicate optional arguments.
waffles_recommend [command]
Predict missing values in data, and test collaborative-filtering
recommendation systems.
crossvalidate <options> [3col-data] [collab-filter]
Measure accuracy using cross-validation. Prints MSE and MAE to stdout.
<options>
-seed [value]
Specify a seed for the random number generator.
-folds [n]
Specify the number of folds. If not specified, the default is 2.
[3col-data]
The filename of 3-column (user, item, rating) dataset. Column 0
contains a user ID. Column 1 contains an item ID. Column 2 contains
the known rating for that user-item pair.
fillmissingvalues <options> [data] [collab-filter]
Fill in the missing values in an ARFF file with predicted values and
print the resulting full dataset to stdout. ([data] is in full
users*items or patterns*attributes format, not the dense 3-column
format.)
<options>
-seed [value]
Specify a seed for the random number generator.
-nonormalize
Do not normalize all of the columns to fall between 0 and 1 before
imputing the missing values. (The default is to normalize first.)
[data]
The filename of a dataset with missing values to impute.
precisionrecall <options> [3col-data] [collab-filter]
Compute precision-recall data
<options>
-seed [value]
Specify a seed for the random number generator.
-ideal
Ignore the model and compute ideal results (as if the model always
predicted correct ratings).
[3col-data]
The filename of 3-column (user, item, rating) dataset. Column 0
contains a user ID. Column 1 contains an item ID. Column 2 contains
the known rating for that user-item pair.
roc <options> [3col-data] [collab-filter]
Compute data for an ROC curve. (The area under the curve will appear in
the comments at the top of the data.)
<options>
-seed [value]
Specify a seed for the random number generator.
-ideal
Ignore the model and compute ideal results (as if the model always
predicted correct ratings).
[3col-data]
The filename of 3-column (user, item, rating) dataset. Column 0
contains a user ID. Column 1 contains an item ID. Column 2 contains
the known rating for that user-item pair.
transacc <options> [train] [test] [collab-filter]
Train using [train], then test using [test]. Prints MSE and MAE to
stdout.
<options>
-seed [value]
Specify a seed for the random number generator.
[train]
The filename of 3-column (user, item, rating) dataset with one row for
each rating. Column 0 contains a user ID. Column 1 contains an item
ID. Column 2 contains the known rating for that user-item pair.
[test]
The filename of 3-column (user, item, rating) dataset with one row for
each rating. Column 0 contains a user ID. Column 1 contains an item
ID. Column 2 contains the known rating for that user-item pair.
usage
Print usage information.
[collab-filter]
A collaborative-filtering recommendation algorithm.
bag <contents> end
A bagging (bootstrap aggregating) ensemble. This is a way to combine the
power of collaborative filtering algorithms through voting. "end" marks
the end of the ensemble contents. Each collaborative filtering algorithm
instance is trained on a subset of the original data, where each
expressed element is given a probability of 0.5 of occurring in the
training set.
<contents>
[instance_count] [collab-filter]
Specify the number of instances of a collaborative filtering
algorithm to add to the bagging ensemble.
baseline
A very simple recommendation algorithm. It always predicts the average
rating for each item. This algorithm is useful as a baseline algorithm
for comparison.
clusterdense [n] <options>
A collaborative-filtering algorithm that clusters users based on a dense
distance metric with k-means, and then makes uniform recommendations
within each cluster.
[n]
The number of clusters to use.
<options>
-norm [l]
Specify the norm for the L-norm distance metric to use.
-missingpenalty [d]
Specify the difference to use in the distance computation when a
value is missing from one or both of the vectors.
clustersparse [n] <options>
A collaborative-filtering algorithm that clusters users based on a sparse
similarity metric with k-means, and then makes uniform recommendations
within each cluster.
[n]
The number of clusters to use.
<options>
-pearson
Use Pearson Correlation to compute the similarity between users.
(The default is to use the cosine method.)
instance [k] <options>
An instance-based collaborative-filtering algorithm that makes
recommendations based on the k-nearest neighbors of a user.
[k]
The number of neighbors to use.
<options>
-pearson
Use Pearson Correlation to compute the similarity between users.
(The default is to use the cosine method.)
-regularize [value]
Add [value] to the denominator in order to regularize the results.
This ensures that recommendations will not be dominated when a
small number of overlapping items occurs. Typically, [value] will
be a small number, like 0.5 or 1.5.
-sigWeight [value]
Scale the significane weighting of the items based on how many
items two users have rated. The default value of 0 indicates the no
significance weightig will be done. The significance is scaled as
numItemsRatedByBotheUSers/sigWeight.
matrix [intrinsic] <options>
A matrix factorization collaborative-filtering algorithm. (Implemented
according to the specification on page 631 in Takacs, G., Pilaszy, I.,
Nemeth, B., and Tikk, D. Scalable collaborative filtering approaches for
large recommender systems. The Journal of Machine Learning Research,
10:623-656, 2009. ISSN 1532-4435., except with the addition of
learning-rate decay and a different stopping criteria.)
[intrinsic]
The number of intrinsic (or latent) feature dims to use to represent
each user's preferences.
<options>
-regularize [value]
Specify a regularization value. Typically, this is a small value.
Larger values will put more pressure on the system to use small
values in the matrix factors.
-miniters [value]
Specify a the minimum number of iterations to train the model
before checking its validation error. This ensures that model does
at least a certain amount of training before converging.
-decayrate [value]
Specify a decay rate in the range of (0-1) for the learning rate
parameter. Value closer to 1 will cause the rate the decay slower
while rate closer to 0 cause the a faster decay.
-nonneg
Constrain all non-bias weights to be non-negative
nlpca [intrinsic] <options>
A non-linear PCA collaborative-filtering algorithm. This algorithm was
published in Scholz, M. Kaplan, F. Guy, C. L. Kopka, J. Selbig, J.,
Non-linear PCA: a missing data approach, In Bioinformatics, Vol. 21,
Number 20, pp. 3887-3895, Oxford University Press, 2005. It uses a
generalization of backpropagation to train a multi-layer perceptron to
fit to the known ratings, and to predict unknown values.
[intrinsic]
The number of intrinsic (or latent) feature dims to use to represent
each user's preferences.
<options>
-addlayer [size]
Add a hidden layer with "size" logisitic units to the network. You
may use this option multiple times to add multiple layers. The
first layer added is adjacent to the input features. The last layer
added is adjacent to the output labels. If you don't add any hidden
layers, the network is just a single layer of sigmoid units.
-learningrate [value]
Specify a value for the learning rate. The default is 0.1
-momentum [value]
Specifies a value for the momentum. The default is 0.0
-windowepochs [value]
Specifies the number of training epochs that are performed before
the stopping criteria is tested again. Bigger values will result in
a more stable stopping criteria. Smaller values will check the
stopping criteria more frequently.
-minwindowimprovement [value]
Specify the minimum improvement that must occur over the window of
epochs for training to continue. [value] specifies the minimum
decrease in error as a ratio. For example, if value is 0.02, then
training will stop when the mean squared error does not decrease by
two percent over the window of epochs. Smaller values will
typically result in longer training times.
-dontsquashoutputs
Don't squash the outputs values with the logistic function. Just
report the net value at the output layer. This is often used for
regression.
-noinputbias
Do not use an input bias.
-nothreepass
Use one-pass training instead of three-pass training.
-regularize [value]
Specify a regularization value. Typically, this is a small value.
Larger values will put more pressure on the system to use small
values in the matrix factors. Note that is only used if three-pass
training is being used and there is at least on hidden layer.
-miniters [value]
Specify a the minimum number of iterations to train the model
before checking its validation error. This ensures that model does
at least a certain amount of training before converging.
-decayrate [value]
Specify a decay rate in the range of (0-1) for the learning rate
parameter. Value closer to 1 will cause the rate the decay slower
while rate closer to 0 cause the a faster decay.
hybridnlpca [intrinsic] [item_dataset] <data_opts> <options>
A hybrid content-based recommendation and collaborative filter based on
NLPCA. This approach uses collaborative filtering and content-based
recommendation DUDE.
[intrinsic]
The number of intrinsic (or latent) feature dims to use to represent
each user's preferences.
[items_dataset] <data_opts>
The dataset representing the item attributes. It is assumed that the
item dataset matrix is in the form of item id followed by the
attribute values for each item. It assumes that the item corresponds
with the first column in the 3-col data.
<data_opts>
-labels [attr_list]
Specify which attributes to use as labels. (If not specified, the
default is to use the last attribute for the label.) [attr_list] is
a comma-separated list of zero-indexed columns. A hypen may be used
to specify a range of columns. A '*' preceding a value means to
index from the right instead of the left. For example, "0,2-5"
refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
column. "0-*1" refers to all but the last column.
-ignore [attr_list]
Specify attributes to ignore. [attr_list] is a comma-separated list
of zero-indexed columns. A hypen may be used to specify a range of
columns. A '*' preceding a value means to index from the right
instead of the left. For example, "0,2-5" refers to columns 0, 2,
3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
but the last column.
<options>
-addlayer [size]
Add a hidden layer with "size" logisitic units to the network. You
may use this option multiple times to add multiple layers. The
first layer added is adjacent to the input features. The last layer
added is adjacent to the output labels. If you don't add any hidden
layers, the network is just a single layer of sigmoid units.
-learningrate [value]
Specify a value for the learning rate. The default is 0.1
-momentum [value]
Specifies a value for the momentum. The default is 0.0
-windowepochs [value]
Specifies the number of training epochs that are performed before
the stopping criteria is tested again. Bigger values will result in
a more stable stopping criteria. Smaller values will check the
stopping criteria more frequently.
-minwindowimprovement [value]
Specify the minimum improvement that must occur over the window of
epochs for training to continue. [value] specifies the minimum
decrease in error as a ratio. For example, if value is 0.02, then
training will stop when the mean squared error does not decrease by
two percent over the window of epochs. Smaller values will
typically result in longer training times.
-dontsquashoutputs
Don't squash the outputs values with the logistic function. Just
report the net value at the output layer. This is often used for
regression.
-crossentropy
Use cross-entropy instead of squared-error for the error signal.
-noinputbias
Do not use an input bias.
-nothreepass
Use one-pass training instead of three-pass training.
-regularize [value]
Specify a regularization value. Typically, this is a small value.
Larger values will put more pressure on the system to use small
weight values. Note that is only used if three-pass training is
being used and there is at least on hidden layer.
-miniters [value]
Specify a the minimum number of iterations to train the model
before checking its validation error. This ensures that model does
at least a certain amount of training before converging.
-decayrate [value]
Specify a decay rate in the range of (0-1) for the learning rate
parameter. Value closer to 1 will cause the rate the decay slower
while rate closer to 0 cause the a faster decay.
contentbased [item_dataset] <data_opts> [learning_algorithm] <learning_opts>
A content-based filter. A content-based recommendation filter is build
using the supervised learning algorithms provided in the Waffles toolkit.
[items_dataset] <data_opts>
The dataset representing the item attributes. It is assumed that the
item dataset matrix is in the form of item id followed by the
attribute values for each item. It assumes that the item corresponds
with the first column in the 3-col data.
<data_opts>
-labels [attr_list]
Specify which attributes to use as labels. (If not specified, the
default is to use the last attribute for the label.) [attr_list] is
a comma-separated list of zero-indexed columns. A hypen may be used
to specify a range of columns. A '*' preceding a value means to
index from the right instead of the left. For example, "0,2-5"
refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
column. "0-*1" refers to all but the last column.
-ignore [attr_list]
Specify attributes to ignore. [attr_list] is a comma-separated list
of zero-indexed columns. A hypen may be used to specify a range of
columns. A '*' preceding a value means to index from the right
instead of the left. For example, "0,2-5" refers to columns 0, 2,
3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
but the last column.
[learning_algorithm] <learning_opts>
See the usage statement for the desired learning algorithm using
"waffles_learn usage".
cbcf [item_dataset] <data_opts> [learning_algorithm] <learning_opts> -- [k] <inst_options>
A content-boosted collaborative filter. This algorithm was published in
P. Melville, R. Mooney, and R. Nagarajan, Content-Boosted Collaborative
Filtering for Improved Recommendations, in Proceedings of the 18th
National Conference on Artificial Intelligence (AAAI-02), pp. 187-192,
2002. It uses a content-based filter to fill in the sparse matrix before
giving it to a collaborative filter. We followed the Author's
implementation and used an instance-based collaborative filter. Note that
this algorithm often takes a while to run.
[items_dataset] <data_opts>
The dataset representing the item attributes. It is assumed that the
item dataset matrix is in the form of item id followed by the
attribute values for each item. It assumes that the item corresponds
with the first column in the 3-col data.
<data_opts>
-labels [attr_list]
Specify which attributes to use as labels. (If not specified, the
default is to use the last attribute for the label.) [attr_list] is
a comma-separated list of zero-indexed columns. A hypen may be used
to specify a range of columns. A '*' preceding a value means to
index from the right instead of the left. For example, "0,2-5"
refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last
column. "0-*1" refers to all but the last column.
-ignore [attr_list]
Specify attributes to ignore. [attr_list] is a comma-separated list
of zero-indexed columns. A hypen may be used to specify a range of
columns. A '*' preceding a value means to index from the right
instead of the left. For example, "0,2-5" refers to columns 0, 2,
3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all
but the last column.
[learning_algorithm] <learning_opts>
See the usage statement for the desired learning algorithm using
"waffles_learn usage".
--
Denotes the ending of the learning algorithm parameters and the
parameters for the collaborative filter.
[k]
The number of neighbors to use.
<inst_options>
-pearson
Use Pearson Correlation to compute the similarity between users.
(The default is to use the cosine method.)
-regularize [value]
Add [value] to the denominator in order to regularize the results.
This ensures that recommendations will not be dominated when a
small number of overlapping items occurs. Typically, [value] will
be a small number, like 0.5 or 1.5.
-sigWeight [value]
Scale the significane weighting of the items based on how many
items two users have rated. The default value of 0 indicates the no
significance weightig will be done. The significance is scaled as
numItemsRatedByBotheUSers/sigWeight.
Previous Next Back to the table of contents |