|
Back to the table of contents Previous Next waffles_transformA command-line tool for transforming datasets. It contains import/export functionality, unsupervised algorithms, and other useful transforms that you may wish to perform on a dataset. Here's the usage information:
Full Usage Information
[Square brackets] are used to indicate required arguments.
<Angled brackets> are used to indicate optional arguments.
waffles_transform [command]
Transform data, shuffle rows, swap columns, matrix operations, etc.
add [dataset1] [dataset2]
Adds two matrices together element-wise. Results are printed to stdout.
[dataset1]
The filename of the first matrix.
[dataset2]
The filename of the second matrix.
addindexcolumn [dataset] <options>
Add a column that Specify the index of each row. This column will be
inserted as column 0. (For example, suppose you would like to plot the
values in each column of your data against the row index. Most plotting
tools expect one of the columns to supply the position on the horizontal
axis. This feature will create such a column for you.)
[dataset]
The filename of a dataset.
<options>
-start [value]
Specify the initial index. (the default is 0.0).
-increment [value]
Specify the increment amount. (the default is 1.0).
-name [value]
Specify the name of the new attribute.
addcategorycolumn [dataset] [name] [value]
Add a column with a constant categorical value. This column will be
inserted as column 0.
[dataset]
The filename of a dataset.
[name]
The name of the new column or attribute.
[value]
The name of the constant value to insert in every row.
addnoise [dataset] [dev] <options>
Add Gaussian noise with the specified deviation to all the elements in
the dataset. (Assumes that the values are all continuous.)
[dataset]
The filename of a dataset.
[dev]
The deviation of the Gaussian noise
<options>
-seed [value]
Specify a seed for the random number generator.
-excludelast [n]
Do not add noise to the last [n] columns.
aggregatecols [n]
Make a matrix by aggregating each column [n] from the .arff files in the
current directory. The resulting matrix is printed to stdout.
aggregaterows [n]
Make a matrix by aggregating each row [n] from the .arff files in the
current directory. The resulting matrix is printed to stdout.
align [a] [b]
Translates and rotates dataset [b] to minimize mean squared difference
with dataset [a]. (Uses the Kabsch algorithm.)
[a]
The filename of a dataset.
[b]
The filename of a dataset.
autocorrelation [dataset]
Compute the autocorrelation of the specified time-series data.
cholesky [dataset]
Compute the cholesky decomposition of the specified matrix.
correlation [dataset] [attr1] [attr2] <options>
Compute the linear correlation coefficient of the two specified
attributes.
[dataset]
The filename of a dataset.
[attr1]
A zero-indexed attribute number.
[attr2]
A zero-indexed attribute number.
<options>
-aboutorigin
Compute the correlation about the origin. (The default is to
compute it about the mean.)
covariance [dataset]
Compute the covariance matrix of the specified matrix.
colstats [dataset]
Generates a 4-row table. Row 0 contains the min value of each column in
[dataset]. Row 1 contains the max value of each column in [dataset]. Row
2 contains the mean value of each column in [dataset]. Row 3 contains the
median value of each column in [dataset].
[dataset]
The filename of a dataset.
cumulativecolumns [dataset] [column-list]
Accumulates the values in the specified columns. For example, a column
that contains the values 2,1,3,2 would be changed to 2,3,6,8. This might
be useful for converting a histogram of some distribution into a
histogram of the cumulative disribution.
[dataset]
The filename of a dataset.
[column-list]
A comma-separated list of zero-indexed columns to transform. A hypen
may be used to specify a range of columns. Example: 0,2-5,7
determinant [dataset]
Compute the determinant of the specified matrix.
discretize [dataset] <options>
Discretizes the continuous attributes in the specified dataset.
[dataset]
The filename of a dataset.
<options>
-buckets [count]
Specify the number of buckets to use. If not specified, the default
is to use the square root of the number of rows in the dataset.
-colrange [first] [last]
Specify a range of columns. Only continuous columns in the
specified range will be modified. (Columns are zero-indexed.)
dropcolumns [dataset] [column-list]
Remove one or more columns from a dataset and prints the results to
stdout. (The input file is not modified.)
[dataset]
The filename of a dataset.
[column-list]
A comma-separated list of zero-indexed columns to drop. A hypen may be
used to specify a range of columns. A '*' preceding a value means to
index from the right instead of the left. For example, "0,2-5" refers
to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1"
refers to all but the last column.
drophomogcols [dataset]
Remove all columns that are homogeneous (have zero variance).
dropiftooclose [dataset] [col] [gap]
Drop each row if the value in the specified column is less than [gap]
greater than that in the previous row.
[dataset]
The filename of a dataset.
[col]
The column to evaluate.
[gap]
The minimum gap between sequential values.
droprows [dataset] [after-size]
Removes all rows except for the first [after-size] rows.
[dataset]
The filename of a dataset.
[after-size]
The number of rows to keep
dropmissingvalues [dataset]
Remove all rows that contain missing values.
droprandomvalues [dataset] [portion] <options>
Drop random values from the specified dataset. The resulting dataset with
missing values is printed to stdout.
[dataset]
The filename of a dataset.
[portion]
The portion of the data to drop. For example, if [portion] is 0.1,
then 10% of the values will be replaced with unknown values
<options>
-seed [value]
Specify a seed for the random number generator.
dropunusedvalues [dataset]
Drops any nominal meta-data values that are not used.
export [dataset] <options>
Print the data as a list of comma separated values without any meta-data.
[dataset]
The filename of a dataset.
<options>
-tab
Separate with tabs instead of commas.
-space
Separate with spaces instead of commas.
-r
Use "NA" instead of "?" for missing values. (This is the format
used by R.)
-columnnames
Print column names on the first row. (The default is to not print
column names.)
-precision [val]
Specify how many digits of precision to use before truncating and
resorting to scientific notation.
fillmissingvalues [dataset] <options>
Replace all missing values in the dataset. (Note that the
fillmissingvalues command in the waffles_recommend tool performs a
similar task, but it can intelligently predict the missing values instead
of just using the baseline value.)
[dataset]
The filename of a dataset
<options>
-seed [value]
Specify a seed for the random number generator.
-random
Replace each missing value with a randomly chosen non-missing value
from the same attribute. (The default is to use the baseline value.
That is, the mean for continuous attributes, and the most-common
value for nominal attributes.)
filterelements [dataset] [attr] [min] [max] <options>
Remove each element in the specified attribute that does not fall in a
certain range.
[dataset]
The filename of a dataset
[attr]
A zero-indexed column number
[min]
The minimum acceptable value
[max]
The maximum acceptable value
<options>
-invert
Drop elements that fall within the range instead of elements that
do not fall within the range.
filterrows [dataset] [attr] [min] [max] <options>
Remove each row where the value of the specified attribute does not fall
in a certain range. Rows with unknown values in the specified attribute
will also be deleted.
[dataset]
The filename of a dataset
[attr]
A zero-indexed column number
[min]
The minimum acceptable value
[max]
The maximum acceptable value
<options>
-invert
Drop the row if the value falls within the range instead of
dropping it if the value does not fall within the range.
-preserveOrder
Preserve the order of the input matrix. By default, the delete
operation does not guarantee the order will be preserved.
function [dataset] [equations]
Compute new data as a function of some existing data. Each row in the
output is computed from the corresponding row of the input dataset. Each
equation, f1, f2, f3, ... will produce one column in the output data.
[dataset]
The filename of a dataset
[equations]
A set of equations to compute the output data. The equations must be
named f1, f2, f3, etc. The parameters to these equations may have any
name, but will correspond with the columns of the input data in order.
geodistance [dataset] [lat1] [lon1] [lat2] [lon2] <options>
For each row in [dataset], compute the distance (in kilometers) between
two points (specified in latitude and longitude) by following a great
circle on the surface of a perfectly spherical Earth, using the haversine
formula.
[dataset]
The filename of a dataset
[lat1]
The latitude of point 1 in degrees.
[lon1]
The longitude of point 2 in degrees.
[lat2]
The latitude of point 1 in degrees.
[lon2]
The longitude of point 2 in degrees.
<options>
-radius [r]
Specify the radius of the Earth (or the sphere upon which the
points occur). The results will have the same units as the radius
specified. The default is 6371.0, which is approximately the radius
of the Earth in kilometers.
import [dataset] <options>
Convert a text file of comma separated (or otherwise separated) values to
a .arff file. The meta-data is automatically determined. The .arff file
is printed to stdout. This makes it easy to operate on structured data
from a spreadsheet, database, or pretty-much any other source.
[dataset]
The filename of a dataset.
<options>
-tab
Data elements are separated with a tab character instead of a
comma.
-space
Data elements are separated with a single space instead of a comma.
-whitespace
Data elements are separated with an arbitrary amount of whitespace.
-semicolon
Data elements are separated with semicolons.
-separator [char]
Data elements are separated with the specified character.
-columnnames
Use the first row of data for column names.
-maxvals [n]
Specify the maximum number of unique values in a categorical
attribute before parsing of that attribute will be aborted.
-time [attr] [format]
Specify that a particular attribute is a date or time stamp in a
particular format. Example format: "YYYY-MM-DD hh:mm:ssssss".
-nominal [attr]
Indiciate that the specified attribute should be treated as
nominal.
-real [attr]
Indiciate that the specified attribute should be treated as real.
enumeratevalues [dataset] [col]
Enumerates all of the unique values in the specified column, and replaces
each value with its enumeration. (For example, if you have a column that
contains the social-security-number of each user, this will change them
to numbers from 0 to n-1, where n is the number of unique users.)
[dataset]
The filename of a dataset
[col]
The column index (starting with 0) to enumerate
keeponlycolumns [dataset] [column-list]
Removes all unlisted columns from a dataset and prints the results to
stdout. (The input file is not modified.)
[dataset]
The filename of a dataset.
[column-list]
A comma-separated list of zero-indexed columns to which will not be
dropped. A hypen may be used to specify a range of columns. A '*'
preceding a value means to index from the right instead of the left.
For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers
to the last column. "0-*1" refers to all but the last column.
measuremeansquarederror [dataset1] [dataset2] <options>
Print the mean squared error between two datasets. (Both datasets must be
the same size.)
[dataset1]
The filename of a dataset
[dataset2]
The filename of a dataset
<options>
-fit
Use a hill-climber to find an affine transformation to make
dataset2 fit as closely as possible to dataset1. Report results
after each iteration.
-sum
Sum the mean-squared error over each attribute and only report this
sum. (The default is to report the mean-squared error in each
attribute.)
mergehoriz [dataset1] [dataset2]
Merge two (or more) datasets horizontally. All datasets must already have
the same number of rows. The resulting dataset will have all the columns
of both datasets.
[dataset1]
The filename of a dataset
[dataset2]
The filename of a dataset
mergevert [dataset1] [dataset2] <options>
Merge two datasets vertically. Both datasets must already have the same
number of columns. The resulting dataset will have all the rows of both
datasets.
[dataset1]
The filename of a dataset
[dataset2]
The filename of a dataset
<options>
-f
Force merge, even if attribute names do not match.
multiply [a] [b] <options>
Matrix multiply [a] x [b]. Both arguments are the filenames of .arff
files. Results are printed to stdout.
[dataset1]
The filename of a dataset
[dataset2]
The filename of a dataset
<options>
-transposea
Transpose [a] before multiplying.
-transposeb
Transpose [b] before multiplying.
multiplyscalar [dataset] [scalar]
Multiply all elements in [dataset] by the specified scalar. Results are
printed to stdout.
[dataset]
The filename of a dataset.
[scalar]
A scalar to multiply each element by.
normalize [dataset] <options>
Normalize all continuous attributes to fall within the specified range.
(Nominal columns are left unchanged.)
[dataset]
The filename of a dataset
<options>
-range [min] [max]
Specify the output min and max values. (The default is 0 1.)
normalizemagnitude [dataset]
Normalize the magnitude of each row-vector to 1.
[dataset]
The filename of a dataset
nominaltocat [dataset] <options>
Convert all nominal attributes in the data to vectors of real values by
representing them as a categorical distribution. Columns with only two
nominal values are converted to 0 or 1. If there are three or more
possible values, a column is created for each value. The column
corresponding to the value is set to 1, and the others are set to 0.
(This is similar to Weka's NominalToBinaryFilter.)
[dataset]
The filename of a dataset
<options>
-maxvalues [cap]
Specify the maximum number of nominal values for which to create
new columns. If not specified, the default is 12.
obfuscate [data]
Strips comments from an ARFF file, and replaces meta-data with generic
meaningless values, thus making it difficult to determine what the data
means. (You may also want to normalize the data to make the range of
continuous attributes meaningless.) Note that the values of the actual
data are not altered, so it may still be possible to derive meaning from
them.
overlay [base] [over]
Combines two same-sized matrices by placing [over] on top of [base], such
that elements from [base] are used only if the same element is missing in
[over].
[base]
The matrix of values to use when they are missing in the other one.
[over]
The matrix of values to use as long as they are not missing.
powercolumns [dataset] [column-list] [exponent]
Raises the values in the specified columns to some power (or exponent).
[dataset]
The filename of a dataset.
[column-list]
A comma-separated list of zero-indexed columns to transform. A hypen
may be used to specify a range of columns. Example: 0,2-5,7
[exponent]
An exponent value, such as 0.5, 2, etc.
prettify [json-file]
Pretty-prints a JSON file.
pseudoinverse [dataset]
Compute the Moore-Penrose pseudo-inverse of the specified matrix of real
values.
reducedrowechelonform [dataset]
Convert a matrix to reduced row echelon form. Results are printed to
stdout.
rotate [dataset] [col_x] [col_y] [angle_degrees]
Rotate angle degrees around the origin in in the col_x,col_y plane. Only
affects the values in col_x and col_y.
[dataset]
The filename of a dataset.
[col_x]
The zero-based index of an attribute to serve as the x coordinate in
the plane of rotation. Rotation from x to y will be 90 degrees. col_x
must be a real-valued attribute.
[col_y]
The zero-based index of an attribute to serve as the y coordinate in
the plane of rotation. Rotation from y to x will be 270 degrees.
col_y must be a real-valued attribute.
[angle_degrees]
The angle in degrees to rotate around the origin in the col_x,col_y
plane.
reordercolumns [dataset] [column-list]
Reorder the columns as specified in the column list.
[dataset]
The filename of a dataset.
[column-list]
A comma-separated list of zero-indexed columns. A hypen may be used to
specify a range of columns. A '*' preceding a value means to index
from the right instead of the left. For example, "0,2-5" refers to
columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1"
refers to all but the last column.
samplerows [dataset] [portion]
Randomly samples from the rows in the specified dataset and prints them
to stdout. This tool reads each row one-at-a-time, so it is well-suited
for reducing the size of datasets that are too big to fit into memory.
(Note that unlike most other tools, this one does not convert CSV to ARFF
format internally. If the input is CSV, the output will be CSV too.)
[dataset]
The filename of a dataset. ARFF, CSV, and a few other formats are
supported.
[portion]
A value between 0 and 1 that specifies the likelihood that each row
will be printed to stdout.
<options>
-seed [value]
Specify a seed for the random number generator.
samplerowsregularly [dataset] [freq]
Samples from the rows in the specified dataset at regular intervals and
prints them to stdout. This tool reads each row one-at-a-time, so it is
well-suited for reducing the size of datasets that are too big to fit
into memory. (Note that unlike most other tools, this one does not
convert CSV to ARFF format internally. If the input is CSV, the output
will be CSV too.)
[dataset]
The filename of a dataset. ARFF, CSV, and a few other formats are
supported.
[freq]
The number of rows read for each row printed.
scalecolumns [dataset] [column-list] [scalar]
Multiply the values in the specified columns by a scalar.
[dataset]
The filename of a dataset.
[column-list]
A comma-separated list of zero-indexed columns to transform. A hypen
may be used to specify a range of columns. Example: 0,2-5,7
[scalar]
A scalar value.
shiftcolumns [dataset] [column-list] [offset]
Add [offset] to all of the values in the specified columns.
[dataset]
The filename of a dataset.
[column-list]
A comma-separated list of zero-indexed columns to transform. A hypen
may be used to specify a range of columns. Example: 0,2-5,7
[offset]
A positive or negative value to add to the values in the specified
columns.
shuffle [dataset] <options>
Shuffle the row order.
[dataset]
The filename of a dataset
<options>
-seed [value]
Specify a seed for the random number generator.
significance [dataset] [attr1] [attr2] <options>
Compute statistical significance values for the two specified attributes.
[dataset]
The filename of a .arff file.
[attr1]
A zero-indexed column number.
[attr2]
A zero-indexed column number.
<options>
-tol [value]
Sets the tolerance value for the Wilcoxon Signed Ranks test. The
default value is 0.001.
sortcolumn [dataset] [col] <options>
Sort the rows in [dataset] such that the values in the specified column
are in ascending order and print the results to to stdout. (The input
file is not modified.)
[dataset]
The filename of a dataset.
[col]
The zero-indexed column number in which to sort
<options>
-descending
Sort in descending order instead of ascending order.
split [dataset] [rows] [filename1] [filename2] <options>
Split a dataset into two datasets. (Nothing is printed to stdout.)
[dataset]
The filename of a datset.
[rows]
The number of rows to go into the first file. The rest go in the
second file.
<options>
-seed [value]
Specify a seed for the random number generator.
-shuffle
Shuffle the input data before splitting it.
[filename1]
The filename for one half of the data.
[filename2]
The filename for the other half of the data.
splitclass [data] [attr] <options>
Splits a dataset by a class attribute, such that a separate file is
created for each unique class label. The generated filenames will be
"[data]_[value]", where [value] is the unique class label value.
[data]
The filename of a dataset.
[attr]
The zero-indexed column number of the class attribute.
<options>
-dropclass
Drop the class attribute after splitting the data. (The default is
to include the class attribute in each of the output datasets,
which is rather redundant since every row in the file will have the
same class label.)
splitfold [dataset] [i] [n] <options>
Divides a dataset into [n] parts of approximately equal size, then puts
part [i] into one file, and the other [n]-1 parts in another file. (This
tool may be useful, for example, to implement n-fold cross validation.)
[dataset]
The filename of a datset.
[i]
The (zero-based) index of the fold, or the part to put into the
training set. [i] must be less than [n].
[n]
The number of folds.
<options>
-out [train_filename] [test_filename]
Specify the filenames for the training and test portions of the
data. The default values are train.arff and test.arff.
squareddistance [a] [b]
Computesthe sum and mean squared distance between dataset [a] and [b].
([a] and [b] are each the names of files in .arff format. They must have
the same dimensions.)
[a]
The filename of a dataset.
[b]
The filename of a dataset.
swapcolumns [dataset] [col1] [col2]
Swap two columns in the specified dataset and prints the results to
stdout. (Columns are zero-indexed.)
[dataset]
The filename of a dataset
[col1]
A zero-indexed column number.
[col2]
A zero-indexed column number.
transition [action-sequence] [state-sequence] <options>
Given a sequence of actions and a sequence of states (each in separate
datasets), this generates a single dataset to map from action-state pairs
to the next state. This would be useful for generating the data to train
a transition function.
<options>
-delta
Predict the delta of the state transition instead of the new state.
threshold [dataset] [column] [threshold]
Outputs a copy of dataset such that any value v in the given column
becomes 0 if v <= threshold and 1 otherwise. Only works on continuous
attributes.
[dataset]
The filename of a dataset.
[column]
The zero-indexed column number to threshold.
[threshold]
The threshold value.
transpose [dataset]
Transpose the data such that columns become rows and rows become columns.
uglify [json-file]
Prints a JSON file with whitespace removed.
unique [dataset] [col] <options>
Discard rows with redundant values in [col].
[dataset]
The dataset on which to operate.
[col]
The column in which to preserve only one of each unique value.
<options>
-last
Preserve the last row with a unique value in [col]. (The default is
to preserve the first row with a unique value in [col].)
zeromean [dataset]
Subtracts the mean from all values of all continuous attributes, so that
their means in the result are zero. Leaves nominal attributes untouched.
usage
Print usage information.
Previous Next Back to the table of contents |