Back to the table of contents Previous Next waffles_transformA command-line tool for transforming datasets. It contains import/export functionality, unsupervised algorithms, and other useful transforms that you may wish to perform on a dataset. Here's the usage information: Full Usage Information [Square brackets] are used to indicate required arguments. <Angled brackets> are used to indicate optional arguments. waffles_transform [command] Transform data, shuffle rows, swap columns, matrix operations, etc. add [dataset1] [dataset2] Adds two matrices together element-wise. Results are printed to stdout. [dataset1] The filename of the first matrix. [dataset2] The filename of the second matrix. addindexcolumn [dataset] <options> Add a column that Specify the index of each row. This column will be inserted as column 0. (For example, suppose you would like to plot the values in each column of your data against the row index. Most plotting tools expect one of the columns to supply the position on the horizontal axis. This feature will create such a column for you.) [dataset] The filename of a dataset. <options> -start [value] Specify the initial index. (the default is 0.0). -increment [value] Specify the increment amount. (the default is 1.0). -name [value] Specify the name of the new attribute. addcategorycolumn [dataset] [name] [value] Add a column with a constant categorical value. This column will be inserted as column 0. [dataset] The filename of a dataset. [name] The name of the new column or attribute. [value] The name of the constant value to insert in every row. addnoise [dataset] [dev] <options> Add Gaussian noise with the specified deviation to all the elements in the dataset. (Assumes that the values are all continuous.) [dataset] The filename of a dataset. [dev] The deviation of the Gaussian noise <options> -seed [value] Specify a seed for the random number generator. -excludelast [n] Do not add noise to the last [n] columns. aggregatecols [n] Make a matrix by aggregating each column [n] from the .arff files in the current directory. The resulting matrix is printed to stdout. aggregaterows [n] Make a matrix by aggregating each row [n] from the .arff files in the current directory. The resulting matrix is printed to stdout. align [a] [b] Translates and rotates dataset [b] to minimize mean squared difference with dataset [a]. (Uses the Kabsch algorithm.) [a] The filename of a dataset. [b] The filename of a dataset. autocorrelation [dataset] Compute the autocorrelation of the specified time-series data. cholesky [dataset] Compute the cholesky decomposition of the specified matrix. correlation [dataset] [attr1] [attr2] <options> Compute the linear correlation coefficient of the two specified attributes. [dataset] The filename of a dataset. [attr1] A zero-indexed attribute number. [attr2] A zero-indexed attribute number. <options> -aboutorigin Compute the correlation about the origin. (The default is to compute it about the mean.) covariance [dataset] Compute the covariance matrix of the specified matrix. colstats [dataset] Generates a 4-row table. Row 0 contains the min value of each column in [dataset]. Row 1 contains the max value of each column in [dataset]. Row 2 contains the mean value of each column in [dataset]. Row 3 contains the median value of each column in [dataset]. [dataset] The filename of a dataset. cumulativecolumns [dataset] [column-list] Accumulates the values in the specified columns. For example, a column that contains the values 2,1,3,2 would be changed to 2,3,6,8. This might be useful for converting a histogram of some distribution into a histogram of the cumulative disribution. [dataset] The filename of a dataset. [column-list] A comma-separated list of zero-indexed columns to transform. A hypen may be used to specify a range of columns. Example: 0,2-5,7 determinant [dataset] Compute the determinant of the specified matrix. discretize [dataset] <options> Discretizes the continuous attributes in the specified dataset. [dataset] The filename of a dataset. <options> -buckets [count] Specify the number of buckets to use. If not specified, the default is to use the square root of the number of rows in the dataset. -colrange [first] [last] Specify a range of columns. Only continuous columns in the specified range will be modified. (Columns are zero-indexed.) dropcolumns [dataset] [column-list] Remove one or more columns from a dataset and prints the results to stdout. (The input file is not modified.) [dataset] The filename of a dataset. [column-list] A comma-separated list of zero-indexed columns to drop. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. drophomogcols [dataset] Remove all columns that are homogeneous (have zero variance). dropiftooclose [dataset] [col] [gap] Drop each row if the value in the specified column is less than [gap] greater than that in the previous row. [dataset] The filename of a dataset. [col] The column to evaluate. [gap] The minimum gap between sequential values. droprows [dataset] [after-size] Removes all rows except for the first [after-size] rows. [dataset] The filename of a dataset. [after-size] The number of rows to keep dropmissingvalues [dataset] Remove all rows that contain missing values. droprandomvalues [dataset] [portion] <options> Drop random values from the specified dataset. The resulting dataset with missing values is printed to stdout. [dataset] The filename of a dataset. [portion] The portion of the data to drop. For example, if [portion] is 0.1, then 10% of the values will be replaced with unknown values <options> -seed [value] Specify a seed for the random number generator. dropunusedvalues [dataset] Drops any nominal meta-data values that are not used. export [dataset] <options> Print the data as a list of comma separated values without any meta-data. [dataset] The filename of a dataset. <options> -tab Separate with tabs instead of commas. -space Separate with spaces instead of commas. -r Use "NA" instead of "?" for missing values. (This is the format used by R.) -columnnames Print column names on the first row. (The default is to not print column names.) -precision [val] Specify how many digits of precision to use before truncating and resorting to scientific notation. fillmissingvalues [dataset] <options> Replace all missing values in the dataset. (Note that the fillmissingvalues command in the waffles_recommend tool performs a similar task, but it can intelligently predict the missing values instead of just using the baseline value.) [dataset] The filename of a dataset <options> -seed [value] Specify a seed for the random number generator. -random Replace each missing value with a randomly chosen non-missing value from the same attribute. (The default is to use the baseline value. That is, the mean for continuous attributes, and the most-common value for nominal attributes.) filterelements [dataset] [attr] [min] [max] <options> Remove each element in the specified attribute that does not fall in a certain range. [dataset] The filename of a dataset [attr] A zero-indexed column number [min] The minimum acceptable value [max] The maximum acceptable value <options> -invert Drop elements that fall within the range instead of elements that do not fall within the range. filterrows [dataset] [attr] [min] [max] <options> Remove each row where the value of the specified attribute does not fall in a certain range. Rows with unknown values in the specified attribute will also be deleted. [dataset] The filename of a dataset [attr] A zero-indexed column number [min] The minimum acceptable value [max] The maximum acceptable value <options> -invert Drop the row if the value falls within the range instead of dropping it if the value does not fall within the range. -preserveOrder Preserve the order of the input matrix. By default, the delete operation does not guarantee the order will be preserved. function [dataset] [equations] Compute new data as a function of some existing data. Each row in the output is computed from the corresponding row of the input dataset. Each equation, f1, f2, f3, ... will produce one column in the output data. [dataset] The filename of a dataset [equations] A set of equations to compute the output data. The equations must be named f1, f2, f3, etc. The parameters to these equations may have any name, but will correspond with the columns of the input data in order. geodistance [dataset] [lat1] [lon1] [lat2] [lon2] <options> For each row in [dataset], compute the distance (in kilometers) between two points (specified in latitude and longitude) by following a great circle on the surface of a perfectly spherical Earth, using the haversine formula. [dataset] The filename of a dataset [lat1] The latitude of point 1 in degrees. [lon1] The longitude of point 2 in degrees. [lat2] The latitude of point 1 in degrees. [lon2] The longitude of point 2 in degrees. <options> -radius [r] Specify the radius of the Earth (or the sphere upon which the points occur). The results will have the same units as the radius specified. The default is 6371.0, which is approximately the radius of the Earth in kilometers. import [dataset] <options> Convert a text file of comma separated (or otherwise separated) values to a .arff file. The meta-data is automatically determined. The .arff file is printed to stdout. This makes it easy to operate on structured data from a spreadsheet, database, or pretty-much any other source. [dataset] The filename of a dataset. <options> -tab Data elements are separated with a tab character instead of a comma. -space Data elements are separated with a single space instead of a comma. -whitespace Data elements are separated with an arbitrary amount of whitespace. -semicolon Data elements are separated with semicolons. -separator [char] Data elements are separated with the specified character. -columnnames Use the first row of data for column names. -maxvals [n] Specify the maximum number of unique values in a categorical attribute before parsing of that attribute will be aborted. -time [attr] [format] Specify that a particular attribute is a date or time stamp in a particular format. Example format: "YYYY-MM-DD hh:mm:ssssss". -nominal [attr] Indiciate that the specified attribute should be treated as nominal. -real [attr] Indiciate that the specified attribute should be treated as real. enumeratevalues [dataset] [col] Enumerates all of the unique values in the specified column, and replaces each value with its enumeration. (For example, if you have a column that contains the social-security-number of each user, this will change them to numbers from 0 to n-1, where n is the number of unique users.) [dataset] The filename of a dataset [col] The column index (starting with 0) to enumerate keeponlycolumns [dataset] [column-list] Removes all unlisted columns from a dataset and prints the results to stdout. (The input file is not modified.) [dataset] The filename of a dataset. [column-list] A comma-separated list of zero-indexed columns to which will not be dropped. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. measuremeansquarederror [dataset1] [dataset2] <options> Print the mean squared error between two datasets. (Both datasets must be the same size.) [dataset1] The filename of a dataset [dataset2] The filename of a dataset <options> -fit Use a hill-climber to find an affine transformation to make dataset2 fit as closely as possible to dataset1. Report results after each iteration. -sum Sum the mean-squared error over each attribute and only report this sum. (The default is to report the mean-squared error in each attribute.) mergehoriz [dataset1] [dataset2] Merge two (or more) datasets horizontally. All datasets must already have the same number of rows. The resulting dataset will have all the columns of both datasets. [dataset1] The filename of a dataset [dataset2] The filename of a dataset mergevert [dataset1] [dataset2] <options> Merge two datasets vertically. Both datasets must already have the same number of columns. The resulting dataset will have all the rows of both datasets. [dataset1] The filename of a dataset [dataset2] The filename of a dataset <options> -f Force merge, even if attribute names do not match. multiply [a] [b] <options> Matrix multiply [a] x [b]. Both arguments are the filenames of .arff files. Results are printed to stdout. [dataset1] The filename of a dataset [dataset2] The filename of a dataset <options> -transposea Transpose [a] before multiplying. -transposeb Transpose [b] before multiplying. multiplyscalar [dataset] [scalar] Multiply all elements in [dataset] by the specified scalar. Results are printed to stdout. [dataset] The filename of a dataset. [scalar] A scalar to multiply each element by. normalize [dataset] <options> Normalize all continuous attributes to fall within the specified range. (Nominal columns are left unchanged.) [dataset] The filename of a dataset <options> -range [min] [max] Specify the output min and max values. (The default is 0 1.) normalizemagnitude [dataset] Normalize the magnitude of each row-vector to 1. [dataset] The filename of a dataset nominaltocat [dataset] <options> Convert all nominal attributes in the data to vectors of real values by representing them as a categorical distribution. Columns with only two nominal values are converted to 0 or 1. If there are three or more possible values, a column is created for each value. The column corresponding to the value is set to 1, and the others are set to 0. (This is similar to Weka's NominalToBinaryFilter.) [dataset] The filename of a dataset <options> -maxvalues [cap] Specify the maximum number of nominal values for which to create new columns. If not specified, the default is 12. obfuscate [data] Strips comments from an ARFF file, and replaces meta-data with generic meaningless values, thus making it difficult to determine what the data means. (You may also want to normalize the data to make the range of continuous attributes meaningless.) Note that the values of the actual data are not altered, so it may still be possible to derive meaning from them. overlay [base] [over] Combines two same-sized matrices by placing [over] on top of [base], such that elements from [base] are used only if the same element is missing in [over]. [base] The matrix of values to use when they are missing in the other one. [over] The matrix of values to use as long as they are not missing. powercolumns [dataset] [column-list] [exponent] Raises the values in the specified columns to some power (or exponent). [dataset] The filename of a dataset. [column-list] A comma-separated list of zero-indexed columns to transform. A hypen may be used to specify a range of columns. Example: 0,2-5,7 [exponent] An exponent value, such as 0.5, 2, etc. prettify [json-file] Pretty-prints a JSON file. pseudoinverse [dataset] Compute the Moore-Penrose pseudo-inverse of the specified matrix of real values. reducedrowechelonform [dataset] Convert a matrix to reduced row echelon form. Results are printed to stdout. rotate [dataset] [col_x] [col_y] [angle_degrees] Rotate angle degrees around the origin in in the col_x,col_y plane. Only affects the values in col_x and col_y. [dataset] The filename of a dataset. [col_x] The zero-based index of an attribute to serve as the x coordinate in the plane of rotation. Rotation from x to y will be 90 degrees. col_x must be a real-valued attribute. [col_y] The zero-based index of an attribute to serve as the y coordinate in the plane of rotation. Rotation from y to x will be 270 degrees. col_y must be a real-valued attribute. [angle_degrees] The angle in degrees to rotate around the origin in the col_x,col_y plane. reordercolumns [dataset] [column-list] Reorder the columns as specified in the column list. [dataset] The filename of a dataset. [column-list] A comma-separated list of zero-indexed columns. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. samplerows [dataset] [portion] Randomly samples from the rows in the specified dataset and prints them to stdout. This tool reads each row one-at-a-time, so it is well-suited for reducing the size of datasets that are too big to fit into memory. (Note that unlike most other tools, this one does not convert CSV to ARFF format internally. If the input is CSV, the output will be CSV too.) [dataset] The filename of a dataset. ARFF, CSV, and a few other formats are supported. [portion] A value between 0 and 1 that specifies the likelihood that each row will be printed to stdout. <options> -seed [value] Specify a seed for the random number generator. samplerowsregularly [dataset] [freq] Samples from the rows in the specified dataset at regular intervals and prints them to stdout. This tool reads each row one-at-a-time, so it is well-suited for reducing the size of datasets that are too big to fit into memory. (Note that unlike most other tools, this one does not convert CSV to ARFF format internally. If the input is CSV, the output will be CSV too.) [dataset] The filename of a dataset. ARFF, CSV, and a few other formats are supported. [freq] The number of rows read for each row printed. scalecolumns [dataset] [column-list] [scalar] Multiply the values in the specified columns by a scalar. [dataset] The filename of a dataset. [column-list] A comma-separated list of zero-indexed columns to transform. A hypen may be used to specify a range of columns. Example: 0,2-5,7 [scalar] A scalar value. shiftcolumns [dataset] [column-list] [offset] Add [offset] to all of the values in the specified columns. [dataset] The filename of a dataset. [column-list] A comma-separated list of zero-indexed columns to transform. A hypen may be used to specify a range of columns. Example: 0,2-5,7 [offset] A positive or negative value to add to the values in the specified columns. shuffle [dataset] <options> Shuffle the row order. [dataset] The filename of a dataset <options> -seed [value] Specify a seed for the random number generator. significance [dataset] [attr1] [attr2] <options> Compute statistical significance values for the two specified attributes. [dataset] The filename of a .arff file. [attr1] A zero-indexed column number. [attr2] A zero-indexed column number. <options> -tol [value] Sets the tolerance value for the Wilcoxon Signed Ranks test. The default value is 0.001. sortcolumn [dataset] [col] <options> Sort the rows in [dataset] such that the values in the specified column are in ascending order and print the results to to stdout. (The input file is not modified.) [dataset] The filename of a dataset. [col] The zero-indexed column number in which to sort <options> -descending Sort in descending order instead of ascending order. split [dataset] [rows] [filename1] [filename2] <options> Split a dataset into two datasets. (Nothing is printed to stdout.) [dataset] The filename of a datset. [rows] The number of rows to go into the first file. The rest go in the second file. <options> -seed [value] Specify a seed for the random number generator. -shuffle Shuffle the input data before splitting it. [filename1] The filename for one half of the data. [filename2] The filename for the other half of the data. splitclass [data] [attr] <options> Splits a dataset by a class attribute, such that a separate file is created for each unique class label. The generated filenames will be "[data]_[value]", where [value] is the unique class label value. [data] The filename of a dataset. [attr] The zero-indexed column number of the class attribute. <options> -dropclass Drop the class attribute after splitting the data. (The default is to include the class attribute in each of the output datasets, which is rather redundant since every row in the file will have the same class label.) splitfold [dataset] [i] [n] <options> Divides a dataset into [n] parts of approximately equal size, then puts part [i] into one file, and the other [n]-1 parts in another file. (This tool may be useful, for example, to implement n-fold cross validation.) [dataset] The filename of a datset. [i] The (zero-based) index of the fold, or the part to put into the training set. [i] must be less than [n]. [n] The number of folds. <options> -out [train_filename] [test_filename] Specify the filenames for the training and test portions of the data. The default values are train.arff and test.arff. squareddistance [a] [b] Computesthe sum and mean squared distance between dataset [a] and [b]. ([a] and [b] are each the names of files in .arff format. They must have the same dimensions.) [a] The filename of a dataset. [b] The filename of a dataset. swapcolumns [dataset] [col1] [col2] Swap two columns in the specified dataset and prints the results to stdout. (Columns are zero-indexed.) [dataset] The filename of a dataset [col1] A zero-indexed column number. [col2] A zero-indexed column number. transition [action-sequence] [state-sequence] <options> Given a sequence of actions and a sequence of states (each in separate datasets), this generates a single dataset to map from action-state pairs to the next state. This would be useful for generating the data to train a transition function. <options> -delta Predict the delta of the state transition instead of the new state. threshold [dataset] [column] [threshold] Outputs a copy of dataset such that any value v in the given column becomes 0 if v <= threshold and 1 otherwise. Only works on continuous attributes. [dataset] The filename of a dataset. [column] The zero-indexed column number to threshold. [threshold] The threshold value. transpose [dataset] Transpose the data such that columns become rows and rows become columns. uglify [json-file] Prints a JSON file with whitespace removed. unique [dataset] [col] <options> Discard rows with redundant values in [col]. [dataset] The dataset on which to operate. [col] The column in which to preserve only one of each unique value. <options> -last Preserve the last row with a unique value in [col]. (The default is to preserve the first row with a unique value in [col].) zeromean [dataset] Subtracts the mean from all values of all continuous attributes, so that their means in the result are zero. Leaves nominal attributes untouched. usage Print usage information. Previous Next Back to the table of contents |