Back to the table of contents

Previous      Next

Representing your data

There are two main data structures that you need to know about: GVec and GMatrix. The GVec class stores a vector of values. These values can be continuous (real-valued) or nominal (categorical).

The GMatrix class is a two-dimensional table. It can be thought of as an array of GVec objects.

To get started, you probably want to include GMatrix.h.

	#include <GClasses/GMatrix.h>

Then, you can load a GMatrix from ARFF format as follows:

	GMatrix M;
	M.loadArff("mydata.arff");

You can also load from a text file of comma-separated values (a CSV file):

	GCSVParser p;
	p.parse(M, "mydata.csv");

or tab-separated values:

	GCSVParser p;
	p.setSeparator('\t');
	p.parse(M, "mydata.tsv");

(The GCSVParser class also provides functionality to specify the format for date-stamp columns, to specify the type of ambiguous columns, to generate a report of parsing errors, etc. For more details, see the GCSVParser class in the GClasses API docs.)

You can construct an uninitialized GMatrix of continuous values by specifying the number of rows and columns:

	GMatrix M(5, 7); // 5 rows, 7 columns

You can use array notation to access a row of the matrix:

	GVec& row = M[3];

This will set element 3,5 to 3.14159, and element 0,2 to 3.34159:

	M[3][5] = 3.14159;
	M[0][2] = M[3][5] + 0.2;

You can also manually construct a matrix where some columns (or attributes) have nominal (categorical) values. This is done using a vector that specifies how many values are supported in each column. (Use 0 for continuous values.) This example will create a matrix with two rows and three columns. The first column is continuous. The second column is binary, and the third column supports three class values:

	vector<size_t> vals;
	vals.push_back(0);
	vals.push_back(2);
	vals.push_back(3);
	GMatrix mydata(vals);
	GVec& row0 = mydata.newRow();
	GVec& row1 = mydata.newRow();

Nominal elements contain an integer value (stored as a double) that specifies the index of its class or category. For example, a nominal attribute with 4 possible class values supports the values 0, 1, 2, and 3. (These values are stored internally as doubles, but most operations will cast them to integers, thus dropping the decimal, when working with nominal values.)

Now, to tie it all together, here is a simple example program that will load some data, mess with it a little bit, and then print the results to stdout.

	#include <GClasses/GMatrix.h>
	#include <GClasses/GRand.h>
	#include <iostream>

	using namespace GClasses;
	using std::cout;

	int main(int argc, char *argv[])
	{
		GMatrix M; // make a matrix
		M.loadArff("mydata.arff"); // load some data
		GRand rand(0); // construct a random number generator with seed 0
		M.shuffle(&rand); // shuffle the row order
		M.sort(2); // sort rows by column 2 in ascending order
		M.reverseRows(); // reverse the row order
		M.swapColumns(2, 4); // swap columns 2 and 4
		M[0][0] = M.median(3); // set element 0,0 to the median of column 3
		M[1][0] = M.entropy(5); // compute the entropy in column 5
		M.deleteRow(7); // delete a row
		M.print(cout); // print M to stdout
		return 0;
	}

Previous      Next

Back to the table of contents