l

Classification tool manual

Introduction

Data about whereabouts from birds can be gathered using a gps tracker with accelerometer from uva-bits and analysed using this tool. Often, we are not directly interested in the forces measured by the accelerometer. What we actually want to know is whether the bird was sitting, walking, foraging of flying. Assigning a behavior to data measurements is called annotation or classification. Ideally, we want to have a system, or model, that automatically classifies unseen data recorded by the tracker. It is hard to design such a model by hand, but it can be learned using machine learning. Learning a model using machine learning is called 'training'.

Supervised machine learning

This software uses a variant of machine learning called supervised machine learning. This means that a classifying model, or a classifier, is trained on a train set. A train set is an already annotated dataset. The trained model, or classifier, can then be applied to unseen data. This way the unseen data can be automatically classified. Using this method however, a train set is needed to train the classifier on.

A train set can be constructed by annotating a dataset by hand using the annotation tool. Make sure that at least 10, but preferably much more, examples exist in the trainset for each class. For instance, if your training set only contains 2 examples of some exotic behavior, you will not be able to train a model that can reliably recognize this behavior.

After training, you would like to know how well your classifier performs. This can be done by classifying again, annotated dataset, and compare the classes that were 'predicted' by the model, and the actual classes. This is called 'testing'. Using the train set for this purpose will not work because this is the data the model was trained on. It is therefore too easy compared to a more complete population of behaviors you want your model to generalize to. Testing on a trainset will always give you a overly optimistic and unrealistic impression of your model's performance!

To get a train set and a test set, an dataset that was annotated by hand can be split in two. This is often done by splitting the data randomly. If the test set and the train set are not completely independent, testing might give overly optimistic results. Therefore, it may be wise to have the test set and train set be taken from different trackers on different birds. This also depends on the population you want the model generalize to.

Outline

The classification tool's functions are divided over tree different processes that can be run independently from each other. This means that you can run all processes in one go, or run only the data splitting once, and run the training and testing processes repeatedly later. However, the training process cannot be run if the dataset has not yet been split into train, test and validation set. Similarly, the testing process cannot be run without a trained model etc. Below is the description of each of these processes. Which processes are executed can be set from the settings file using the lines below.

execute_dataset_splitting_process = true
execute_train_process = true
execute_test_process = true
execute_classification_process = true
execute_output_features_csv_process = true
                            

Examples

The classification tool comes with example jobs with example data. These examples were included to demonstrate how to set up your data to be able to execute all processes above. They can also help demonstrating most of the several visualization without need to use your own data. However, the performance of the classification should not be considered representative. The reason for this is that the example data was annotated by a developer, a non-expert in the ecology domain. Besides many annotation errors, not enough segments had been annotated to train a reliable model.

Data splitting

Splitting the data can be done by executing the data splitting process. This process loads annotated measurements from the source that is defined in the settings file. The definition of a single measurement in this context is the combination of a single x, y and z direction of the accelerometer, together with its speed as measured by the gps.

Loading annotated accelerometer data from a Matlab file

As a source, a mat-file can be selected. This mat-file should be located in the data folder.

 annotated_measurement_source_paths = 210911_meeuw_alldata_reformatted.mat
                            

Multiple mat-files can be selected using the same setting. In this case a comma separated list of file names can be given. Make sure not to use line breaks within the list. Below is an example of loading 2 source files.

 annotated_measurement_source_paths = Anot6020_4343.mat, Anot6020_4352.mat
                            

The mat files must have a certain structure in order for the classification to be able to load them. This type of mat file is what the Matlab annotation tool saves as output.

outputStruct =

 nOfSamples: 60
   sampleID: [1x60 double]
       year: [1x60 double]
      month: [1x60 double]
        day: [1x60 double]
       hour: [1x60 double]
        min: [1x60 double]
        sec: [60x1 double]
       accX: {60x1 cell}
       accY: {60x1 cell}
       accZ: {60x1 cell}
       accP: []
       accT: []
       tags: {1x60 cell}
annotations: [33x6 double]
     gpsSpd: [60x1 double]
                            

It is recommended that measurements containing NaN (not a number) values are filtered out. This can be done using the setting below.

remove_measurements_containing_nan = true
                            

Loading annotated GPS points from csv files

There also exists a newer annotation tool that runs in a webbrowser. This tool differs from the old one in the sense that gps points, together with complete accelerometer blocks, are annotated, where in the old Matlab tool, each accelerometer data point could be annotated independently from its neighboring points. The new webbrowser annotation tool saves its output as a csv (comma separated values) file. This output only contains the annotation, and not the data itself. These files can be loaded using the settings file.

gps_records_path = gull1gps.csv
gps_record_annotations_path = gull1annotations.csv
                            

The csv file with the actual data, that goes with the annotations is also a csv file. Such a file contains the results from a SQL query like the one below.

SELECT device_info_serial, date_time, latitude, longitude, altitude, 
       pressure, temperature, satellites_used, gps_fixtime, positiondop, 
       h_accuracy, v_accuracy, x_speed, y_speed, z_speed, speed_accuracy, 
       location, userflag, speed_2d, speed_3d, direction, altitude_agl
  FROM gps.uva_tracking_data101 
 WHERE date_time > '2014-05-04 0:00:00'
   AND date_time < '2014-05-05 0:00:00'
   AND device_info_serial = 184
   AND altitude IS NOT NULL;
                            

The exported csv file must contain at least the columns selected in the sql statement above, and have correct headers. You need to choose whether you want to load annotated gps records from a csv file, like described above, instead of accelerometer data from a mat file. This can to be set in the settings file.

use_gps_records_instead_of_accelerometer_data = true
                            

It is common for GPS data to be incomplete. Often, some columns like the h_accuracy, v_accuracy or pressure is missing for a given GPS record. Machine learning algorithms often cannot handle missing values. Multiple solutions exist in de literature, like discarding a row that is missing a value. This is not feasible when handling the GPS data because too many rows would have to be discarded. In this software a zero is used for every missing value when loading GPS records.

Segmenting accelerometer measurements

Consecutive accelerometer measurements grouped together are called a segment. Classification of accelerometer data is done at the segment level, because a single accelerometer point often does not provide enough information to make a reliable classification. The number of consecutive accelerometer measurements in a single segment can be defined the setting below.

measurement_segment_size = 20
                            

Segments can either be overlapping or non overlapping. The number of overlapping measurements can be set.

accelerometer_overlap_size = 19
                            

Note that the step size between two segments can not be set explicitly. It is set indirectly because it is defined by subtracting the overlap from the segment size.

Only a homogeneous segment can be created. This means that there will never be created a segment that contains measurements with various time stamps, device ids or different labels. If there are too few consecutive homogeneous measurements to form a segment, these measurements are all discarded. If a device recorded many measurements at a given time, multiple segments can be created with the same device and time stamp. In some situations, these instances cannot be regarded as independent. To prevent multiple segments from being created from data of a single device and time stamp, see the line below.

segments_must_have_unique_id_timestamp_combination = true
                            

Segmenting gps records

Like accelerometer measurements, gps points are grouped into segments before classification. To determine the number of gps records in a single segment, use the setting below.

gps_segment_size = 5
                            

When classifying gps records, we often want to get a classification for each individual record even though classification is done on the segment level. This problem is tackled by assigning the class of each segment to individual gps record in the center of the segment. By creating overlapping windows, every individual gps record can be given a classification. To allow segments to have this overlap, the setting below can be used.

gps_segments_may_overlap = true
                            

Use caution when using overlapping segments. The resulting segments are not at all independent from one another. Using overlapping segments to randomly split over a train and test set would therefore be incorrect. Doing this would result in overly optimistic test results.

Label schema

The labels, their description, and their associated colors used for visualization, are defined in a schema file. Note that, the schema file should be located in the data folder of the software. From the settings file, the used schema can be set. The lines below indicate that the file at location /data/schemaGull.txt is to be used.

label_schema_file_path = schemaGull.txt
                            

Below is a typical schema file, defining the details of the classes 1 through 9.

1 stand 1.00 0.00 0.00
2 flap 0.75 0.00 0.75
3 soar 0.00 0.00 1.00
4 walk 0.00 0.50 0.00
5 sit 0.50 0.50 0.50
6 XflapL 1.00 0.80 1.00
7 float 1.00 1.00 0.00
8 XflapS 0.00 1.00 1.00
9 other 0.00 1.00 0.00
                            

Remapping the label schema

It can be chosen to merge several classes. This can be done using a remapping schema. Such a schema can be defined in a file. In the settings file the remapping schema file name can be set. As with the schema file, the schema remapping file should be located in the data folder.

label_ids_must_be_remapped = true
label_schema_remapping_path = schemaGullRemap.txt
                            

The schema remapping file is a space delimited text file. The file below would cause original labels 1 and 2 to be merged in to a new label 1, with the description 'Feeding'. Original label 3 is given the new label 2 with description 'Walk' etc.

1 Feeding 1,2
2 Walk 3
3 Stand 4
4 Preen 5
5 Fly 6,7,8
6 Other 9
                            

Splitting

The resulting segments are then divided over 3 sets, a train set, a test set, and a validation set. The train set is a set that is used for training the model, also known as classifier. Training a model could for instance mean generating a decision tree. The test set is for testing the performance of the trained model. Trying out different training methods with various parameters cause several trained models that vary in performance on the test set. By picking out the best performing model, the model is somewhat optimized to the test set. To objectively test the performance of the final version of a trained model, a validation set is used. The segments can be split over these data sets according to ratios defined in the settings file. The lines below define a 50%, 25%, 25% distribution.

dataset_split_train_ratio = 0.5
dataset_split_test_ratio = 0.25
dataset_split_validation_ratio = 0.25
                            

The lines below are equivalent.

dataset_split_train_ratio = 50
dataset_split_test_ratio = 25
dataset_split_validation_ratio = 25
                            

Often, the data is not equally divided over the available classes. This could result in the poor performance of classes of which there is less data available. In some extreme cases, the training process can even choose to ignore a class completely. Therefore a fixed number of segments for each class can be defined. If this method is used, the defined number of segments are taken from the train set. The rest of the train set is discarded. The lines below will result in sampling 25 segments of each class, 1 through 5, to train on. Note that, when using a schema remapping, the new, remapped labels are used in this setting.

train_on_fixed_class_numbers = false
train_instances_per_class = 1:25, 2:25, 3:25, 4:25, 5:25
                            

After this, the train, test and validation sets are saved in json format in the data folder. The file names under which the data sets are saved are set in the settings file.

train_set_file_path = train_set.json
test_set_file_path = test_set.json
validation_set_file_path = validation_set.json
                            

Feature exploration

Before training, we want to decide what features of the data we want to build a model with. For some behaviors, mean forces measured by the accelerometer might be informative, while for others periodic properties, such as fundamental frequencies in the data, are more informative. Often, it's hard to predict what features will be especially good in seperating the classes in the data. To get an idea of what features are important for classifying your dataset over your classes, the features can be explored.

execute_output_features_csv_process = true
                            
This will load train, test and validation sets and calculates the features of all the data. In the output in the job folder, your data can be explored using a scatter plot matrix. Each plot shows the complete data set with respect to two features. Features that seperate one or more classes from the rest of the data tend to be useful for classification.

Data used in the scatter plot matrix is saved in csv-format in file 'featurescomplete.csv' (under output/data in the job folder). Besides exploring the data in the 'feature space' using this tool, the data and their features can be explored further in an external tool like Matlab or Excel.

Training

During the training process, the train set is loaded from the data folder. The name of the file from which train instances should be loaded is defined in the settings file.

train_set_file_path = train_set.json
                            

Feature extraction

For each of the segments in the train set features are calculated. Based on these features, a model is trained. From a fixed set of features, a selection can be picked that is used for training. Note that the trained model can only classify instances that have the exact same features. In other words, a model trained on mean_x, mean_y, mean_z, cannot be used to classify data based on std_x, std_y, std_z. What features are going to be used can be set in the settings file. A complete list of standard features and some explanation can be found in the found in the feature description document. The line below causes only the means and their standard deviations of each dimension to be used as features.

extract_features = mean_x, mean_y, mean_z, std_x, std_y, std_z
                            

Standard features

The classification tool comes with a set of features that can be calculated for each segment by selecting them as shown above. The complete list of these standard features is shown in the table below.

Feature name Description
mean_x The mean of the x value over the accelerometer points in the segment.
mean_y The mean of the y value over the accelerometer points in the segment.
mean_z The mean of the z value over the accelerometer points in the segment.
std_x The standard deviation of the x value over the accelerometer points in the segment.
std_y The standard deviation of the y value over the accelerometer points in the segment.
std_z The standard deviation of the z value over the accelerometer points in the segment.
mean_pitch The mean value of the pitch over the accelerometer points in the segment. The pitch is defined as atan2(x, sqrt(y^2 + z^2)) in degrees.
std_pitch The standard deviation of the pitch over the accelerometer points in the segment.
mean_roll The mean value of the roll over the accelerometer points in the segment. The pitch is defined as atan2(y, sqrt(x^2 + z^2)) in degrees.
std_roll The standard deviation of the roll over the accelerometer points in the segment.
correlation_xy The Pearson’s correlation between the signal of x and the signal of y.
correlation_yz The Pearson’s correlation between the signal of y and the signal of z.
correlation_xz The Pearson’s correlation between the signal of x and the signal of z.
gps_speed The speed as measured by the GPS device.
meanabsder_x The mean of the absolute value of the derivative of x. Derivative is calculated by convolving the signal with at kernel of [-1,1].
meanabsder_y The mean of the absolute value of the derivative of y. Derivative is calculated by convolving the signal with at kernel of [-1,1].
meanabsder_z The mean of the absolute value of the derivative of z. Derivative is calculated by convolving the signal with at kernel of [-1,1].
noise_x Measure of the noise in x signal. Noise is measured as by convolving the signal with a kernel of [-0.5, 1, -0.5].
noise_y Measure of the noise in y signal. Noise is measured as by convolving the signal with a kernel of [-0.5, 1, -0.5].
noise_z Measure of the noise in z signal. Noise is measured as by convolving the signal with a kernel of [-0.5, 1, -0.5].
noise/absder_x Noise in signal of x divided by the mean of the absolute derivative of x. This is effectively the quotient between noise_x and meanabsder_x.
noise/absder_y Noise in signal of y divided by the mean of the absolute derivative of y. This is effectively the quotient between noise_y and meanabsder_y.
noise/absder_z Noise in signal of z divided by the mean of the absolute derivative of z. This is effectively the quotient between noise_z and meanabsder_z.
fundfreq_x The fundamental frequency of the x signal. It is defined as the frequency belonging to the highest peak in the frequency domain of the Fourier transformation of the signal. A Hamming window is used. The windowed signal is zero padded. The number of bins used can be configured.
fundfreq_y The fundamental frequency of the y signal.
fundfreq_z The fundamental frequency of the z signal.
odba Overall dynamic body acceleration. A measure that can be used as a proxy for for energy expenditure.
vedba Vector of dynamic body acceleration. A measure that can be used as a proxy for for energy expenditure.
fundfreqcorr_x Pearson correlation of signal x with a generated sine wave with equal mean, and the fundamental frequency of x as its frequency. The sine wave’s phase was shifted to maximize the correlation.
fundfreqcorr_y Pearson correlation of signal y with a generated sine wave with equal mean, and the fundamental frequency of y as its frequency. The sine wave’s phase was shifted to maximize the correlation.
fundfreqcorr_z Pearson correlation of signal z with a generated sine wave with equal mean, and the fundamental frequency of z as its frequency. The sine wave’s phase was shifted to maximize the correlation.
fundfreqmagnitude_x The magnitude of the highest peak in the frequency domain of the Fourier transformation of the x signal.
fundfreqmagnitude_y The magnitude of the highest peak in the frequency domain of the Fourier transformation of the y signal.
fundfreqmagnitude_z The magnitude of the highest peak in the frequency domain of the Fourier transformation of the z signal.
raw The raw input. The keyword raw will add all values of x, y and z to the features. This is rather a feature group than a single feature.
first_x The first (raw) value of the x signal.
first_y The first (raw) value of the y signal.
first_z The first (raw) value of the z signal.
measurement_classifier Each measurement classified individually by a specific classifier. This is again a feature group rather than a single feature. A classifier for this feature can be set in the configuration file. The features is a normalized histogram of measurements that were put in each class. To train a classifier for this role, use features first_x, first_y, first_z and gps_speed.
stepresponse The maximum response of the x signal (with its mean subtracted) to the convolution with kernel shaped as the smoothed average of several x signals of a Vulture stepping. The resulting kernel: [-0.0667, 0.1463, 0.3886, 0.4430, 0.3763, 0.3213, 0.2795, 0.2016, 0.0878, -0.0424, -0.1720, -0.2821, -0.3319, -0.2668]

Custom features

Besides the standard features, user-defined features can be added. This functionality was added to allow users to define their own features without having to change the code of the software. Custom features can be defined in a text file. The location of the text file, in the data folder, can be set with the following line.

custom_feature_extractor_file_path = custom_features.txt
                            

The format of the file containing custom feature definition is semicolon seperated. Below is an example of such a file in which 2 features are defined.

myfeature; x1*y1*z1 + x2*y2*z2 + x3*y3*z3
someotherfeature; x1+x2+x3+x4 - 4 * meanx
                            

The first column contains the name of the feature, and the second column contains the expression defining the feature. For interpreting the expression, the exp4j library is used. Note that variable names x1, x2 etc. refer to the different elements of x of the segment. The variable names xmean and stdx refer to the mean and standard deviations of x. In addition the speed as measured by the gps can be referred to by speed1. The complete list of variable names is below.

x1
x2
x3
...
y1
y2
y3
...
z1
z2
z3
...
meanx
stdx
meany
stdy
meanz
stdz
speed1
                            

Note that while defining new features, the number of elements have to be taken into account. Using z30, for example, on a segment of only 20 measurements will result in an error.

If you are working with gps records (use_gps_records_instead_of_accelerometer_data = true), a number of additional variable names are available. The variables follow a similar pattern as above. I.e. lat1, lat2, ... meanlat, stdlat for the base name lat, which refers to latitude. Here, lat2 refers to the latitude of the second gps point in a segment. The complete list of base names with a short explanation is shown below.

Base name Description
lat latitude
long longitude
alt altitude
pres pressure
temp temperature
sat satellites used for the gps fix
fixtime duration of the fix
speed2d speed parallel to the ground plane
speed3d speed
dir direction

Note that referring to lat5 for example, when there are only 4 gps records in a single segment (gps_segment_size = 4) will result in an error.

Once the custom features are defined and loaded as described above, they are added to the list of features available for selection. To use them, they need to be selected through as shown below.

extract_features = myfeature, someotherfeature
                            

External feature values

The custom feature functionality has some limitations. More complicated features are often be hard or even impossible to describe in a single line equation. For this reason, the user is given the possibility to calculate feature values externally. It is possible for instance to load your data into Matlab, or any other tool, and calculate feature values there. These externally calculated feature values can be loaded by the classification tool as a csv file.
externally_calculated_feature_value_csv_path = externalvalues.csv
                            
For each combination of device id and timestamp that exists in the data, the feature value must be defined in the csv file. This file is used as a lookup table mapping a pair of id and timestamp to a feature value. If such an entry is missing, the software will not be able to succesfully finish execution. The first two columns of the csv file must contain the device ids and timestamps respectively. The following columns are interpreted as features. This means that multiple features can be loaded using a single csv file. Below is an example of such a file.
device_info_serial,date_time,my_own_feature1,my_second_feature
1011,2013-04-19 00:28:09,18.3329124961778,17.78
1011,2013-04-19 00:58:07,16.9489105376155,9.86
1011,2013-04-19 01:27:59,17.2219288606688,15.67
1011,2013-04-19 01:58:07,15.319438765775,14.18
                            

Note that loading externally calculated features adds those features to the list of features available for selection. To use them, they need to be selected through as shown below.

extract_features = my_own_feature1, my_second_feature
                            

Classifier

A model is trained on the train set using a machine learning algorithm. Which algorithm should be used, is set in the settings file. To use the C4.5 treelearning algorithm use the line below. Note that j48 is the name of a JAVA implementation of the C4.5. The -R parameter turns on reduce error pruning. This is a setting specific to the J48 algorithm. Explanation of the algorithm and its parameters and their default values can be read in WEKA's documentation and in more detail in Ross Quinlan (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA..

machine_learning_algorithm_string = weka.classifiers.trees.J48 -R
                            

The line below causes the Random Forest algorithm to be used with 500 trees. For more information about this algorithm and its parameters and their defaults see WEKA's documentation and in more detail in Leo Breiman (2001). Random Forests. Machine Learning. 45(1):5-32.

machine_learning_algorithm_string = weka.classifiers.trees.RandomForest -I 500
                            

There doesn't seem to be a complete list of all possible classifiers with their options. For exploring various classifiers, it is recommended to run the WEKA explorer and choose a classifier from their interface. This interface also provides the same information as the WEKA documentation. WEKA can even construct a machine_learning_algorithm_string that can be used in the classification tool. This is recommended when defining more complicated classifiers that include wrappers or ensembles for instance.

After training, the resulting trained model, also known as classifier, is saved in the data folder under the file name that is defined in the settings file.

classifier_path = classifier.cls
                            

If the trained model is a decision tree, its structure is also saved in the job folder under "treegraph.json". By opening "treevisualization.html" in the same folder, the structure of the tree can be visualized. More complex tree structures can be inspected further by collapsing intermediate nodes.

Testing

Testing is done to evaluate the performance of a trained model. The trained model that is used to evaluate can be set from the settings file. The file containing the trained model should be located in the data folder.

classifier_path = classifier.cls
                            

The file, within the data folder, containing the test set is also set from the settings file.

test_set_file_path = test_set.json
                            

For each of the segments in the test set features are calculated. Note that the same features need to be used for testing as were used for training the model. For further explanation of features and how to set which ones are use, see the Training section.

After testing, a test report is generated, containing some statistics, including an error rate and a confusion matrix. This report is saved in the job folder under the file name "test_report.txt". Misclassified instances of the test set are saved in json format in the job folder under "misclassifications.json". By opening "misclassifications.html", visualizations of the misclassifications can be inspected together with their predicted and actual label and feature values.

Classification

During the classification process, unannotated data can be labeled using a trained model. From which file the model is to be loaded can be set from the settings file. The file should be located in the data folder.

classifier_path = classifier.cls
                            

The classifications are saved to a csv file so they can be used in other applications, like Matlab, for further analysis. The file can also be uploaded to the eEcology database using the GPS annotation service. If you choose 'acceleration classification' as table template then the classification can be viewed in the annotation tool.

Both kinds of unannotated data, GPS or accelerometer, can be used for classification. How to load either one is described below.

Loading unannotated accelerometer data

Unannotated accelerometer data can be loaded from a mat-file or a csv-file and segmented the same way as in the datasplitting process. Features are calculated for each segment the same way as described in the Training section. The file containing the unannotated measurements should be in the data folder. Its file name can be set from the settings file.

unannotated_measurement_source_paths = g1acc.csv, febomeeuw.mat
                            

Loading unannotated accelerometer data from csv

The format of an unannotated accelerometer data file must be similar to the example below.

"device_info_serial","date_time","speed","longitude","latitude","altitude","tspeed","index","x_cal","y_cal","z_cal"
1,"2010-06-30 12:00:23",13.0578709628735,4.5754068,52.8965984,-5,3.63009687805689,0,-0.28836729463202664293,-0.08380981041227806199,-0.17589082638362395754
1,"2010-06-30 12:00:23",13.0578709628735,4.5754068,52.8965984,-5,3.63009687805689,1,-0.02527727650897015128,0.33749623833885043635,-0.03032600454890068234
1,"2010-06-30 12:00:23",13.0578709628735,4.5754068,52.8965984,-5,3.63009687805689,2,0.20686097477607969429,0.15693650315979536563,0.31084154662623199393
                            

Such data can be pulled from the database by running a sql query like the one below.

    SELECT s.device_info_serial, 
           s.date_time, 
           s.speed_2d 
           as speed, 
           s.longitude, 
           s.latitude, 
           s.altitude, 
           t.speed as tspeed, 
           a.index,
           (a.x_acceleration-d.x_o)/d.x_s as x_cal, 
           (a.y_acceleration-d.y_o)/d.y_s as y_cal, 
           (a.z_acceleration-d.z_o)/d.z_s as z_cal 
      FROM gps.ee_tracking_speed_limited s 
      JOIN gps.ee_acceleration_limited a   
            ON (s.device_info_serial = a.device_info_serial AND s.date_time = a.date_time) 
      JOIN gps.ee_tracker_limited d 
            ON a.device_info_serial = d.device_info_serial 
      JOIN gps.get_uvagps_track_speed ('1', '2010-06-30 00:00:00', '2010-07-01 00:00:00') t 
            ON s.device_info_serial = t.device_info_serial and s.date_time = t.date_time 
     WHERE s.device_info_serial = '1'
       AND s.date_time >'2010-06-30 12:00:00'
       AND s.date_time < '2010-07-01 12:10:00'
       AND s.latitude is not null and s.userflag <> 1 
  ORDER BY s.date_time, a.index;
                            

Loading unannotated accelerometer data from Matlab file

Unannotated accelerometer data can also be loaded from a mat-file. The file should contain a similar struct to the one below.

outputStruct = 

nOfSamples: 60
sampleID: [1x60 double]
year: [1x60 double]
month: [1x60 double]
day: [1x60 double]
hour: [1x60 double]
min: [1x60 double]
sec: [60x1 double]
accX: {60x1 cell}
accY: {60x1 cell}
accZ: {60x1 cell}
accP: []
accT: []
tags: {1x60 cell}
annotations: [33x6 double]
gpsSpd: [60x1 double]
                            

After classifying, all segments have been assigned a label by the model. The results are saved in an csv file called "classifications.csv" in the job folder. Each row in this file describes a single segment and its classification. The columns in this file are described below.

Column Description
device_info_serial The device id
date_time The time at which the device started recording accelerometer data
first_index The index of the first measurement of the segment
class_id The id of the assigned class
class_name The name of the assigned class
class_red The amount of red in the associated color of the class (for visualisations only)
class_green The amount of green in the associated color of the class (for visualisations only)
class_blue The amount of blue in the associated color of the class (for visualisations only)
longitude The longitude measured by the GPS when the accelerometer started recoring
latitude The latitude measured by the GPS when the accelerometer started recoring
altitude The altitude measured by the GPS when the accelerometer started recoring
gpsspeed The speed measured directly by the GPS when the accelerometer started recoring
Below is an example of an output of the classification process.

device_info_serial,date_time,first_index,class_id,class_name,class_red,class_green,class_blue,longitude,latitude,altitude,gpsspeed
538,2011-06-08T08:05:31.000Z,0,3,soar,0.0,0.0,1.0,4.4429206,52.7374668,-4.0,0.8106154424176607
538,2011-06-08T08:15:07.000Z,0,7,float,1.0,1.0,0.0,4.4449913,52.7411828,-2.0,0.6645039573775211
538,2011-06-08T08:19:55.000Z,0,7,float,1.0,1.0,0.0,4.4460503,52.7427955,-5.0,0.6157412031979498
                            

Often, we are interested in a label for each device-timestamp combination as opposed to for each segment. Such device-timestamp combination can have multiple segments with possibly different labels. To cope with this, every first segment with a unique device id - timestamp combination is used as leading for the complete device-timestamp combination. If no segments exist for an id-timestamp combination, no label is given.

Loading unannotated gps data

Unannotated GPS data can loaded to be classified with a model trained on GPS data. When loading unannotated GPS data, the same setting can be used as was used for loading annotated data.

gps_records_path = gull1gps.csv
                            

See the section about loading annotated GPS points in the section about data splitting to read about the format of the csv file. Because the same setting is used to load either unannotated or annotated GPS data, it is not possible to train a classifier on GPS data and classify some other GPS data in a single job. Doing both in a single job would be unnatural because one normally needs to test and evaluate a trained classifier before applying it to unseen data.

Settings

In the sections above, all settings in the settings file are discussed. Because only a selection of the above processes can be selected, setting all possible settings is sometimes counter-intuitive. When only the data splitting process is selected, it doesn't make sense to have to set a classifier_path for instance, as this setting is obviously not needed to run that particular process.

It was chosen however to check if every possible setting has been set before executing. This ensures that no processes have to be stopped halfway because of a missing setting.

Output files

After the Classification tool has run, the output can be found in the output subfolder in the job folder.

Purpose of each output file

The output is saved as a small website in the output folder. This results in file/folder structure that might look complicated. In the table below the purpose of each file is described.
File Location Purpose
treegraph.json <job>/output/data This file contains a representation of a learned decision tree. This file is read from when opening the tree visualization in the webbrowser as it contains everything that needs to be visualized there.
confusionMatrix.txt <job>/output/data This text file contains the confusion matrix and is saved for use in external applications only. It is never read by the Classification tool.
classifications.csv <job>/output/data This file is the output of the classification process. It contains the classification of the unseen data. This file can be uploaded in the annotation tool for visualization or loaded in an external tool.
teststatistics.json <job>/output/data This file contains all the data shown when opening the test results in the webbrowser. It contains error rates, confusion matrix and other statistics.
misclassifications.json <job>/output/data This file contains all the data shown when opening the visualization of the misclassifications in the webbrowser.
schema.json <job>/output/data This file contains the schema (classes, class labels, associated colors) in json format. It is loaded by several visualizations in the webbrowser.
featurescomplete.csv <job>/output/data This file contains the features of all the data in the test, train and validation set. This file is loaded from when opening the scatter plot matrix visualization in the webbrowser. It can also be loaded with external tools.
*.css <job>/output/css Contains style sheets used in visualizations in webbrowser.
*.js <job>/output/js Contains code that runs in the webbrowser for loading and running visualizations.
*.html <job>/output The pages containing visualizations of the output.

Sharing results

To share results with someone, you could share the complete output folder. This will share all visualizations at once. Downsides are that you're automatically sharing large parts of the original data as well which may not be what you want to do. Another downside is that, because of including this data, the size of the files you share can be quite large.

There is an alternative method of sharing results. For this you go to any of the visualizations in the browser you want to share and press 'save' (ctrl + s). This will let you save your result as an html file called 'myresults.html' for example. Note that used stylesheets are also saved in a folder called 'myresults_files' in the same folder. Both the html file and the folder containing the stylesheets need to be shared in order to correctly view the results. Note that, using this method, only the data visible at the time of saving the page will be included. I.e. collapsed branches of the tree visualization, or filtered out misclassifications, or unselected features in the scatter plot matrix will not be saved.