This work was carried out as part of the A2CNES project, which is funded by CNES/UNISTRA R&T grant number 2016-033.

TSCC contains constrained clustering implementations that use the DTW dissimilarity measure (although Euclidean is still an option) and, where necessary, DBA averaging. Please refer to the following publication for further details.

## Reference

This toolbox accompanies the paper: T. Lampert, T.-B.-H. Dao, B. Lafabregue, N. Serrette, G. Forestier, B. Crémilleux, C. Vrain, P. Gançarski. Constrained Distance Based Clustering for Time-Series: A Comparative and Experimental Study. (submitted)

It contains modified implementations of constrained clustering methods that have been found online. These methods have been modified to use the DTW distance measure (although Euclidean is still an option) and, where necessary, DBA averaging. Please refer to the above publication for further details.

## Instructions

To start using the toolbox, download some of the UCR datasets (http://www.cs.ucr.edu/~eamonn/time_series_data/) and edit the UCR_path in `prepare_Datasets.m` to point to the correct location. Using Matlab, execute `prepare_Datasets()`. This will create a subdirectory called datasets, which will contain the data in a form expected by the scripts (described in more detail below). It will convert all UCR datasets under the UCR_path.

To execute these, use any of the `test_<method>_unconstrained.sh` or `test_<method>_constrained.sh` scripts, where _<method>_ is the method to be run, see below. These scripts will run through all datasets that are located in the datasets subdirectory (i.e. that have been processed by the `prepare_Datasets()` function). Before executing any methods implemented in Matlab, edit the script so that the `matlab_path` variable points to the directory where Matlab is installed. Each of these will store the clusterings, performance results, and constraint satisfaction results in subdirectories of the method called clusterings, constraint_satisfaction, and results respectively. If the method includes parameters that should be trained, e.g. Kamvar03 and Li09, there are separate `train_<method>_unconstrained.sh` and `train_<method>_constrained.sh` scripts to achieve this using a grid search. The Matlab function `summarise_training.m` can then be used to select the best parameter values. The train and test scripts include options to run multiple iterations in parallel, change the `n_proc` variable to the desired number of parallel processes. For the methods that have parameter values, these are kept in a case code block within the scripts, the parameter values used in the paper are included for the Li09 and Kamvar03 methods.

...

...

@@ -22,54 +23,58 @@ These `inter_cluster_distance` function can be used to measure the silhouette sc

The following methods are included in the 'methods' subdirectory. Each method may have its own setup procedure so please check any readmes in the respective subdirectories. All licenses for the included software can be found in their respective directories.

Each dataset contained within the dataset subdirectory should contain a TEST subdirectory (and TRAIN if this process is necessary), and each of these should contain the following files:

* `<dataset_name>.data` -- a tab-delimited csv containing the data, having N rows, where each row is a time-series of length M (if multiple features per time point the length will be MxF, where F is the number of features, see below, and columns 1,...,F are the features for the first time point, F+1,...,2F those for the second time point, etc).

* `<dataset_name>.distances` -- a tab-delimited csv containing the pre-computed distance matrix of size N x N.

* `<dataset_name>.k` -- the number of clusters in the dataset.

* `<dataset_name>.labels` -- the ground-truth (continuous integer) labels of the dataset (starting at 1), N rows.

* `<dataset_name>.metric` -- the distance metric used to calculate the distances, 'dtw' or 'euclid'.

* `<dataset_name>.nfeatures` -- the number of features in each time point, F such that each row of data has MxF columns.

* `<dataset_name>_<constraint_fraction>_<iteration>.constraints` -- a file containing the constraints, where each row defines a pariwise constraint in the format p1 p2 c, where the constraint is between point p1 and p2 (1 indexed), and c is -1 (cannot link) or 1 (must link).

The methods included are as follows:

***Subdirectory:** Babaki14

**Name:** CCCG -- Constrained Clustering using Column Generation

**Language:** C++

**URL:** https://dtai.cs.kuleuven.be/CP4IM/cccg/

* `<dataset_name>.data` — a tab-delimited csv containing the data, having N rows, where each row is a time-series of length M (if multiple features per time point the length will be M x F, where F is the number of features, see below, and columns 1,...,F are the features for the first time point, F+1,...,2F those for the second time point, etc).

* `<dataset_name>.distances` — a tab-delimited csv containing the pre-computed distance matrix of size N x N.

* `<dataset_name>.k` — the number of clusters in the dataset.

* `<dataset_name>.labels` — the ground-truth (continuous integer) labels of the dataset (starting at 1), N rows.

* `<dataset_name>.metric` — the distance metric used to calculate the distances, 'dtw' or 'euclid'.

* `<dataset_name>.nfeatures` — the number of features in each time point, F such that each row of data has M x F columns.

* `<dataset_name>_<constraint_fraction>_<iteration>.constraints` — a file containing the constraints, where each row defines a pariwise constraint in the format p1 p2 c, where the constraint is between point p1 and p2 (1 indexed), and c is -1 (cannot link) or 1 (must link).

## Algorithms

The following implementations are included:

***Subdirectory:** Babaki14<br>

**Name:** CCCG — Constrained Clustering using Column Generation<br>

**Publication:** B. Babaki, T. Guns, S. Nijjsen. Constrained Clustering using Column Generation. Proceedings of the 11th International Conference on Integration of Artificial Intelligence (AI) and Operations Research (OR) Techniques in Constraint Programming, 2014.

**Publication:** M. Cucuringu, I. Koutis, S. Chawla, G. Miller, R. Peng. Simple and Scalable Constrained Clustering: A Generalized Spectral Method. Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, 2016.

***Subdirectory:** Hiep16

**Name:** Pairwise Constrained Clustering By Local Search

**Publication:** T. K. Hiep, N. M. Duc, B. Q. Trung. Pairwise Constrained Clustering by Local Search. Proceedings of the 7th Symposium on Information and Communication Technology, 2016.

***Subdirectory:** Kamvar03

**Language:** Matlab

***Subdirectory:** Kamvar03<br>

**Language:** Matlab<br>

**Publication:** S. Kamvar, S. Klein, C. Manning. Spectral Learning. Proceedings of the 18th International Joint Conference on Artificial Intelligence, 2003.

***Subdirectory:** Li09

**Name:** CCSR: Constrained Clustering via Spectral Regularization

**Language:** Matlab

**URL:** http://www.ee.columbia.edu/~zgli/

***Subdirectory:** Li09<br>

**Name:** CCSR — Constrained Clustering via Spectral Regularization<br>

**Language:** Matlab<br>

**URL:** http://www.ee.columbia.edu/~zgli/<br>

**Publication:** Z. Li, J. Liu, X. Tang. Constrained Clustering via Spectral Regularization. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2001.

**Publication:** K. Wagstaff, C. Cardie, S. Rogers, S. Schroedl. Constrained K-means Clustering with Background Knowledge. Proceedings of the 18th International Conference on Machine Learning, 2001.

***Subdirectory:** Wang14

**Name:** CSP

**Language:** Matlab

**URL:** https://github.com/gnaixgnaw/CSP

**Publication:** X. Wang, I. Davidson. Flexible Constrained Spectral Clustering. Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2010.

\ No newline at end of file

***Subdirectory:** Wang14<br>

**Name:** CSP<br>

**Language:** Matlab<br>

**URL:** https://github.com/gnaixgnaw/CSP<br>

**Publication:** X. Wang, I. Davidson. Flexible Constrained Spectral Clustering. Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2010.

## Funding

This work was carried out as part of the A2CNES project, which is funded by CNES/UNISTRA R&T grant number 2016-033.