Commit 0f3a4cd4 authored by Tom Lampert's avatar Tom Lampert

Initial commit

parents
This diff is collapsed.
# Constrained Clustering Toolbox
Thomas A. Lampert
ICube
University of Strasbourg
tomalampert@outlook.com
This work was carried out as part of the A2CNES project, which is funded by CNES/UNISTRA R&T grant number 2016-033.
This toolbox accompanies the paper: T. Lampert, T.-B.-H. Dao, B. Lafabregue, N. Serrette, G. Forestier, B. Crémilleux, C. Vrain, P. Gançarski. Constrained Distance Based Clustering for Time-Series: A Comparative and Experimental Study. (submitted)
It contains modified implementations of constrained clustering methods that have been found online. These methods have been modified to use the DTW distance measure (although Euclidean is still an option) and, where necessary, DBA averaging. Please refer to the above publication for further details.
To start using the toolbox, download some of the UCR datasets (http://www.cs.ucr.edu/~eamonn/time_series_data/) and edit the UCR_path in `prepare_Datasets.m` to point to the correct location. Using Matlab, execute `prepare_Datasets()`. This will create a subdirectory called datasets, which will contain the data in a form expected by the scripts (described in more detail below). It will convert all UCR datasets under the UCR_path.
To execute these, use any of the `test_<method>_unconstrained.sh` or `test_<method>_constrained.sh` scripts, where _<method>_ is the method to be run, see below. These scripts will run through all datasets that are located in the datasets subdirectory (i.e. that have been processed by the `prepare_Datasets()` function). Before executing any methods implemented in Matlab, edit the script so that the `matlab_path` variable points to the directory where Matlab is installed. Each of these will store the clusterings, performance results, and constraint satisfaction results in subdirectories of the method called clusterings, constraint_satisfaction, and results respectively. If the method includes parameters that should be trained, e.g. Kamvar03 and Li09, there are separate `train_<method>_unconstrained.sh` and `train_<method>_constrained.sh` scripts to achieve this using a grid search. The Matlab function `summarise_training.m` can then be used to select the best parameter values. The train and test scripts include options to run multiple iterations in parallel, change the `n_proc` variable to the desired number of parallel processes. For the methods that have parameter values, these are kept in a case code block within the scripts, the parameter values used in the paper are included for the Li09 and Kamvar03 methods.
These scripts, and the `prepare_Datasets` function, can be used as models to input other (non-UCR) datasets.
These `inter_cluster_distance` function can be used to measure the silhouette score for the dataset (see paper for details).
The following methods are included in the 'methods' subdirectory. Each method may have its own setup procedure so please check any readmes in the respective subdirectories. All licenses for the included software can be found in their respective directories.
Each dataset contained within the dataset subdirectory should contain a TEST subdirectory (and TRAIN if this process is necessary), and each of these should contain the following files:
* `<dataset_name>.data` -- a tab-delimited csv containing the data, having N rows, where each row is a time-series of length M (if multiple features per time point the length will be MxF, where F is the number of features, see below, and columns 1,...,F are the features for the first time point, F+1,...,2F those for the second time point, etc).
* `<dataset_name>.distances` -- a tab-delimited csv containing the pre-computed distance matrix of size N x N.
* `<dataset_name>.k` -- the number of clusters in the dataset.
* `<dataset_name>.labels` -- the ground-truth (continuous integer) labels of the dataset (starting at 1), N rows.
* `<dataset_name>.metric` -- the distance metric used to calculate the distances, 'dtw' or 'euclid'.
* `<dataset_name>.nfeatures` -- the number of features in each time point, F such that each row of data has MxF columns.
* `<dataset_name>_<constraint_fraction>_<iteration>.constraints` -- a file containing the constraints, where each row defines a pariwise constraint in the format p1 p2 c, where the constraint is between point p1 and p2 (1 indexed), and c is -1 (cannot link) or 1 (must link).
The methods included are as follows:
* **Subdirectory:** Babaki14
**Name:** CCCG -- Constrained Clustering using Column Generation
**Language:** C++
**URL:** https://dtai.cs.kuleuven.be/CP4IM/cccg/
**Publication:** B. Babaki, T. Guns, S. Nijjsen. Constrained Clustering using Column Generation. Proceedings of the 11th International Conference on Integration of Artificial Intelligence (AI) and Operations Research (OR) Techniques in Constraint Programming, 2014.
* **Subdirectory:** BabakiXX
**Name:** MIPKmeans
**Language:** Python
**URL:** https://github.com/Behrouz-Babaki/MIPKmeans
* **Subdirectory:** Cucuringu16
**Language:** Matlab
**Publication:** M. Cucuringu, I. Koutis, S. Chawla, G. Miller, R. Peng. Simple and Scalable Constrained Clustering: A Generalized Spectral Method. Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, 2016.
* **Subdirectory:** Hiep16
**Name:** Pairwise Constrained Clustering By Local Search
**Language:** R
**URL:** https://github.com/cran/conclust/blob/master/R/ccls.R
**Publication:** T. K. Hiep, N. M. Duc, B. Q. Trung. Pairwise Constrained Clustering by Local Search. Proceedings of the 7th Symposium on Information and Communication Technology, 2016.
* **Subdirectory:** Kamvar03
**Language:** Matlab
**Publication:** S. Kamvar, S. Klein, C. Manning. Spectral Learning. Proceedings of the 18th International Joint Conference on Artificial Intelligence, 2003.
* **Subdirectory:** Li09
**Name:** CCSR: Constrained Clustering via Spectral Regularization
**Language:** Matlab
**URL:** http://www.ee.columbia.edu/~zgli/
**Publication:** Z. Li, J. Liu, X. Tang. Constrained Clustering via Spectral Regularization. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2001.
* **Subdirectory:** Wagstaff01
**Name:** COP-KMeans
**Language:** R
**URL:** https://github.com/cran/conclust/blob/master/R/ckmeans.R
**Publication:** K. Wagstaff, C. Cardie, S. Rogers, S. Schroedl. Constrained K-means Clustering with Background Knowledge. Proceedings of the 18th International Conference on Machine Learning, 2001.
* **Subdirectory:** Wang14
**Name:** CSP
**Language:** Matlab
**URL:** https://github.com/gnaixgnaw/CSP
**Publication:** X. Wang, I. Davidson. Flexible Constrained Spectral Clustering. Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2010.
\ No newline at end of file
#!/bin/bash
subdir=$1
data_path=./datasets
test_type=test
indir=./methods/${subdir}/clusterings/constrained
indir2=./methods/${subdir}/clusterings/unconstrained
resdir=./constraint_satisfaction/${subdir}
mkdir -p ${resdir}
for d in ${data_path}/*/
do
dataname=`basename ${d}`
echo ${dataname}
resultdir=./methods/${subdir}/constraint_satisfaction/${test_type}
mkdir -p ${resultdir}
for constraintsfilename in $(find ${data_path}/${dataname}/${test_type} -name '*.constraints')
do
IN=`basename ${constraintsfilename%.*}`
arrIN=(${IN//_/ })
cnstrnt_frac=${arrIN[1]}
cnstrnt_iter=${arrIN[2]}
filenamepattern=${dataname}_${test_type}_${cnstrnt_frac}_${cnstrnt_iter}
infilename=${indir}/${filenamepattern}.clustering
resultfilename=${resultdir}/${filenamepattern}.satisfaction
if [ -f ${infilename} ]
then
echo ${infilename}
if [ ! -f ${resultfilename} ]
then
echo ${resultfilename}
python3.5 ./utils/constraint_satisfaction.py -r ${infilename} -c ${constraintsfilename} -o ${resultfilename} -i 1
fi
fi
if [ "${cnstrnt_frac}" = "0.5" ]
then
filenamepattern2=${dataname}_${test_type}_0_${cnstrnt_iter}
infilename=${indir2}/${filenamepattern2}.clustering
resultfilename=${resultdir}/${filenamepattern2}.satisfaction
if [ -f ${infilename} ]
then
echo ${infilename}
rm -f ${resultfilename}
if [ ! -f ${resultfilename} ]
then
echo ${resultfilename}
python3.5 ./utils/constraint_satisfaction.py -r ${infilename} -c ${constraintsfilename} -o ${resultfilename} -i 1
fi
fi
fi
done
for cnstrnt_frac in 0 0.05 0.1 0.15 0.5
do
outfilename=${resdir}/${dataname}_${test_type}_${cnstrnt_frac}.satisfaction
if [ ! -f ${outfilename} ]
then
echo ${outfilename}
rm -f ${outfilename}
Rscript --vanilla ./utils/summarise_satisfaction_results.R ${resultdir} ${dataname} ${test_type} ${cnstrnt_frac} ${outfilename}
fi
done
done
function m = inter_cluster_distance(dataset)
addpath('utils');
distance = 'dtw';
%distance = 'euclid';
experiment = 'test';
data_path = './datasets';
if strcmpi(distance, 'dtw')
distances = dlmread([data_path, filesep, dataset, filesep, experiment, filesep, dataset, '.distances']);
else
X = dlmread(['/Users/tomalampert/Documents/University/Strasbourg/PostDoc_3/code/comparisons/datasets/', dataset, filesep, experiment, filesep, dataset, '.data'], '\t');
distances = pdist(X, 'euclidean');
distances = squareform(distances);
end
gt = dlmread([data_path, filesep, dataset, filesep, experiment, filesep, dataset, '.labels']);
%%%%%%
% Silhouette coefficient
%%%%%%
m = zeros(1, size(distances,1));
for ind = 1:size(distances,1)
m(ind) = inter_cluster_distance_point(distances(ind,:), gt, ind);
end
m = mean(m);
end
\ No newline at end of file
The MIT License (MIT)
Copyright (c) 2014 Behrouz Babaki, Tias Guns, Siegfried Nijssen
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
# Makefile for C++ constarined clustering code using SCIP for branch-and-price
#-----------------------------------------------------------------------------
# paths
#-----------------------------------------------------------------------------
ifdef SCIP_HOME
SCIPDIR = $(SCIP_HOME)
else
$(error Environment variable SCIP_HOME is not defined.)
endif
#-----------------------------------------------------------------------------
# include default project Makefile from SCIP
#-----------------------------------------------------------------------------
include $(SCIPDIR)/make/make.project
CXXFLAGS+= -std=c++0x -O3 -I /usr/local/include -I /opt/local/include
DFLAGS+= -std=c++0x
#-----------------------------------------------------------------------------
# Main Program
#-----------------------------------------------------------------------------
MAINNAME = cccg
MAINOBJ = main_optclust.o \
pricer_optclust.o \
reader.o \
dtw.o \
init.o \
subsolver.o \
SiegBnB.o \
SiegBnBCons.o \
cons_mlcl.o \
branch_ryanfoster.o
MAINSRC = $(addprefix $(SRCDIR)/,$(MAINOBJ:.o=.cpp))
MAINDEP = $(SRCDIR)/depend.cppmain.$(OPT)
MAIN = $(MAINNAME).$(BASE).$(LPS)$(EXEEXTENSION)
MAINFILE = $(BINDIR)/$(MAIN)
MAINSHORTLINK = $(BINDIR)/$(MAINNAME)
MAINOBJFILES = $(addprefix $(OBJDIR)/,$(MAINOBJ))
#-----------------------------------------------------------------------------
# Rules
#-----------------------------------------------------------------------------
ifeq ($(VERBOSE),false)
.SILENT: $(MAINFILE) $(MAINOBJFILES) $(MAINSHORTLINK)
endif
.PHONY: all
all: $(SCIPDIR) $(MAINFILE) $(MAINSHORTLINK)
.PHONY: lint
lint: $(MAINSRC)
-rm -f lint.out
$(SHELL) -ec 'for i in $^; \
do \
echo $$i; \
$(LINT) $(SCIPDIR)/lint/scip.lnt +os\(lint.out\) -u -zero \
$(FLAGS) -UNDEBUG -UWITH_READLINE -UROUNDING_FE $$i; \
done'
.PHONY: scip
scip:
@$(MAKE) -C $(SCIPDIR) libs $^
.PHONY: doc
doc:
@-(cd doc && ln -fs ../$(SCIPDIR)/doc/scip.css);
@-(cd doc && ln -fs ../$(SCIPDIR)/doc/pictures/scippy.png);
@-(cd doc && ln -fs ../$(SCIPDIR)/doc/pictures/miniscippy.png);
@-(cd doc && ln -fs ../$(SCIPDIR)/doc/scipfooter.html footer.html);
cd doc; $(DOXY) $(MAINNAME).dxy
$(MAINSHORTLINK): $(MAINFILE)
@rm -f $@
cd $(dir $@) && ln -s $(notdir $(MAINFILE)) $(notdir $@)
$(OBJDIR):
@-mkdir -p $(OBJDIR)
$(BINDIR):
@-mkdir -p $(BINDIR)
.PHONY: clean
clean: $(OBJDIR)
ifneq ($(OBJDIR),)
-rm -f $(OBJDIR)/*.o
-rmdir $(OBJDIR)
endif
-rm -f $(MAINFILE)
-rm -f $(BINDIR)/cccg
.PHONY: test
test: $(MAINFILE)
@-($(MAINFILE) -d ./examples/iris.data -n ./examples/iris.constraints -k 3);
.PHONY: depend
depend: $(SCIPDIR)
$(SHELL) -ec '$(DCXX) $(FLAGS) $(DFLAGS) $(MAINSRC) \
| sed '\''s|^\([0-9A-Za-z\_]\{1,\}\)\.o *: *$(SRCDIR)/\([0-9A-Za-z\_]*\).cpp|$$\(OBJDIR\)/\2.o: $(SRCDIR)/\2.cpp|g'\'' \
>$(MAINDEP)'
-include $(MAINDEP)
$(MAINFILE): $(BINDIR) $(OBJDIR) $(SCIPLIBFILE) $(LPILIBFILE) $(NLPILIBFILE) $(MAINOBJFILES)
@echo "-> linking $@"
$(LINKCXX) $(MAINOBJFILES) \
$(LINKCXX_L)$(SCIPDIR)/lib $(LINKCXX_l)$(SCIPLIB)$(LINKLIBSUFFIX) \
$(LINKCXX_l)$(OBJSCIPLIB)$(LINKLIBSUFFIX) $(LINKCXX_l)$(LPILIB)$(LINKLIBSUFFIX) $(LINKCXX_l)$(NLPILIB)$(LINKLIBSUFFIX) \
$(OFLAGS) $(LPSLDFLAGS) $(CSPFLAGS)\
$(LDFLAGS) $(LINKCXX_o)$@
$(OBJDIR)/%.o: $(SRCDIR)/%.c
@echo "-> compiling $@"
$(CC) $(FLAGS) $(OFLAGS) $(BINOFLAGS) $(CFLAGS) -c $< $(CC_o)$@
$(OBJDIR)/%.o: $(SRCDIR)/%.cpp
@echo "-> compiling $@"
$(CXX) $(FLAGS) $(OFLAGS) $(BINOFLAGS) $(CXXFLAGS) -c $< $(CXX_o)$@
#---- EOF --------------------------------------------------------------------
CCCG -- Constrained Clustering using Column Generation
What is it?
-----------
The problem of clustering in presence of additional constraints has
been extensively studied already, and a number of efficient methods
to solve these problems (approximately) have been developed. However,
these methods are often limited to only a number of specific
constraints. CCCG is an attempt to build a more general framework for
constrained clustering. It is based on an integer linear programming
formulation of clustering problem, and hence obtains an exact
solution for this problem. The clustering criterion used in CCCG is
similar to that of k-means algorithm, namely to minimize the sum of
squared distances from cluster centers.
Installation
------------
CCCG depends on Boost C++ libraries, which could be obtained from:
<http://www.boost.org/>
CCCG also depends on SCIP which is a Mixed Integer programming
solver. CCCG has been tested with versions 3.0.1 and 3.1.0. You can
obtain SCIP for free from here:
<http://scip.zib.de/>
We recommend that you install not just the SCIP MIP solver, but
rather the whole SCIP optimization suite, which also contains LP
solver SOPLEX. SCIP will need such an LP solver for solving the LP
relaxations. Alternatively, you could only install SCIP, and later
configure it to use your favorite LP solver.
Once you have installed SCIP, define the environment variable
SCIP_HOME such that it points to the directory where you have
installed SCIP. For example:
$ export SCIP_HOME=/usr/local/scipoptsuite-3.0.1/scip-3.0.1
Then go to the CCCG home directory and run:
$ make depend
$ make
If you want to run CCCG pre-seeded with COP-kmeans clustering results, you should build COP-kmeans as well:
$ cd bin/Davidson_mod; make
There are a number of example datasets and example sets of
constraints in /examples directory. To verify that CCCG is working
correctly, you could run it on these examples. Perhaps the easiest
way to do so is to go to the CCCG home directory and run
$ make test
Usage
-----
Like many other clustering algorithms, CCCG assumes that the number
of clusters are specified in advance by user. So the minimum input
for CCCG are information about objects, and the number of
clusters. After going to the CCCG home directory, one could run the
following command:
$ ./bin/cccg -d <DATA> -k <#CLUSTERS>
In the above command, DATA is a file in which the dimensions of data
objects are stored, and #CLUSTERS is the number of clusters.
If you want to seed CCCG with the output of COP-kmeans, use the 'cccg-seed.py' script instead of the 'cccg' binary:
$ ./bin/cccg-seed.py -d <DATA> -k <#CLUSTERS>
To have CCCG consider constraints when generating the optimal
clustering, they should be specified in an input file which is
communicated to CCCG code using flag -n:
$ ./bin/cccg -d <DATA> -k <#CLUSTERS> -n <FILE>
In the input constraint file, the Must-Link (ML) and Cannot-Link
(CL) constraints are specified in the usual way: A pair of
must-linked objects are followed by +1, and a pair of cannot-linked
objects are followed by -1. For an example of constraint file, look
into the /examples directory in CCCG home directory.
There are several other options that could be used with CCCG. For a
list of example commands using these options, look into the /examples
directory. Some of these options are described below:
* The efficiency of CCCG could be improved by providing an initial
clustering. This could be done by using the flag -i:
$ ./bin/cccg -d <DATA> -k <#CLUSTERS> -i <INIT>
* Instead of specifying a single clustering as an initial
solution for CCCG, one could alternatively specify a
directory containing several clusterings. In this case,
CCCG code will use collects all clusters from all these
clusterings, and will add each and every one of them as
a column to the problem. This could be done by using
the flag -r:
$ ./bin/cccg -d <DATA> -k <#CLUSTERS> -r <DIR>
* In each call to subproblem-solver, several columns with
negative reduced cost are found. One could decide all
these columns or only the most negative one. The
default behavior is to do the latter. If one prefers to
add all columns, this could be done via flag -a:
$ ./bin/cccg -d <DATA> -k <#CLUSTERS> -a
For a list of all available options, run ./bin/cccg with
no arguments.
Availability
------------
CCCG software source can be found on the CCCG page under
<http://dtai.cs.kuleuven.be/CP4IM/cccg/>
Authors
-------
Behrouz Babaki
Tias Guns
Siegfried Nijssen
Licensing
---------
Please see the file named LICENSE.
Publications
------------
Babaki, B., Guns, T., & Nijjsen, S. (2014) Constrained Clustering
using Column Generation. Eleventh International Conference on
Integration of Artificial Intelligence (AI) and Operations Research
(OR) techniques in Constraint Programming (CPAIOR 2014)
Contact
-------
For questions, bug reports, and any other matters, you could contact
Behrouz Babaki via this email address:
<Behrouz.Babaki@cs.kuleuven.be>
The MIT License (MIT)
Copyright (c) 2015
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
# COP-Kmeans
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.275118.svg)](https://doi.org/10.5281/zenodo.275118)
<p align="left">
<img src="https://cdn.rawgit.com/Behrouz-Babaki/COP-Kmeans/gh-pages/images/diagram.svg"
width="200">
</p>
This is an implementations of the *Constrained K-means Algorithm*,
developed by Wagstaff et al. according to the description of algorithm
presented in [[1][1]].
## The COP-Kmeans algorithm
This is the *COP-Kmeans* algorithm, as described in [[1][1]]:
<img src="https://cdn.rawgit.com/Behrouz-Babaki/COP-Kmeans/gh-pages/images/algo.svg"
width="550">
## Usage
```
usage: run_ckm.py [-h] [--ofile OFILE] [--n_rep N_REP] [--m_iter M_ITER] [--tol TOL] dfile cfile k
Run COP-Kmeans algorithm
positional arguments:
dfile data file
cfile constraint file
k number of clusters
optional arguments:
-h, --help show this help message and exit
--ofile OFILE file to store the output
--n_rep N_REP number of times to repeat the algorithm
--m_iter M_ITER maximum number of iterations of the main loop
--tol TOL tolerance for deciding on convergence
```
To see a run of the algorithm on example data and constraints, run the script `runner.sh` in the `examples` directory.
## Citing
If you want to cite this implementation, you can use the following bibtex entry (other formats are also [available](https://doi.org/10.5281/zenodo.275118)):
```
@misc{behrouz_babaki_2017_275118,
author = {Behrouz Babaki},
title = {COP-Kmeans v1.0},
month = feb,
year = 2017,
doi = {10.5281/zenodo.275118},
url = {https://doi.org/10.5281/zenodo.275118}
}
```
## There's more ...
### Other implementations
- [Mateusz Zawiślak](https://github.com/mateuszzawislak) has a [java implementation](https://github.com/mateuszzawislak/k-means-clustering) of the COP-Kmeans algorithm.
- The R package [conclust](https://cran.r-project.org/web/packages/conclust/index.html) contains an implementation of COP-Kmeans, among a number of other constrained clustering algorithms.
### Other types of constraints
There is another version of constrained Kmeans that handles *size* constraints [[2][2]]. A python implementation of the algorithm (and its extensions) is available [here](https://github.com/Behrouz-Babaki/MinSizeKmeans).
### Exact algorithms for constrained clustering
In 2013-14, I was working on developing an integer linear programming
formulation for an instance of the constrained clustering problem. The
approach that I chose was *Branch-and-Price* (also referred to as
column-generation). In the initialization step of my algorithm, I needed another algorithm that can produce solutions of reasonably good quality very
quickly. The algorithm **COP-Kmeans** turned out to be exactly what I
was looking for. Interested in knowing more about my own work? Go to my
[homepage][page], from where you can access my paper [[3][3]] and the
corresponding code.
There is also a [body of work](http://cp4clustering.com/) on using constraint programming for exact constrained clustering. In particular, [[4][4]] is the state-of-the art in exact constrained clustering.
## References
1. Wagstaff, K., Cardie, C., Rogers, S., & Schrödl, S. (2001,
June). Constrained k-means clustering with background knowledge. In
ICML (Vol. 1, pp. 577-584).
2. Bradley, P. S., K. P. Bennett, and Ayhan Demiriz. "Constrained k-means clustering." Microsoft Research, Redmond (2000): 1-8.
3. Babaki, B., Guns, T., & Nijssen, S. (2014). Constrained clustering
using column generation. In Integration of AI and OR Techniques in
Constraint Programming (pp. 438-454). Springer International
Publishing.
4. Guns, Tias, Christel Vrain, and Khanh-Chuong Duong. "Repetitive branch-and-bound using constraint programming for constrained minimum sum-of-squares clustering." 22nd European Conference on Artificial Intelligence. 2016.