swamp.clustering.clustering module¶

class Clustering(nthreads=1, n_iter=100, logger=None)[source]¶

Bases: abc.ABC

Abstract class for SWAMP library clustering purposes, used to cluster similar helical pairs into ensembles

Implements data structures and methods commonly used in clustering tasks and optimal hyper-parameter search

Parameters:

Parameters:	nthreads (int) – number of threads to be used for fragment clustering (default 1) n_iter (int) – number of iterations for the `grid_search()` of the optimal clustering hyper paramaters (default 100) logger (SwampLogger) – logger instance to record log messages
Variables:	error (bool) – True if errors have occurred at some point on the process of clustering best_params (dict) – contains the optimal hyper-parameters as found on the `grid_search()` labels (list) – a list with the cluster labels assigned to each member of the `similarity_mtx` child_threads (list) – a list with the `threading.thread.name` of each child `threading.thread` instance used on the `grid_search()` rmsd_mtx (pandas.DataFrame) – square dataframe with the rmsd distance across framgents in the library similarity_mtx (pandas.DataFrame) – square dataframe with the similarity across framgents in the library nalign_mtx (pandas.DataFrame) – square dataframe with the no. of aligned residues between framgents in the library centroid_dict (dict) – dictionary with the centroids associated with each cluster id cluster_dict (dict) – dictionary with the list of fragments ids that form each cluster composition_dict (dict) – dictionary with the final composition of each ensemble ensemble_dict (dict) – dictionary with the list fragments ids that form each ensemble semaphore (threading.Semaphore) – a threading.Semaphore instance to control the execution of threads on `grid_search()` gridsearch_results (ParameterSearchResults) – a `ParameterSearchResults` instance with the results obtained on `grid_search()`

nthreads (int) – number of threads to be used for fragment clustering (default 1)
n_iter (int) – number of iterations for the grid_search() of the optimal clustering hyper paramaters (default 100)
logger (SwampLogger) – logger instance to record log messages

Variables:

error (bool) – True if errors have occurred at some point on the process of clustering
best_params (dict) – contains the optimal hyper-parameters as found on the grid_search()
labels (list) – a list with the cluster labels assigned to each member of the similarity_mtx
child_threads (list) – a list with the threading.thread.name of each child threading.thread instance used on the grid_search()
rmsd_mtx (pandas.DataFrame) – square dataframe with the rmsd distance across framgents in the library
similarity_mtx (pandas.DataFrame) – square dataframe with the similarity across framgents in the library
nalign_mtx (pandas.DataFrame) – square dataframe with the no. of aligned residues between framgents in the library
centroid_dict (dict) – dictionary with the centroids associated with each cluster id
cluster_dict (dict) – dictionary with the list of fragments ids that form each cluster
composition_dict (dict) – dictionary with the final composition of each ensemble
ensemble_dict (dict) – dictionary with the list fragments ids that form each ensemble
semaphore (threading.Semaphore) – a threading.Semaphore instance to control the execution of threads on grid_search()
gridsearch_results (ParameterSearchResults) – a ParameterSearchResults instance with the results obtained on grid_search()

assess_clustering(labels)[source]¶

Method to asses the quality of the obtained clustering

Parameters:	labels (tuple) – the labels assigned to each fragment
Returns:	a tuple with no. clusters, average cluster size, average cluster qscore, silhouette score and the no. singletons

cluster()[source]¶: Abstract method to run the clustering algorithm

clustering_header¶: The header to be displayed when initialising the logger

distance_mtx¶: Square matrix that corresponds to 1 - distance_mtx

fragment_list¶: Columns of the distance_mtx, which correspond with the list of fragments in the library

get_centroid_dict()[source]¶: Get a dictionary with the centroids of each cluster at cluster dict

get_centroid_id(frag_id)[source]¶

Get the unique cluster identifier where a given centroid is assigned

Parameters:	frag_id (str) – the id of the centroid of interest
Returns:	the cluster id where the fragment can be found (str)

get_clst_id(frag_id)[source]¶

Get the unique cluster identifier where a given fragment is assigned

Parameters:	frag_id (str) – the id of the fragment of interest
Returns:	the cluster id where the fragment can be found (str)

get_cluster_dict(labels, inplace=True)[source]¶

Method to generate a cluster dictionary containing the cluster identifiers as keys and the fragment names as the values

Parameters:	labels (tuple) – the labels assigned to each fragment in the library inplace (bool) – if True, then it sets `cluster_dict` (default True)
Returns:	a dictionary containing the cluster id as keys and the frag ids as the values (if not inplace)

get_ensemble(centroid_id, rmsd_threshold=0.7, nalign_threshold=30, qscore_threshold=None, nthreads=1)[source]¶

Search for fragments to form an ensemble given a centroid identifier

Fragments within the threshold distance from the centroid will be included in the ensemble. Other centroids will be excluded from this search.

Parameters:

Parameters:	centroid_id (int) – centroid identifier of the centroid of interest rmsd_threshold (float) – the rmsd threshold at which fragments are included into the ensemble (default 0.7) nalign_threshold (int) – threshold of aligned residues to include a fragment into an ensemble (default 30) qscore_threshold (float) – qscore threshold at which fragments are included into the ensemble (default None) nthreads (int) – number of threads to compute the ensemble optimal model alignment (default 1)
Returns:	a `gemmi.Structure` hierarchy with the ensemble

centroid_id (int) – centroid identifier of the centroid of interest
rmsd_threshold (float) – the rmsd threshold at which fragments are included into the ensemble (default 0.7)
nalign_threshold (int) – threshold of aligned residues to include a fragment into an ensemble (default 30)
qscore_threshold (float) – qscore threshold at which fragments are included into the ensemble (default None)
nthreads (int) – number of threads to compute the ensemble optimal model alignment (default 1)

Returns:

a gemmi.Structure hierarchy with the ensemble

get_ensemble_dict(rmsd_threshold=0.7, nalign_threshold=30, qscore_threshold=None, nthreads=1)[source]¶

Merge similar fragments to create a dictionary with fragment ensembles (clustering with replacement)

An rmsd and no. aligned residues threshold can be used, but if a qscore threshold is given, this one will be used instead

Parameters:

Parameters:	rmsd_threshold (float) – the rmsd threshold at which fragments are included into the ensemble (default 0.7) nalign_threshold (int) – threshold of aligned residues to include a fragment into an ensemble (default 30) qscore_threshold (float) – qscore threshold at which fragments are included into the ensemble (default None) nthreads (int) – number of threads to compute the ensemble optimal parameter (default 1)

rmsd_threshold (float) – the rmsd threshold at which fragments are included into the ensemble (default 0.7)
nalign_threshold (int) – threshold of aligned residues to include a fragment into an ensemble (default 30)
qscore_threshold (float) – qscore threshold at which fragments are included into the ensemble (default None)
nthreads (int) – number of threads to compute the ensemble optimal parameter (default 1)

grid_search()[source]¶: Method to do a grid random search for a range of clustering hyper-parameters defined at _hyper_params

register_library(library)[source]¶

Parameters:	library (SwampLibrary) – the `SwampLibrary` insntance to be registered
Raises:	TypeError – if library is not a `SwampLibrary` insntance

class ParameterSearchResults(logger)[source]¶

Bases: object

Class to hold the results from a multi-threaded grid_search()

Implements Semaphore methods to regulate thread I/O into a result list

Parameters:	logger (SwampLogger) – logger instance to record thread log messages
Variables:	lock (threading.lock) – lock to control I/O to the result instance value (pandas.DataFrame) – dataframe with the results of the grid search

register(new_results)[source]¶

Parameters:	new_results (pandas.DataFrame) – the set of new results to be registered