swamp.clustering.clustering module

class Clustering(nthreads=1, n_iter=100, logger=None)[source]

Bases: abc.ABC

Abstract class for SWAMP library clustering purposes, used to cluster similar helical pairs into ensembles

Implements data structures and methods commonly used in clustering tasks and optimal hyper-parameter search

  • nthreads (int) – number of threads to be used for fragment clustering (default 1)
  • n_iter (int) – number of iterations for the grid_search() of the optimal clustering hyper paramaters (default 100)
  • logger (SwampLogger) – logger instance to record log messages
  • error (bool) – True if errors have occurred at some point on the process of clustering
  • best_params (dict) – contains the optimal hyper-parameters as found on the grid_search()
  • labels (list) – a list with the cluster labels assigned to each member of the similarity_mtx
  • child_threads (list) – a list with the threading.thread.name of each child threading.thread instance used on the grid_search()
  • rmsd_mtx (pandas.DataFrame) – square dataframe with the rmsd distance across framgents in the library
  • similarity_mtx (pandas.DataFrame) – square dataframe with the similarity across framgents in the library
  • nalign_mtx (pandas.DataFrame) – square dataframe with the no. of aligned residues between framgents in the library
  • centroid_dict (dict) – dictionary with the centroids associated with each cluster id
  • cluster_dict (dict) – dictionary with the list of fragments ids that form each cluster
  • composition_dict (dict) – dictionary with the final composition of each ensemble
  • ensemble_dict (dict) – dictionary with the list fragments ids that form each ensemble
  • semaphore (threading.Semaphore) – a threading.Semaphore instance to control the execution of threads on grid_search()
  • gridsearch_results (ParameterSearchResults) – a ParameterSearchResults instance with the results obtained on grid_search()

Method to asses the quality of the obtained clustering

Parameters:labels (tuple) – the labels assigned to each fragment
Returns:a tuple with no. clusters, average cluster size, average cluster qscore, silhouette score and the no. singletons

Abstract method to run the clustering algorithm


The header to be displayed when initialising the logger


Square matrix that corresponds to 1 - distance_mtx


Columns of the distance_mtx, which correspond with the list of fragments in the library


Get a dictionary with the centroids of each cluster at cluster dict


Get the unique cluster identifier where a given centroid is assigned

Parameters:frag_id (str) – the id of the centroid of interest
Returns:the cluster id where the fragment can be found (str)

Get the unique cluster identifier where a given fragment is assigned

Parameters:frag_id (str) – the id of the fragment of interest
Returns:the cluster id where the fragment can be found (str)
get_cluster_dict(labels, inplace=True)[source]

Method to generate a cluster dictionary containing the cluster identifiers as keys and the fragment names as the values

  • labels (tuple) – the labels assigned to each fragment in the library
  • inplace (bool) – if True, then it sets cluster_dict (default True)

a dictionary containing the cluster id as keys and the frag ids as the values (if not inplace)

get_ensemble(centroid_id, rmsd_threshold=0.7, nalign_threshold=30, qscore_threshold=None, nthreads=1)[source]

Search for fragments to form an ensemble given a centroid identifier

Fragments within the threshold distance from the centroid will be included in the ensemble. Other centroids will be excluded from this search.

  • centroid_id (int) – centroid identifier of the centroid of interest
  • rmsd_threshold (float) – the rmsd threshold at which fragments are included into the ensemble (default 0.7)
  • nalign_threshold (int) – threshold of aligned residues to include a fragment into an ensemble (default 30)
  • qscore_threshold (float) – qscore threshold at which fragments are included into the ensemble (default None)
  • nthreads (int) – number of threads to compute the ensemble optimal model alignment (default 1)

a gemmi.Structure hierarchy with the ensemble

get_ensemble_dict(rmsd_threshold=0.7, nalign_threshold=30, qscore_threshold=None, nthreads=1)[source]

Merge similar fragments to create a dictionary with fragment ensembles (clustering with replacement)

An rmsd and no. aligned residues threshold can be used, but if a qscore threshold is given, this one will be used instead

  • rmsd_threshold (float) – the rmsd threshold at which fragments are included into the ensemble (default 0.7)
  • nalign_threshold (int) – threshold of aligned residues to include a fragment into an ensemble (default 30)
  • qscore_threshold (float) – qscore threshold at which fragments are included into the ensemble (default None)
  • nthreads (int) – number of threads to compute the ensemble optimal parameter (default 1)

Method to do a grid random search for a range of clustering hyper-parameters defined at _hyper_params


Register a given SwampLibrary instance in order to load the fragment distance info

Parameters:library (SwampLibrary) – the SwampLibrary insntance to be registered
Raises:TypeError – if library is not a SwampLibrary insntance
class ParameterSearchResults(logger)[source]

Bases: object

Class to hold the results from a multi-threaded grid_search()

Implements Semaphore methods to regulate thread I/O into a result list


logger (SwampLogger) – logger instance to record thread log messages

  • lock (threading.lock) – lock to control I/O to the result instance
  • value (pandas.DataFrame) – dataframe with the results of the grid search

Register a given set of new results into value

Parameters:new_results (pandas.DataFrame) – the set of new results to be registered