swamp.clustering.clustering module

class Clustering(nthreads=1, n_iter=100, logger=None)[source]

Bases: abc.ABC

Abstract class for SWAMP library clustering purposes, used to cluster similar helical pairs into ensembles

Implements data structures and methods commonly used in clustering tasks and optimal hyper-parameter search

Parameters:
  • nthreads (int) – number of threads to be used for fragment clustering (default 1)
  • n_iter (int) – number of iterations for the grid_search() of the optimal clustering hyper paramaters (default 100)
  • logger (SwampLogger) – logger instance to record log messages
Variables:
  • error (bool) – True if errors have occurred at some point on the process of clustering
  • best_params (dict) – contains the optimal hyper-parameters as found on the grid_search()
  • labels (list) – a list with the cluster labels assigned to each member of the similarity_mtx
  • child_threads (list) – a list with the threading.thread.name of each child threading.thread instance used on the grid_search()
  • rmsd_mtx (pandas.DataFrame) – square dataframe with the rmsd distance across framgents in the library
  • similarity_mtx (pandas.DataFrame) – square dataframe with the similarity across framgents in the library
  • nalign_mtx (pandas.DataFrame) – square dataframe with the no. of aligned residues between framgents in the library
  • centroid_dict (dict) – dictionary with the centroids associated with each cluster id
  • cluster_dict (dict) – dictionary with the list of fragments ids that form each cluster
  • composition_dict (dict) – dictionary with the final composition of each ensemble
  • ensemble_dict (dict) – dictionary with the list fragments ids that form each ensemble
  • semaphore (threading.Semaphore) – a threading.Semaphore instance to control the execution of threads on grid_search()
  • gridsearch_results (ParameterSearchResults) – a ParameterSearchResults instance with the results obtained on grid_search()
assess_clustering(labels)[source]

Method to asses the quality of the obtained clustering

Parameters:labels (tuple) – the labels assigned to each fragment
Returns:a tuple with no. clusters, average cluster size, average cluster qscore, silhouette score and the no. singletons
cluster()[source]

Abstract method to run the clustering algorithm

clustering_header

The header to be displayed when initialising the logger

distance_mtx

Square matrix that corresponds to 1 - distance_mtx

fragment_list

Columns of the distance_mtx, which correspond with the list of fragments in the library

get_centroid_dict()[source]

Get a dictionary with the centroids of each cluster at cluster dict

get_centroid_id(frag_id)[source]

Get the unique cluster identifier where a given centroid is assigned

Parameters:frag_id (str) – the id of the centroid of interest
Returns:the cluster id where the fragment can be found (str)
get_clst_id(frag_id)[source]

Get the unique cluster identifier where a given fragment is assigned

Parameters:frag_id (str) – the id of the fragment of interest
Returns:the cluster id where the fragment can be found (str)
get_cluster_dict(labels, inplace=True)[source]

Method to generate a cluster dictionary containing the cluster identifiers as keys and the fragment names as the values

Parameters:
  • labels (tuple) – the labels assigned to each fragment in the library
  • inplace (bool) – if True, then it sets cluster_dict (default True)
Returns:

a dictionary containing the cluster id as keys and the frag ids as the values (if not inplace)

get_ensemble(centroid_id, rmsd_threshold=0.7, nalign_threshold=30, qscore_threshold=None, nthreads=1)[source]

Search for fragments to form an ensemble given a centroid identifier

Fragments within the threshold distance from the centroid will be included in the ensemble. Other centroids will be excluded from this search.

Parameters:
  • centroid_id (int) – centroid identifier of the centroid of interest
  • rmsd_threshold (float) – the rmsd threshold at which fragments are included into the ensemble (default 0.7)
  • nalign_threshold (int) – threshold of aligned residues to include a fragment into an ensemble (default 30)
  • qscore_threshold (float) – qscore threshold at which fragments are included into the ensemble (default None)
  • nthreads (int) – number of threads to compute the ensemble optimal model alignment (default 1)
Returns:

a gemmi.Structure hierarchy with the ensemble

get_ensemble_dict(rmsd_threshold=0.7, nalign_threshold=30, qscore_threshold=None, nthreads=1)[source]

Merge similar fragments to create a dictionary with fragment ensembles (clustering with replacement)

An rmsd and no. aligned residues threshold can be used, but if a qscore threshold is given, this one will be used instead

Parameters:
  • rmsd_threshold (float) – the rmsd threshold at which fragments are included into the ensemble (default 0.7)
  • nalign_threshold (int) – threshold of aligned residues to include a fragment into an ensemble (default 30)
  • qscore_threshold (float) – qscore threshold at which fragments are included into the ensemble (default None)
  • nthreads (int) – number of threads to compute the ensemble optimal parameter (default 1)

Method to do a grid random search for a range of clustering hyper-parameters defined at _hyper_params

register_library(library)[source]

Register a given SwampLibrary instance in order to load the fragment distance info

Parameters:library (SwampLibrary) – the SwampLibrary insntance to be registered
Raises:TypeError – if library is not a SwampLibrary insntance
class ParameterSearchResults(logger)[source]

Bases: object

Class to hold the results from a multi-threaded grid_search()

Implements Semaphore methods to regulate thread I/O into a result list

Parameters:

logger (SwampLogger) – logger instance to record thread log messages

Variables:
  • lock (threading.lock) – lock to control I/O to the result instance
  • value (pandas.DataFrame) – dataframe with the results of the grid search
register(new_results)[source]

Register a given set of new results into value

Parameters:new_results (pandas.DataFrame) – the set of new results to be registered