swamp.clustering.clustering module¶
-
class
Clustering
(nthreads=1, n_iter=100, logger=None)[source]¶ Bases:
abc.ABC
Abstract class for SWAMP library clustering purposes, used to cluster similar helical pairs into ensembles
Implements data structures and methods commonly used in clustering tasks and optimal hyper-parameter search
Parameters: - nthreads (int) – number of threads to be used for fragment clustering (default 1)
- n_iter (int) – number of iterations for the
grid_search()
of the optimal clustering hyper paramaters (default 100) - logger (SwampLogger) – logger instance to record log messages
Variables: - error (bool) – True if errors have occurred at some point on the process of clustering
- best_params (dict) – contains the optimal hyper-parameters as found on the
grid_search()
- labels (list) – a list with the cluster labels assigned to each member of the
similarity_mtx
- child_threads (list) – a list with the
threading.thread.name
of each childthreading.thread
instance used on thegrid_search()
- rmsd_mtx (pandas.DataFrame) – square dataframe with the rmsd distance across framgents in the library
- similarity_mtx (pandas.DataFrame) – square dataframe with the similarity across framgents in the library
- nalign_mtx (pandas.DataFrame) – square dataframe with the no. of aligned residues between framgents in the library
- centroid_dict (dict) – dictionary with the centroids associated with each cluster id
- cluster_dict (dict) – dictionary with the list of fragments ids that form each cluster
- composition_dict (dict) – dictionary with the final composition of each ensemble
- ensemble_dict (dict) – dictionary with the list fragments ids that form each ensemble
- semaphore (threading.Semaphore) – a threading.Semaphore instance to control the execution of threads on
grid_search()
- gridsearch_results (ParameterSearchResults) – a
ParameterSearchResults
instance with the results obtained ongrid_search()
-
assess_clustering
(labels)[source]¶ Method to asses the quality of the obtained clustering
Parameters: labels (tuple) – the labels assigned to each fragment Returns: a tuple with no. clusters, average cluster size, average cluster qscore, silhouette score and the no. singletons
-
clustering_header
¶ The header to be displayed when initialising the logger
-
distance_mtx
¶ Square matrix that corresponds to 1 -
distance_mtx
-
fragment_list
¶ Columns of the
distance_mtx
, which correspond with the list of fragments in the library
-
get_centroid_id
(frag_id)[source]¶ Get the unique cluster identifier where a given centroid is assigned
Parameters: frag_id (str) – the id of the centroid of interest Returns: the cluster id where the fragment can be found (str)
-
get_clst_id
(frag_id)[source]¶ Get the unique cluster identifier where a given fragment is assigned
Parameters: frag_id (str) – the id of the fragment of interest Returns: the cluster id where the fragment can be found (str)
-
get_cluster_dict
(labels, inplace=True)[source]¶ Method to generate a cluster dictionary containing the cluster identifiers as keys and the fragment names as the values
Parameters: Returns: a dictionary containing the cluster id as keys and the frag ids as the values (if not inplace)
-
get_ensemble
(centroid_id, rmsd_threshold=0.7, nalign_threshold=30, qscore_threshold=None, nthreads=1)[source]¶ Search for fragments to form an ensemble given a centroid identifier
Fragments within the threshold distance from the centroid will be included in the ensemble. Other centroids will be excluded from this search.
Parameters: - centroid_id (int) – centroid identifier of the centroid of interest
- rmsd_threshold (float) – the rmsd threshold at which fragments are included into the ensemble (default 0.7)
- nalign_threshold (int) – threshold of aligned residues to include a fragment into an ensemble (default 30)
- qscore_threshold (float) – qscore threshold at which fragments are included into the ensemble (default None)
- nthreads (int) – number of threads to compute the ensemble optimal model alignment (default 1)
Returns: a
gemmi.Structure
hierarchy with the ensemble
-
get_ensemble_dict
(rmsd_threshold=0.7, nalign_threshold=30, qscore_threshold=None, nthreads=1)[source]¶ Merge similar fragments to create a dictionary with fragment ensembles (clustering with replacement)
An rmsd and no. aligned residues threshold can be used, but if a qscore threshold is given, this one will be used instead
Parameters: - rmsd_threshold (float) – the rmsd threshold at which fragments are included into the ensemble (default 0.7)
- nalign_threshold (int) – threshold of aligned residues to include a fragment into an ensemble (default 30)
- qscore_threshold (float) – qscore threshold at which fragments are included into the ensemble (default None)
- nthreads (int) – number of threads to compute the ensemble optimal parameter (default 1)
-
grid_search
()[source]¶ Method to do a grid random search for a range of clustering hyper-parameters defined at
_hyper_params
-
register_library
(library)[source]¶ Register a given
SwampLibrary
instance in order to load the fragment distance infoParameters: library (SwampLibrary) – the SwampLibrary
insntance to be registeredRaises: TypeError – if library is not a SwampLibrary
insntance
-
class
ParameterSearchResults
(logger)[source]¶ Bases:
object
Class to hold the results from a multi-threaded
grid_search()
Implements
Semaphore
methods to regulate thread I/O into a result listParameters: logger (SwampLogger) – logger instance to record thread log messages
Variables: - lock (threading.lock) – lock to control I/O to the result instance
- value (pandas.DataFrame) – dataframe with the results of the grid search