Draft: Added logging of cluster assignments to wandb
- 3 new metrics added to wandb: `sim_proteins_found_in_chunk`, `cluster_accuracy_nn`, `cluster_accuracy_kmeans`.
- How does it work? We have a single protein (
A0A346LI80
) with several hundred (~690
) similar proteins (TM-score >0.9
)-
as soon as we have the first chunk clustered and trained, we want to figure out this protein's cluster assignment
-
after each training chunk, we find a subset of the 690 similar proteins in that current chunk and evaluate the portion of these proteins are located in the cluster that
A0A346LI80
belongs to vs. similar proteins that do not both according to NN (cluster_accuracy_nn
), and K-Means (cluster_accuracy_kmeans
) -
in order to have a broader context, we also log the number of similar proteins found (whether it's 2 or 20, the base metrics should be interpreted differently) --
sim_proteins_found_in_chunk
-
- also in logs (not in graphs) we include the bins and counts of cluster assignments of similar proteins (again to give a broader context)
-
see it in action: https://wandb.ai/protein-db/large-data-training/runs/32w7g6mg
-