Draft: Added logging of cluster assignments to wandb (!5) · Merge requests · FI LMI / protein-db

Terézia Slanináková requested to merge terka into dev Oct 25, 2023

3 new metrics added to wandb: `sim_proteins_found_in_chunk`, `cluster_accuracy_nn`, `cluster_accuracy_kmeans`.
How does it work? We have a single protein (A0A346LI80) with several hundred (~690) similar proteins (TM-score > 0.9)
- as soon as we have the first chunk clustered and trained, we want to figure out this protein's cluster assignment
- after each training chunk, we find a subset of the 690 similar proteins in that current chunk and evaluate the portion of these proteins are located in the cluster that A0A346LI80 belongs to vs. similar proteins that do not both according to NN (cluster_accuracy_nn), and K-Means (cluster_accuracy_kmeans)
- in order to have a broader context, we also log the number of similar proteins found (whether it's 2 or 20, the base metrics should be interpreted differently) -- sim_proteins_found_in_chunk
- - also in logs (not in graphs) we include the bins and counts of cluster assignments of similar proteins (again to give a broader context)
- see it in action: https://wandb.ai/protein-db/large-data-training/runs/32w7g6mg

Edited Oct 25, 2023 by Terézia Slanináková

Draft: Added logging of cluster assignments to wandb