Skip to content

Draft: Added logging of cluster assignments to wandb

Terézia Slanináková requested to merge terka into dev
  • 3 new metrics added to wandb: `sim_proteins_found_in_chunk`, `cluster_accuracy_nn`, `cluster_accuracy_kmeans`.
  • How does it work? We have a single protein (A0A346LI80) with several hundred (~690) similar proteins (TM-score > 0.9)
    • as soon as we have the first chunk clustered and trained, we want to figure out this protein's cluster assignment

    • after each training chunk, we find a subset of the 690 similar proteins in that current chunk and evaluate the portion of these proteins are located in the cluster that A0A346LI80 belongs to vs. similar proteins that do not both according to NN (cluster_accuracy_nn), and K-Means (cluster_accuracy_kmeans)

    • in order to have a broader context, we also log the number of similar proteins found (whether it's 2 or 20, the base metrics should be interpreted differently) -- sim_proteins_found_in_chunk

      image.png

    • - also in logs (not in graphs) we include the bins and counts of cluster assignments of similar proteins (again to give a broader context)

    • see it in action: https://wandb.ai/protein-db/large-data-training/runs/32w7g6mg

Edited by Terézia Slanináková

Merge request reports