Potential validation data contamination of label assigner
In grid search, I am using a portion of validation data for one-split validation with PredefinedSplit
. During grid search, only the training data (X_train
) are used to .fit()
however after that all of the data (X
) are used to re-train the whole .best_estimator_
. This may or may not be problematic.
Why it may not be problematic I am not using that part of the validation set anywhere after this selection pipeline. Therefore, in no way i'm testing on training data.
Why it may be problematic I am using real-world data (10 examples) in conjunction to synthetic data (~1860 examples) to train a predictor. There is possibly a shift in distribution between the two sets. Still, the amount of real-world data is so small that it probably doesn't make that much of a difference.
For now, I will keep the predictor as is with the re-train on the whole (X
) dataset. I am creating this issue though to keep this in mind for the future.
Also note that test_rrmse
metric is not affected by this -- i.e., it still reflects true test performance, since this split was not used in X
.