Potential validation data contamination of label assigner

In grid search, I am using a portion of validation data for one-split validation with PredefinedSplit. During grid search, only the training data (X_train) are used to .fit() however after that all of the data (X) are used to re-train the whole .best_estimator_. This may or may not be problematic.

Why it may not be problematic I am not using that part of the validation set anywhere after this selection pipeline. Therefore, in no way i'm testing on training data.

Why it may be problematic I am using real-world data (10 examples) in conjunction to synthetic data (~1860 examples) to train a predictor. There is possibly a shift in distribution between the two sets. Still, the amount of real-world data is so small that it probably doesn't make that much of a difference.

For now, I will keep the predictor as is with the re-train on the whole (X) dataset. I am creating this issue though to keep this in mind for the future.

Also note that test_rrmse metric is not affected by this -- i.e., it still reflects true test performance, since this split was not used in X.