Adding Perrone example for building surrogate#832
Conversation
|
@mfeurer might need some more textual descriptions, but I presume that the person going through this example already is aware of the paper. |
| a tabular format that can be used to build models. | ||
| """ | ||
|
|
||
| def fetch_evaluations(run_full=False, flow_type='svm', metric = 'area_under_roc_curve'): |
There was a problem hiding this comment.
Could you please remove the whitespace around the equals for metric=...?
| return eval_df, task_ids, flow_id | ||
|
|
||
|
|
||
| def create_table_from_evaluations(eval_df, flow_type='svm', run_count=np.iinfo(np.int64).max, |
There was a problem hiding this comment.
Could you please put each argument into its own line?
|
|
||
|
|
||
| def create_table_from_evaluations(eval_df, flow_type='svm', run_count=np.iinfo(np.int64).max, | ||
| metric = 'area_under_roc_curve', task_ids=None): |
| values : list | ||
| ''' | ||
| if task_ids is not None: | ||
| eval_df = eval_df.loc[eval_df.task_id.isin(task_ids)] |
There was a problem hiding this comment.
What about eval_df.query()? I usually find this easier to read where possible.
| ''' | ||
| if task_ids is not None: | ||
| eval_df = eval_df.loc[eval_df.task_id.isin(task_ids)] | ||
| ncols = 4 if flow_type == 'svm' else 10 # ncols determine the number of hyperparameters |
There was a problem hiding this comment.
That line is duplicated in the if/else statement below.
| # Replacing NaNs with fixed values outside the range of the parameters | ||
| # given in the supplement material of the paper | ||
| if flow_type == 'svm': | ||
| eval_table.kernel.fillna("None", inplace=True) |
There was a problem hiding this comment.
Could you please not use the syntax of indexing the columns of the array like they were attributes? This is likely to be removed from pandas in the future (and kind of confusing attributes of the dataframe object with the content of the dataframe).
| eval_table = impute_missing_values(eval_table, flow_type) | ||
| # Encode categorical variables as one-hot vectors | ||
| enc = OneHotEncoder(handle_unknown='ignore') | ||
| enc.fit(eval_table.kernel.to_numpy().reshape(-1, 1)) |
|
|
||
|
|
||
| def preprocess(eval_table, flow_type='svm'): | ||
| eval_table = impute_missing_values(eval_table, flow_type) |
There was a problem hiding this comment.
Could you please construct a scikit-learn pipeline? That could then be easily used for also predicting for new hyperparameter settings with calling a single function.
|
|
||
| eval_df, task_ids, flow_id = fetch_evaluations(run_full=False) | ||
| X, y = create_table_from_evaluations(eval_df, run_count=1000) | ||
| X = preprocess(X) |
There was a problem hiding this comment.
Could you please add a print statement here so that the user sees the data format?
There was a problem hiding this comment.
Also, could you please relate more to the paper? Is this the metadata for the meta-tasks in the paper?
Codecov Report
@@ Coverage Diff @@
## develop #832 +/- ##
==========================================
Coverage ? 89.78%
==========================================
Files ? 36
Lines ? 5118
Branches ? 0
==========================================
Hits ? 4595
Misses ? 523
Partials ? 0Continue to review full report at Codecov.
|
|
@Neeratyoy could you please have a look at my changes? Apparently, the flake8 checks only fail on the push, not the pr check, so we should be fine. Please feel free to merge if you agree with my changes. |
What does this PR implement/fix? Explain your changes.
Shows an example of Perrone et al's use of previous OpenML runs to build a surrogate model.