openml · PGijsbers · Aug 22, 2022 · Aug 22, 2022 · Aug 24, 2022 · Sep 9, 2022
diff --git a/docs/README.md b/docs/README.md
@@ -1,6 +1,6 @@
 # OpenML AutoML Benchmark
 
-The OpenML AutoML Benchmark provides a framework for evaluating and comparing open-source AutoML systems.  The system is *extensible* because you can [add your own](https://github.com/openml/automlbenchmark/blob/master/docs/extending.md) AutoML frameworks and datasets. For a thorough explanation of the benchmark, and evaluation of results, you can read our [paper](https://openml.github.io/automlbenchmark/paper.html) which was accepted at the [2019 ICML AutoML Workshop](https://sites.google.com/view/automl2019icml/).
+The OpenML AutoML Benchmark provides a framework for evaluating and comparing open-source AutoML systems. The system is *extensible* because you can [add your own](https://github.com/openml/automlbenchmark/blob/master/docs/extending.md) AutoML frameworks and datasets. For a thorough explanation of the benchmark, and evaluation of results, refer to our preprint [AMLB: an AutoML Benchmark](https://arxiv.org/pdf/2207.12560.pdf).
 
 _**NOTE:**_ _This benchmarking framework currently features binary and multiclass classification; extending to regression is a work in progress.  Please file an issue with any concerns/questions._
 
@@ -28,7 +28,7 @@ This toolkit aims to address these problems by setting up standardized environme
 Documentation: <https://openml.github.io/automlbenchmark/>
 
 ### Features:
-* Curated suites of [benchmarking datasets](https://openml.github.io/automlbenchmark/benchmark_datasets.html) from [OpenML](https://www.openml.org/s/218/data).
+* Curated suites of [benchmarking datasets]([https://openml.github.io/automlbenchmark/benchmark_datasets.html](https://github.com/openml/automlbenchmark/blob/master/docs/benchmark_datasets.md) from [OpenML](https://www.openml.org/s/218/data).
 * Includes code to benchmark a number of [popular AutoML systems](https://openml.github.io/automlbenchmark/automl_overview.html) on regression and classification tasks.
 * [New AutoML systems can be added](./HOWTO.md#add-an-automl-framework)
 * Experiments can be run in Docker or Singularity containers

diff --git a/reports/CleanResults.ipynb b/reports/CleanResults.ipynb
@@ -0,0 +1,298 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "d3d4b7a3",
+   "metadata": {},
+   "source": [
+    "Given the raw result files generate a final cleaned version of results:\n",
+    "  - Only the latest results; \n",
+    "      - not any jobs which failed because of the benchmark framework and were redone, or\n",
+    "      - frameworks which were later excluded because issues were identified with the integration itself.\n",
+    "  - Transfer `RandomForest` results from 1 hour to 4 hour if 1 hour jobs ran to completion.\n",
+    "  - Impute `TunedRandomForest` results with random forest results of the same budget.\n",
+    "  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "7954b9cd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "def filter_results(results):\n",
+    "    results = results.sort_values(by=\"utc\", na_position=\"first\")\n",
+    "    # There was a mistake in the old KDDCup09-Upselling task, so it was replaced with a new task.\n",
+    "    results = results[results[\"id\"] != \"openml.org/t/360947\"]\n",
+    "    # Use only the latest results (earlier failures don't count, only justified reruns are done)\n",
+    "    results = results.drop_duplicates([\"framework\", \"task\", \"fold\"], keep=\"last\")\n",
+    "    results = results[~results[\"framework\"].isin([\"autoxgboost\", \"GAMA\", \"MLPlanWEKA\"])]\n",
+    "    return results\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "b9dfccc2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Pick the results to\n",
+    "ttype = \"regression\"\n",
+    "ttype = \"classification\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "id": "550aaf17",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "one = pd.read_csv(r\"http://openml-test.win.tue.nl/amlb/{}_1h8c.csv\".format(ttype))\n",
+    "one = filter_results(one)\n",
+    "four = pd.read_csv(r\"http://openml-test.win.tue.nl/amlb/{}_4h8c.csv\".format(ttype))\n",
+    "four = filter_results(four)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a7fc8cc4",
+   "metadata": {},
+   "source": [
+    "\n",
+    "First a sanity check that 1H RF has results for every job (even if it is not a full forest)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "id": "a229de00",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "rf = one[one.framework == \"RandomForest\"]\n",
+    "assert len(rf) == (330 if ttype == \"regression\" else 710)\n",
+    "assert rf[\"info\"].isna().all()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "05f43033",
+   "metadata": {},
+   "source": [
+    "Impute one hour Tuned Random Forest with Random Forest:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "id": "f1a98a6e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Original dataset with 8519 entries of which 272 had missing results.\n",
+      "The new  dataset with 8519 entries of which 252 are missing results because 20 results were imputed.\n"
+     ]
+    }
+   ],
+   "source": [
+    "def impute_trf_with_rf(results):\n",
+    "    rf = results[results.framework == \"RandomForest\"]\n",
+    "    trf = results[results.framework == \"TunedRandomForest\"]\n",
+    "    missing_results = trf[~trf[\"info\"].isna()][[\"task\", \"fold\"]].itertuples(index=False, name=None)\n",
+    "\n",
+    "    imputation_values = rf.set_index([\"task\", \"fold\"]).loc[missing_results].reset_index().copy()\n",
+    "    imputation_values[\"framework\"] = \"TunedRandomForest\"\n",
+    "\n",
+    "    trf_success = trf[trf[\"info\"].isna()]\n",
+    "    trf_imputed = pd.concat([trf_success, imputation_values])\n",
+    "\n",
+    "    no_trf = results[results.framework != \"TunedRandomForest\"]\n",
+    "    imputed = pd.concat([no_trf, trf_imputed])\n",
+    "    print(f\"Original dataset with {len(results)} entries of which {sum(~results['info'].isna())} had missing results.\")\n",
+    "    print(f\"The new  dataset with {len(imputed)} entries of which {sum(~imputed['info'].isna())} are missing results because {len(imputation_values)} results were imputed.\")\n",
+    "    return imputed\n",
+    "\n",
+    "one_imputed = impute_trf_with_rf(one)    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "79cacc03",
+   "metadata": {},
+   "source": [
+    "Impute four hour Random Forest with complete one hour Random Forest:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "id": "7bde3af8",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Original dataset had   8570 entries of which     50 were 4H RF.\n",
+      "The new  dataset has   9230 entries of which    710  are 4H RF.\n"
+     ]
+    }
+   ],
+   "source": [
+    "rf_one = one[one.framework == \"RandomForest\"]\n",
+    "keep = rf_one[(rf_one[\"models_count\"] == 2000.0) & (~rf_one[\"result\"].isna())].copy()\n",
+    "keep[\"constraint\"] = \"4h8c_gp3\"\n",
+    "four_added = pd.concat([four, keep])\n",
+    "print(f\"Original dataset had {len(four):6d} entries of which {len(four[four.framework == 'RandomForest']):6d} were 4H RF.\")\n",
+    "print(f\"The new  dataset has {len(four_added):6d} entries of which {len(four_added[four_added.framework == 'RandomForest']):6d}  are 4H RF.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bf74eb01",
+   "metadata": {},
+   "source": [
+    "Above file is useful to avoid running 4H RF experiments which would grow the same (sized) forests as the 1H budget ones. This result file was also used to `--resume` from to automatically find the remaining 4H RF experiments."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9ceb0e57",
+   "metadata": {},
+   "source": [
+    "After completing all RF results, we can use them to impute the `TunedRandomForest` where it otherwise failed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "id": "b5fcdcf6",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Original dataset with 9230 entries of which 451 had missing results.\n",
+      "The new  dataset with 9230 entries of which 427 are missing results because 24 results were imputed.\n"
+     ]
+    }
+   ],
+   "source": [
+    "four_imputed = impute_trf_with_rf(four_added)  "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "15b8d68b",
+   "metadata": {},
+   "source": [
+    "We only need to perform the `constantpredictor` baseline once, since the result is deterministic regardless of time budget. We performed the set of experiments with a four hour time budget and transfer the to one hour:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "id": "d410ddc3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "constants = four_imputed[four_imputed.framework == \"constantpredictor\"].copy()\n",
+    "constants[\"constraint\"] = \"1h8c_gp3\"\n",
+    "one_imputed = pd.concat([one_imputed, constants])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "id": "ee5a7991",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# The constantpredictor experiments were ran without the gp3 SSD, but we can rename the result.\n",
+    "four_imputed.loc[four_imputed.framework == \"constantpredictor\", \"constraint\"] = \"4h8c_gp3\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "id": "80869c02",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "final = pd.concat([one_imputed, four_imputed])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "76860af9",
+   "metadata": {},
+   "source": [
+    "A few sanity checks:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "id": "22acb1bf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "if ttype == \"classification\":\n",
+    "    jobs = 71 * 10 * 13 * 2 - 1 # tasks * folds * frameworks * time budgets - known failures\n",
+    "else:\n",
+    "    jobs = 33 * 10 * 12 * 2 # autosklearn2 does not support regression\n",
+    "\n",
+    "assert len(final) == jobs\n",
+    "assert len(final) == len(final.drop_duplicates([\"framework\", \"task\", \"fold\", \"constraint\"]))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "id": "2cf9a9ce",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "final.to_csv(f\"{ttype}_all_cleaned.csv\", index=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1b503d5f",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}