aws-samples
diff --git a/‎examples/Getting-started-emr-serverless.ipynb‎
Lines changed: 315 additions & 0 deletions b/‎examples/Getting-started-emr-serverless.ipynb‎
Lines changed: 315 additions & 0 deletions
@@ -0,0 +1,315 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "1f7439a3",
+   "metadata": {},
+   "source": [
+    "# Get started with EMR Serverless on EMR Studio"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e283e844",
+   "metadata": {},
+   "source": [
+    "#### Topics covered in this example\n",
+    "<ol>\n",
+    "    <li> Configure a Spark session </li>\n",
+    "    <li> Import a library to help with plot </li>\n",
+    "    <li> Spark DataFrames: reading a public dataset, selecting data and writing to a S3 location </li>\n",
+    "    <li> Spark SQL: creating a new view and selecting data </li>\n",
+    "    <li> Visualize your data </li>\n",
+    "</ol>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d16c0e10",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2023-10-16T17:21:25.407818Z",
+     "iopub.status.busy": "2023-10-16T17:21:25.407393Z",
+     "iopub.status.idle": "2023-10-16T17:21:39.912554Z",
+     "shell.execute_reply": "2023-10-16T17:21:39.911928Z",
+     "shell.execute_reply.started": "2023-10-16T17:21:25.407789Z"
+    }
+   },
+   "source": [
+    "***\n",
+    "\n",
+    "## Prerequisites\n",
+    "<div class=\"alert alert-block alert-info\">\n",
+    "<b>NOTE :</b> In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.</div>\n",
+    "\n",
+    "* EMR Serverless should be chosen as the Compute.\n",
+    "* Make sure the Studio user role has permission to attach the Workspace to the Application and to pass the runtime role to it.\n",
+    "* This notebook uses the `PySpark` kernel.\n",
+    "* Your Serverless Application must be configured with a VPC that has internet connectivity. [Learn more](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/vpc-access.html)\n",
+    "***"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8af6027b",
+   "metadata": {},
+   "source": [
+    "## 1. Configure your Spark session.\n",
+    "Configure the Spark Session to use Virtualenv. Virtualenv is needed to install other Python packages."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "24ce8423",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "%%configure -f\n",
+    "{\n",
+    "  \"conf\": {\n",
+    "    \"spark.pyspark.virtualenv.enabled\": \"true\",\n",
+    "    \"spark.pyspark.virtualenv.bin.path\": \"/usr/bin/virtualenv\",\n",
+    "    \"spark.pyspark.virtualenv.type\": \"native\",\n",
+    "    \"spark.pyspark.python\": \"/usr/bin/python3\",\n",
+    "    \"spark.executorEnv.PYSPARK_PYTHON\": \"/usr/bin/python3\"\n",
+    "  }\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b2165194",
+   "metadata": {},
+   "source": [
+    "Let's start a Spark session:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d14c84e6",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "spark"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1ea0659c",
+   "metadata": {},
+   "source": [
+    "Let's run the `%%info` magic command which shows the Spark configuration for the current session as well as provides links to navigate to the live Spark UI for the session:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4148b249",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "%%info"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "02facc47",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## 2. Install packages from PyPI\n",
+    "We will install matplotlib Python package. \n",
+    "<div class=\"alert alert-block alert-info\">\n",
+    "<b>NOTE :</b> You will need internet access to do this step.</div>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bd12d484",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "sc.install_pypi_package(\"matplotlib\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "26a5516b",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## 3. Read data from S3\n",
+    "We will use a public data set on NYC yellow taxis. Read the Parquet file from S3. The file has headers and we want Spark to infer the schema. \n",
+    "<div class=\"alert alert-block alert-info\">\n",
+    "<b>NOTE :</b> You will need to update your runtime role to allow Get access to the s3://athena-examples-us-east-1/notebooks/ folder and its sub-folders.</div>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "34e4291d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "file_name = \"s3://athena-examples-us-east-1/notebooks/yellow_tripdata_2016-01.parquet\"\n",
+    "\n",
+    "taxi_df = (spark.read.format(\"parquet\").option(\"header\", \"true\") \\\n",
+    "           .option(\"inferSchema\", \"true\").load(file_name))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8f910a35",
+   "metadata": {},
+   "source": [
+    "#### Use Spark Dataframe to group and count specific column from taxi_df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6c66389d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "taxi1_df = taxi_df.groupBy(\"VendorID\", \"passenger_count\").count()\n",
+    "taxi1_df.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "afe654d5",
+   "metadata": {},
+   "source": [
+    "### Use the %%display magic to quickly visualize a dataframe\n",
+    "<ol>\n",
+    "    <li> You can choose to view the results in a table format. </li>\n",
+    "    <li> You can also choose to visualize your data with five types of charts. You can select the display type below and the chart will change accordingly. </li>\n",
+    "</ol>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a1649eed",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "%%display\n",
+    "taxi1_df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6f8a3889",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## 4. Run Spark SQL commands\n",
+    "#### Create a new temporary view taxis. Use Spark SQL to select data from this view. Create a taxi dataframe for further processing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d34e2a59",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "taxi_df.createOrReplaceTempView(\"taxis\")\n",
+    "\n",
+    "sqlDF = spark.sql(\n",
+    "    \"SELECT DOLocationID, sum(total_amount) as sum_total_amount \\\n",
+    "     FROM taxis where DOLocationID < 25 Group by DOLocationID ORDER BY DOLocationID\"\n",
+    ")\n",
+    "sqlDF.show(50)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ea77d28f",
+   "metadata": {},
+   "source": [
+    "Use %%sql magic"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ecbeea32",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%sql\n",
+    "SHOW DATABASES"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "08a44bb0",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## 5. Visualize your data using Python \n",
+    "#### Use matplotlib to plot the drop off location and the total amount as a bar chart"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fef525f5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "\n",
+    "plt.clf()\n",
+    "df = sqlDF.toPandas()\n",
+    "plt.bar(df.DOLocationID, df.sum_total_amount)\n",
+    "%matplot plt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0ec35ea5",
+   "metadata": {},
+   "source": [
+    "### You have made it to the end of the demo notebook!!"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "PySpark",
+   "language": "python",
+   "name": "spark_magic_pyspark"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "python",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "pyspark",
+   "pygments_lexer": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}