Skip to content

Commit 58ae326

Browse files
authored
Add EMR Serverless Notebook examples (#34)
* Add EMR Serverless Notebook examples * Address review comments
1 parent d4faeb9 commit 58ae326

5 files changed

Lines changed: 1087 additions & 0 deletions
Lines changed: 315 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,315 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "1f7439a3",
6+
"metadata": {},
7+
"source": [
8+
"# Get started with EMR Serverless on EMR Studio"
9+
]
10+
},
11+
{
12+
"cell_type": "markdown",
13+
"id": "e283e844",
14+
"metadata": {},
15+
"source": [
16+
"#### Topics covered in this example\n",
17+
"<ol>\n",
18+
" <li> Configure a Spark session </li>\n",
19+
" <li> Import a library to help with plot </li>\n",
20+
" <li> Spark DataFrames: reading a public dataset, selecting data and writing to a S3 location </li>\n",
21+
" <li> Spark SQL: creating a new view and selecting data </li>\n",
22+
" <li> Visualize your data </li>\n",
23+
"</ol>"
24+
]
25+
},
26+
{
27+
"cell_type": "markdown",
28+
"id": "d16c0e10",
29+
"metadata": {
30+
"execution": {
31+
"iopub.execute_input": "2023-10-16T17:21:25.407818Z",
32+
"iopub.status.busy": "2023-10-16T17:21:25.407393Z",
33+
"iopub.status.idle": "2023-10-16T17:21:39.912554Z",
34+
"shell.execute_reply": "2023-10-16T17:21:39.911928Z",
35+
"shell.execute_reply.started": "2023-10-16T17:21:25.407789Z"
36+
}
37+
},
38+
"source": [
39+
"***\n",
40+
"\n",
41+
"## Prerequisites\n",
42+
"<div class=\"alert alert-block alert-info\">\n",
43+
"<b>NOTE :</b> In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.</div>\n",
44+
"\n",
45+
"* EMR Serverless should be chosen as the Compute.\n",
46+
"* Make sure the Studio user role has permission to attach the Workspace to the Application and to pass the runtime role to it.\n",
47+
"* This notebook uses the `PySpark` kernel.\n",
48+
"* Your Serverless Application must be configured with a VPC that has internet connectivity. [Learn more](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/vpc-access.html)\n",
49+
"***"
50+
]
51+
},
52+
{
53+
"cell_type": "markdown",
54+
"id": "8af6027b",
55+
"metadata": {},
56+
"source": [
57+
"## 1. Configure your Spark session.\n",
58+
"Configure the Spark Session to use Virtualenv. Virtualenv is needed to install other Python packages."
59+
]
60+
},
61+
{
62+
"cell_type": "code",
63+
"execution_count": null,
64+
"id": "24ce8423",
65+
"metadata": {
66+
"tags": []
67+
},
68+
"outputs": [],
69+
"source": [
70+
"%%configure -f\n",
71+
"{\n",
72+
" \"conf\": {\n",
73+
" \"spark.pyspark.virtualenv.enabled\": \"true\",\n",
74+
" \"spark.pyspark.virtualenv.bin.path\": \"/usr/bin/virtualenv\",\n",
75+
" \"spark.pyspark.virtualenv.type\": \"native\",\n",
76+
" \"spark.pyspark.python\": \"/usr/bin/python3\",\n",
77+
" \"spark.executorEnv.PYSPARK_PYTHON\": \"/usr/bin/python3\"\n",
78+
" }\n",
79+
"}"
80+
]
81+
},
82+
{
83+
"cell_type": "markdown",
84+
"id": "b2165194",
85+
"metadata": {},
86+
"source": [
87+
"Let's start a Spark session:"
88+
]
89+
},
90+
{
91+
"cell_type": "code",
92+
"execution_count": null,
93+
"id": "d14c84e6",
94+
"metadata": {
95+
"tags": []
96+
},
97+
"outputs": [],
98+
"source": [
99+
"spark"
100+
]
101+
},
102+
{
103+
"cell_type": "markdown",
104+
"id": "1ea0659c",
105+
"metadata": {},
106+
"source": [
107+
"Let's run the `%%info` magic command which shows the Spark configuration for the current session as well as provides links to navigate to the live Spark UI for the session:"
108+
]
109+
},
110+
{
111+
"cell_type": "code",
112+
"execution_count": null,
113+
"id": "4148b249",
114+
"metadata": {
115+
"tags": []
116+
},
117+
"outputs": [],
118+
"source": [
119+
"%%info"
120+
]
121+
},
122+
{
123+
"cell_type": "markdown",
124+
"id": "02facc47",
125+
"metadata": {},
126+
"source": [
127+
"---\n",
128+
"## 2. Install packages from PyPI\n",
129+
"We will install matplotlib Python package. \n",
130+
"<div class=\"alert alert-block alert-info\">\n",
131+
"<b>NOTE :</b> You will need internet access to do this step.</div>"
132+
]
133+
},
134+
{
135+
"cell_type": "code",
136+
"execution_count": null,
137+
"id": "bd12d484",
138+
"metadata": {
139+
"tags": []
140+
},
141+
"outputs": [],
142+
"source": [
143+
"sc.install_pypi_package(\"matplotlib\")"
144+
]
145+
},
146+
{
147+
"cell_type": "markdown",
148+
"id": "26a5516b",
149+
"metadata": {},
150+
"source": [
151+
"---\n",
152+
"## 3. Read data from S3\n",
153+
"We will use a public data set on NYC yellow taxis. Read the Parquet file from S3. The file has headers and we want Spark to infer the schema. \n",
154+
"<div class=\"alert alert-block alert-info\">\n",
155+
"<b>NOTE :</b> You will need to update your runtime role to allow Get access to the s3://athena-examples-us-east-1/notebooks/ folder and its sub-folders.</div>"
156+
]
157+
},
158+
{
159+
"cell_type": "code",
160+
"execution_count": null,
161+
"id": "34e4291d",
162+
"metadata": {},
163+
"outputs": [],
164+
"source": [
165+
"file_name = \"s3://athena-examples-us-east-1/notebooks/yellow_tripdata_2016-01.parquet\"\n",
166+
"\n",
167+
"taxi_df = (spark.read.format(\"parquet\").option(\"header\", \"true\") \\\n",
168+
" .option(\"inferSchema\", \"true\").load(file_name))"
169+
]
170+
},
171+
{
172+
"cell_type": "markdown",
173+
"id": "8f910a35",
174+
"metadata": {},
175+
"source": [
176+
"#### Use Spark Dataframe to group and count specific column from taxi_df"
177+
]
178+
},
179+
{
180+
"cell_type": "code",
181+
"execution_count": null,
182+
"id": "6c66389d",
183+
"metadata": {},
184+
"outputs": [],
185+
"source": [
186+
"taxi1_df = taxi_df.groupBy(\"VendorID\", \"passenger_count\").count()\n",
187+
"taxi1_df.show()"
188+
]
189+
},
190+
{
191+
"cell_type": "markdown",
192+
"id": "afe654d5",
193+
"metadata": {},
194+
"source": [
195+
"### Use the %%display magic to quickly visualize a dataframe\n",
196+
"<ol>\n",
197+
" <li> You can choose to view the results in a table format. </li>\n",
198+
" <li> You can also choose to visualize your data with five types of charts. You can select the display type below and the chart will change accordingly. </li>\n",
199+
"</ol>"
200+
]
201+
},
202+
{
203+
"cell_type": "code",
204+
"execution_count": null,
205+
"id": "a1649eed",
206+
"metadata": {
207+
"tags": []
208+
},
209+
"outputs": [],
210+
"source": [
211+
"%%display\n",
212+
"taxi1_df"
213+
]
214+
},
215+
{
216+
"cell_type": "markdown",
217+
"id": "6f8a3889",
218+
"metadata": {},
219+
"source": [
220+
"---\n",
221+
"## 4. Run Spark SQL commands\n",
222+
"#### Create a new temporary view taxis. Use Spark SQL to select data from this view. Create a taxi dataframe for further processing"
223+
]
224+
},
225+
{
226+
"cell_type": "code",
227+
"execution_count": null,
228+
"id": "d34e2a59",
229+
"metadata": {},
230+
"outputs": [],
231+
"source": [
232+
"taxi_df.createOrReplaceTempView(\"taxis\")\n",
233+
"\n",
234+
"sqlDF = spark.sql(\n",
235+
" \"SELECT DOLocationID, sum(total_amount) as sum_total_amount \\\n",
236+
" FROM taxis where DOLocationID < 25 Group by DOLocationID ORDER BY DOLocationID\"\n",
237+
")\n",
238+
"sqlDF.show(50)"
239+
]
240+
},
241+
{
242+
"cell_type": "markdown",
243+
"id": "ea77d28f",
244+
"metadata": {},
245+
"source": [
246+
"Use %%sql magic"
247+
]
248+
},
249+
{
250+
"cell_type": "code",
251+
"execution_count": null,
252+
"id": "ecbeea32",
253+
"metadata": {},
254+
"outputs": [],
255+
"source": [
256+
"%%sql\n",
257+
"SHOW DATABASES"
258+
]
259+
},
260+
{
261+
"cell_type": "markdown",
262+
"id": "08a44bb0",
263+
"metadata": {},
264+
"source": [
265+
"---\n",
266+
"## 5. Visualize your data using Python \n",
267+
"#### Use matplotlib to plot the drop off location and the total amount as a bar chart"
268+
]
269+
},
270+
{
271+
"cell_type": "code",
272+
"execution_count": null,
273+
"id": "fef525f5",
274+
"metadata": {},
275+
"outputs": [],
276+
"source": [
277+
"import matplotlib.pyplot as plt\n",
278+
"import numpy as np\n",
279+
"import pandas as pd\n",
280+
"\n",
281+
"plt.clf()\n",
282+
"df = sqlDF.toPandas()\n",
283+
"plt.bar(df.DOLocationID, df.sum_total_amount)\n",
284+
"%matplot plt"
285+
]
286+
},
287+
{
288+
"cell_type": "markdown",
289+
"id": "0ec35ea5",
290+
"metadata": {},
291+
"source": [
292+
"### You have made it to the end of the demo notebook!!"
293+
]
294+
}
295+
],
296+
"metadata": {
297+
"kernelspec": {
298+
"display_name": "PySpark",
299+
"language": "python",
300+
"name": "spark_magic_pyspark"
301+
},
302+
"language_info": {
303+
"codemirror_mode": {
304+
"name": "python",
305+
"version": 3
306+
},
307+
"file_extension": ".py",
308+
"mimetype": "text/x-python",
309+
"name": "pyspark",
310+
"pygments_lexer": "python3"
311+
}
312+
},
313+
"nbformat": 4,
314+
"nbformat_minor": 5
315+
}

0 commit comments

Comments
 (0)