From cf78c38bef60193c1166a783b7052655ca91ea99 Mon Sep 17 00:00:00 2001 From: albhasan Date: Sun, 19 Nov 2023 12:09:36 -0300 Subject: [PATCH 01/16] Solves #118 --- episodes/04-data-structures-part2.Rmd | 63 +++++++++++++++++++++++++-- 1 file changed, 59 insertions(+), 4 deletions(-) diff --git a/episodes/04-data-structures-part2.Rmd b/episodes/04-data-structures-part2.Rmd index 2ce4db0d..724dcdf2 100644 --- a/episodes/04-data-structures-part2.Rmd +++ b/episodes/04-data-structures-part2.Rmd @@ -37,10 +37,10 @@ So far, you have seen the basics of manipulating data frames with our nordic dat ::::::::::::::::::::::::::::::::::::::::: instructor -Pay attention to and explain the errors and warnings generated from the +Pay attention to and explain the errors and warnings generated from the examples in this episode. -::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::: ```{r, echo=TRUE} gapminder <- read.csv("data/gapminder_data.csv") @@ -75,7 +75,7 @@ gapminder <- read.csv("https://datacarpentry.org/r-intro-geospatial/data/gapmind - You can read directly from excel spreadsheets without converting them to plain text first by using the [readxl](https://cran.r-project.org/package=readxl) package. - + :::::::::::::::::::::::::::::::::::::::::::::::::: @@ -86,10 +86,12 @@ always do is check out what the data looks like with `str`: str(gapminder) ``` -We can also examine individual columns of the data frame with our `class` function: +We can also examine individual columns of the data frame with the `class` or +'typeof' functions.: ```{r} class(gapminder$year) +typeof(gapminder$year) class(gapminder$country) str(gapminder$country) ``` @@ -281,6 +283,59 @@ tail(gapminder_norway) To understand why R is giving us a warning when we try to add this row, let's learn a little more about factors. + +## Removing columns and rows in data frames + +To remove columns from a data frame, we can use the 'subset' function. +This function allows us to remove columns using their names: + +```{r} +life_expectancy <- subset(gapminder, select = -c(continent, pop, gdpPercap)) +head(life_expectancy) +``` + +We can also use a logical vector to achieve the same result. Make sure the +vector's length match the number of columns in the data frame (to avoid vector +recycling): + +```{r} +life_expectancy <- gapminder[c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE)] +head(life_expectancy) +``` + +Alternatively, we can use column's positions: + +```{r} +life_expectancy <- gapminder[-c(3, 4, 6)] +head(life_expectancy) +``` + +Note that the easy way to remove rows from a data frame is selecting the rows +we want to keep instead. +Anyway, to remove rows from a data frame, we can use their positions: + +```{r} +# Filter data for Afghanistan during the 20th century: +afghanistan_20c <- gapminder[gapminder$country == "Afghanistan" & + gapminder$year > 2000, ] + +# Now remove data for 2002, that is, the first row: +afghanistan_20c[-1, ] +``` + + +An interesting case is removing rows containing NAs: + +```{r} +# Turn some values into NAs: +afghanistan_20c <- gapminder[gapminder$country == "Afghanistan", ] +afghanistan_20c[afghanistan_20c$year < 2007, "year"] <- NA + +# Remove NAs +na.omit(afghanistan_20c) +``` + + ## Factors Here is another thing to look out for: in a `factor`, each different value From 8cb90a89e501130bc6bddf0d201cc3d040aca999 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alber=20S=C3=A1nchez?= Date: Thu, 11 Jan 2024 11:02:18 -0300 Subject: [PATCH 02/16] Update episodes/04-data-structures-part2.Rmd Co-authored-by: Michael Mahoney --- episodes/04-data-structures-part2.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/04-data-structures-part2.Rmd b/episodes/04-data-structures-part2.Rmd index 724dcdf2..9a92cb27 100644 --- a/episodes/04-data-structures-part2.Rmd +++ b/episodes/04-data-structures-part2.Rmd @@ -87,7 +87,7 @@ str(gapminder) ``` We can also examine individual columns of the data frame with the `class` or -'typeof' functions.: +'typeof' functions: ```{r} class(gapminder$year) From 4abdbf6b5fb7547194706f1d6d534745cc8a978b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alber=20S=C3=A1nchez?= Date: Wed, 26 Nov 2025 11:11:58 -0300 Subject: [PATCH 03/16] Update episodes/04-data-structures-part2.Rmd Grammar improvement. Co-authored-by: Sarah Stevens --- episodes/04-data-structures-part2.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/04-data-structures-part2.Rmd b/episodes/04-data-structures-part2.Rmd index 9a92cb27..6b5485b6 100644 --- a/episodes/04-data-structures-part2.Rmd +++ b/episodes/04-data-structures-part2.Rmd @@ -303,7 +303,7 @@ life_expectancy <- gapminder[c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE)] head(life_expectancy) ``` -Alternatively, we can use column's positions: +Alternatively, we can use column positions: ```{r} life_expectancy <- gapminder[-c(3, 4, 6)] From 91e02e30b7d7eba62f19a972ca839038ec222c8c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alber=20S=C3=A1nchez?= Date: Wed, 26 Nov 2025 13:50:38 -0300 Subject: [PATCH 04/16] Update episodes/04-data-structures-part2.Rmd Co-authored-by: Sarah Stevens --- episodes/04-data-structures-part2.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/04-data-structures-part2.Rmd b/episodes/04-data-structures-part2.Rmd index 6b5485b6..0bb008be 100644 --- a/episodes/04-data-structures-part2.Rmd +++ b/episodes/04-data-structures-part2.Rmd @@ -312,7 +312,7 @@ head(life_expectancy) Note that the easy way to remove rows from a data frame is selecting the rows we want to keep instead. -Anyway, to remove rows from a data frame, we can use their positions: +However, to remove rows from a data frame, we can use their positions: ```{r} # Filter data for Afghanistan during the 20th century: From ae053604f6912ea7f0d9b0f66a734f7fdf49bf38 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alber=20S=C3=A1nchez?= Date: Wed, 26 Nov 2025 13:51:10 -0300 Subject: [PATCH 05/16] Update episodes/04-data-structures-part2.Rmd Co-authored-by: Sarah Stevens --- episodes/04-data-structures-part2.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/04-data-structures-part2.Rmd b/episodes/04-data-structures-part2.Rmd index 0bb008be..eebe4bdf 100644 --- a/episodes/04-data-structures-part2.Rmd +++ b/episodes/04-data-structures-part2.Rmd @@ -324,7 +324,7 @@ afghanistan_20c[-1, ] ``` -An interesting case is removing rows containing NAs: +In research, you may want to remove all the missing data prior to an analysis. Let's first add some missing values (NAs) into the data and then we can use `na.omit()` to remove them. ```{r} # Turn some values into NAs: From 0c1af3eb7861a06bfb916fd35a8d58494a6a7780 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alber=20S=C3=A1nchez?= Date: Wed, 26 Nov 2025 13:51:49 -0300 Subject: [PATCH 06/16] Update episodes/04-data-structures-part2.Rmd Co-authored-by: Sarah Stevens --- episodes/04-data-structures-part2.Rmd | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/episodes/04-data-structures-part2.Rmd b/episodes/04-data-structures-part2.Rmd index eebe4bdf..40fbc09c 100644 --- a/episodes/04-data-structures-part2.Rmd +++ b/episodes/04-data-structures-part2.Rmd @@ -287,7 +287,8 @@ To understand why R is giving us a warning when we try to add this row, let's le ## Removing columns and rows in data frames To remove columns from a data frame, we can use the 'subset' function. -This function allows us to remove columns using their names: +This function allows us to remove columns using their names. +If we want to keep all columns except continent, pop and gdpPercap we can use the following `subset` command: ```{r} life_expectancy <- subset(gapminder, select = -c(continent, pop, gdpPercap)) From c93dec119fffc253b2f4b5e0b2b89d4abe726c4b Mon Sep 17 00:00:00 2001 From: albhasan Date: Tue, 6 Jan 2026 14:16:46 -0300 Subject: [PATCH 07/16] Added explanation and reference to vector recycling. --- episodes/04-data-structures-part2.Rmd | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/episodes/04-data-structures-part2.Rmd b/episodes/04-data-structures-part2.Rmd index 40fbc09c..b7c0d72f 100644 --- a/episodes/04-data-structures-part2.Rmd +++ b/episodes/04-data-structures-part2.Rmd @@ -304,6 +304,11 @@ life_expectancy <- gapminder[c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE)] head(life_expectancy) ``` +Vector recycling occurs when working with vectors of different length and it +consist on repeating the elements of the shorter vector up to the lenght of +the larger one. For more information, check the book R for Data Science and its +[chapter about vectors](https://r4ds.had.co.nz/vectors.html#scalars-and-recycling-rules). + Alternatively, we can use column positions: ```{r} From 14ea9ef3b32ca5ab7fbeedc070eed38c0187320a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alber=20S=C3=A1nchez?= Date: Tue, 6 Jan 2026 14:22:09 -0300 Subject: [PATCH 08/16] Update episodes/04-data-structures-part2.Rmd Co-authored-by: Sarah Stevens --- episodes/04-data-structures-part2.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/04-data-structures-part2.Rmd b/episodes/04-data-structures-part2.Rmd index b7c0d72f..e2c5bb97 100644 --- a/episodes/04-data-structures-part2.Rmd +++ b/episodes/04-data-structures-part2.Rmd @@ -316,7 +316,7 @@ life_expectancy <- gapminder[-c(3, 4, 6)] head(life_expectancy) ``` -Note that the easy way to remove rows from a data frame is selecting the rows +Note that typically we select the rows we want to keep, rather than removing rows we do not want in the data. we want to keep instead. However, to remove rows from a data frame, we can use their positions: From 8d72c0c80f75821e7dd7f54ce2c75bf017f5df2c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alber=20S=C3=A1nchez?= Date: Wed, 7 Jan 2026 10:28:57 -0300 Subject: [PATCH 09/16] Update episodes/04-data-structures-part2.Rmd Co-authored-by: Sarah Stevens --- episodes/04-data-structures-part2.Rmd | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/episodes/04-data-structures-part2.Rmd b/episodes/04-data-structures-part2.Rmd index e2c5bb97..fdf6b0df 100644 --- a/episodes/04-data-structures-part2.Rmd +++ b/episodes/04-data-structures-part2.Rmd @@ -296,8 +296,7 @@ head(life_expectancy) ``` We can also use a logical vector to achieve the same result. Make sure the -vector's length match the number of columns in the data frame (to avoid vector -recycling): +vector's length match the number of columns in the data frame (to avoid R repeating the shorter vector to match the length of the longer vector): ```{r} life_expectancy <- gapminder[c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE)] From 1c83f0dde80263023e3550c3788b5d0b8e7dc300 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alber=20S=C3=A1nchez?= Date: Wed, 7 Jan 2026 10:38:27 -0300 Subject: [PATCH 10/16] Update episodes/04-data-structures-part2.Rmd Co-authored-by: Sarah Stevens --- episodes/04-data-structures-part2.Rmd | 1 - 1 file changed, 1 deletion(-) diff --git a/episodes/04-data-structures-part2.Rmd b/episodes/04-data-structures-part2.Rmd index fdf6b0df..7d41535a 100644 --- a/episodes/04-data-structures-part2.Rmd +++ b/episodes/04-data-structures-part2.Rmd @@ -316,7 +316,6 @@ head(life_expectancy) ``` Note that typically we select the rows we want to keep, rather than removing rows we do not want in the data. -we want to keep instead. However, to remove rows from a data frame, we can use their positions: ```{r} From f161796c73eac749e6e6a18ac83059a1a0bc214a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alber=20S=C3=A1nchez?= Date: Wed, 7 Jan 2026 10:38:53 -0300 Subject: [PATCH 11/16] Update episodes/04-data-structures-part2.Rmd Co-authored-by: Sarah Stevens --- episodes/04-data-structures-part2.Rmd | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/episodes/04-data-structures-part2.Rmd b/episodes/04-data-structures-part2.Rmd index 7d41535a..e0e8172b 100644 --- a/episodes/04-data-structures-part2.Rmd +++ b/episodes/04-data-structures-part2.Rmd @@ -316,7 +316,9 @@ head(life_expectancy) ``` Note that typically we select the rows we want to keep, rather than removing rows we do not want in the data. -However, to remove rows from a data frame, we can use their positions: +However, to remove rows from a data frame, we can use their positions. +To practice on a smaller subset, we will filter the data to only those entries from Afghanistan after the year 2000. +This smaller dataset will be easier for us to inspect by eye and see the changes we are making. ```{r} # Filter data for Afghanistan during the 20th century: From 84870543e700542d01bcc355da0f559c9ef1449c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alber=20S=C3=A1nchez?= Date: Wed, 7 Jan 2026 10:39:47 -0300 Subject: [PATCH 12/16] Update episodes/04-data-structures-part2.Rmd Co-authored-by: Sarah Stevens --- episodes/04-data-structures-part2.Rmd | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/episodes/04-data-structures-part2.Rmd b/episodes/04-data-structures-part2.Rmd index e0e8172b..a24e0603 100644 --- a/episodes/04-data-structures-part2.Rmd +++ b/episodes/04-data-structures-part2.Rmd @@ -330,7 +330,8 @@ afghanistan_20c[-1, ] ``` -In research, you may want to remove all the missing data prior to an analysis. Let's first add some missing values (NAs) into the data and then we can use `na.omit()` to remove them. +In research, we often remove rows based on features of the data itself, rather than its location. +For example, you may want to remove all the missing data prior to an analysis. Let's first add some missing values (NAs) into the data and then we can use `na.omit()` to remove them. ```{r} # Turn some values into NAs: From 9da42addda4e1e92237189a44a642260324a2872 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alber=20S=C3=A1nchez?= Date: Wed, 7 Jan 2026 11:15:18 -0300 Subject: [PATCH 13/16] Update episodes/04-data-structures-part2.Rmd Co-authored-by: Sarah Stevens --- episodes/04-data-structures-part2.Rmd | 1 + 1 file changed, 1 insertion(+) diff --git a/episodes/04-data-structures-part2.Rmd b/episodes/04-data-structures-part2.Rmd index a24e0603..7c302510 100644 --- a/episodes/04-data-structures-part2.Rmd +++ b/episodes/04-data-structures-part2.Rmd @@ -337,6 +337,7 @@ For example, you may want to remove all the missing data prior to an analysis. # Turn some values into NAs: afghanistan_20c <- gapminder[gapminder$country == "Afghanistan", ] afghanistan_20c[afghanistan_20c$year < 2007, "year"] <- NA +head(afghanistan_20c) # Remove NAs na.omit(afghanistan_20c) From 992bd9039708bc1348e9f3234b76a1057ead1a4d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alber=20S=C3=A1nchez?= Date: Thu, 22 Jan 2026 13:22:12 -0300 Subject: [PATCH 14/16] Update episodes/04-data-structures-part2.Rmd Co-authored-by: Sarah Stevens --- episodes/04-data-structures-part2.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/04-data-structures-part2.Rmd b/episodes/04-data-structures-part2.Rmd index 7c302510..d319cd98 100644 --- a/episodes/04-data-structures-part2.Rmd +++ b/episodes/04-data-structures-part2.Rmd @@ -296,7 +296,7 @@ head(life_expectancy) ``` We can also use a logical vector to achieve the same result. Make sure the -vector's length match the number of columns in the data frame (to avoid R repeating the shorter vector to match the length of the longer vector): +vector's length match the number of columns in the data frame (to avoid R repeating the shorter vector to match the length of the longer vector, called "vector recycling"): ```{r} life_expectancy <- gapminder[c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE)] From b7233e9c86b491f7ce5edba798bfe19f356d175c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alber=20S=C3=A1nchez?= Date: Thu, 22 Jan 2026 13:22:46 -0300 Subject: [PATCH 15/16] Update episodes/04-data-structures-part2.Rmd Co-authored-by: Sarah Stevens --- episodes/04-data-structures-part2.Rmd | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/episodes/04-data-structures-part2.Rmd b/episodes/04-data-structures-part2.Rmd index d319cd98..74ce6bac 100644 --- a/episodes/04-data-structures-part2.Rmd +++ b/episodes/04-data-structures-part2.Rmd @@ -303,10 +303,15 @@ life_expectancy <- gapminder[c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE)] head(life_expectancy) ``` +:::::: spoiler + +### Vector Recycling + Vector recycling occurs when working with vectors of different length and it consist on repeating the elements of the shorter vector up to the lenght of the larger one. For more information, check the book R for Data Science and its [chapter about vectors](https://r4ds.had.co.nz/vectors.html#scalars-and-recycling-rules). +:::::::: Alternatively, we can use column positions: From 26eff3863624d6a352f3fb0b34d8d02ef1d53459 Mon Sep 17 00:00:00 2001 From: albhasan Date: Mon, 2 Feb 2026 14:13:15 -0300 Subject: [PATCH 16/16] Fix typos (found by @coopermkr). --- episodes/04-data-structures-part2.Rmd | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/episodes/04-data-structures-part2.Rmd b/episodes/04-data-structures-part2.Rmd index 74ce6bac..732123dc 100644 --- a/episodes/04-data-structures-part2.Rmd +++ b/episodes/04-data-structures-part2.Rmd @@ -296,7 +296,9 @@ head(life_expectancy) ``` We can also use a logical vector to achieve the same result. Make sure the -vector's length match the number of columns in the data frame (to avoid R repeating the shorter vector to match the length of the longer vector, called "vector recycling"): +vector's length matches the number of columns in the data frame (to avoid R +repeating the shorter vector to match the length of the longer vector, called +"vector recycling"): ```{r} life_expectancy <- gapminder[c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE)] @@ -308,7 +310,7 @@ head(life_expectancy) ### Vector Recycling Vector recycling occurs when working with vectors of different length and it -consist on repeating the elements of the shorter vector up to the lenght of +consist of repeating the elements of the shorter vector up to the length of the larger one. For more information, check the book R for Data Science and its [chapter about vectors](https://r4ds.had.co.nz/vectors.html#scalars-and-recycling-rules). ::::::::