diff --git a/.github/workflows/publish_image_to_dockerhub.yml b/.github/workflows/publish_image_to_dockerhub.yml new file mode 100644 index 000000000..af51c8044 --- /dev/null +++ b/.github/workflows/publish_image_to_dockerhub.yml @@ -0,0 +1,40 @@ +name: Publish Docker image on DockerHub +on: + push: + paths: + - 'Dockerfile' +jobs: + build: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@master + - uses: actions/checkout@master + with: + fetch-depth: '0' + - name: Bump version and push tag + uses: anothrNick/github-tag-action@1.17.2 + env: + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} + WITH_V: true + id: bump + - name: Create Release + id: create_release + uses: actions/create-release@v1 + env: + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} # This token is provided by Actions, you do not need to create your own token + with: + tag_name: ${{ steps.bump.outputs.new_tag }} + release_name: ${{ steps.bump.outputs.new_tag }} + body: | + Changes in this Release + - Rebuilt Docker image and published to DockerHub with new tag + draft: false + prerelease: false + - name: Publish to Registry + uses: elgohr/Publish-Docker-Github-Action@master + with: + name: ubcdsci/intro-to-ds + username: ${{ secrets.DOCKER_USERNAME }} + password: ${{ secrets.DOCKER_PASSWORD }} + tags: "latest,${{ steps.bump.outputs.new_tag }}" + diff --git a/03-viz.Rmd b/03-viz.Rmd index fdc3e27f7..270cbefd6 100644 --- a/03-viz.Rmd +++ b/03-viz.Rmd @@ -39,7 +39,7 @@ are external references that contain a wealth of additional information on the t - Use `ggsave` to save visualizations in `.png` and `.svg` format ## Choosing the visualization -#### *Ask a question, and answer it* {-#my-section} +#### *Ask a question, and answer it* The purpose of a visualization is to answer a question about a data set of interest. So naturally, the first thing to do **before** creating a visualization is to formulate the question about the data that you are trying to answer. @@ -68,7 +68,7 @@ again typically a better alternative. ## Refining the visualization -#### *Convey the message, minimize noise* {-#my-section} +#### *Convey the message, minimize noise* Just being able to make a visualization in R with `ggplot2` (or any other tool for that matter) doesn't mean that it is effective at communicating your message to others. Once you have selected a broad type of visualization to use, you will have to refine it to suit your particular need. @@ -98,7 +98,7 @@ making it easier for them to quickly understand and remember your message. ## Creating visualizations with `ggplot2` -#### *Build the visualization iteratively* {-#my-section} +#### *Build the visualization iteratively* This section will cover examples of how to choose and refine a visualization given a data set and a question that you want to answer, and then how to create the visualization in R using `ggplot2`. To use the `ggplot2` library, we need to load the `tidyverse` metapackage. @@ -379,7 +379,7 @@ admirable job given the technology available at the time period. ## Explaining the visualization -#### *Tell a story* {-#my-section} +#### *Tell a story* Typically, your visualization will not be shown completely on its own, but rather it will be part of a larger presentation. Further, visualizations can provide supporting information for any part of a presentation, from opening to conclusion. @@ -425,7 +425,7 @@ worth further investigation into the differences between these experiments to se ## Saving the visualization -#### *Choose the right output format for your needs* {-#my-section} +#### *Choose the right output format for your needs* Just as there are many ways to store data sets, there are many ways to store visualizations and images. Which one you choose can depend on a number of factors, such as file size/type limitations diff --git a/Dockerfile b/Dockerfile index ea588ad82..e0e67f30c 100644 --- a/Dockerfile +++ b/Dockerfile @@ -1,3 +1,4 @@ +# Copyright (c) UBC-DSCI Development Team. FROM rocker/verse:4.0.0 RUN apt-get update --fix-missing \ diff --git a/data/can_lang.csv b/data/can_lang.csv index 658527676..33869dbcb 100644 --- a/data/can_lang.csv +++ b/data/can_lang.csv @@ -1,3 +1,4 @@ +<<<<<<< HEAD category,language,mother_tongue,most_at_home,most_at_work,lang_known Aboriginal languages,"Aboriginal languages, n.o.s.",590,235,30,665 Non-Official & Non-Aboriginal languages,Afrikaans,10260,4785,85,23415 @@ -213,3 +214,220 @@ Aboriginal languages,Woods Cree,1840,800,75,2665 Non-Official & Non-Aboriginal languages,Wu (Shanghainese),12915,7650,105,16530 Non-Official & Non-Aboriginal languages,Yiddish,13555,7085,895,20985 Non-Official & Non-Aboriginal languages,Yoruba,9080,2615,15,22415 +======= +category,language,mother_tongue,most_at_home,most_at_work,lang_known +Aboriginal languages,"Aboriginal languages, n.o.s.",590,235,30,665 +Non-Official & Non-Aboriginal languages,Afrikaans,10260,4785,85,23415 +Non-Official & Non-Aboriginal languages,"Afro-Asiatic languages, n.i.e.",1150,445,10,2775 +Non-Official & Non-Aboriginal languages,Akan (Twi),13460,5985,25,22150 +Non-Official & Non-Aboriginal languages,Albanian,26895,13135,345,31930 +Aboriginal languages,"Algonquian languages, n.i.e.",45,10,0,120 +Aboriginal languages,Algonquin,1260,370,40,2480 +Non-Official & Non-Aboriginal languages,American Sign Language,2685,3020,1145,21930 +Non-Official & Non-Aboriginal languages,Amharic,22465,12785,200,33670 +Non-Official & Non-Aboriginal languages,Arabic,419890,223535,5585,629055 +Non-Official & Non-Aboriginal languages,Armenian,33460,21510,450,41295 +Non-Official & Non-Aboriginal languages,Assyrian Neo-Aramaic,16070,10510,205,19740 +Aboriginal languages,"Athabaskan languages, n.i.e.",50,10,0,85 +Aboriginal languages,Atikamekw,6150,5465,1100,6645 +Non-Official & Non-Aboriginal languages,"Austro-Asiatic languages, n.i.e",170,80,0,190 +Non-Official & Non-Aboriginal languages,"Austronesian languages, n.i.e.",4195,1160,35,5585 +Non-Official & Non-Aboriginal languages,Azerbaijani,3255,1245,25,5455 +Aboriginal languages,Babine (Wetsuwet'en),110,20,10,210 +Non-Official & Non-Aboriginal languages,Bamanankan,1535,345,0,3190 +Aboriginal languages,Beaver,190,50,0,340 +Non-Official & Non-Aboriginal languages,Belarusan,810,225,0,2265 +Non-Official & Non-Aboriginal languages,Bengali,73125,47350,525,91220 +Non-Official & Non-Aboriginal languages,"Berber languages, n.i.e.",8985,2615,15,12510 +Non-Official & Non-Aboriginal languages,Bikol,1785,290,0,2075 +Non-Official & Non-Aboriginal languages,Bilen,805,615,15,1085 +Aboriginal languages,Blackfoot,2815,1110,85,5645 +Non-Official & Non-Aboriginal languages,Bosnian,12215,6045,155,18265 +Non-Official & Non-Aboriginal languages,Bulgarian,20020,11985,200,22425 +Non-Official & Non-Aboriginal languages,Burmese,3585,2245,75,4995 +Non-Official & Non-Aboriginal languages,Cantonese,565270,400220,58820,699125 +Aboriginal languages,Carrier,1025,250,15,2100 +Non-Official & Non-Aboriginal languages,Catalan,870,350,30,2035 +Aboriginal languages,Cayuga,45,10,10,125 +Non-Official & Non-Aboriginal languages,Cebuano,19890,7205,70,27040 +Non-Official & Non-Aboriginal languages,"Celtic languages, n.i.e.",525,80,10,3595 +Non-Official & Non-Aboriginal languages,Chaldean Neo-Aramaic,5545,3445,35,7115 +Aboriginal languages,Chilcotin,655,255,15,1150 +Non-Official & Non-Aboriginal languages,"Chinese languages, n.i.e.",615,280,0,590 +Non-Official & Non-Aboriginal languages,"Chinese, n.o.s.",38580,23940,2935,41685 +Aboriginal languages,Comox,85,0,0,185 +Aboriginal languages,"Cree, n.o.s.",64050,37950,7800,86115 +Non-Official & Non-Aboriginal languages,"Creole languages, n.i.e.",4985,2005,15,16635 +Non-Official & Non-Aboriginal languages,"Creole, n.o.s.",64110,24570,310,133045 +Non-Official & Non-Aboriginal languages,Croatian,48200,16775,220,69835 +Non-Official & Non-Aboriginal languages,"Cushitic languages, n.i.e.",365,180,0,480 +Non-Official & Non-Aboriginal languages,Czech,22295,6235,70,28725 +Aboriginal languages,Dakota,1210,255,20,1760 +Non-Official & Non-Aboriginal languages,Danish,12630,855,85,15750 +Aboriginal languages,Dene,10700,7710,770,13060 +Non-Official & Non-Aboriginal languages,Dinka,2120,1130,0,2475 +Aboriginal languages,Dogrib (Tlicho),1650,1020,165,2375 +Non-Official & Non-Aboriginal languages,"Dravidian languages, n.i.e.",490,190,0,790 +Non-Official & Non-Aboriginal languages,Dutch,99015,9565,1165,120870 +Non-Official & Non-Aboriginal languages,Edo,1670,410,0,3220 +Official languages,English,19460850,22162865,15265335,29748265 +Non-Official & Non-Aboriginal languages,Estonian,5445,975,55,6070 +Non-Official & Non-Aboriginal languages,Ewe,1760,405,10,3000 +Non-Official & Non-Aboriginal languages,Fijian,745,195,0,1665 +Non-Official & Non-Aboriginal languages,Finnish,15295,2790,105,17590 +Official languages,French,7166700,6943800,3825215,10242945 +Non-Official & Non-Aboriginal languages,Frisian,2100,185,40,2910 +Non-Official & Non-Aboriginal languages,"Fulah (Pular, Pulaar, Fulfulde)",2825,825,0,4725 +Non-Official & Non-Aboriginal languages,Ga,920,250,0,2250 +Non-Official & Non-Aboriginal languages,Ganda,1295,345,25,2495 +Non-Official & Non-Aboriginal languages,Georgian,1710,1040,25,2150 +Non-Official & Non-Aboriginal languages,German,384040,120335,10065,502735 +Non-Official & Non-Aboriginal languages,"Germanic languages, n.i.e.",525,1630,725,8705 +Aboriginal languages,Gitxsan (Gitksan),880,315,10,1305 +Non-Official & Non-Aboriginal languages,Greek,106525,44550,1020,150965 +Non-Official & Non-Aboriginal languages,Gujarati,108780,64150,885,149045 +Aboriginal languages,Gwich'in,255,50,10,360 +Aboriginal languages,Haida,80,10,0,465 +Aboriginal languages,Haisla,90,20,0,175 +Non-Official & Non-Aboriginal languages,Haitian Creole,3030,1280,25,6855 +Non-Official & Non-Aboriginal languages,Hakka,10910,4085,70,12445 +Aboriginal languages,Halkomelem,480,50,20,1060 +Non-Official & Non-Aboriginal languages,Harari,1320,735,0,1715 +Non-Official & Non-Aboriginal languages,Hebrew,19530,8560,825,75020 +Aboriginal languages,Heiltsuk,100,5,10,125 +Non-Official & Non-Aboriginal languages,Hiligaynon,6880,2210,25,7925 +Non-Official & Non-Aboriginal languages,Hindi,110645,55510,1405,433365 +Non-Official & Non-Aboriginal languages,Hmong-Mien languages,795,335,10,870 +Non-Official & Non-Aboriginal languages,Hungarian,61235,19480,440,71285 +Non-Official & Non-Aboriginal languages,Icelandic,1285,270,0,1780 +Non-Official & Non-Aboriginal languages,Igbo,4235,1000,10,8855 +Non-Official & Non-Aboriginal languages,Ilocano,26345,9125,110,34530 +Non-Official & Non-Aboriginal languages,"Indo-Iranian languages, n.i.e.",5185,2380,20,8870 +Aboriginal languages,Inuinnaqtun (Inuvialuktun),1020,165,30,1975 +Aboriginal languages,"Inuit languages, n.i.e.",310,90,15,470 +Aboriginal languages,Inuktitut,35210,29230,8795,40620 +Aboriginal languages,"Iroquoian languages, n.i.e.",35,5,0,115 +Non-Official & Non-Aboriginal languages,Italian,375635,115415,1705,574725 +Non-Official & Non-Aboriginal languages,"Italic (Romance) languages, n.i.e.",720,175,25,2680 +Non-Official & Non-Aboriginal languages,Japanese,43640,19785,3255,83095 +Non-Official & Non-Aboriginal languages,Kabyle,13150,5490,15,17120 +Non-Official & Non-Aboriginal languages,Kannada,3970,1630,10,8245 +Non-Official & Non-Aboriginal languages,Karenic languages,4705,3860,135,4895 +Non-Official & Non-Aboriginal languages,Kashmiri,565,135,0,905 +Aboriginal languages,Kaska (Nahani),180,20,10,365 +Non-Official & Non-Aboriginal languages,Khmer (Cambodian),20130,10885,475,27035 +Non-Official & Non-Aboriginal languages,Kinyarwanda (Rwanda),5250,1530,25,7860 +Non-Official & Non-Aboriginal languages,Konkani,3330,720,10,6790 +Non-Official & Non-Aboriginal languages,Korean,153425,109705,12150,172750 +Non-Official & Non-Aboriginal languages,Kurdish,11705,6580,185,15290 +Aboriginal languages,Kutenai,110,10,0,170 +Aboriginal languages,Kwakiutl (Kwak'wala),325,25,15,605 +Non-Official & Non-Aboriginal languages,Lao,12670,6175,150,17235 +Non-Official & Non-Aboriginal languages,Latvian,5450,1255,35,6500 +Aboriginal languages,Lillooet,315,25,15,790 +Non-Official & Non-Aboriginal languages,Lingala,3805,1045,10,17010 +Non-Official & Non-Aboriginal languages,Lithuanian,7075,2015,60,8185 +Non-Official & Non-Aboriginal languages,Macedonian,16770,6830,95,23075 +Non-Official & Non-Aboriginal languages,Malagasy,1430,430,0,2340 +Non-Official & Non-Aboriginal languages,Malay,12275,3625,140,22470 +Non-Official & Non-Aboriginal languages,Malayalam,28565,15440,95,37810 +Aboriginal languages,Malecite,300,55,10,760 +Non-Official & Non-Aboriginal languages,Maltese,5565,1125,25,7625 +Non-Official & Non-Aboriginal languages,Mandarin,592040,462890,60090,814450 +Non-Official & Non-Aboriginal languages,Marathi,8295,3780,30,15565 +Aboriginal languages,Mi'kmaq,6690,3565,915,9025 +Aboriginal languages,Michif,465,80,10,1210 +Non-Official & Non-Aboriginal languages,Min Dong,1230,345,30,1045 +Non-Official & Non-Aboriginal languages,"Min Nan (Chaochow, Teochow, Fukien, Taiwanese)",31800,13965,565,42840 +Aboriginal languages,Mohawk,985,255,30,2415 +Non-Official & Non-Aboriginal languages,Mongolian,1575,905,10,2095 +Aboriginal languages,Montagnais (Innu),10235,8585,2055,11445 +Aboriginal languages,Moose Cree,105,10,0,195 +Aboriginal languages,Naskapi,1205,1195,370,1465 +Non-Official & Non-Aboriginal languages,Nepali,18275,13375,195,21385 +Non-Official & Non-Aboriginal languages,"Niger-Congo languages, n.i.e.",19135,4010,30,40760 +Non-Official & Non-Aboriginal languages,"Nilo-Saharan languages, n.i.e.",3750,1520,0,4550 +Aboriginal languages,Nisga'a,400,75,10,1055 +Aboriginal languages,North Slavey (Hare),765,340,95,1005 +Aboriginal languages,Northern East Cree,315,110,35,550 +Aboriginal languages,Northern Tutchone,220,30,0,280 +Non-Official & Non-Aboriginal languages,Norwegian,4615,350,70,8120 +Aboriginal languages,Nuu-chah-nulth (Nootka),280,30,10,560 +Aboriginal languages,Oji-Cree,12855,7905,1080,15605 +Aboriginal languages,Ojibway,17885,6175,765,28580 +Aboriginal languages,Okanagan,275,80,20,820 +Aboriginal languages,Oneida,60,15,0,185 +Non-Official & Non-Aboriginal languages,Oriya (Odia),1055,475,0,1530 +Non-Official & Non-Aboriginal languages,Oromo,4960,3410,45,6245 +Non-Official & Non-Aboriginal languages,"Other languages, n.i.e.",3685,1110,80,9730 +Aboriginal languages,Ottawa (Odawa),150,75,0,205 +Non-Official & Non-Aboriginal languages,"Pampangan (Kapampangan, Pampango)",4045,1200,10,5425 +Non-Official & Non-Aboriginal languages,Pangasinan,1390,240,0,1800 +Non-Official & Non-Aboriginal languages,Pashto,16905,10590,50,23180 +Non-Official & Non-Aboriginal languages,Persian (Farsi),214200,143025,4580,252325 +Aboriginal languages,Plains Cree,3065,1345,95,5905 +Non-Official & Non-Aboriginal languages,Polish,181710,74780,2495,214965 +Non-Official & Non-Aboriginal languages,Portuguese,221535,98710,7485,295955 +Non-Official & Non-Aboriginal languages,Punjabi (Panjabi),501680,349140,27865,668240 +Non-Official & Non-Aboriginal languages,Quebec Sign Language,695,730,130,4665 +Non-Official & Non-Aboriginal languages,Romanian,96660,53325,745,115050 +Non-Official & Non-Aboriginal languages,Rundi (Kirundi),5850,2110,0,8590 +Non-Official & Non-Aboriginal languages,Russian,188255,116595,4855,269645 +Aboriginal languages,"Salish languages, n.i.e.",260,25,0,560 +Aboriginal languages,Sarsi (Sarcee),80,10,0,145 +Non-Official & Non-Aboriginal languages,Scottish Gaelic,1090,190,15,3980 +Aboriginal languages,Sekani,85,15,0,185 +Non-Official & Non-Aboriginal languages,"Semitic languages, n.i.e.",2150,1205,65,3220 +Non-Official & Non-Aboriginal languages,Serbian,57350,31750,530,73780 +Non-Official & Non-Aboriginal languages,Serbo-Croatian,9550,3890,30,11275 +Non-Official & Non-Aboriginal languages,Shona,3185,1035,0,5430 +Aboriginal languages,Shuswap (Secwepemctsin),445,50,35,1305 +Non-Official & Non-Aboriginal languages,"Sign languages, n.i.e",4125,6690,645,22280 +Non-Official & Non-Aboriginal languages,Sindhi,11860,4975,35,20260 +Non-Official & Non-Aboriginal languages,Sinhala (Sinhalese),16335,7790,40,27825 +Aboriginal languages,"Siouan languages, n.i.e.",55,20,0,140 +Aboriginal languages,"Slavey, n.o.s.",280,105,10,675 +Non-Official & Non-Aboriginal languages,"Slavic languages, n.i.e.",2420,670,10,2995 +Non-Official & Non-Aboriginal languages,Slovak,17580,5610,100,21470 +Non-Official & Non-Aboriginal languages,Slovene (Slovenian),9785,2055,15,11490 +Non-Official & Non-Aboriginal languages,Somali,36755,22895,220,49660 +Aboriginal languages,South Slavey,945,370,35,1365 +Aboriginal languages,Southern East Cree,45,15,0,40 +Aboriginal languages,Southern Tutchone,70,5,0,145 +Non-Official & Non-Aboriginal languages,Spanish,458850,263505,13030,995260 +Aboriginal languages,Squamish,40,5,10,285 +Aboriginal languages,Stoney,3025,1950,240,3675 +Aboriginal languages,Straits,80,25,15,365 +Non-Official & Non-Aboriginal languages,Swahili,13370,5370,80,38685 +Aboriginal languages,Swampy Cree,1440,330,10,2350 +Non-Official & Non-Aboriginal languages,Swedish,6840,1050,125,14140 +Non-Official & Non-Aboriginal languages,"Tagalog (Pilipino, Filipino)",431385,213790,3450,612735 +Aboriginal languages,Tahltan,95,5,0,265 +Non-Official & Non-Aboriginal languages,"Tai-Kadai languages, n.i.e",85,30,0,115 +Non-Official & Non-Aboriginal languages,Tamil,140720,96955,2085,189860 +Non-Official & Non-Aboriginal languages,Telugu,15660,8280,40,23165 +Non-Official & Non-Aboriginal languages,Thai,9255,3365,525,15395 +Aboriginal languages,Thompson (Ntlakapamux),335,20,0,450 +Non-Official & Non-Aboriginal languages,Tibetan,6160,4590,50,7050 +Non-Official & Non-Aboriginal languages,"Tibeto-Burman languages, n.i.e.",1405,655,15,2380 +Non-Official & Non-Aboriginal languages,Tigrigna,16645,10205,130,21340 +Aboriginal languages,Tlingit,95,0,10,260 +Aboriginal languages,Tsimshian,200,30,10,410 +Non-Official & Non-Aboriginal languages,"Turkic languages, n.i.e.",1315,455,10,1875 +Non-Official & Non-Aboriginal languages,Turkish,32815,18955,690,50770 +Non-Official & Non-Aboriginal languages,Ukrainian,102485,28250,1210,132115 +Non-Official & Non-Aboriginal languages,"Uralic languages, n.i.e.",10,5,0,25 +Non-Official & Non-Aboriginal languages,Urdu,210815,128785,1495,322220 +Non-Official & Non-Aboriginal languages,Uyghur,1035,610,20,1390 +Non-Official & Non-Aboriginal languages,Uzbek,1720,995,15,2465 +Non-Official & Non-Aboriginal languages,Vietnamese,156430,104245,8075,198895 +Non-Official & Non-Aboriginal languages,Vlaams (Flemish),3895,355,35,4400 +Aboriginal languages,"Wakashan languages, n.i.e.",10,0,0,25 +Non-Official & Non-Aboriginal languages,Waray-Waray,1110,310,0,1395 +Non-Official & Non-Aboriginal languages,Welsh,1075,95,0,1695 +Non-Official & Non-Aboriginal languages,Wolof,3990,1385,10,8240 +Aboriginal languages,Woods Cree,1840,800,75,2665 +Non-Official & Non-Aboriginal languages,Wu (Shanghainese),12915,7650,105,16530 +Non-Official & Non-Aboriginal languages,Yiddish,13555,7085,895,20985 +Non-Official & Non-Aboriginal languages,Yoruba,9080,2615,15,22415 +>>>>>>> dev diff --git a/docs/GitHub.html b/docs/GitHub.html index dd4e1de2e..9e81ad257 100644 --- a/docs/GitHub.html +++ b/docs/GitHub.html @@ -26,7 +26,7 @@ - + diff --git a/docs/_main_files/figure-html/11-bootstrapping-six-bootstrap-samples-1.png b/docs/_main_files/figure-html/11-bootstrapping-six-bootstrap-samples-1.png new file mode 100644 index 000000000..b278a25f4 Binary files /dev/null and b/docs/_main_files/figure-html/11-bootstrapping-six-bootstrap-samples-1.png differ diff --git a/docs/classification-continued.html b/docs/classification-continued.html index 91d5890ca..56225a305 100644 --- a/docs/classification-continued.html +++ b/docs/classification-continued.html @@ -26,7 +26,7 @@ - + @@ -410,26 +410,26 @@

7.3 Evaluating accuracy

start by loading the necessary libraries, reading in the breast cancer data from the previous chapter, and making a quick scatter plot visualization of tumour cell concavity versus smoothness coloured by diagnosis.

-
# load libraries
-library(tidyverse)
-library(tidymodels)
-
-#load data
-cancer <- read_csv("data/unscaled_wdbc.csv") %>% 
-  mutate(Class = as_factor(Class)) # convert the character Class variable to the factor datatype
-
-# colour palette
-cbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#999999") 
-
-# create scatter plot of tumour cell concavity versus smoothness, 
-# labelling the points be diagnosis class
-perim_concav <- cancer %>%  
-  ggplot(aes(x = Smoothness, y = Concavity, color = Class)) + 
-    geom_point(alpha = 0.5) +
-    labs(color = "Diagnosis") + 
-    scale_color_manual(labels = c("Malignant", "Benign"), values = cbPalette)
-
-perim_concav
+
# load libraries
+library(tidyverse)
+library(tidymodels)
+
+#load data
+cancer <- read_csv("data/unscaled_wdbc.csv") %>% 
+  mutate(Class = as_factor(Class)) # convert the character Class variable to the factor datatype
+
+# colour palette
+cbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#999999") 
+
+# create scatter plot of tumour cell concavity versus smoothness, 
+# labelling the points be diagnosis class
+perim_concav <- cancer %>%  
+  ggplot(aes(x = Smoothness, y = Concavity, color = Class)) + 
+    geom_point(alpha = 0.5) +
+    labs(color = "Diagnosis") + 
+    scale_color_manual(labels = c("Malignant", "Benign"), values = cbPalette)
+
+perim_concav

1. Create the train / test split

Once we have decided on a predictive question to answer and done some @@ -441,45 +441,45 @@

7.3 Evaluating accuracy

using a larger test data set). Here, we will use 75% of the data for training, and 25% for testing. To do this we will use the initial_split function, specifying that prop = 0.75 and the target variable is Class:

-
set.seed(1)
-cancer_split <- initial_split(cancer, prop = 0.75, strata = Class)
-cancer_train <- training(cancer_split)
-cancer_test <- testing(cancer_split)
+
set.seed(1)
+cancer_split <- initial_split(cancer, prop = 0.75, strata = Class)
+cancer_train <- training(cancer_split)
+cancer_test <- testing(cancer_split)

Note: You will see in the code above that we use the set.seed function again, as discussed in the previous chapter. In this case it is because initial_split uses random sampling to choose which rows will be in the training set. Since we want our code to be reproducible and generate the same train/test split each time it is run, we use set.seed.

-
glimpse(cancer_train)
+
glimpse(cancer_train)
## Rows: 427
 ## Columns: 12
-## $ ID                <dbl> 842302, 842517, 84300903, 84348301, 84358402, 84378…
-## $ Class             <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, B, B, …
-## $ Radius            <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450, 18.…
-## $ Texture           <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.98, 20…
-## $ Perimeter         <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, 119.6…
-## $ Area              <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, 1040.…
-## $ Smoothness        <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0.1278…
-## $ Compactness       <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0.1700…
-## $ Concavity         <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0.1578…
-## $ Concave_Points    <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0.0808…
-## $ Symmetry          <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087, 0.1…
-## $ Fractal_Dimension <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0.0761…
-
glimpse(cancer_test)
+## $ ID <dbl> 842302, 842517, 84300903, 84348301, 84358402, 843786, 844359, 84458202, 84501001, 845… +## $ Class <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, B, B, M, M, M, M, M, M, M, M, M, M, M, M… +## $ Radius <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450, 18.250, 13.710, 12.460, 16.020, 15.78… +## $ Texture <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.98, 20.83, 24.04, 23.24, 17.89, 24.80, 2… +## $ Perimeter <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, 119.60, 90.20, 83.97, 102.70, 103.60, 1… +## $ Area <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, 1040.0, 577.9, 475.9, 797.8, 781.0, 112… +## $ Smoothness <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0.12780, 0.09463, 0.11890, 0.11860, 0.08… +## $ Compactness <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0.17000, 0.10900, 0.16450, 0.23960, 0.06… +## $ Concavity <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0.15780, 0.11270, 0.09366, 0.22730, 0.03… +## $ Concave_Points <dbl> 0.147100, 0.070170, 0.127900, 0.105200, 0.104300, 0.080890, 0.074000, 0.059850, 0.085… +## $ Symmetry <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087, 0.1794, 0.2196, 0.2030, 0.1528, 0.184… +## $ Fractal_Dimension <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0.07613, 0.05742, 0.07451, 0.08243, 0.05… +
glimpse(cancer_test)
## Rows: 142
 ## Columns: 12
-## $ ID                <dbl> 844981, 84799002, 848406, 849014, 8510426, 8511133,…
-## $ Class             <fct> M, M, M, M, B, M, M, M, M, M, M, B, B, M, M, M, B, …
-## $ Radius            <dbl> 13.000, 14.540, 14.680, 19.810, 13.540, 15.340, 18.…
-## $ Texture           <dbl> 21.82, 27.54, 20.13, 22.15, 14.36, 14.26, 25.11, 26…
-## $ Perimeter         <dbl> 87.50, 96.73, 94.74, 130.00, 87.46, 102.50, 124.80,…
-## $ Area              <dbl> 519.8, 658.8, 684.5, 1260.0, 566.3, 704.4, 1088.0, …
-## $ Smoothness        <dbl> 0.12730, 0.11390, 0.09867, 0.09831, 0.09779, 0.1073…
-## $ Compactness       <dbl> 0.19320, 0.15950, 0.07200, 0.10270, 0.08129, 0.2135…
-## $ Concavity         <dbl> 0.18590, 0.16390, 0.07395, 0.14790, 0.06664, 0.2077…
-## $ Concave_Points    <dbl> 0.093530, 0.073640, 0.052590, 0.094980, 0.047810, 0…
-## $ Symmetry          <dbl> 0.2350, 0.2303, 0.1586, 0.1582, 0.1885, 0.2521, 0.2…
-## $ Fractal_Dimension <dbl> 0.07389, 0.07077, 0.05922, 0.05395, 0.05766, 0.0703…
+## $ ID <dbl> 844981, 84799002, 848406, 849014, 8510426, 8511133, 853401, 854002, 855167, 856106, 8… +## $ Class <fct> M, M, M, M, B, M, M, M, M, M, M, B, B, M, M, M, B, M, M, B, B, B, M, M, B, B, M, B, B… +## $ Radius <dbl> 13.000, 14.540, 14.680, 19.810, 13.540, 15.340, 18.630, 19.270, 13.440, 13.280, 18.22… +## $ Texture <dbl> 21.82, 27.54, 20.13, 22.15, 14.36, 14.26, 25.11, 26.47, 21.58, 20.28, 18.70, 11.79, 1… +## $ Perimeter <dbl> 87.50, 96.73, 94.74, 130.00, 87.46, 102.50, 124.80, 127.90, 86.18, 87.32, 120.30, 54.… +## $ Area <dbl> 519.8, 658.8, 684.5, 1260.0, 566.3, 704.4, 1088.0, 1162.0, 563.0, 545.2, 1033.0, 224.… +## $ Smoothness <dbl> 0.12730, 0.11390, 0.09867, 0.09831, 0.09779, 0.10730, 0.10640, 0.09401, 0.08162, 0.10… +## $ Compactness <dbl> 0.19320, 0.15950, 0.07200, 0.10270, 0.08129, 0.21350, 0.18870, 0.17190, 0.06031, 0.14… +## $ Concavity <dbl> 0.18590, 0.16390, 0.07395, 0.14790, 0.06664, 0.20770, 0.23190, 0.16570, 0.03110, 0.09… +## $ Concave_Points <dbl> 0.093530, 0.073640, 0.052590, 0.094980, 0.047810, 0.097560, 0.124400, 0.075930, 0.020… +## $ Symmetry <dbl> 0.2350, 0.2303, 0.1586, 0.1582, 0.1885, 0.2521, 0.2183, 0.1853, 0.1784, 0.1974, 0.209… +## $ Fractal_Dimension <dbl> 0.07389, 0.07077, 0.05922, 0.05395, 0.05766, 0.07032, 0.06197, 0.06261, 0.05587, 0.06…

We can see from glimpse in the code above that the training set contains 427 observations, while the test set contains 142 observations. This corresponds to a train / test split of 75% / 25%, as desired.

@@ -494,37 +494,37 @@

7.3 Evaluating accuracy

Fortunately, the recipe framework from tidymodels makes it simple to handle this properly. Below we construct and prepare the recipe using only the training data (due to data = cancer_train in the first line).

-
cancer_recipe <- recipe(Class ~ Smoothness + Concavity,  data = cancer_train) %>%
-       step_scale(all_predictors()) %>%
-       step_center(all_predictors())
+
cancer_recipe <- recipe(Class ~ Smoothness + Concavity,  data = cancer_train) %>%
+       step_scale(all_predictors()) %>%
+       step_center(all_predictors())

3. Train the classifier

Now that we have split our original data set into training and test sets, we can create our K-nearest neighbour classifier with only the training set using the technique we learned in the previous chapter. For now, we will just choose the number \(K\) of neighbours to be 3, and use concavity and smoothness as the predictors.

-
set.seed(1)
-knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 3) %>%
-       set_engine("kknn") %>%
-       set_mode("classification")
-
-knn_fit <- workflow() %>%
-             add_recipe(cancer_recipe) %>%
-             add_model(knn_spec) %>%
-        fit(data = cancer_train)
-
-knn_fit
-
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
+
set.seed(1)
+knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 3) %>%
+       set_engine("kknn") %>%
+       set_mode("classification")
+
+knn_fit <- workflow() %>%
+             add_recipe(cancer_recipe) %>%
+             add_model(knn_spec) %>%
+        fit(data = cancer_train)
+
+knn_fit
+
## ══ Workflow [trained] ════════════════════════════════════════════════════════════════════════════════════════════
 ## Preprocessor: Recipe
 ## Model: nearest_neighbor()
 ## 
-## ── Preprocessor ────────────────────────────────────────────────────────────────
+## ── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────
 ## 2 Recipe Steps
 ## 
 ## ● step_scale()
 ## ● step_center()
 ## 
-## ── Model ───────────────────────────────────────────────────────────────────────
+## ── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────
 ## 
 ## Call:
 ## kknn::train.kknn(formula = formula, data = data, ks = ~3, kernel = ~"rectangular")
@@ -547,26 +547,25 @@ 

7.3 Evaluating accuracy

cancer_test_predictions data frame. The Class variable contains the true diagnoses, while the .pred_class contains the predicted diagnoses from the model.

-
cancer_test_predictions <- predict(knn_fit, cancer_test) %>%
-                            bind_cols(cancer_test)
-head(cancer_test_predictions)
+
cancer_test_predictions <- predict(knn_fit, cancer_test) %>%
+                            bind_cols(cancer_test)
+head(cancer_test_predictions)
## # A tibble: 6 x 13
-##   .pred_class     ID Class Radius Texture Perimeter  Area Smoothness Compactness
-##   <fct>        <dbl> <fct>  <dbl>   <dbl>     <dbl> <dbl>      <dbl>       <dbl>
-## 1 M           8.45e5 M       13      21.8      87.5  520.     0.127       0.193 
-## 2 M           8.48e7 M       14.5    27.5      96.7  659.     0.114       0.160 
-## 3 B           8.48e5 M       14.7    20.1      94.7  684.     0.0987      0.072 
-## 4 M           8.49e5 M       19.8    22.2     130   1260      0.0983      0.103 
-## 5 B           8.51e6 B       13.5    14.4      87.5  566.     0.0978      0.0813
-## 6 M           8.51e6 M       15.3    14.3     102.   704.     0.107       0.214 
-## # … with 4 more variables: Concavity <dbl>, Concave_Points <dbl>,
-## #   Symmetry <dbl>, Fractal_Dimension <dbl>
+## .pred_class ID Class Radius Texture Perimeter Area Smoothness Compactness Concavity Concave_Points Symmetry +## <fct> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> +## 1 M 8.45e5 M 13 21.8 87.5 520. 0.127 0.193 0.186 0.0935 0.235 +## 2 M 8.48e7 M 14.5 27.5 96.7 659. 0.114 0.160 0.164 0.0736 0.230 +## 3 B 8.48e5 M 14.7 20.1 94.7 684. 0.0987 0.072 0.0740 0.0526 0.159 +## 4 M 8.49e5 M 19.8 22.2 130 1260 0.0983 0.103 0.148 0.0950 0.158 +## 5 B 8.51e6 B 13.5 14.4 87.5 566. 0.0978 0.0813 0.0666 0.0478 0.188 +## 6 M 8.51e6 M 15.3 14.3 102. 704. 0.107 0.214 0.208 0.0976 0.252 +## # … with 1 more variable: Fractal_Dimension <dbl>

5. Compute the accuracy

Finally we can assess our classifier’s accuracy. To do this we use the metrics function from tidymodels to get the statistics about the quality of our model, specifying the truth and estimate arguments:

-
cancer_test_predictions %>%
-    metrics(truth = Class, estimate = .pred_class)
+
cancer_test_predictions %>%
+    metrics(truth = Class, estimate = .pred_class)
## # A tibble: 2 x 3
 ##   .metric  .estimator .estimate
 ##   <chr>    <chr>          <dbl>
@@ -575,8 +574,8 @@ 

7.3 Evaluating accuracy

This shows that the accuracy of the classifier on the test data was 88%. We can also look at the confusion matrix for the classifier, which shows the table of predicted labels and correct labels, using the conf_mat function:

-
cancer_test_predictions %>%
-    conf_mat(truth = Class, estimate = .pred_class)
+
cancer_test_predictions %>%
+    conf_mat(truth = Class, estimate = .pred_class)
##           Truth
 ## Prediction  M  B
 ##          M 43  7
@@ -622,39 +621,39 @@ 

7.4.1 Cross-validation

values in the set.seed function to generate five different train / validation splits of our overall training data, train five different K-nearest neighbour models, and evaluate their accuracy.

-
accuracies <- c()
-for (i in 1:5){
-    set.seed(i) # makes the random selection of rows reproducible
-
-    # create the 25/75 split of the training data into training and validation
-    cancer_split <- initial_split(cancer_train, prop = 0.75, strata = Class)
-    cancer_subtrain <- training(cancer_split)
-    cancer_validation <- testing(cancer_split)
-
-    # recreate the standardization recipe from before (since it must be based on the training data)
-    cancer_recipe <- recipe(Class ~ Smoothness + Concavity,  data = cancer_subtrain) %>%
-       step_scale(all_predictors()) %>%
-       step_center(all_predictors())
-
-    # fit the knn model (we can reuse the old knn_spec model from before)
-    knn_fit <- workflow() %>%
-             add_recipe(cancer_recipe) %>%
-             add_model(knn_spec) %>%
-             fit(data = cancer_subtrain)
-
-    # get predictions on the validation data
-    validation_predicted <- predict(knn_fit, cancer_validation) %>%
-                              bind_cols(cancer_validation)
-    
-    #compute the accuracy
-    acc <- validation_predicted %>% 
-              metrics(truth = Class, estimate = .pred_class) %>%
-              filter(.metric == "accuracy") %>%
-              select(.estimate) %>%
-              pull()
-    accuracies <- append(accuracies, acc)
-}
-accuracies
+
accuracies <- c()
+for (i in 1:5){
+    set.seed(i) # makes the random selection of rows reproducible
+
+    # create the 25/75 split of the training data into training and validation
+    cancer_split <- initial_split(cancer_train, prop = 0.75, strata = Class)
+    cancer_subtrain <- training(cancer_split)
+    cancer_validation <- testing(cancer_split)
+
+    # recreate the standardization recipe from before (since it must be based on the training data)
+    cancer_recipe <- recipe(Class ~ Smoothness + Concavity,  data = cancer_subtrain) %>%
+       step_scale(all_predictors()) %>%
+       step_center(all_predictors())
+
+    # fit the knn model (we can reuse the old knn_spec model from before)
+    knn_fit <- workflow() %>%
+             add_recipe(cancer_recipe) %>%
+             add_model(knn_spec) %>%
+             fit(data = cancer_subtrain)
+
+    # get predictions on the validation data
+    validation_predicted <- predict(knn_fit, cancer_validation) %>%
+                              bind_cols(cancer_validation)
+    
+    #compute the accuracy
+    acc <- validation_predicted %>% 
+              metrics(truth = Class, estimate = .pred_class) %>%
+              filter(.metric == "accuracy") %>%
+              select(.estimate) %>%
+              pull()
+    accuracies <- append(accuracies, acc)
+}
+accuracies
## [1] 0.9150943 0.8679245 0.8490566 0.8962264 0.9150943

With five different shuffles of the data, we get five different values for accuracy. None of these is necessarily “more correct” than any other; they’re @@ -676,8 +675,8 @@

7.4.1 Cross-validation

5-fold cross-validation. To do 5-fold cross-validation in R with tidymodels, we use another function: vfold_cv. This function splits our training data into v folds automatically:

-
cancer_vfold <- vfold_cv(cancer_train, v = 5, strata = Class)
-cancer_vfold
+
cancer_vfold <- vfold_cv(cancer_train, v = 5, strata = Class)
+cancer_vfold
## #  5-fold cross-validation using stratification 
 ## # A tibble: 5 x 2
 ##   splits           id   
@@ -694,20 +693,20 @@ 

7.4.1 Cross-validation

Note: we set the seed when we call train not only because of the potential for ties, but also because we are doing cross-validation. Cross-validation uses a random process to select how to partition the training data.

-
set.seed(1)
-
-# recreate the standardization recipe from before (since it must be based on the training data)
-cancer_recipe <- recipe(Class ~ Smoothness + Concavity,  data = cancer_train) %>%
-   step_scale(all_predictors()) %>%
-   step_center(all_predictors())
-
-# fit the knn model (we can reuse the old knn_spec model from before)
-knn_fit <- workflow() %>%
-         add_recipe(cancer_recipe) %>%
-         add_model(knn_spec) %>%
-         fit_resamples(resamples = cancer_vfold)
-
-knn_fit
+
set.seed(1)
+
+# recreate the standardization recipe from before (since it must be based on the training data)
+cancer_recipe <- recipe(Class ~ Smoothness + Concavity,  data = cancer_train) %>%
+   step_scale(all_predictors()) %>%
+   step_center(all_predictors())
+
+# fit the knn model (we can reuse the old knn_spec model from before)
+knn_fit <- workflow() %>%
+         add_recipe(cancer_recipe) %>%
+         add_model(knn_spec) %>%
+         fit_resamples(resamples = cancer_vfold)
+
+knn_fit
## #  5-fold cross-validation using stratification 
 ## # A tibble: 5 x 4
 ##   splits           id    .metrics         .notes          
@@ -725,7 +724,7 @@ 

7.4.1 Cross-validation

error is 0.02, you can expect the true average accuracy of the classifier to be somewhere roughly between 0.86 and 0.90 (although it may fall outside this range).

-
knn_fit %>% collect_metrics()
+
knn_fit %>% collect_metrics()
## # A tibble: 2 x 5
 ##   .metric  .estimator  mean     n std_err
 ##   <chr>    <chr>      <dbl> <int>   <dbl>
@@ -741,13 +740,13 @@ 

7.4.1 Cross-validation

error process, but typically \(C\) is chosen to be either 5 or 10. Here we show how the standard error decreases when we use 10-fold cross validation rather than 5-fold:

-
cancer_vfold <- vfold_cv(cancer_train, v = 10, strata = Class)
-
-workflow() %>%
-    add_recipe(cancer_recipe) %>%
-    add_model(knn_spec) %>%
-    fit_resamples(resamples = cancer_vfold) %>%
-    collect_metrics()
+
cancer_vfold <- vfold_cv(cancer_train, v = 10, strata = Class)
+
+workflow() %>%
+    add_recipe(cancer_recipe) %>%
+    add_model(knn_spec) %>%
+    fit_resamples(resamples = cancer_vfold) %>%
+    collect_metrics()
## # A tibble: 2 x 5
 ##   .metric  .estimator  mean     n std_err
 ##   <chr>    <chr>      <dbl> <int>   <dbl>
@@ -769,20 +768,20 @@ 

7.4.2 Parameter value selectiontidymodels package collection provides a very simple syntax for tuning models: each parameter in the model to be tuned should be specified as tune() in the model specification rather than given a particular value.

-
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>%
-       set_engine("kknn") %>%
-       set_mode("classification")
+
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>%
+       set_engine("kknn") %>%
+       set_mode("classification")

Then instead of using fit or fit_resamples, we will use the tune_grid function to fit the model for each value in a range of parameter values. Here the grid = 10 argument specifies that the tuning should try 10 values of the number of neighbours \(K\) when tuning. We set the seed prior to tuning to ensure results are reproducible:

-
set.seed(1)
-knn_results <- workflow() %>%
-                 add_recipe(cancer_recipe) %>%
-                 add_model(knn_spec) %>%
-                 tune_grid(resamples = cancer_vfold, grid = 10) %>%
-                 collect_metrics()
-knn_results
+
set.seed(1)
+knn_results <- workflow() %>%
+                 add_recipe(cancer_recipe) %>%
+                 add_model(knn_spec) %>%
+                 tune_grid(resamples = cancer_vfold, grid = 10) %>%
+                 collect_metrics()
+knn_results
## # A tibble: 20 x 6
 ##    neighbors .metric  .estimator  mean     n std_err
 ##        <int> <chr>    <chr>      <dbl> <int>   <dbl>
@@ -808,14 +807,14 @@ 

7.4.2 Parameter value selection

We can select the best value of the number of neighbours (i.e., the one that results in the highest classifier accuracy estimate) by plotting the accuracy versus \(K\):

-
accuracies <- knn_results %>%
-                 filter(.metric == 'accuracy')
-
-accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
-  geom_point() +
-  geom_line() +
-  labs(x = 'Neighbors', y = 'Accuracy Estimate')
-accuracy_vs_k
+
accuracies <- knn_results %>%
+                 filter(.metric == 'accuracy')
+
+accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
+  geom_point() +
+  geom_line() +
+  labs(x = 'Neighbors', y = 'Accuracy Estimate')
+accuracy_vs_k

This visualization suggests that \(K = 7\) provides the highest accuracy. But as you can see, there is no exact or perfect answer here; @@ -838,22 +837,22 @@

7.4.3 Under/overfitting

actually starts to decrease! Rather than setting grid = 10 and letting tidymodels decide what values of \(K\) to try, let’s specify the values explicitly by creating a data frame with a neighbors variable. Take a look as the plot below as we vary \(K\) from 1 to almost the number of observations in the data set:

-
set.seed(1)
-k_lots = tibble(neighbors = seq(from = 1, to = 385, by = 10))
-knn_results <- workflow() %>%
-                 add_recipe(cancer_recipe) %>%
-                 add_model(knn_spec) %>%
-                 tune_grid(resamples = cancer_vfold, grid = k_lots) %>%
-                 collect_metrics()
-
-accuracies <- knn_results %>%
-                 filter(.metric == 'accuracy')
-
-accuracy_vs_k_lots <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
-  geom_point() +
-  geom_line() + 
-  labs(x = 'Neighbors', y = 'Accuracy Estimate')
-accuracy_vs_k_lots
+
set.seed(1)
+k_lots = tibble(neighbors = seq(from = 1, to = 385, by = 10))
+knn_results <- workflow() %>%
+                 add_recipe(cancer_recipe) %>%
+                 add_model(knn_spec) %>%
+                 tune_grid(resamples = cancer_vfold, grid = k_lots) %>%
+                 collect_metrics()
+
+accuracies <- knn_results %>%
+                 filter(.metric == 'accuracy')
+
+accuracy_vs_k_lots <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
+  geom_point() +
+  geom_line() + 
+  labs(x = 'Neighbors', y = 'Accuracy Estimate')
+accuracy_vs_k_lots

Underfitting: What is actually happening to our classifier that causes this? As we increase the number of neighbours, more and more of the training diff --git a/docs/classification.html b/docs/classification.html index 9553a8c3c..6cc0df227 100644 --- a/docs/classification.html +++ b/docs/classification.html @@ -26,7 +26,7 @@ - + @@ -433,24 +433,23 @@

6.4 Exploring a labelled data set The forcats library enables us to easily manipulate factors in R; factors are a special categorical type of variable in R that are often used for class label data.

-
library(tidyverse)
-library(forcats)
+
library(tidyverse)
+library(forcats)

In this case, the file containing the breast cancer data set is a simple .csv file with headers. We’ll use the read_csv function with no additional arguments, and then the head function to inspect its contents:

-
cancer <- read_csv("data/wdbc.csv")
-head(cancer)
+
cancer <- read_csv("data/wdbc.csv")
+head(cancer)
## # A tibble: 6 x 12
-##       ID Class Radius Texture Perimeter   Area Smoothness Compactness Concavity
-##    <dbl> <chr>  <dbl>   <dbl>     <dbl>  <dbl>      <dbl>       <dbl>     <dbl>
-## 1 8.42e5 M      1.10   -2.07      1.27   0.984      1.57        3.28     2.65  
-## 2 8.43e5 M      1.83   -0.353     1.68   1.91      -0.826      -0.487   -0.0238
-## 3 8.43e7 M      1.58    0.456     1.57   1.56       0.941       1.05     1.36  
-## 4 8.43e7 M     -0.768   0.254    -0.592 -0.764      3.28        3.40     1.91  
-## 5 8.44e7 M      1.75   -1.15      1.78   1.82       0.280       0.539    1.37  
-## 6 8.44e5 M     -0.476  -0.835    -0.387 -0.505      2.24        1.24     0.866 
-## # … with 3 more variables: Concave_Points <dbl>, Symmetry <dbl>,
-## #   Fractal_Dimension <dbl>
+## ID Class Radius Texture Perimeter Area Smoothness Compactness Concavity Concave_Points Symmetry +## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> +## 1 8.42e5 M 1.10 -2.07 1.27 0.984 1.57 3.28 2.65 2.53 2.22 +## 2 8.43e5 M 1.83 -0.353 1.68 1.91 -0.826 -0.487 -0.0238 0.548 0.00139 +## 3 8.43e7 M 1.58 0.456 1.57 1.56 0.941 1.05 1.36 2.04 0.939 +## 4 8.43e7 M -0.768 0.254 -0.592 -0.764 3.28 3.40 1.91 1.45 2.86 +## 5 8.44e7 M 1.75 -1.15 1.78 1.82 0.280 0.539 1.37 1.43 -0.00955 +## 6 8.44e5 M -0.476 -0.835 -0.387 -0.505 2.24 1.24 0.866 0.824 1.00 +## # … with 1 more variable: Fractal_Dimension <dbl>

Variable descriptions

Breast tumours can be diagnosed by performing a biopsy, a process where tissue is removed from the body and examined for the presence of disease. @@ -483,42 +482,42 @@

6.4 Exploring a labelled data set

A magnified image of a malignant breast fine needle aspiration image. White lines denote the boundary of the cell nuclei. Source

Below we use glimpse to preview the data frame. This function is similar to head, but can be easier to read when we have a lot of columns:

-
glimpse(cancer)
+
glimpse(cancer)
## Rows: 569
 ## Columns: 12
-## $ ID                <dbl> 842302, 842517, 84300903, 84348301, 84358402, 84378…
-## $ Class             <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "…
-## $ Radius            <dbl> 1.0960995, 1.8282120, 1.5784992, -0.7682333, 1.7487…
-## $ Texture           <dbl> -2.0715123, -0.3533215, 0.4557859, 0.2535091, -1.15…
-## $ Perimeter         <dbl> 1.26881726, 1.68447255, 1.56512598, -0.59216612, 1.…
-## $ Area              <dbl> 0.98350952, 1.90703027, 1.55751319, -0.76379174, 1.…
-## $ Smoothness        <dbl> 1.56708746, -0.82623545, 0.94138212, 3.28066684, 0.…
-## $ Compactness       <dbl> 3.28062806, -0.48664348, 1.05199990, 3.39991742, 0.…
-## $ Concavity         <dbl> 2.65054179, -0.02382489, 1.36227979, 1.91421287, 1.…
-## $ Concave_Points    <dbl> 2.53024886, 0.54766227, 2.03543978, 1.45043113, 1.4…
-## $ Symmetry          <dbl> 2.215565542, 0.001391139, 0.938858720, 2.864862154,…
-## $ Fractal_Dimension <dbl> 2.25376381, -0.86788881, -0.39765801, 4.90660199, -…
+## $ ID <dbl> 842302, 842517, 84300903, 84348301, 84358402, 843786, 844359, 84458202, 844981, 84501… +## $ Class <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", … +## $ Radius <dbl> 1.09609953, 1.82821197, 1.57849920, -0.76823332, 1.74875791, -0.47595587, 1.16987830,… +## $ Texture <dbl> -2.0715123, -0.3533215, 0.4557859, 0.2535091, -1.1508038, -0.8346009, 0.1605082, 0.35… +## $ Perimeter <dbl> 1.26881726, 1.68447255, 1.56512598, -0.59216612, 1.77501133, -0.38680772, 1.13712450,… +## $ Area <dbl> 0.98350952, 1.90703027, 1.55751319, -0.76379174, 1.82462380, -0.50520593, 1.09433201,… +## $ Smoothness <dbl> 1.56708746, -0.82623545, 0.94138212, 3.28066684, 0.28012535, 2.23545452, -0.12302797,… +## $ Compactness <dbl> 3.28062806, -0.48664348, 1.05199990, 3.39991742, 0.53886631, 1.24324156, 0.08821762, … +## $ Concavity <dbl> 2.65054179, -0.02382489, 1.36227979, 1.91421287, 1.36980615, 0.86554001, 0.29980860, … +## $ Concave_Points <dbl> 2.53024886, 0.54766227, 2.03543978, 1.45043113, 1.42723695, 0.82393067, 0.64636637, 0… +## $ Symmetry <dbl> 2.215565542, 0.001391139, 0.938858720, 2.864862154, -0.009552062, 1.004517928, -0.064… +## $ Fractal_Dimension <dbl> 2.25376381, -0.86788881, -0.39765801, 4.90660199, -0.56195552, 1.88834350, -0.7616619…

We can see from the summary of the data above that Class is of type character (denoted by <chr>). Since we are going to be working with Class as a categorical statistical variable, we will convert it to factor using the function as_factor.

-
cancer <- cancer %>% 
-  mutate(Class = as_factor(Class)) 
-glimpse(cancer)
+
cancer <- cancer %>% 
+  mutate(Class = as_factor(Class)) 
+glimpse(cancer)
## Rows: 569
 ## Columns: 12
-## $ ID                <dbl> 842302, 842517, 84300903, 84348301, 84358402, 84378…
-## $ Class             <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, …
-## $ Radius            <dbl> 1.0960995, 1.8282120, 1.5784992, -0.7682333, 1.7487…
-## $ Texture           <dbl> -2.0715123, -0.3533215, 0.4557859, 0.2535091, -1.15…
-## $ Perimeter         <dbl> 1.26881726, 1.68447255, 1.56512598, -0.59216612, 1.…
-## $ Area              <dbl> 0.98350952, 1.90703027, 1.55751319, -0.76379174, 1.…
-## $ Smoothness        <dbl> 1.56708746, -0.82623545, 0.94138212, 3.28066684, 0.…
-## $ Compactness       <dbl> 3.28062806, -0.48664348, 1.05199990, 3.39991742, 0.…
-## $ Concavity         <dbl> 2.65054179, -0.02382489, 1.36227979, 1.91421287, 1.…
-## $ Concave_Points    <dbl> 2.53024886, 0.54766227, 2.03543978, 1.45043113, 1.4…
-## $ Symmetry          <dbl> 2.215565542, 0.001391139, 0.938858720, 2.864862154,…
-## $ Fractal_Dimension <dbl> 2.25376381, -0.86788881, -0.39765801, 4.90660199, -…
+## $ ID <dbl> 842302, 842517, 84300903, 84348301, 84358402, 843786, 844359, 84458202, 844981, 84501… +## $ Class <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, B, B, B, M, M, M, M, M, M, M… +## $ Radius <dbl> 1.09609953, 1.82821197, 1.57849920, -0.76823332, 1.74875791, -0.47595587, 1.16987830,… +## $ Texture <dbl> -2.0715123, -0.3533215, 0.4557859, 0.2535091, -1.1508038, -0.8346009, 0.1605082, 0.35… +## $ Perimeter <dbl> 1.26881726, 1.68447255, 1.56512598, -0.59216612, 1.77501133, -0.38680772, 1.13712450,… +## $ Area <dbl> 0.98350952, 1.90703027, 1.55751319, -0.76379174, 1.82462380, -0.50520593, 1.09433201,… +## $ Smoothness <dbl> 1.56708746, -0.82623545, 0.94138212, 3.28066684, 0.28012535, 2.23545452, -0.12302797,… +## $ Compactness <dbl> 3.28062806, -0.48664348, 1.05199990, 3.39991742, 0.53886631, 1.24324156, 0.08821762, … +## $ Concavity <dbl> 2.65054179, -0.02382489, 1.36227979, 1.91421287, 1.36980615, 0.86554001, 0.29980860, … +## $ Concave_Points <dbl> 2.53024886, 0.54766227, 2.03543978, 1.45043113, 1.42723695, 0.82393067, 0.64636637, 0… +## $ Symmetry <dbl> 2.215565542, 0.001391139, 0.938858720, 2.864862154, -0.009552062, 1.004517928, -0.064… +## $ Fractal_Dimension <dbl> 2.25376381, -0.86788881, -0.39765801, 4.90660199, -0.56195552, 1.88834350, -0.7616619…

Factors have what are called “levels”, which you can think of as categories. We can ask for the levels from the Class column by using the levels function. This function should return the name of each category in that column. Given @@ -527,20 +526,20 @@

6.4 Exploring a labelled data set a vector argument, while the select function outputs a data frame; so we use the pull function, which converts a single column of a data frame into a vector.

-
cancer %>% 
-  select(Class) %>% 
-  pull() %>% # turns a data frame into a vector
-  levels()
+
cancer %>% 
+  select(Class) %>% 
+  pull() %>% # turns a data frame into a vector
+  levels()
## [1] "M" "B"

Exploring the data

Before we start doing any modelling, let’s explore our data set. Below we use the group_by + summarize code pattern we used before to see that we have 357 (63%) benign and 212 (37%) malignant tumour observations.

-
num_obs <- nrow(cancer)
-cancer %>% 
-  group_by(Class) %>% 
-  summarize(n = n(),
-            percentage = n() / num_obs * 100)
+
num_obs <- nrow(cancer)
+cancer %>% 
+  group_by(Class) %>% 
+  summarize(n = n(),
+            percentage = n() / num_obs * 100)
## # A tibble: 2 x 3
 ##   Class     n percentage
 ##   <fct> <int>      <dbl>
@@ -552,15 +551,15 @@ 

6.4 Exploring a labelled data set the scale_color_manual function. We also make the category labels (“B” and “M”) more readable by changing them to “Benign” and “Malignant” using the labels argument.

-
# colour palette
-cbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#999999") 
-
-perim_concav <- cancer %>%  
-  ggplot(aes(x = Perimeter, y = Concavity, color = Class)) + 
-    geom_point(alpha = 0.5) +
-    labs(color = "Diagnosis") + 
-    scale_color_manual(labels = c("Malignant", "Benign"), values = cbPalette)
-perim_concav
+
# colour palette
+cbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#999999") 
+
+perim_concav <- cancer %>%  
+  ggplot(aes(x = Perimeter, y = Concavity, color = Class)) + 
+    geom_point(alpha = 0.5) +
+    labs(color = "Diagnosis") + 
+    scale_color_manual(labels = c("Malignant", "Benign"), values = cbPalette)
+perim_concav

In this visualization, we can see that malignant observations typically fall in the the upper right-hand corner of the plot area. By contrast, benign @@ -635,12 +634,12 @@

6.5 Classification with K-nearest
-
new_obs_Perimeter <- 0
-new_obs_Concavity <- 3.5
-cancer %>% select(ID, Perimeter, Concavity, Class) %>% 
-  mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2  + (Concavity - new_obs_Concavity)^2)) %>% 
-  arrange(dist_from_new) %>% 
-  head(n = 5)
+
new_obs_Perimeter <- 0
+new_obs_Concavity <- 3.5
+cancer %>% select(ID, Perimeter, Concavity, Class) %>% 
+  mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2  + (Concavity - new_obs_Concavity)^2)) %>% 
+  arrange(dist_from_new) %>% 
+  head(n = 5)
## # A tibble: 5 x 5
 ##        ID Perimeter Concavity Class dist_from_new
 ##     <dbl>     <dbl>     <dbl> <fct>         <dbl>
@@ -714,8 +713,8 @@ 

6.5 Classification with K-nearest and then took the square root; now we will do the same, except for all of our \(m\) variables. In other words, the distance formula becomes

\[Distance = \sqrt{(u_{1} -v_{1})^2 + (u_{2} - v_{2})^2 + \dots + (u_{m} - v_{m})^2}\]

-
- +
+

Click and drag the plot above to rotate it, and scroll to zoom. Note that in general we recommend against using 3D visualizations; here we show the data in 3D only to illustrate what “higher dimensions” look like for learning @@ -743,15 +742,15 @@

6.6 K-nearest neighbours with tidymodels:

-
library(tidymodels)
+
library(tidymodels)

Let’s again suppose we have a new observation with perimeter 0 and concavity 3.5, but its diagnosis is unknown (as in our example above). Suppose we want to use the perimeter and concavity explanatory variables/predictors to predict the diagnosis class of this observation. Let’s pick out our 2 desired predictor variables and class label and store it as a new dataset named cancer_train:

-
cancer_train <- cancer %>%
-  select(Class, Perimeter, Concavity)
-head(cancer_train)
+
cancer_train <- cancer %>%
+  select(Class, Perimeter, Concavity)
+head(cancer_train)
## # A tibble: 6 x 3
 ##   Class Perimeter Concavity
 ##   <fct>     <dbl>     <dbl>
@@ -772,10 +771,10 @@ 

6.6 K-nearest neighbours with kknn engine) for training the model with the set_engine function. Finally we specify that this is a classification problem with the set_mode function.

-
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 5) %>%
-       set_engine("kknn") %>%
-       set_mode("classification")
-knn_spec
+
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 5) %>%
+       set_engine("kknn") %>%
+       set_mode("classification")
+knn_spec
## K-Nearest Neighbor Model Specification (classification)
 ## 
 ## Main Arguments:
@@ -788,12 +787,12 @@ 

6.6 K-nearest neighbours with Class ~ . argument specifies that Class is the target variable (the one we want to predict), and . (everything except Class) is to be used as the predictor.

-
knn_fit <- knn_spec %>%
-        fit(Class ~ ., data = cancer_train)
-knn_fit
+
knn_fit <- knn_spec %>%
+        fit(Class ~ ., data = cancer_train)
+knn_fit
## parsnip model object
 ## 
-## Fit time:  14ms 
+## Fit time:  37ms 
 ## 
 ## Call:
 ## kknn::train.kknn(formula = formula, data = data, ks = ~5, kernel = ~"rectangular")
@@ -815,8 +814,8 @@ 

6.6 K-nearest neighbours with knn_fit object classifies the new observation as malignant (“M”). Note that the predict function outputs a data frame with a single variable named .pred_class.

-
new_obs <- tibble(Perimeter = 0, Concavity = 3.5)
-predict(knn_fit, new_obs)
+
new_obs <- tibble(Perimeter = 0, Concavity = 3.5)
+predict(knn_fit, new_obs)
## # A tibble: 1 x 1
 ##   .pred_class
 ##   <fct>      
@@ -856,10 +855,10 @@ 

6.7.1 Centering and scaling

cancer data set; we have been using a standardized version of the data set up until now. To keep things simple, we will just use the Area, Smoothness, and Class variables:

-
unscaled_cancer <- read_csv("data/unscaled_wdbc.csv") %>% 
-             mutate(Class = as_factor(Class)) %>%
-            select(Class, Area, Smoothness)
-head(unscaled_cancer)
+
unscaled_cancer <- read_csv("data/unscaled_wdbc.csv") %>% 
+             mutate(Class = as_factor(Class)) %>%
+            select(Class, Area, Smoothness)
+head(unscaled_cancer)
## # A tibble: 6 x 3
 ##   Class  Area Smoothness
 ##   <fct> <dbl>      <dbl>
@@ -878,8 +877,8 @@ 

6.7.1 Centering and scaling

In the tidymodels framework, all data preprocessing happens using a recipe. Here we will initialize a recipe for the unscaled_cancer data above, specifying that the Class variable is the target, and all other variables are predictors:

-
uc_recipe <- recipe(Class ~ ., data = unscaled_cancer)
-print(uc_recipe)
+
uc_recipe <- recipe(Class ~ ., data = unscaled_cancer)
+print(uc_recipe)
## Data Recipe
 ## 
 ## Inputs:
@@ -893,11 +892,11 @@ 

6.7.1 Centering and scaling

The prep function finalizes the recipe by using the data (here, unscaled_cancer) to compute anything necessary to run the recipe (in this case, the column means and standard deviations):

-
uc_recipe <- uc_recipe %>%
-       step_scale(all_predictors()) %>%
-       step_center(all_predictors()) %>%
-       prep()
-uc_recipe
+
uc_recipe <- uc_recipe %>%
+       step_scale(all_predictors()) %>%
+       step_center(all_predictors()) %>%
+       prep()
+uc_recipe
## Data Recipe
 ## 
 ## Inputs:
@@ -927,8 +926,8 @@ 

6.7.1 Centering and scaling

You can find a full set of all the steps and variable selection functions on the recipes home page. We finally use the bake function to apply the recipe.

-
scaled_cancer <- bake(uc_recipe, unscaled_cancer)
-head(scaled_cancer)
+
scaled_cancer <- bake(uc_recipe, unscaled_cancer)
+head(scaled_cancer)
## # A tibble: 6 x 3
 ##     Area Smoothness Class
 ##    <dbl>      <dbl> <fct>
@@ -973,17 +972,17 @@ 

6.7.2 Balancing

what the data would look like if the cancer was rare. We will do this by picking only 3 observations randomly from the malignant group, and keeping all of the benign observations.

-
set.seed(3)
-rare_cancer <- bind_rows(filter(cancer, Class == "B"),
-                         cancer %>% filter(Class == "M") %>% sample_n(3)) %>%
-                  select(Class, Perimeter, Concavity)
-        
-rare_plot <- rare_cancer %>%  
-  ggplot(aes(x = Perimeter, y = Concavity, color = Class)) + 
-    geom_point(alpha = 0.5) +
-    labs(color = "Diagnosis") + 
-    scale_color_manual(labels = c("Malignant", "Benign"), values = cbPalette)
-rare_plot
+
set.seed(3)
+rare_cancer <- bind_rows(filter(cancer, Class == "B"),
+                         cancer %>% filter(Class == "M") %>% sample_n(3)) %>%
+                  select(Class, Perimeter, Concavity)
+        
+rare_plot <- rare_cancer %>%  
+  ggplot(aes(x = Perimeter, y = Concavity, color = Class)) + 
+    geom_point(alpha = 0.5) +
+    labs(color = "Diagnosis") + 
+    scale_color_manual(labels = c("Malignant", "Benign"), values = cbPalette)
+rare_plot

Note: You will see in the code above that we use the set.seed function. This is because we are using sample_n to artificially pick @@ -1017,10 +1016,10 @@

6.7.2 Balancing

step to the earlier uc_recipe recipe with the step_upsample function. We show below how to do this, and also use the group_by + summarize pattern we’ve seen before to see that our classes are now balanced:

-
ups_recipe <- recipe(Class ~ ., data = rare_cancer) %>%
-       step_upsample(Class, over_ratio = 1, skip = FALSE) %>%
-       prep()
-ups_recipe
+
ups_recipe <- recipe(Class ~ ., data = rare_cancer) %>%
+       step_upsample(Class, over_ratio = 1, skip = FALSE) %>%
+       prep()
+ups_recipe
## Data Recipe
 ## 
 ## Inputs:
@@ -1034,11 +1033,11 @@ 

6.7.2 Balancing

## Operations: ## ## Up-sampling based on Class [trained]
-
upsampled_cancer <- bake(ups_recipe, rare_cancer)
-
-upsampled_cancer %>% 
-    group_by(Class) %>%
-    summarize(n = n())
+
upsampled_cancer <- bake(ups_recipe, rare_cancer)
+
+upsampled_cancer %>% 
+    group_by(Class) %>%
+    summarize(n = n())
## # A tibble: 2 x 2
 ##   Class     n
 ##   <fct> <int>
@@ -1059,19 +1058,19 @@ 

6.8 Putting it together in a unscaled_wdbc.csv data. First we will load the data, create a model, and specify a recipe for how the data should be preprocessed:

-
# load the unscaled cancer data and make sure the target Class variable is a factor
-unscaled_cancer <- read_csv("data/unscaled_wdbc.csv") %>%
-                      mutate(Class = as_factor(Class))
-
-# create the KNN model
-knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) %>%
-       set_engine("kknn") %>%
-       set_mode("classification")
-
-# create the centering / scaling recipe
-uc_recipe <- recipe(Class ~ Area + Smoothness, data = unscaled_cancer) %>%
-       step_scale(all_predictors()) %>%
-       step_center(all_predictors()) 
+
# load the unscaled cancer data and make sure the target Class variable is a factor
+unscaled_cancer <- read_csv("data/unscaled_wdbc.csv") %>%
+                      mutate(Class = as_factor(Class))
+
+# create the KNN model
+knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) %>%
+       set_engine("kknn") %>%
+       set_mode("classification")
+
+# create the centering / scaling recipe
+uc_recipe <- recipe(Class ~ Area + Smoothness, data = unscaled_cancer) %>%
+       step_scale(all_predictors()) %>%
+       step_center(all_predictors()) 

Note that each of these steps is exactly the same as earlier, except for one major difference: we did not use the select function to extract the relevant variables from the data frame, and instead simply specified the relevant variables to use via the @@ -1082,22 +1081,22 @@

6.8 Putting it together in a fit function to run the whole workflow on the unscaled_cancer data. Note another difference from earlier here: we do not include a formula in the fit function. This is again because we included the formula in the recipe, so there is no need to respecify it:

-
knn_fit <- workflow() %>%
-           add_recipe(uc_recipe) %>%
-           add_model(knn_spec) %>% 
-    fit(data = unscaled_cancer)
-knn_fit
-
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
+
knn_fit <- workflow() %>%
+           add_recipe(uc_recipe) %>%
+           add_model(knn_spec) %>% 
+    fit(data = unscaled_cancer)
+knn_fit
+
## ══ Workflow [trained] ════════════════════════════════════════════════════════════════════════════════════════════
 ## Preprocessor: Recipe
 ## Model: nearest_neighbor()
 ## 
-## ── Preprocessor ────────────────────────────────────────────────────────────────
+## ── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────
 ## 2 Recipe Steps
 ## 
 ## ● step_scale()
 ## ● step_center()
 ## 
-## ── Model ───────────────────────────────────────────────────────────────────────
+## ── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────
 ## 
 ## Call:
 ## kknn::train.kknn(formula = formula, data = data, ks = ~7, kernel = ~"rectangular")
@@ -1116,26 +1115,26 @@ 

6.8 Putting it together in a alpha value) and large point radius. We include the code here as a learning challenge; see if you can figure out what each line is doing!

-
# create the grid of area/smoothness vals, and arrange in a data frame
-are_grid <- seq(min(unscaled_cancer$Area), max(unscaled_cancer$Area), length.out = 100)
-smo_grid <- seq(min(unscaled_cancer$Smoothness), max(unscaled_cancer$Smoothness), length.out = 100)
-asgrid <- as_tibble(expand.grid(Area=are_grid, Smoothness=smo_grid))
-
-# use the fit workflow to make predictions at the grid points
-knnPredGrid <- predict(knn_fit, asgrid)
-
-# bind the predictions as a new column with the grid points
-prediction_table <- bind_cols(knnPredGrid, asgrid) %>% rename(Class = .pred_class)
-
-# plot:
-# 1. the coloured scatter of the original data
-# 2. the faded coloured scatter for the grid points 
-wkflw_plot <-
-  ggplot() +
-    geom_point(data = unscaled_cancer, mapping = aes(x = Area, y = Smoothness, color = Class), alpha=0.75) +
-    geom_point(data = prediction_table, mapping = aes(x = Area, y = Smoothness, color = Class), alpha=0.02, size=5.)+
-    labs(color = "Diagnosis") +
-    scale_color_manual(labels = c("Malignant", "Benign"), values = cbPalette)
+
# create the grid of area/smoothness vals, and arrange in a data frame
+are_grid <- seq(min(unscaled_cancer$Area), max(unscaled_cancer$Area), length.out = 100)
+smo_grid <- seq(min(unscaled_cancer$Smoothness), max(unscaled_cancer$Smoothness), length.out = 100)
+asgrid <- as_tibble(expand.grid(Area=are_grid, Smoothness=smo_grid))
+
+# use the fit workflow to make predictions at the grid points
+knnPredGrid <- predict(knn_fit, asgrid)
+
+# bind the predictions as a new column with the grid points
+prediction_table <- bind_cols(knnPredGrid, asgrid) %>% rename(Class = .pred_class)
+
+# plot:
+# 1. the coloured scatter of the original data
+# 2. the faded coloured scatter for the grid points 
+wkflw_plot <-
+  ggplot() +
+    geom_point(data = unscaled_cancer, mapping = aes(x = Area, y = Smoothness, color = Class), alpha=0.75) +
+    geom_point(data = prediction_table, mapping = aes(x = Area, y = Smoothness, color = Class), alpha=0.02, size=5.)+
+    labs(color = "Diagnosis") +
+    scale_color_manual(labels = c("Malignant", "Benign"), values = cbPalette)
diff --git a/docs/clustering.html b/docs/clustering.html index 4fe49b8e7..ed15ee480 100644 --- a/docs/clustering.html +++ b/docs/clustering.html @@ -26,7 +26,7 @@ - + @@ -410,7 +410,7 @@

10.3 Clustering

customer loyalty and satisfaction, and we want to learn whether there are distinct “types” of customer. Understanding this might help us come up with better products or promotions to improve our business in a data-driven way.

-
head(marketing_data)
+
head(marketing_data)
## # A tibble: 6 x 2
 ##   loyalty  csat
 ##     <dbl> <dbl>
@@ -589,9 +589,9 @@ 

10.5 K-means in R

and K, the number of clusters (here we choose K = 3). Note that since the K-means algorithm uses a random initialization of assignments, we need to set the random seed to make the clustering reproducible.

-
set.seed(1234)
-marketing_clust <- kmeans(marketing_data, centers = 3)
-marketing_clust
+
set.seed(1234)
+marketing_clust <- kmeans(marketing_data, centers = 3)
+marketing_clust
## K-means clustering with 3 clusters of sizes 2, 10, 7
 ## 
 ## Cluster means:
@@ -609,8 +609,8 @@ 

10.5 K-means in R

## ## Available components: ## -## [1] "cluster" "centers" "totss" "withinss" "tot.withinss" -## [6] "betweenss" "size" "iter" "ifault"
+## [1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" +## [8] "iter" "ifault"

As you can see above, the clustering object returned has a lot of information about our analysis that we need to explore. Let’s take a look at it now. To do this, we will call in help from the broom package so that we get the model @@ -619,8 +619,8 @@

10.5 K-means in R

we use the augment function. Augment takes in the model and the original data frame, and returns a data frame with the data and the cluster assignments for each point:

-
clustered_data <- augment(marketing_clust, marketing_data)
-head(clustered_data)
+
clustered_data <- augment(marketing_clust, marketing_data)
+head(clustered_data)
## # A tibble: 6 x 4
 ##   loyalty  csat label .cluster
 ##     <dbl> <dbl> <chr> <fct>   
@@ -631,17 +631,17 @@ 

10.5 K-means in R

## 5 8 3 2 2 ## 6 3 2 1 2

Now that we have this data frame, we can easily plot the data (i.e., cluster assignments of each point):

-
cluster_plot <- ggplot(clustered_data, aes(x = csat, y = loyalty, colour = .cluster), size=2) +
-  geom_point() +
-  labs(x = 'Customer satisfaction', y = 'Loyalty', colour = 'Cluster')
-cluster_plot
+
cluster_plot <- ggplot(clustered_data, aes(x = csat, y = loyalty, colour = .cluster), size=2) +
+  geom_point() +
+  labs(x = 'Customer satisfaction', y = 'Loyalty', colour = 'Cluster')
+cluster_plot

As mentioned above, we need to choose a K to perform K-means clustering by finding where the “elbow” occurs in the plot of total WSSD versus number of clusters. We can get at the total WSSD (tot.withinss) from our clustering using broom’s glance function (it gives model-level statistics). For example:

-
glance(marketing_clust)
+
glance(marketing_clust)
## # A tibble: 1 x 4
 ##   totss tot.withinss betweenss  iter
 ##   <dbl>        <dbl>     <dbl> <int>
@@ -653,10 +653,10 @@ 

10.5 K-means in R

This results in a complex data frame with 3 columns, one for K, one for the models, and one for the model statistics (output of glance, which is a data frame):

-
marketing_clust_ks <- tibble(k = 1:9) %>%
-  mutate(marketing_clusts = map(k, ~kmeans(marketing_data, .x)),
-         glanced = map(marketing_clusts, glance)) 
-head(marketing_clust_ks)
+
marketing_clust_ks <- tibble(k = 1:9) %>%
+  mutate(marketing_clusts = map(k, ~kmeans(marketing_data, .x)),
+         glanced = map(marketing_clusts, glance)) 
+head(marketing_clust_ks)
## # A tibble: 6 x 3
 ##       k marketing_clusts glanced         
 ##   <int> <list>           <list>          
@@ -669,10 +669,10 @@ 

10.5 K-means in R

We now extract the total WSSD from the glanced column. Given that each item in this column is a data frame, we will need to use the unnest function to unpack the data frames.

-
clustering_statistics <- marketing_clust_ks %>%
-  unnest(glanced)
-
-head(clustering_statistics)
+
clustering_statistics <- marketing_clust_ks %>%
+  unnest(glanced)
+
+head(clustering_statistics)
## # A tibble: 6 x 6
 ##       k marketing_clusts totss tot.withinss betweenss  iter
 ##   <int> <list>           <dbl>        <dbl>     <dbl> <int>
@@ -684,13 +684,13 @@ 

10.5 K-means in R

## 6 6 <kmeans> 245. 15.8 2.29e+ 2 2

Now that we have tot.withinss and k as columns in a data frame, we can make a line plot and search for the “elbow” to find which value of K to use.

-
elbow_plot <- ggplot(clustering_statistics, aes(x = k, y = tot.withinss)) +
-  geom_point() +
-  geom_line() +
-  xlab("K") +
-  ylab("Total within-cluster sum of squares")+
-  scale_x_continuous(breaks = 1:9)
-elbow_plot
+
elbow_plot <- ggplot(clustering_statistics, aes(x = k, y = tot.withinss)) +
+  geom_point() +
+  geom_line() +
+  xlab("K") +
+  ylab("Total within-cluster sum of squares")+
+  scale_x_continuous(breaks = 1:9)
+elbow_plot

It looks like 3 clusters is the right choice for this data. But why is there a “bump” in the total WSSD plot here? Shouldn’t total WSSD always @@ -698,20 +698,20 @@

10.5 K-means in R

get “stuck” in a bad solution. Unfortunately, for K = 6 we had an unlucky initialization and found a bad clustering! We can help prevent finding a bad clustering by trying a few different random initializations via the nstart argument (here we use 10 restarts).

-
marketing_clust_ks <- tibble(k = 1:9) %>%
-  mutate(marketing_clusts = map(k, ~kmeans(marketing_data, nstart = 10, .x)),
-         glanced = map(marketing_clusts, glance)) 
-
-clustering_statistics <- marketing_clust_ks %>%
-  unnest(glanced)
-
-elbow_plot <- ggplot(clustering_statistics, aes(x = k, y = tot.withinss)) +
-  geom_point() +
-  geom_line() +
-  xlab("K") +
-  ylab("Total within-cluster sum of squares")+
-  scale_x_continuous(breaks = 1:9)
-elbow_plot
+
marketing_clust_ks <- tibble(k = 1:9) %>%
+  mutate(marketing_clusts = map(k, ~kmeans(marketing_data, nstart = 10, .x)),
+         glanced = map(marketing_clusts, glance)) 
+
+clustering_statistics <- marketing_clust_ks %>%
+  unnest(glanced)
+
+elbow_plot <- ggplot(clustering_statistics, aes(x = k, y = tot.withinss)) +
+  geom_point() +
+  geom_line() +
+  xlab("K") +
+  ylab("Total within-cluster sum of squares")+
+  scale_x_continuous(breaks = 1:9)
+elbow_plot

diff --git a/docs/img/intro-bootstrap.svg b/docs/img/intro-bootstrap.svg new file mode 100644 index 000000000..aa31d51c6 --- /dev/null +++ b/docs/img/intro-bootstrap.svg @@ -0,0 +1,457 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/img/population_vs_sample.svg b/docs/img/population_vs_sample.svg index fc7c44382..cba6fefdb 100644 --- a/docs/img/population_vs_sample.svg +++ b/docs/img/population_vs_sample.svg @@ -45,16 +45,13 @@ - + - + - - - - + @@ -463,15 +460,20 @@ - - - - - - - - - + + + + + + + + + + + + + + @@ -544,9 +546,9 @@ - - - + + + diff --git a/docs/index.html b/docs/index.html index 9c14bed71..3226a13db 100644 --- a/docs/index.html +++ b/docs/index.html @@ -26,7 +26,11 @@ +<<<<<<< HEAD +======= + +>>>>>>> dev @@ -342,7 +346,11 @@

Introduction to Data Science

Tiffany-Anne Timbers

Trevor Campbell

Melissa Lee

+<<<<<<< HEAD

2020-12-07

+======= +

2020-11-24

+>>>>>>> dev

Chapter 1 R, Jupyter, and the tidyverse

diff --git a/docs/inference.html b/docs/inference.html index 481aea4ec..f5e5e1a76 100644 --- a/docs/inference.html +++ b/docs/inference.html @@ -26,7 +26,7 @@ - + @@ -348,14 +348,13 @@

Chapter 11 Introduction to Statistical Inference

11.1 Overview

-

In almost all data analysis tasks in practice, we want to draw conclusions about some unknown +

A typical data analysis task in practice is to draw conclusions about some unknown aspect of a population of interest based on observed data sampled from that -population; we typically do not get data on the full population. +population; we typically do not get data on the entire population. Data analysis questions regarding how summaries, -patterns, trends, or relationships in a dataset +patterns, trends, or relationships in a data set extend to the wider population are called inferential questions. This chapter will start -with the fundamental ideas of sampling from populations, and then will work towards -introducing two common techniques in statistical inference: point estimation and +with the fundamental ideas of sampling from populations and then introduce two common techniques in statistical inference: point estimation and interval estimation.

@@ -378,38 +377,31 @@

11.2 Chapter learning objectives<

11.3 Why do we need sampling?

Statistical inference can help us decide how quantities we observe in a subset of data relate to the same quantities in the broader -population. Here is an example question that we might use statistical inference to answer:

+population. Suppose a retailer is considering selling iPhone accessories, and they want to estimate how big the market might be. Additionally, they want to strategize how they can market their products on North American college and university campuses. This retailer might use statistical inference to answer the question:

What proportion of all undergraduate students in North America own an iPhone?

In the above question, we are interested in making a conclusion about all -undergraduate students in North America. This is our population: -in general, the population is the complete collection of individuals or cases we are interested in studying. +undergraduate students in North America; this is our population. +In general, the population is the complete collection of individuals or cases we are interested in studying. Further, in the above question, we are interested in computing a quantity—the proportion -of iPhone owners—based on the entire population. This is our population parameter: -in general, a population parameter is a numerical characteristic -of the entire population. In order to compute this number in the example above, we would need to ask +of iPhone owners—based on the entire population. This is our population parameter. +In general, a population parameter is a numerical characteristic +of the entire population. To compute this number in the example above, we would need to ask every single undergraduate in North America whether or not they own an iPhone. In practice, directly computing population parameters is often time-consuming and costly, and sometimes impossible.

A more practical approach would be to collect measurements for a sample: a subset of -individuals collected from the population. We can then compute a sample statistic—a numerical -characteristic of the sample—that estimates the population parameter. For example, if we -randomly selected 100 undergraduate students across North America (the sample) and computed the fraction of those -students who own an iPhone (the sample statistic), we might suspect that that fraction is a reasonable -estimate of the full population fraction.

-
-Figure 11.1: Population versus sample +individuals collected from the population. We can then compute a sample estimate—a numerical +characteristic of the sample—that estimates the population parameter. For example, suppose we randomly selected 100 undergraduate students across North America (the sample) and computed the proportion of those +students who own an iPhone (the sample estimate). In that case, we might suspect that that proportion is a reasonable estimate of the proportion of students who own an iPhone in the entire population.

+
+Population versus sample +

+Figure 11.1: Population versus sample +

+
-

Note that proportions are not the only kind of population parameter we might be interested in. -Let’s consider another example question that we might tackle with statistical inference:

-

What is the average price-per-night of one-bedroom apartment rentals in Vancouver, Canada?

-

Here, the population consists of all one-bedroom apartment rental offerings in Vancouver, and the population -parameter is the average price-per-night. But even within this one example, we could also be interested -in many other population parameters: the median price, the fraction of -one-bedroom apartments that cost more than $200 per night, -the standard deviation of the price, and the list goes on. -If we were somehow able to observe the whole population of one-bedroom apartments in Vancouver, -we could compute each of these numbers exactly; therefore these are all population parameters. -There are many kinds of observations and population parameters that you will run into in practice, -but in this chapter we will focus on two settings:

+

Note that proportions are not the only kind of population parameter we might be interested in. Suppose an undergraduate student studying at the University of British Columbia in Vancouver, British Columbia, is looking for an apartment to rent. They need to create a budget, so they want to know something about studio apartment rental prices in Vancouver, BC. This student might use statistical inference to tackle the question:

+

What is the average price-per-month of studio apartment rentals in Vancouver, Canada?

+

The population consists of all studio apartment rentals in Vancouver, and the population parameter is the average price-per-month. Here we used the average as a measure of center to describe the “typical value” of studio apartment rental prices. But even within this one example, we could also be interested in many other population parameters. For instance, we know that not every studio apartment rental in Vancouver will have the same price-per-month. The student might be interested in how much monthly prices vary and want to find a measure of the rentals’ spread (or variability), such as the standard deviation. We might be interested in the fraction of studio apartment rentals that cost more than $1000 per month. And the list of population parameters we might want to calculate goes on. The question we want to answer will help us determine the parameter we want to estimate. If we were somehow able to observe the whole population of studio apartment rental offerings in Vancouver, we could compute each of these numbers exactly; therefore, these are all population parameters. There are many kinds of observations and population parameters that you will run into in practice, but in this chapter, we will focus on two settings:

  1. Using categorical observations to estimate the proportion of each category
  2. Using quantitative observations to estimate the average (or mean)
  3. @@ -422,94 +414,93 @@

    11.4.1 Sampling distributions for

    Let’s start with an illustrative (and tasty!) example. Timbits are bite-sized doughnuts sold at Tim Hortons, a popular Canadian-based fast-food restaurant chain founded in Hamilton, Ontario, Canada.

    -
    - -

    Timbits. Source: wikimedia.org

    +
    +Timbits. Source: wikimedia.org +

    +Figure 11.2: Timbits. Source: wikimedia.org +

    Suppose we wanted to estimate the true proportion of chocolate doughnuts at Tim Hortons restaurants. Now, of course, we (the authors!) do not have access to the true population. -So in this chapter, we will simulate a synthetic box of 10,000 Timbits with two types—old-fashioned +So for this chapter, we created a fictitious box of 10,000 Timbits with two flavours—old-fashioned and chocolate—as our population, and use this to illustrate -inferential concepts. Below we create a tibble() with a subject ID and Timbit type as our columns.

    -
    library(tidyverse) 
    -library(ggplot2)
    -library(infer)
    -library(gridExtra)
    -set.seed(1234)
    -virtual_box <- tibble(timbit_id = seq(1, 10000, by = 1),
    -                     color = factor(rbinom(10000, 1, 0.63),
    -                     labels = c("old fashioned", "chocolate")))
    -head(virtual_box)
    -
    ## # A tibble: 6 x 2
    -##   timbit_id color        
    -##       <dbl> <fct>        
    -## 1         1 chocolate    
    -## 2         2 chocolate    
    -## 3         3 chocolate    
    -## 4         4 chocolate    
    -## 5         5 old fashioned
    -## 6         6 old fashioned
    +inferential concepts. Below we have a tibble() called virtual_box with a Timbit ID and flavour as our columns. We have also loaded our necessary packages: tidyverse and the infer package, which we will need to perform sampling later in the chapter.

    +
    library(tidyverse) 
    +library(infer)
    +virtual_box
    +
    ## # A tibble: 10,000 x 2
    +##    timbit_id flavour      
    +##        <dbl> <fct>        
    +##  1         1 chocolate    
    +##  2         2 chocolate    
    +##  3         3 chocolate    
    +##  4         4 chocolate    
    +##  5         5 old fashioned
    +##  6         6 old fashioned
    +##  7         7 chocolate    
    +##  8         8 chocolate    
    +##  9         9 old fashioned
    +## 10        10 chocolate    
    +## # … with 9,990 more rows

    From our simulated box, we can see that the proportion of chocolate Timbits is -0.63. This value, 0.63, is the population parameter. Note that this parameter value is -usually unknown in real data analysis problems.

    -
    virtual_box %>% 
    -    group_by(color) %>% 
    -    summarize(n = n(),
    -             proportion = n() / 10000)
    +0.63. This value, 0.63, is the population parameter. Note that this parameter value is usually unknown in real data analysis problems.

    +
    virtual_box %>% 
    +    group_by(flavour) %>% 
    +    summarize(n = n(),
    +             proportion = n() / 10000)
    ## # A tibble: 2 x 3
    -##   color             n proportion
    +##   flavour           n proportion
     ##   <fct>         <int>      <dbl>
     ## 1 old fashioned  3705      0.370
     ## 2 chocolate      6295      0.630
    -

    Suppose we buy a box of 40 randomly-selected Timbits and count the number of chocolate Timbits, -i.e., take a random sample of size 40 from our Timbits population. The function -rep_sample_n from the infer package will allow us to sample. The arguments -of rep_sample_n are (1) the data frame to sample from, and (2) the size of the sample -to take.

    -
    samples_1 <- rep_sample_n(tbl = virtual_box, size = 40)
    -choc_sample_1 <- summarize(samples_1, n = sum(color == "chocolate"),
    -                                        prop = sum(color == "chocolate") / 40)
    -choc_sample_1
    +

    What would happen if we were to buy a box of 40 randomly-selected Timbits and count the number of chocolate Timbits (i.e., take a random sample of size 40 from our Timbits population)? Let’s use R to simulate this using our virtual_box population. We can do this using the rep_sample_n function from the infer package. The arguments +of rep_sample_n are (1) the data frame (or tibble) to sample from, and (2) the size of the sample to take.

    +
    set.seed(1)
    +samples_1 <- rep_sample_n(tbl = virtual_box, size = 40)
    +choc_sample_1 <- summarize(samples_1, 
    +                           n = sum(flavour == "chocolate"),
    +                           prop = sum(flavour == "chocolate") / 40)
    +choc_sample_1
    ## # A tibble: 1 x 3
     ##   replicate     n  prop
     ##       <int> <int> <dbl>
    -## 1         1    20   0.5
    +## 1 1 29 0.725

Here we see that the proportion of chocolate Timbits in this random sample is -0.5. This value is our sample statistic; since it is a single -value that is used to estimate a population parameter, we refer to it as a point estimate.

+0.72. This value is our estimate — our best guess of our population parameter using this sample. Given that it is a single +value that we are estimating, we often refer to it as a point estimate.

Now imagine we took another random sample of 40 Timbits from the population. Do you -think you would get the same proportion? Let’s try sampling from the population +think we would get the same proportion? Let’s try sampling from the population again and see what happens.

-
set.seed(2)
-samples_2 <- rep_sample_n(virtual_box, size = 40)
-choc_sample_2 <- summarize(samples_2, n = sum(color == "chocolate"),
-                                        prop = sum(color == "chocolate") / 40)
-choc_sample_2
+
set.seed(2)
+samples_2 <- rep_sample_n(virtual_box, size = 40)
+choc_sample_2 <- summarize(samples_2, n = sum(flavour == "chocolate"),
+                                        prop = sum(flavour == "chocolate") / 40)
+choc_sample_2
## # A tibble: 1 x 3
 ##   replicate     n  prop
 ##       <int> <int> <dbl>
 ## 1         1    27 0.675
-

Notice that we get a different value for our statistic this time. The +

Notice that we get a different value for our estimate this time. The proportion of chocolate Timbits in this sample is 0.68. If we were to do this again, another random sample could also give a -different result. Statistics vary from sample to sample +different result. Estimates vary from sample to sample due to sampling variability.

-

But just how much should we expect the statistics of our random -samples to vary? In order to understand this, we will simulate more samples +

But just how much should we expect the estimates of our random +samples to vary? In order to understand this, we will simulate taking more samples of size 40 from our population of Timbits, and calculate the proportion of chocolate Timbits in each sample. We can then -construct the distribution of sample proportions we calculate. The distribution -of the statistic for all possible samples of size \(n\) from a population is +visualize the distribution of sample proportions we calculate. The distribution +of the estimate for all possible samples of a given size (which we commonly refer to as \(n\)) from a population is called a sampling distribution. The sampling distribution will help us see how much we would expect our sample proportions from this population to vary for samples of size 40. Below we again use the rep_sample_n to take samples of size 40 from our population of Timbits, but we set the reps argument -to specify the number of samples to take.

-
samples <- rep_sample_n(virtual_box, size = 40, reps = 1000)
-head(samples)
+to specify the number of samples to take, here 15,000. We will use the function head() to see the first few rows and tail() to see the last few rows of our samples data frame.

+
samples <- rep_sample_n(virtual_box, size = 40, reps = 15000)
+head(samples)
## # A tibble: 6 x 3
 ## # Groups:   replicate [1]
-##   replicate timbit_id color        
+##   replicate timbit_id flavour      
 ##       <int>     <dbl> <fct>        
 ## 1         1      9054 chocolate    
 ## 2         1      4322 old fashioned
@@ -517,23 +508,24 @@ 

11.4.1 Sampling distributions for ## 4 1 3958 chocolate ## 5 1 2765 old fashioned ## 6 1 358 old fashioned

-
tail(samples)
+
tail(samples)
## # A tibble: 6 x 3
 ## # Groups:   replicate [1]
-##   replicate timbit_id color        
+##   replicate timbit_id flavour      
 ##       <int>     <dbl> <fct>        
-## 1      1000      4677 chocolate    
-## 2      1000      3619 chocolate    
-## 3      1000       142 old fashioned
-## 4      1000      8991 chocolate    
-## 5      1000      8945 chocolate    
-## 6      1000      7564 old fashioned
-

Notice the column replicate is indicating the replicate with which each -Timbit belongs. Since we took 1000 samples of size 40, there are 1000 replicates.

-
sample_estimates <- samples %>% 
-    group_by(replicate) %>% 
-    summarise(sample_proportion = sum(color == "chocolate") / 40)
-head(sample_estimates)
+## 1 15000 4633 old fashioned +## 2 15000 552 chocolate +## 3 15000 7998 old fashioned +## 4 15000 8649 chocolate +## 5 15000 2974 chocolate +## 6 15000 7811 old fashioned
+

Notice the column replicate is indicating the replicate, or sample, with which each +Timbit belongs. Since we took 15,000 samples of size 40, there are 15,000 replicates. +Now that we have taken 15,000 samples, to create a sampling distribution of sample proportions for samples of size 40, we need to calculate the proportion of chocolate Timbits for each sample, \(\hat{p}_\text{chocolate}\):

+
sample_estimates <- samples %>% 
+    group_by(replicate) %>% 
+    summarise(sample_proportion = sum(flavour == "chocolate") / 40)
+head(sample_estimates)
## # A tibble: 6 x 2
 ##   replicate sample_proportion
 ##       <int>             <dbl>
@@ -543,182 +535,162 @@ 

11.4.1 Sampling distributions for ## 4 4 0.675 ## 5 5 0.45 ## 6 6 0.425

-
tail(sample_estimates)
+
tail(sample_estimates)
## # A tibble: 6 x 2
 ##   replicate sample_proportion
 ##       <int>             <dbl>
-## 1       995             0.675
-## 2       996             0.75 
-## 3       997             0.7  
-## 4       998             0.475
-## 5       999             0.6  
-## 6      1000             0.375
-
sampling_distribution <-  ggplot(sample_estimates, aes(x = sample_proportion)) +
-    geom_histogram(fill="#0072B2", color="#e9ecef", binwidth = 0.05) +
-    xlab("Sample proportions") 
-sampling_distribution
-
-Sampling distribution of the sample proportion for sample size 40 +## 1 14995 0.575 +## 2 14996 0.6 +## 3 14997 0.45 +## 4 14998 0.675 +## 5 14999 0.7 +## 6 15000 0.525
+

Now that we have calculated the proportion of chocolate Timbits for each sample, \(\hat{p}_\text{chocolate}\), we can visualize the sampling distribution of sample proportions for samples of size 40:

+
sampling_distribution <-  ggplot(sample_estimates, aes(x = sample_proportion)) +
+    geom_histogram(fill="dodgerblue3", color="lightgrey", bins = 12) +
+    xlab("Sample proportions") 
+sampling_distribution
+
+Sampling distribution of the sample proportion for sample size 40

-Figure 11.1: Sampling distribution of the sample proportion for sample size 40 +Figure 11.3: Sampling distribution of the sample proportion for sample size 40

The sampling distribution appears to be bell-shaped with one peak. It is centered around 0.6 and the sample proportions range from about 0.3 to -about 0.8. In fact, we can calculate -the mean and standard deviation of the sample proportions.

-
sample_estimates %>% 
-  summarise(mean = mean(sample_proportion), sd = sd(sample_proportion))
-
## # A tibble: 1 x 2
-##    mean     sd
-##   <dbl>  <dbl>
-## 1 0.621 0.0783
-

We notice that the sample proportions are centred around the population -proportion value. The standard deviation of the sample proportions -is 0.078.

-
-

Note: If random samples of size \(n\) are taken from a population, \(\hat{p}\) will be approximately Normal with mean \(p\) and standard deviation \(\sqrt{\frac{p(1-p)}{n}}\) as long as the sample size \(n\) is large enough such that \(np\) and \(n(1 - p)\) are at least 10, where \(p\) is the population proportion, \(\hat{p}\) is the sample proportion and \(n\) is the sample size.

-
+about 0.9. In fact, we can calculate +the mean of the sample proportions.

+
sample_estimates %>% 
+  summarise(mean = mean(sample_proportion))
+
## # A tibble: 1 x 1
+##    mean
+##   <dbl>
+## 1 0.629
+

We notice that the sample proportions are centred around the population proportion value, 0.63! In general, the mean of the distribution of \(\hat{p}\) should be equal to \(p\), which is good because that means the sample proportion is neither an overestimate nor an underestimate of the population proportion.

+

So what can we learn from this sampling distribution? This distribution tells us what we might expect from proportions from samples of size \(40\) when our population proportion is 0.63. In practice, we usually don’t know the proportion of our population, but if we can use what we know about the sampling distribution, we can use it to make inferences about our population when we only have a single sample.

+

11.4.2 Sampling distributions for means

In the previous section, our variable of interest—Timbit flavour—was categorical, and the population parameter of interest was the proportion of chocolate -Timbits. What if we wanted to infer something about a population of quantitative variables instead? -As mentioned in the introduction to this chapter, there are many choices of population parameter -for each type of observed variable. In this section, we will study the case where we are interested -in the population mean of a quantitative variable.

-

In particular, we will look at an example using data from Airbnb, an online -marketplace for arranging or offering places to stay. The dataset contains -Airbnb listings for Vancouver, Canada, in September 2020 -from Inside Airbnb. -Let’s imagine (for learning purposes) that our dataset represents the population of all Airbnb rental listings in Vancouver, -and we are interested in the population mean price per night. +Timbits. As mentioned in the introduction to this chapter, there are many choices of population parameter for each type of observed variable. What if we wanted to infer something about a population of quantitative variables instead? For instance, a traveller visiting Vancouver, BC may wish to know about the prices of staying somewhere using Airbnb, an online marketplace for arranging places to stay. Particularly, they might be interested in estimating the population mean price per night of Airbnb listings in Vancouver, BC. This section will study the case where we are interested in the population mean of a quantitative variable.

+

We will look at an example using data from Inside Airbnb. The data set contains Airbnb listings for Vancouver, Canada, in September 2020. Let’s imagine (for learning purposes) that our data set represents the population of all Airbnb rental listings in Vancouver, and we are interested in the population mean price per night. Our data contains an ID number, neighbourhood, -type of room, the number of people that the rental accommodates, number of -bathrooms, bedrooms, beds, and the price per night.

+type of room, the number of people the rental accommodates, number of bathrooms, bedrooms, beds, and the price per night.

## # A tibble: 6 x 8
-##      id neighbourhood     room_type  accommodates bathrooms bedrooms  beds price
-##   <int> <chr>             <chr>             <dbl> <chr>        <dbl> <dbl> <dbl>
-## 1     1 Downtown          Entire ho…            5 2 baths          2     2   150
-## 2     2 Downtown Eastside Entire ho…            4 2 baths          2     2   132
-## 3     3 West End          Entire ho…            2 1 bath           1     1    85
-## 4     4 Kensington-Cedar… Entire ho…            2 1 bath           1     0   146
-## 5     5 Kensington-Cedar… Entire ho…            4 1 bath           1     2   110
-## 6     6 Hastings-Sunrise  Entire ho…            4 1 bath           2     3   195
+## id neighbourhood room_type accommodates bathrooms bedrooms beds price +## <int> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> +## 1 1 Downtown Entire home/apt 5 2 baths 2 2 150 +## 2 2 Downtown Eastside Entire home/apt 4 2 baths 2 2 132 +## 3 3 West End Entire home/apt 2 1 bath 1 1 85 +## 4 4 Kensington-Cedar Cottage Entire home/apt 2 1 bath 1 0 146 +## 5 5 Kensington-Cedar Cottage Entire home/apt 4 1 bath 1 2 110 +## 6 6 Hastings-Sunrise Entire home/apt 4 1 bath 2 3 195

We can visualize the population distribution of the price per night with a histogram.

-
population_distribution <-  ggplot(airbnb, aes(x = price)) +
-    geom_histogram(fill="#0072B2", color="#e9ecef") +
-    xlab("Price per night ($)") 
-population_distribution
-
-Population distribution of price per night ($) for all Airbnb listings in Vancouver, Canada +
population_distribution <-  ggplot(airbnb, aes(x = price)) +
+    geom_histogram(fill="dodgerblue3", color="lightgrey") +
+    xlab("Price per night ($)") 
+population_distribution
+
+Population distribution of price per night ($) for all Airbnb listings in Vancouver, Canada

-Figure 11.2: Population distribution of price per night ($) for all Airbnb listings in Vancouver, Canada +Figure 11.4: Population distribution of price per night ($) for all Airbnb listings in Vancouver, Canada

-
population_parameters <- airbnb %>% 
-    summarize(pop_mean = mean(price),
-             pop_sd = sd(price))
-population_parameters
-
## # A tibble: 1 x 2
-##   pop_mean pop_sd
-##      <dbl>  <dbl>
-## 1     155.   116.

We see that the distribution has one peak and is skewed—most of the listings are less than $250 per night, but a small proportion of listings cost more -than that, creating a long tail on the histogram’s right side. -The population mean is $154.51 and the -population standard deviation is $115.79.

-

Suppose we take a sample of 20 observations from our population. Below we -create a histogram to visualize the +than that, creating a long tail on the histogram’s right side.

+

We can also calculate the population mean, the average price per night for all the Airbnb listings.

+
population_parameters <- airbnb %>% 
+    summarize(pop_mean = mean(price))
+population_parameters
+
## # A tibble: 1 x 1
+##   pop_mean
+##      <dbl>
+## 1     155.
+

The price per night of all Airbnb rentals in Vancouver, BC is $154.51, on average. This value is our population parameter since we are calculating it using the population data.

+

Suppose that we did not have access to the population data, yet we still wanted to estimate the mean price per night. We could answer this question by taking a random sample of as many Airbnb listings as we had time to, let’s say we could do this for 40 listings. What would such a sample look like?

+

Let’s take advantage of the fact that we do have access to the population data and simulate taking one random sample of 40 listings in R, again using rep_sample_n. After doing this we create a histogram to visualize the distribution of observations in the sample, -and calculate the mean and standard deviation of our sample. These two numbers -are point estimates for the mean and standard deviation of the full population.

-
sample_1 <- airbnb %>% 
-    rep_sample_n(20)
-head(sample_1)
+and calculate the mean of our sample. This number is a point estimate for the mean of the full population.

+
sample_1 <- airbnb %>% 
+    rep_sample_n(40)
+head(sample_1)
## # A tibble: 6 x 9
 ## # Groups:   replicate [1]
-##   replicate    id neighbourhood room_type accommodates bathrooms bedrooms  beds
-##       <int> <int> <chr>         <chr>            <dbl> <chr>        <dbl> <dbl>
-## 1         1   401 Kitsilano     Private …            2 2 baths          2     2
-## 2         1  3187 South Cambie  Private …            4 1 privat…        1     2
-## 3         1  2127 Sunset        Entire h…            8 2 baths          3     4
-## 4         1  3203 Downtown      Entire h…            4 1 bath           1     0
-## 5         1   455 Kitsilano     Entire h…            6 2 baths          3     3
-## 6         1  3116 Marpole       Entire h…            2 1 bath           1     1
-## # … with 1 more variable: price <dbl>
-
sample_distribution <- ggplot(sample_1, aes(price)) + 
-    geom_histogram(fill="#0072B2", color="#e9ecef") +
-    xlab("Price per night ($)") 
-sample_distribution
-
-Distribution of price per night ($) for sample of 20 Airbnb listings +## replicate id neighbourhood room_type accommodates bathrooms bedrooms beds price +## <int> <int> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> +## 1 1 436 Kensington-Cedar Cottage Entire home/apt 7 2 baths 3 6 140 +## 2 1 2794 Riley Park Entire home/apt 3 1 bath 1 1 100 +## 3 1 4423 Mount Pleasant Entire home/apt 6 1 bath 3 3 207 +## 4 1 853 Kensington-Cedar Cottage Private room 2 1 shared bath 1 1 45 +## 5 1 1545 Kensington-Cedar Cottage Entire home/apt 4 1 bath 2 4 80 +## 6 1 2505 Oakridge Private room 2 1 private bath 1 1 154

+
sample_distribution <- ggplot(sample_1, aes(price)) + 
+    geom_histogram(fill="dodgerblue3", color="lightgrey") +
+    xlab("Price per night ($)") 
+sample_distribution
+
+Distribution of price per night ($) for sample of 40 Airbnb listings

-Figure 11.3: Distribution of price per night ($) for sample of 20 Airbnb listings +Figure 11.5: Distribution of price per night ($) for sample of 40 Airbnb listings

-
estimates <- sample_1 %>% 
-    summarize(sample_mean = mean(price),
-             sample_sd = sd(price))
-estimates
-
## # A tibble: 1 x 3
-##   replicate sample_mean sample_sd
-##       <int>       <dbl>     <dbl>
-## 1         1        168.      117.
+
estimates <- sample_1 %>% 
+    summarize(sample_mean = mean(price))
+estimates
+
## # A tibble: 1 x 2
+##   replicate sample_mean
+##       <int>       <dbl>
+## 1         1        128.

Recall that the population mean -was $154.51 and the population standard deviation -was $115.79. We see that our point -estimates for the mean and standard deviation -are $167.93 and $116.96, -respectively. So our estimates were actually quite close to the population parameters: the mean was -about 8.7% off, -while the standard deviation was -about 1% off. +was $154.51. We see that our point +estimate for the mean is $127.8. So our estimate was actually quite close to the population parameter: the mean was +about 17.3% off. Note that in practice, we usually cannot compute the accuracy of the estimate, since we do not have access to the population parameter; if we did, we wouldn’t need to estimate it!

Also recall from the previous section that the point estimate can vary; if -we took another random sample from the population, then the value of our statistic may change. +we took another random sample from the population, then the value of our estimate may change. So then did we just get lucky with our point estimate above? -How much does our estimate vary across different samples of size 20 in this example? Again, since we have access to the population, -we can take many samples and plot the sampling distribution of the point estimates to get a sense -for this variation. In this case, we’ll use 1500 samples of size 20.

-
samples <- rep_sample_n(airbnb, size = 20, reps = 1500)
-head(samples)
+How much does our estimate vary across different samples of size 40 in this example? Again, since we have access to the population, +we can take many samples and plot the sampling distribution of sample means for samples of size 40 to get a sense +for this variation. In this case, we’ll use 15,000 samples of size 40.

+
samples <- rep_sample_n(airbnb, size = 40, reps = 15000)
+head(samples)
## # A tibble: 6 x 9
 ## # Groups:   replicate [1]
-##   replicate    id neighbourhood room_type accommodates bathrooms bedrooms  beds
-##       <int> <int> <chr>         <chr>            <dbl> <chr>        <dbl> <dbl>
-## 1         1  3750 Hastings-Sun… Entire h…            4 1 bath           2     2
-## 2         1  3254 Hastings-Sun… Entire h…            2 1 bath           1     1
-## 3         1  2318 Mount Pleasa… Entire h…            4 1.5 baths        2     2
-## 4         1  3940 West Point G… Private …            1 1 shared…        1     1
-## 5         1  4331 West End      Entire h…            4 1 bath           1     1
-## 6         1   159 Mount Pleasa… Entire h…            5 2.5 baths        3     3
-## # … with 1 more variable: price <dbl>
-
sample_estimates <- samples %>% 
-    group_by(replicate) %>% 
-    summarise(sample_mean = mean(price))
-head(sample_estimates)
+## replicate id neighbourhood room_type accommodates bathrooms bedrooms beds price +## <int> <int> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> +## 1 1 3101 West End Entire home/apt 5 2 baths 2 2 152 +## 2 1 281 Renfrew-Collingwood Private room 2 1 shared bath 1 1 40 +## 3 1 3343 Oakridge Entire home/apt 4 1 bath 1 2 101 +## 4 1 3394 Mount Pleasant Entire home/apt 6 1 bath 2 2 126 +## 5 1 3339 Downtown Eastside Entire home/apt 4 1 bath 1 2 113 +## 6 1 1908 Riley Park Entire home/apt 3 1 bath 2 3 102
+
sample_estimates <- samples %>% 
+    group_by(replicate) %>% 
+    summarise(sample_mean = mean(price))
+head(sample_estimates)
## # A tibble: 6 x 2
 ##   replicate sample_mean
 ##       <int>       <dbl>
-## 1         1        164.
-## 2         2        201.
-## 3         3        185 
-## 4         4        190.
-## 5         5        156.
-## 6         6        158.
-
sampling_distribution_20 <-  ggplot(sample_estimates, aes(x = sample_mean)) +
-    geom_histogram(fill="#0072B2", color="#e9ecef") + 
-    xlab("Sample mean price per night ($)") 
-sampling_distribution_20
-
-Sampling distribution of the sample means for sample size of 20 +## 1 1 136. +## 2 2 145. +## 3 3 111. +## 4 4 173. +## 5 5 131. +## 6 6 174.
+
sampling_distribution_40 <-  ggplot(sample_estimates, aes(x = sample_mean)) +
+    geom_histogram(fill="dodgerblue3", color="lightgrey") + 
+    xlab("Sample mean price per night ($)") 
+sampling_distribution_40
+
+Sampling distribution of the sample means for sample size of 40

-Figure 11.4: Sampling distribution of the sample means for sample size of 20 +Figure 11.6: Sampling distribution of the sample means for sample size of 40

Here we see that the sampling distribution of the mean has one peak and is @@ -728,7 +700,7 @@

11.4.2 Sampling distributions for a good fraction of cases outside this range (i.e., where the point estimate was not close to the population parameter). So it does indeed look like we were quite lucky when we estimated the population mean -with only 8.7% error. +with only 17.3% error. Let’s visualize the population distribution, distribution of the sample, and the sampling distribution on one plot to compare them. @@ -743,49 +715,50 @@

11.4.2 Sampling distributions for ## # A tibble: 1 x 1 ## mean_of_sample_means ## -## 1 155. +## 1 154. ``` -Notice that the mean of the sample means is \$155.08. Recall that the population mean +Notice that the mean of the sample means is \$154.29. Recall that the population mean was \$154.51. -->

-
-Comparision of population distribution, sample distribution and sampling distribution +
+Comparision of population distribution, sample distribution and sampling distribution

-Figure 11.5: Comparision of population distribution, sample distribution and sampling distribution +Figure 11.7: Comparision of population distribution, sample distribution and sampling distribution

Given that there is quite a bit of variation in the sampling distribution of the sample mean—i.e., the point estimate that we obtain is not very reliable—is there any way to improve the estimate? One way to improve a point estimate is to take a larger sample. To illustrate what effect this has, -we will take 1500 samples of size 20, 50, 100, and 500, and plot the sampling distribution of the sample mean +we will take many samples of size 20, 50, 100, and 500, and plot the sampling distribution of the sample mean below.

-
-Comparision of sampling distributions +
+Comparision of sampling distributions

-Figure 11.6: Comparision of sampling distributions +Figure 11.8: Comparision of sampling distributions

Based on the visualization, two points about the sample mean become clear. First, the mean of the sample mean (across samples) is equal to the population mean. Second, increasing the size of the sample -decreases the standard deviation (i.e., the variability) in the sample mean +decreases the spread (i.e., the variability) in the sample mean point estimate of the population mean. Therefore, a larger sample size results in a more reliable point estimate of the population parameter.

-
-

Note: If random samples of size \(n\) are taken from a population, the sample mean \(\bar{x}\) will be approximately Normal with mean \(\mu\) and standard deviation \(\frac{\sigma}{\sqrt{n}}\) as long as the sample size \(n\) is large enough. \(\mu\) is the population mean, \(\sigma\) is the population standard deviation, \(\bar{x}\) is the sample mean, and \(n\) is the sample size. -If samples are selected from a finite population as we are doing in this chapter, we should apply a finite population correction. We multiply \(\frac{\sigma}{\sqrt{n}}\) by \(\sqrt{\frac{N - n}{N - 1}}\) where \(N\) is the population size and \(n\) is the sample size. If our sample size, \(n\), is small relative to the population size, this finite correction factor is less important.

-
+

11.4.3 Summary

    -
  1. A statistic is a value computed using a sample from a population; a point estimate is a statistic that is a single value (e.g. a mean or proportion)
  2. -
  3. The sampling distribution of a statistic is the distribution of the statistic for all possible samples of a fixed size from the same population.
  4. +
  5. A point estimate is a single value computed using a sample from a population (e.g. a mean or proportion)
  6. +
  7. The sampling distribution of an estimate is the distribution of the estimate for all possible samples of a fixed size from the same population.
  8. The sample means and proportions calculated from samples are centered around the population mean and proportion, respectively.
  9. The spread of the sampling distribution is related to the sample size. As the sample size increases, the spread of the sampling distribution decreases.
  10. The shape of the sampling distribution is usually bell-shaped with one peak and centred at the population mean or proportion.
+

Why all this emphasis on sampling distributions?

+

Usually, we don’t have access to the population data, so we cannot construct the sampling distribution as we did in this section. As we saw, our sample estimate’s value will likely not equal the population parameter value exactly. We saw from the sampling distribution just how much our estimates can vary. So reporting a single point estimate for the population parameter alone may not be enough. Using simulations, we can see patterns of the sample estimate’s sampling distribution would look like for a sample of a given size. We can use these patterns to approximate the sampling distribution when we only have one sample, which is the realistic case. If we can “predict” what the sampling distribution would look like for a sample, we could construct a range of values we think the population parameter’s value might lie. We can use our single sample and its properties that influence sampling distributions, such as the spread and sample size, to approximate the sampling distribution as best as we can. There are several methods to do this; however, in this book, we will use the bootstrap method to do this, as we will see in the next section.

@@ -799,12 +772,16 @@

11.5.1 Overview

But in real data analysis settings, we usually have just one sample from our population, and do not have access to the population itself. So how do we get a sense for how variable our point estimate is when we only have one sample to work with? -In this section, we will discuss interval estimation and construct confidence intervals -using just a single sample from a population.

-

Here is the key idea. First, if you take a big enough sample, it looks like the population.

-

[TODO: visualize increasing sample size means sample looks like pop]

+In this section, we will discuss interval estimation and construct confidence intervals using just a single sample from a population. A confidence interval is a range of plausible values for our population parameter.

+

Here is the key idea. First, if you take a big enough sample, it looks like the population. Notice the histograms’ shapes for samples of different sizes taken from the population in the picture below. We see that for a large enough sample, the sample’s distribution looks like that of the population.

+
+Comparision of samples of different sizes from the population +

+Figure 11.9: Comparision of samples of different sizes from the population +

+

In the previous section, we took many samples of the same size from our population to get -a sense for the variability of a sample statistic. But if our sample is big enough that it looks like our population, +a sense for the variability of a sample estimate. But if our sample is big enough that it looks like our population, we can pretend that our sample is the population, and take more samples (with replacement) of the same size from it instead! This very clever technique is called the bootstrap. Note that by taking many samples from our single, observed sample, we do not obtain the true sampling distribution, @@ -824,78 +801,88 @@

11.5.1 Overview

  • Repeat steps (1) - (5) many times to create a distribution of point estimates (the bootstrap distribution)
  • Calculate the plausible range of values around our observed point estimate
  • +
    +Overview of the bootstrap process +

    +Figure 11.10: Overview of the bootstrap process +

    +

    11.5.2 Bootstrapping in R

    -

    Let’s continue working with our Airbnb data. Once again, let’s say we are interested -in estimating the population mean price per night of all Airbnb listings in -Vancouver, Canada. We will draw a single sample of size 40 from the -population and visualize the distribution of the sample:

    -
    one_sample <- airbnb %>% 
    -    rep_sample_n(40) %>% 
    -    ungroup() %>% # ungroup the data frame 
    -    select(price) # drop the replicate column 
    -head(one_sample)
    +

    Let’s continue working with our Airbnb data. Once again, let’s say we are interested in estimating the population mean price per night of all Airbnb listings in +Vancouver, Canada using a single sample we collected of size 40.

    +

    To simulate doing this in R, we will use rep_sample_n to take a random sample from from our population. In real life we wouldn’t do this step in R, we would instead simply load the data into R, that we, or our collaborators collected.

    +

    After we have our sample, we will visualize it’s distribution and calculate our point estimate, the sample mean.

    +
    one_sample <- airbnb %>% 
    +    rep_sample_n(40) %>% 
    +    ungroup() %>% # ungroup the data frame 
    +    select(price) # drop the replicate column 
    +head(one_sample)
    ## # A tibble: 6 x 1
     ##   price
     ##   <dbl>
    -## 1   300
    -## 2   168
    -## 3   140
    -## 4   199
    -## 5    60
    -## 6   450
    -
    one_sample_dist <- ggplot(one_sample, aes(price)) + 
    -    geom_histogram(fill="#0072B2", color="#e9ecef") +
    -    xlab("Price per night ($)") 
    -one_sample_dist
    -
    -Histogram of price per night ($) for one sample of size 20 +## 1 250 +## 2 106 +## 3 150 +## 4 357 +## 5 50 +## 6 110

    +
    one_sample_dist <- ggplot(one_sample, aes(price)) + 
    +    geom_histogram(fill="dodgerblue3", color="lightgrey") +
    +    xlab("Price per night ($)") 
    +one_sample_dist
    +
    +Histogram of price per night ($) for one sample of size 40

    -Figure 11.7: Histogram of price per night ($) for one sample of size 20 +Figure 11.11: Histogram of price per night ($) for one sample of size 40

    -
    one_sample_estimates <- one_sample %>% 
    -    summarise(sample_mean = mean(price))
    -one_sample_estimates
    +
    one_sample_estimates <- one_sample %>% 
    +    summarise(sample_mean = mean(price))
    +one_sample_estimates
    ## # A tibble: 1 x 1
     ##   sample_mean
     ##         <dbl>
    -## 1        162.
    +## 1 166.

    The sample distribution is skewed with a few observations out to the right. The -mean of the sample is $161.5. +mean of the sample is $165.62. Remember, in practice, we usually only have one sample from the population. So this sample and estimate are the only data we can work with.

    -

    We now perform steps 1 - 5 listed above to generate a single bootstrap sample in R using the +

    We now perform steps (1) - (5) listed above to generate a single bootstrap sample in R using the sample we just took, and calculate the bootstrap estimate for that sample. We will use the rep_sample_n function as we did when we were creating our sampling distribution. Since we want to sample with replacement, we change the argument for replace from its default value of FALSE to TRUE.

    -
    boot1 <- one_sample %>%
    -    rep_sample_n(size = 40, replace = TRUE, reps = 1)
    -head(boot1)
    +
    boot1 <- one_sample %>%
    +    rep_sample_n(size = 40, replace = TRUE, reps = 1)
    +head(boot1)
    ## # A tibble: 6 x 2
     ## # Groups:   replicate [1]
     ##   replicate price
     ##       <int> <dbl>
    -## 1         1   150
    -## 2         1   188
    -## 3         1   380
    -## 4         1   168
    -## 5         1   150
    -## 6         1    85
    -
    boot1_dist <- ggplot(boot1, aes(price)) + 
    -    geom_histogram(fill="#0072B2", color="#e9ecef") +
    -    xlab("Price per night ($)") +
    -    ggtitle("Bootstrap distribution")
    -
    -boot1_dist
    -

    -
    summarise(boot1, mean = mean(price))
    +## 1 1 201 +## 2 1 199 +## 3 1 127. +## 4 1 85 +## 5 1 169 +## 6 1 60
    +
    boot1_dist <- ggplot(boot1, aes(price)) + 
    +    geom_histogram(fill="dodgerblue3", color="lightgrey") +
    +    xlab("Price per night ($)") 
    +
    +boot1_dist
    +
    +Bootstrap distribution +

    +Figure 11.12: Bootstrap distribution +

    +
    +
    summarise(boot1, mean = mean(price))
    ## # A tibble: 1 x 2
     ##   replicate  mean
     ##       <int> <dbl>
    -## 1         1   151
    +## 1 1 152.

    Notice that our bootstrap distribution has a similar shape to the original sample distribution. Though the shapes of the distributions are similar, they are not identical. You’ll also notice that the original sample mean and the @@ -903,133 +890,138 @@

    11.5.2 Bootstrapping in R

    sampling with replacement from the original sample, so we don’t end up with the same sample values again. We are trying to mimic drawing another sample from the population without actually having to do that.

    -

    Let’s now take 1500 bootstrap samples from the original sample we drew from the +

    Let’s now take 15,000 bootstrap samples from the original sample we drew from the population (one_sample) using rep_sample_n and calculate the means for each of those replicates. Recall that this assumes that one_sample looks like our original population; but since we do not have access to the population itself, this is often the best we can do.

    -
    boot1500 <- one_sample %>%
    -    rep_sample_n(size = 40, replace = TRUE, reps = 1500)
    -head(boot1500)
    +
    boot15000 <- one_sample %>%
    +    rep_sample_n(size = 40, replace = TRUE, reps = 15000)
    +head(boot15000)
    ## # A tibble: 6 x 2
     ## # Groups:   replicate [1]
     ##   replicate price
     ##       <int> <dbl>
    -## 1         1   125
    -## 2         1   188
    -## 3         1   112
    -## 4         1    95
    -## 5         1   140
    -## 6         1    79
    -
    tail(boot1500)
    +## 1 1 200 +## 2 1 176 +## 3 1 105 +## 4 1 105 +## 5 1 105 +## 6 1 132
    +
    tail(boot15000)
    ## # A tibble: 6 x 2
     ## # Groups:   replicate [1]
     ##   replicate price
     ##       <int> <dbl>
    -## 1      1500   150
    -## 2      1500    60
    -## 3      1500   100
    -## 4      1500   112
    -## 5      1500   111
    -## 6      1500    99
    -
    boot1500_means <- boot1500 %>% 
    -  group_by(replicate) %>% 
    -  summarize(mean = mean(price))
    -head(boot1500_means)
    +## 1 15000 357 +## 2 15000 49 +## 3 15000 115 +## 4 15000 169 +## 5 15000 145 +## 6 15000 357

    +

    Let’s take a look at histograms of the first six replicates of our bootstrap samples.

    +
    six_bootstrap_samples <- boot15000 %>% 
    +  filter(replicate <= 6)
    +ggplot(six_bootstrap_samples, aes(price)) +
    +  geom_histogram(fill="dodgerblue3", color="lightgrey") + 
    +  xlab("Price per night ($)") +
    +  facet_wrap(~replicate) 
    +

    +

    We see in the graph above how the bootstrap samples differ. We can also calculate the sample mean for each of these six replicates.

    +
    six_bootstrap_samples %>% 
    +  group_by(replicate) %>% 
    +  summarize(mean = mean(price))
    +
    ## # A tibble: 6 x 2
    +##   replicate  mean
    +##       <int> <dbl>
    +## 1         1  154.
    +## 2         2  162.
    +## 3         3  151.
    +## 4         4  163.
    +## 5         5  158.
    +## 6         6  156.
    +

    We can see that the bootstrap sample distributions and the sample means are different. This is because we are sampling with replacement. We will now calculate point estimates for our 15,000 bootstrap samples and generate a bootstrap distribution of our point estimates. The bootstrap distribution suggests how we might expect our point estimate to behave if we took another sample.

    +
    boot15000_means <- boot15000 %>% 
    +  group_by(replicate) %>% 
    +  summarize(mean = mean(price))
    +head(boot15000_means)
    ## # A tibble: 6 x 2
     ##   replicate  mean
     ##       <int> <dbl>
    -## 1         1  146.
    -## 2         2  144.
    -## 3         3  155.
    -## 4         4  158.
    -## 5         5  149.
    -## 6         6  174.
    -
    tail(boot1500_means)
    +## 1 1 154. +## 2 2 162. +## 3 3 151. +## 4 4 163. +## 5 5 158. +## 6 6 156.

    +
    tail(boot15000_means)
    ## # A tibble: 6 x 2
     ##   replicate  mean
     ##       <int> <dbl>
    -## 1      1495  171.
    -## 2      1496  161.
    -## 3      1497  162 
    -## 4      1498  164.
    -## 5      1499  139.
    -## 6      1500  161.
    -

    We now calculate point estimates for each bootstrap sample and generate a bootstrap -distribution of our point estimates. The bootstrap distribution suggests how we -might expect our point estimate to behave if we took another sample.

    -
    boot_est_dist <-  ggplot(boot1500_means, aes(x = mean)) +
    -    geom_histogram(fill="#0072B2", color="#e9ecef") +
    -    xlab("Sample mean price per night ($)") 
    +## 1 14995 155. +## 2 14996 148. +## 3 14997 139. +## 4 14998 156. +## 5 14999 158. +## 6 15000 176.

    +
    boot_est_dist <-  ggplot(boot15000_means, aes(x = mean)) +
    +    geom_histogram(fill="dodgerblue3", color="lightgrey") +
    +    xlab("Sample mean price per night ($)") 

    Let’s compare our bootstrap distribution with the true sampling distribution (taking many samples from the population).

    -
    -Comparison of distribution of the bootstrap sample means and sampling distribution +
    +Comparison of distribution of the bootstrap sample means and sampling distribution

    -Figure 11.8: Comparison of distribution of the bootstrap sample means and sampling distribution +Figure 11.13: Comparison of distribution of the bootstrap sample means and sampling distribution

    -

    There are two important points that we can take away from these plots. -First, we see how the sampling distribution is centred at $154.51, the -population mean value, as we learned in -the previous section. However, the bootstrap distribution is centred at $161.56, which -is the original sample’s mean price per night. This is because we are resampling from the original sample over and -over again, so we see that the bootstrap distribution is centred at the -original sample’s mean value. The second important point is that the shape and spread of the true sampling distribution -and the bootstrap distribution is similar; the bootstrap distribution lets us get a sense for the variability of the -point estimate.

    -
    -Summary of bootstrapping process +

    There are two essential points that we can take away from these plots. First, the shape and spread of the true sampling distribution and the bootstrap distribution are similar; the bootstrap distribution lets us get a sense of the point estimate’s variability. The second important point is that the means of these two distributions are different. The sampling distribution is centred at $154.51, the population mean value. However, the bootstrap distribution is centred at the original sample’s mean price per night, $165.56. Because we are resampling from the original sample repeatedly, we see that the bootstrap distribution is centred at the original sample’s mean value (unlike the sampling distribution of the sample mean, which is centred at the population parameter value).

    +

    The idea here is that we can use this distribution of bootstrap sample means to approximate the sampling distribution of the sample means when we only have one sample. Since the bootstrap distribution pretty well approximates the sampling distribution spread, we can use the bootstrap spread to help us develop a plausible range for our population parameter along with our estimate!

    +
    +Summary of bootstrapping process

    -Figure 11.9: Summary of bootstrapping process +Figure 11.14: Summary of bootstrapping process

    11.5.3 Using the bootstrap to calculate a plausible range

    -

    Now that we have constructed our bootstrap distribution let’s use it to create -a confidence interval for the mean, a range of plausible values for the -population mean. Confidence intervals can be set at different levels. Common -levels are 90%, 95% and 99%. There is a balance between your level of -confidence and an interval’s precision, so the confidence level you choose will -depend on the application. Suppose we want to construct a 95% confidence -interval using the bootstrap distribution we found earlier.

    -

    To calculate a 95% confidence interval using bootstrapping:

    +

    Now that we have constructed our bootstrap distribution let’s use it to create an approximate bootstrap confidence interval, a range of plausible values for the population mean. We will build a 95% percentile bootstrap confidence interval and find the range of values that cover the middle 95% of the bootstrap distribution. A 95% confidence interval means that if we were to repeat the sampling process and calculate 95% confidence intervals each time and repeat this process many times, then 95% of the intervals would capture the population parameter’s value. Note that there’s nothing particularly special about 95%, we could have used other confidence levels, such as 90% or 99%. There is a balance between our level of confidence and precision. A higher confidence level corresponds to a wider range of the interval, and a lower confidence level corresponds to a narrower range. Therefore the level we choose is based on what chance we are willing to take of being wrong based on the implications of being wrong for our application. In general, we choose confidence levels to be comfortable with our level of uncertainty, but not so strict that the interval is unhelpful. For instance, if our decision impacts human life and the implications of being wrong are deadly, we may want to be very confident and choose a higher confidence level.

    +

    To calculate our 95% percentile bootstrap confidence interval, we will do the following:

    1. Arrange the observations in the bootstrap distribution in ascending order
    2. Find the value such that 2.5% of observations fall below it (the 2.5% percentile). Use that value as the lower bound of the interval
    3. Find the value such that 97.5% of observations fall below it (the 97.5% percentile). Use that value as the upper bound of the interval

    To do this in R, we can use the quantile() function:

    -
    bounds <- boot1500_means %>% 
    -    select(mean) %>% 
    -    pull() %>% 
    -    quantile(c(0.025, 0.975))
    -bounds
    +
    bounds <- boot15000_means %>% 
    +    select(mean) %>% 
    +    pull() %>% 
    +    quantile(c(0.025, 0.975))
    +bounds
    ##     2.5%    97.5% 
    -## 136.3356 189.6656
    -

    We’re 95% “confident” that the true population mean price per night of Airbnb -listings in Vancouver is between $136.34 and $189.67. Suppose -we repeated this process of sampling and constructing confidence intervals more times with more samples. We’d expect -95% of intervals would contain the true population parameter.

    -
    -Distribution of the bootstrap sample means with 95% confidence interval lower and upper bounds +## 134.0778 200.2759

    +

    Our interval, $134.08 to $200.28, captures the middle 95% of the sample mean prices in the bootstrap distribution. We can visualize the interval on our distribution in the picture below.

    +
    +Distribution of the bootstrap sample means with percentile lower and upper bounds

    -Figure 11.10: Distribution of the bootstrap sample means with 95% confidence interval lower and upper bounds +Figure 11.15: Distribution of the bootstrap sample means with percentile lower and upper bounds

    -

    Here we notice that our 95% confidence interval does indeed contain the true +

    To finish our estimation of the population parameter, we would report the point estimate and our confidence interval’s lower and upper bounds. Here the sample mean price-per-night of 40 Airbnb listings was $127.8, and we are 95% “confident” that the true population mean price-per-night for all Airbnb listings in Vancouver is between $(134.08, 200.28).

    +

    Notice that our interval does indeed contain the true population mean value, $154.51! However, in -practice, we would not know whether our interval captured the true population -parameter or not because we usually only have a single sample, not the entire -population.

    +practice, we would not know whether our interval captured the population parameter or not because we usually only have a single sample, not the entire population. However, this is the best we can do when we only have one sample!

    +

    This chapter is only the beginning of the journey into statistical inference. We can extend the concepts learned here to do much more than report point estimates and confidence intervals, such as hypothesis testing for differences between populations, tests for associations between variables, and so much more! We have just scratched the surface of statistical inference; however, the material presented here will serve as the foundation for more advanced statistical techniques you may learn about in the future!

    11.6 Additional readings

    -

    For more about statistical inference and bootstrapping, refer to pages 187-190
    -of Introduction to Statistical Learning with Applications in R by -Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, and Chapters 7 - 8 of Modern Dive Statistical -Inference via Data Science by Chester Ismay and Albert Y. Kim.

    +

    For more about statistical inference and bootstrapping, refer to

    +
    diff --git a/docs/reading.html b/docs/reading.html index c170b5a63..749482d3e 100644 --- a/docs/reading.html +++ b/docs/reading.html @@ -26,7 +26,7 @@ - + @@ -490,23 +490,32 @@

    2.4.1 Skipping rows when reading ## cols( ## `Data source: https://datausa.io/` = col_character() ## )

    -
    us_data
    +
    ## Warning: 53 parsing failures.
    +## row col  expected    actual                                     file
    +##   3  -- 1 columns 6 columns 'data/state_property_vote_meta-data.csv'
    +##   4  -- 1 columns 6 columns 'data/state_property_vote_meta-data.csv'
    +##   5  -- 1 columns 6 columns 'data/state_property_vote_meta-data.csv'
    +##   6  -- 1 columns 6 columns 'data/state_property_vote_meta-data.csv'
    +##   7  -- 1 columns 6 columns 'data/state_property_vote_meta-data.csv'
    +## ... ... ......... ......... ........................................
    +## See problems(...) for more details.
    +
    us_data
    ## # A tibble: 55 x 1
    -##    `Data source: https://datausa.io/`                                           
    -##    <chr>                                                                        
    -##  1 Record of how data was collected: https://github.com/UBC-DSCI/introduction-t…
    -##  2 Date collected: 2020-07-08                                                   
    -##  3 state                                                                        
    -##  4 Montana                                                                      
    -##  5 Alabama                                                                      
    -##  6 Arizona                                                                      
    -##  7 Arkansas                                                                     
    -##  8 California                                                                   
    -##  9 Colorado                                                                     
    -## 10 Connecticut                                                                  
    +##    `Data source: https://datausa.io/`                                                                             
    +##    <chr>                                                                                                          
    +##  1 Record of how data was collected: https://github.com/UBC-DSCI/introduction-to-datascience/blob/master/data/ret…
    +##  2 Date collected: 2020-07-08                                                                                     
    +##  3 state                                                                                                          
    +##  4 Montana                                                                                                        
    +##  5 Alabama                                                                                                        
    +##  6 Arizona                                                                                                        
    +##  7 Arkansas                                                                                                       
    +##  8 California                                                                                                     
    +##  9 Colorado                                                                                                       
    +## 10 Connecticut                                                                                                    
     ## # … with 45 more rows

    To successfully read data like this into R, the skip argument can be useful to tell R how many lines to skip before it should start reading in the data. In the example above, we would set this value to 3:

    -
    us_data <- read_csv("data/state_property_vote_meta-data.csv", skip = 3)
    +
    us_data <- read_csv("data/state_property_vote_meta-data.csv", skip = 3)
    ## Parsed with column specification:
     ## cols(
     ##   state = col_character(),
    @@ -516,7 +525,7 @@ 

    2.4.1 Skipping rows when reading ## avg_commute = col_double(), ## party = col_character() ## )

    -
    us_data
    +
    us_data
    ## # A tibble: 52 x 6
     ##    state                     pop med_prop_val med_income avg_commute party     
     ##    <chr>                   <dbl>        <dbl>      <dbl>       <dbl> <chr>     
    @@ -547,7 +556,7 @@ 

    2.4.2 read_delim as District of Columbia 681170 576100 75506 28.96 Democratic Florida 20612439 197700 47439 25.8 Republican

    To get this into R using the read_delim() function, we specify the first argument as the path to the file (as done with read_csv), and then provide values to the delim argument (here a tab, which we represent by "\t") and the col_names argument (here we specify that there are no column names be assigning it the value of FALSE). Both read_csv() and read_delim() have a col_names argument and the default is TRUE.

    -
    us_data <- read_delim("data/state_property_vote.tsv",  delim = "\t", col_names = FALSE)
    +
    us_data <- read_delim("data/state_property_vote.tsv",  delim = "\t", col_names = FALSE)
    ## Parsed with column specification:
     ## cols(
     ##   X1 = col_character(),
    @@ -557,7 +566,7 @@ 

    2.4.2 read_delim as ## X5 = col_double(), ## X6 = col_character() ## )

    -
    us_data
    +
    us_data
    ## # A tibble: 52 x 6
     ##    X1                         X2     X3    X4    X5 X6        
     ##    <chr>                   <dbl>  <dbl> <dbl> <dbl> <chr>     
    @@ -577,7 +586,7 @@ 

    2.4.2 read_delim as

    2.4.3 Reading tabular data directly from a URL

    We can also use read_csv() or read_delim() (and related functions) to read in tabular data directly from a url that contains tabular data. In this case, we provide the url to read_csv() as the path to the file instead of a path to a local file on our computer. Similar to when we specify a path on our local computer, here we need to surround the url by quotes. All other arguments that we use are the same as when using these functions with a local file on our computer.

    -
    us_data <- read_csv("https://raw.githubusercontent.com/UBC-DSCI/introduction-to-datascience/master/data/state_property_vote.csv")
    +
    us_data <- read_csv("https://raw.githubusercontent.com/UBC-DSCI/introduction-to-datascience/master/data/state_property_vote.csv")
    ## Parsed with column specification:
     ## cols(
     ##   state = col_character(),
    @@ -587,7 +596,7 @@ 

    2.4.3 Reading tabular data direct ## avg_commute = col_double(), ## party = col_character() ## )

    -
    us_data
    +
    us_data
    ## # A tibble: 52 x 6
     ##    state                     pop med_prop_val med_income avg_commute party     
     ##    <chr>                   <dbl>        <dbl>      <dbl>       <dbl> <chr>     
    @@ -653,9 +662,9 @@ 

    2.5 Reading data from an Microsof

    This type of file representation allows Excel files to store additional things that you cannot store in a .csv file, such as fonts, text formatting, graphics, multiple sheets and more. And despite looking odd in a plain text editor, we can read Excel spreadsheets into R using the readxl package developed specifically for this purpose.

    -
    library(readxl)
    -us_data <- read_excel("data/state_property_vote.xlsx")
    -us_data
    +
    library(readxl)
    +us_data <- read_excel("data/state_property_vote.xlsx")
    +us_data
    ## # A tibble: 52 x 6
     ##    state                     pop med_prop_val med_income avg_commute party     
     ##    <chr>                   <dbl>        <dbl>      <dbl>       <dbl> <chr>     
    @@ -685,19 +694,23 @@ 

    2.6 Reading data from a database<

    2.6.1 Reading data from a SQLite database

    SQLite is probably the simplest relational database that one can use in combination with R. SQLite databases are self-contained and usually stored and accessed locally on one computer. Data is usually stored in a file with a .db extension. Similar to Excel files, these are not plain text files and cannot be read in a plain text editor.

    The first thing you need to do to read data into R from a database is to connect to the database. We do that using the dbConnect function from the DBI (database interface) package. This does not read in the data, but simply tells R where the database is and opens up a communication channel.

    -
    library(DBI)
    -con_state_data <- dbConnect(RSQLite::SQLite(), "data/state_property_vote.db")
    +
    library(DBI)
    +con_state_data <- dbConnect(RSQLite::SQLite(), "data/state_property_vote.db")

    Often times relational databases have many tables, and their power comes from the useful ways they can be joined. Thus anytime you want to access data from a relational database you need to know the table names. You can get the names of all the tables in the database using the dbListTables function:

    -
    tables <- dbListTables(con_state_data)
    -tables
    +
    tables <- dbListTables(con_state_data)
    +tables
    ## [1] "state"

    We only get one table name returned form calling dbListTables, and this tells us that there is only one table in this database. To reference a table in the database so we can do things like select columns and filter rows, we use the tbl function from the dbplyr package:

    -
    library(dbplyr)
    -state_db <- tbl(con_state_data, "state")
    -state_db
    +
    library(dbplyr)
    +
    ## 
    +## Attaching package: 'dbplyr'
    +
    ## The following objects are masked from 'package:dplyr':
    +## 
    +##     ident, sql
    +
    state_db <- tbl(con_state_data, "state")
    +state_db
    ## # Source:   table<state> [?? x 6]
    -## # Database: sqlite 3.30.1
    -## #   [/introduction-to-datascience/data/state_property_vote.db]
    +## # Database: sqlite 3.30.1 [/home/rstudio/introduction-to-datascience/data/state_property_vote.db]
     ##    state                     pop med_prop_val med_income avg_commute party     
     ##    <chr>                   <dbl>        <dbl>      <dbl>       <dbl> <chr>     
     ##  1 Montana               1042520       217200      46608        16.4 Republican
    @@ -716,11 +729,10 @@ 

    2.6.1 Reading data from a SQLite stored on your computer, but rather a more powerful machine somewhere on the web. So R is lazy and waits to bring this data into memory until you explicitly tell it to do so using the collect function from the dbplyr library.

    Here we will filter for only states that voted for the Republican candidate in the 2016 Presidential election, and then use collect to finally bring this data into R as a data frame.

    -
    republican_db <- filter(state_db, party == "Republican")
    -republican_db
    +
    republican_db <- filter(state_db, party == "Republican")
    +republican_db
    ## # Source:   lazy query [?? x 6]
    -## # Database: sqlite 3.30.1
    -## #   [/introduction-to-datascience/data/state_property_vote.db]
    +## # Database: sqlite 3.30.1 [/home/rstudio/introduction-to-datascience/data/state_property_vote.db]
     ##    state         pop med_prop_val med_income avg_commute party     
     ##    <chr>       <dbl>        <dbl>      <dbl>       <dbl> <chr>     
     ##  1 Montana   1042520       217200      46608        16.4 Republican
    @@ -734,8 +746,8 @@ 

    2.6.1 Reading data from a SQLite ## 9 Iowa 3134693 142300 53816 18.1 Republican ## 10 Kansas 2907289 144900 52392 18.5 Republican ## # … with more rows

    -
    republican_data <- collect(republican_db)
    -republican_data
    +
    republican_data <- collect(republican_db)
    +republican_data
    ## # A tibble: 30 x 6
     ##    state         pop med_prop_val med_income avg_commute party     
     ##    <chr>       <dbl>        <dbl>      <dbl>       <dbl> <chr>     
    @@ -754,7 +766,7 @@ 

    2.6.1 Reading data from a SQLite you can use to directly feed the database reference (what tbl gives you) into downstream analysis functions (e.g., ggplot2 for data visualization and lm for linear regression modeling). However, this does not work in every case; look what happens when we try to use nrow to count rows in a data frame:

    -
    nrow(republican_db)
    +
    nrow(republican_db)
    ## [1] NA

    or tail to preview the last 6 rows of a data frame:

    tail(republican_db)
    @@ -919,28 +931,37 @@

    2.8.2 Are you allowed to scrape t

    2.8.3 Using rvest

    Now that we have our CSS selectors we can use rvest R package to scrape our desired data from the website. First we start by loading the rvest package:

    -
    library(rvest)
    +
    library(rvest)
    +
    ## Loading required package: xml2
    +
    ## 
    +## Attaching package: 'rvest'
    +
    ## The following object is masked from 'package:purrr':
    +## 
    +##     pluck
    +
    ## The following object is masked from 'package:readr':
    +## 
    +##     guess_encoding

    library(rvest) gives error…

    If you get an error about R not being able to find the package (e.g., Error in library(rvest) : there is no package called ‘rvest’) this is likely because it was not installed. To install the rvest package, run the following command once inside R (and then delete that line of code): install.packages("rvest").

    Next, we tell R what page we want to scrape by providing the webpage’s URL in quotations to the function read_html:

    -
    page <- read_html("https://en.wikipedia.org/wiki/Canada")
    +
    page <- read_html("https://en.wikipedia.org/wiki/Canada")

    Then we send the page object to the html_nodes function. We also provide that function with the CSS selectors we obtained from the selectorgadget tool. These should be surrounded by quotations. The html_nodes function select nodes from the HTML document using CSS selectors. nodes are the HTML tag pairs as well as the content between the tags. For our CSS selector td:nth-child(5) and example node that would be selected would be: <td style="text-align:left;background:#f0f0f0;"><a href="/wiki/London,_Ontario" title="London, Ontario">London</a></td>

    -
    population_nodes <- html_nodes(page, "td:nth-child(5) , td:nth-child(7) , .infobox:nth-child(122) td:nth-child(1) , .infobox td:nth-child(3)")
    -head(population_nodes)
    +
    population_nodes <- html_nodes(page, "td:nth-child(5) , td:nth-child(7) , .infobox:nth-child(122) td:nth-child(1) , .infobox td:nth-child(3)")
    +head(population_nodes)
    ## {xml_nodeset (6)}
     ## [1] <td style="text-align:right;">5,928,040</td>
    -## [2] <td style="text-align:left;background:#f0f0f0;"><a href="/wiki/London,_On ...
    +## [2] <td style="text-align:left;background:#f0f0f0;"><a href="/wiki/London,_Ontario" title="London, Ontario">Lon ...
     ## [3] <td style="text-align:right;">494,069\n</td>
     ## [4] <td style="text-align:right;">4,098,927</td>
    -## [5] <td style="text-align:left;background:#f0f0f0;">\n<a href="/wiki/St._Cath ...
    +## [5] <td style="text-align:left;background:#f0f0f0;">\n<a href="/wiki/St._Catharines" title="St. Catharines">St. ...
     ## [6] <td style="text-align:right;">406,074\n</td>

    Next we extract the meaningful data from the HTML nodes using the html_text function. For our example, this functions only required argument is the an html_nodes object, which we named rent_nodes. In the case of this example node: <td style="text-align:left;background:#f0f0f0;"><a href="/wiki/London,_Ontario" title="London, Ontario">London</a></td>, the html_text function would return London.

    -
    population_text <- html_text(population_nodes)
    -head(population_text)
    -
    ## [1] "5,928,040"              "London"                 "494,069\n"             
    -## [4] "4,098,927"              "St. Catharines–Niagara" "406,074\n"
    +
    population_text <- html_text(population_nodes)
    +head(population_text)
    +
    ## [1] "5,928,040"              "London"                 "494,069\n"              "4,098,927"             
    +## [5] "St. Catharines–Niagara" "406,074\n"

    Are we done? Not quite… If you look at the data closely you see that the data is not in an optimal format for data analysis. Both the city names and population are encoded as characters in a single vector instead of being in a data frame with one character column for city and one numeric column for population (think of how you would organize the data in a spreadsheet). Additionally, the populations contain commas (not useful for programmatically dealing with numbers), and some even contain a line break character at the end (\n). Next chapter we will learn more about data wrangling using R so that we can easily clean up this data with a few lines of code.

    diff --git a/docs/regression1.html b/docs/regression1.html index e640c76b7..b1a89ea01 100644 --- a/docs/regression1.html +++ b/docs/regression1.html @@ -26,7 +26,7 @@ - + @@ -388,12 +388,12 @@

    8.4 Sacremento real estate exampl exploratory analysis. The Sacramento real estate data set we will study in this chapter was originally reported in the Sacramento Bee, but we have provided it with this repository as a stable source for the data.

    -
    library(tidyverse)
    -library(tidymodels)
    -library(gridExtra)
    -
    -sacramento <- read_csv('data/sacramento.csv')
    -head(sacramento)
    +
    library(tidyverse)
    +library(tidymodels)
    +library(gridExtra)
    +
    +sacramento <- read_csv('data/sacramento.csv')
    +head(sacramento)
    ## # A tibble: 6 x 9
     ##   city       zip     beds baths  sqft type        price latitude longitude
     ##   <chr>      <chr>  <dbl> <dbl> <dbl> <chr>       <dbl>    <dbl>     <dbl>
    @@ -410,12 +410,12 @@ 

    8.4 Sacremento real estate exampl the data as a scatter plot where we place the predictor/explanatory variable (house size) on the x-axis, and we place the target/response variable that we want to predict (price) on the y-axis:

    -
    eda <- ggplot(sacramento, aes(x = sqft, y = price)) +
    -  geom_point(alpha = 0.4) +
    -  xlab("House size (square footage)") +
    -  ylab("Price (USD)") +
    -  scale_y_continuous(labels = dollar_format()) 
    -eda
    +
    eda <- ggplot(sacramento, aes(x = sqft, y = price)) +
    +  geom_point(alpha = 0.4) +
    +  xlab("House size (square footage)") +
    +  ylab("Price (USD)") +
    +  scale_y_continuous(labels = dollar_format()) 
    +eda

    Based on the visualization above, we can see that in Sacramento, CA, as the size of a house increases, so does its sale price. Thus, we can reason that we @@ -437,8 +437,8 @@

    8.5 K-nearest neighbours regressi
  • tbl (a data frame-like object to sample from)
  • size (the number of observations/rows to be randomly selected/sampled)
  • -
    set.seed(1234)
    -small_sacramento <- sample_n(sacramento, size = 30)
    +
    set.seed(1234)
    +small_sacramento <- sample_n(sacramento, size = 30)

    Next let’s say we come across a 2,000 square-foot house in Sacramento we are interested in purchasing, with an advertised list price of $350,000. Should we offer to pay the asking price for this house, or is it overpriced and we should @@ -447,39 +447,39 @@

    8.5 K-nearest neighbours regressi sale prices we have already observed. But in the plot below, we have no observations of a house of size exactly 2000 square feet. How can we predict the price?

    -
    small_plot <- ggplot(small_sacramento, aes(x = sqft, y = price)) +
    -  geom_point() +
    -  xlab("House size (square footage)") +
    -  ylab("Price (USD)") +
    -  scale_y_continuous(labels=dollar_format()) +
    -  geom_vline(xintercept = 2000, linetype = "dotted") 
    -small_plot
    +
    small_plot <- ggplot(small_sacramento, aes(x = sqft, y = price)) +
    +  geom_point() +
    +  xlab("House size (square footage)") +
    +  ylab("Price (USD)") +
    +  scale_y_continuous(labels=dollar_format()) +
    +  geom_vline(xintercept = 2000, linetype = "dotted") 
    +small_plot

    We will employ the same intuition from the classification chapter, and use the neighbouring points to the new point of interest to suggest/predict what its price should be. For the example above, we find and label the 5 nearest neighbours to our observation of a house that is 2000 square feet:

    -
    nearest_neighbours <- small_sacramento %>% 
    -  mutate(diff = abs(2000 - sqft)) %>% 
    -  arrange(diff) %>% 
    -  head(5)
    -nearest_neighbours
    +
    nearest_neighbours <- small_sacramento %>% 
    +  mutate(diff = abs(2000 - sqft)) %>% 
    +  arrange(diff) %>% 
    +  head(5)
    +nearest_neighbours
    ## # A tibble: 5 x 10
    -##   city         zip    beds baths  sqft type       price latitude longitude  diff
    -##   <chr>        <chr> <dbl> <dbl> <dbl> <chr>      <dbl>    <dbl>     <dbl> <dbl>
    -## 1 GOLD_RIVER   z956…     3     2  1981 Resident… 305000     38.6     -121.    19
    -## 2 ELK_GROVE    z957…     4     2  2056 Resident… 275000     38.4     -121.    56
    -## 3 ELK_GROVE    z956…     5     3  2136 Resident… 223058     38.4     -121.   136
    -## 4 RANCHO_CORD… z957…     4     2  1713 Resident… 263500     38.6     -121.   287
    -## 5 RIO_LINDA    z956…     2     2  1690 Resident… 136500     38.7     -121.   310
    +## city zip beds baths sqft type price latitude longitude diff +## <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> +## 1 GOLD_RIVER z95670 3 2 1981 Residential 305000 38.6 -121. 19 +## 2 ELK_GROVE z95758 4 2 2056 Residential 275000 38.4 -121. 56 +## 3 ELK_GROVE z95624 5 3 2136 Residential 223058 38.4 -121. 136 +## 4 RANCHO_CORDOVA z95742 4 2 1713 Residential 263500 38.6 -121. 287 +## 5 RIO_LINDA z95673 2 2 1690 Residential 136500 38.7 -121. 310

    Now that we have the 5 nearest neighbours (in terms of house size) to our new 2,000 square-foot house of interest, we can use their values to predict a selling price for the new home. Specifically, we can take the mean (or average) of these 5 values as our predicted value.

    -
    prediction <- nearest_neighbours %>% 
    -  summarise(predicted = mean(price))
    -prediction
    +
    prediction <- nearest_neighbours %>% 
    +  summarise(predicted = mean(price))
    +prediction
    ## # A tibble: 1 x 1
     ##   predicted
     ##       <dbl>
    @@ -500,10 +500,10 @@ 

    8.6 Training, evaluating, and tun will come back to only after we choose our final model. Let’s take care of that now. Note that for the remainder of the chapter we’ll be working with the entire Sacramento data set, as opposed to the smaller sample of 30 points above.

    -
    set.seed(1234)
    -sacramento_split <- initial_split(sacramento, prop = 0.6, strata = price)
    -sacramento_train <- training(sacramento_split)
    -sacramento_test <- testing(sacramento_split)
    +
    set.seed(1234)
    +sacramento_split <- initial_split(sacramento, prop = 0.6, strata = price)
    +sacramento_train <- training(sacramento_split)
    +sacramento_test <- testing(sacramento_split)

    Next, we’ll use cross-validation to choose \(K\). In K-NN classification, we used accuracy to see how well our predictions matched the true labels. Here in the context of K-NN regression we will use root mean square prediction error @@ -548,31 +548,31 @@

    8.6 Training, evaluating, and tun in our preprocessing to build good habits, but since we only have one predictor it is technically not necessary; there is no risk of comparing two predictors of different scales.

    -
    sacr_recipe <- recipe(price ~ sqft, data = sacramento_train) %>%
    -                  step_scale(all_predictors()) %>%
    -                  step_center(all_predictors())
    -
    -sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>%
    -                  set_engine("kknn") %>%
    -                  set_mode("regression")
    -
    -sacr_vfold <- vfold_cv(sacramento_train, v = 5, strata = price)
    -
    -sacr_wkflw <- workflow() %>%
    -                 add_recipe(sacr_recipe) %>%
    -                 add_model(sacr_spec)
    -sacr_wkflw
    -
    ## ══ Workflow ════════════════════════════════════════════════════════════════════
    +
    sacr_recipe <- recipe(price ~ sqft, data = sacramento_train) %>%
    +                  step_scale(all_predictors()) %>%
    +                  step_center(all_predictors())
    +
    +sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>%
    +                  set_engine("kknn") %>%
    +                  set_mode("regression")
    +
    +sacr_vfold <- vfold_cv(sacramento_train, v = 5, strata = price)
    +
    +sacr_wkflw <- workflow() %>%
    +                 add_recipe(sacr_recipe) %>%
    +                 add_model(sacr_spec)
    +sacr_wkflw
    +
    ## ══ Workflow ══════════════════════════════════════════════════════════════════════════════════════════════════════
     ## Preprocessor: Recipe
     ## Model: nearest_neighbor()
     ## 
    -## ── Preprocessor ────────────────────────────────────────────────────────────────
    +## ── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────
     ## 2 Recipe Steps
     ## 
     ## ● step_scale()
     ## ● step_center()
     ## 
    -## ── Model ───────────────────────────────────────────────────────────────────────
    +## ── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────
     ## K-Nearest Neighbor Model Specification (regression)
     ## 
     ## Main Arguments:
    @@ -586,14 +586,14 @@ 

    8.6 Training, evaluating, and tun tells tidymodels that we need to use different metrics (RMSPE, not accuracy) for tuning and evaluation. You can see this in the following code, which tunes the model and returns the RMSPE for each number of neighbours.

    -
    gridvals <- tibble(neighbors = seq(1,200))
    -
    -sacr_results <- sacr_wkflw %>%
    -                   tune_grid(resamples = sacr_vfold, grid = gridvals) %>%
    -                   collect_metrics() 
    -
    -# show all the results
    -sacr_results
    +
    gridvals <- tibble(neighbors = seq(1,200))
    +
    +sacr_results <- sacr_wkflw %>%
    +                   tune_grid(resamples = sacr_vfold, grid = gridvals) %>%
    +                   collect_metrics() 
    +
    +# show all the results
    +sacr_results
    ## # A tibble: 400 x 6
     ##    neighbors .metric .estimator       mean     n   std_err
     ##        <int> <chr>   <chr>           <dbl> <int>     <dbl>
    @@ -609,11 +609,11 @@ 

    8.6 Training, evaluating, and tun ## 10 5 rsq standard 0.543 5 0.0303 ## # … with 390 more rows

    We take the minimum RMSPE to find the best setting for the number of neighbours:

    -
    # show only the row of minimum RMSPE
    -sacr_min <- sacr_results %>%
    -               filter(.metric == 'rmse') %>%
    -               filter(mean == min(mean))
    -sacr_min
    +
    # show only the row of minimum RMSPE
    +sacr_min <- sacr_results %>%
    +               filter(.metric == 'rmse') %>%
    +               filter(mean == min(mean))
    +sacr_min
    ## # A tibble: 1 x 6
     ##   neighbors .metric .estimator   mean     n std_err
     ##       <int> <chr>   <chr>       <dbl> <int>   <dbl>
    @@ -684,23 +684,23 @@ 

    8.8 Evaluating on the test setset_mode, the metrics function knows to output a quality summary related to regression, and not, say, classification.

    -
    set.seed(1234)
    -kmin <- sacr_min %>% pull(neighbors)
    -sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = kmin) %>%
    -            set_engine("kknn") %>%
    -            set_mode("regression")
    -
    -sacr_fit <- workflow() %>%
    -           add_recipe(sacr_recipe) %>%
    -           add_model(sacr_spec) %>%
    -           fit(data = sacramento_train)
    -
    -sacr_summary <- sacr_fit %>% 
    -           predict(sacramento_test) %>%
    -           bind_cols(sacramento_test) %>%
    -           metrics(truth = price, estimate = .pred) 
    -
    -sacr_summary
    +
    set.seed(1234)
    +kmin <- sacr_min %>% pull(neighbors)
    +sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = kmin) %>%
    +            set_engine("kknn") %>%
    +            set_mode("regression")
    +
    +sacr_fit <- workflow() %>%
    +           add_recipe(sacr_recipe) %>%
    +           add_model(sacr_spec) %>%
    +           fit(data = sacramento_train)
    +
    +sacr_summary <- sacr_fit %>% 
    +           predict(sacramento_test) %>%
    +           bind_cols(sacramento_test) %>%
    +           metrics(truth = price, estimate = .pred) 
    +
    +sacr_summary
    ## # A tibble: 3 x 3
     ##   .metric .estimator .estimate
     ##   <chr>   <chr>          <dbl>
    @@ -726,20 +726,20 @@ 

    8.8 Evaluating on the test set\(k\) affects K-NN regression, but we show it again now, along with the code that generated it:

    -
    set.seed(1234)
    -sacr_preds <- sacr_fit %>%
    -                predict(sacramento_train) %>%
    -                bind_cols(sacramento_train)
    -
    -plot_final <- ggplot(sacr_preds, aes(x = sqft, y = price)) +
    -            geom_point(alpha = 0.4) +
    -            xlab("House size (square footage)") +
    -            ylab("Price (USD)") +
    -            scale_y_continuous(labels = dollar_format())  +
    -            geom_line(data = sacr_preds, aes(x = sqft, y = .pred), color = "blue") +
    -            ggtitle(paste0("K = ", kmin))
    -
    -plot_final
    +
    set.seed(1234)
    +sacr_preds <- sacr_fit %>%
    +                predict(sacramento_train) %>%
    +                bind_cols(sacramento_train)
    +
    +plot_final <- ggplot(sacr_preds, aes(x = sqft, y = price)) +
    +            geom_point(alpha = 0.4) +
    +            xlab("House size (square footage)") +
    +            ylab("Price (USD)") +
    +            scale_y_continuous(labels = dollar_format())  +
    +            geom_line(data = sacr_preds, aes(x = sqft, y = .pred), color = "blue") +
    +            ggtitle(paste0("K = ", kmin))
    +
    +plot_final

    @@ -776,54 +776,54 @@

    8.10 Multivariate K-NN regression visualizing the data, before we start modeling the data. Thus the first thing we will do is use ggpairs (from the GGally package) to plot all the variables we are interested in using in our analyses:

    -
    library(GGally)
    -plot_pairs <- sacramento %>% 
    -  select(price, sqft, beds) %>% 
    -  ggpairs()
    -plot_pairs
    +
    library(GGally)
    +plot_pairs <- sacramento %>% 
    +  select(price, sqft, beds) %>% 
    +  ggpairs()
    +plot_pairs

    From this we can see that generally, as both house size and number of bedrooms increase, so does price. Does adding the number of bedrooms to our model improve our ability to predict house price? To answer that question, we will have to come up with the test error for a K-NN regression model using house size and number of bedrooms, and then we can compare it to the test error for the model we previously came up with that only used house size to see if it is smaller (decreased test error indicates increased prediction quality). Let’s do that now!

    First we’ll build a new model specification and recipe for the analysis. Note that we use the formula price ~ sqft + beds to denote that we have two predictors, and set neighbors = tune() to tell tidymodels to tune the number of neighbours for us.

    -
    sacr_recipe <- recipe(price ~ sqft + beds, data = sacramento_train) %>%
    -                  step_scale(all_predictors()) %>%
    -                  step_center(all_predictors())
    -
    -sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>%
    -                  set_engine("kknn") %>%
    -                  set_mode("regression")
    +
    sacr_recipe <- recipe(price ~ sqft + beds, data = sacramento_train) %>%
    +                  step_scale(all_predictors()) %>%
    +                  step_center(all_predictors())
    +
    +sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>%
    +                  set_engine("kknn") %>%
    +                  set_mode("regression")

    Next, we’ll use 5-fold cross-validation to choose the number of neighbours via the minimum RMSPE:

    -
    gridvals <- tibble(neighbors = seq(1,200))
    -sacr_k <- workflow() %>%
    -                 add_recipe(sacr_recipe) %>%
    -                 add_model(sacr_spec) %>%
    -                 tune_grid(sacr_vfold, grid = gridvals) %>%
    -                 collect_metrics() %>%
    -                 filter(.metric == 'rmse') %>%
    -                 filter(mean == min(mean)) %>%
    -                 pull(neighbors)
    -sacr_k
    +
    gridvals <- tibble(neighbors = seq(1,200))
    +sacr_k <- workflow() %>%
    +                 add_recipe(sacr_recipe) %>%
    +                 add_model(sacr_spec) %>%
    +                 tune_grid(sacr_vfold, grid = gridvals) %>%
    +                 collect_metrics() %>%
    +                 filter(.metric == 'rmse') %>%
    +                 filter(mean == min(mean)) %>%
    +                 pull(neighbors)
    +sacr_k
    ## [1] 14

    Here we see that the smallest RMSPE occurs when \(K =\) 14.

    Now that we have chosen \(K\), we need to re-train the model on the entire training data set, and after that we can use that model to predict on the test data to get our test error.

    -
    sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = sacr_k) %>%
    -                  set_engine("kknn") %>%
    -                  set_mode("regression")
    -
    -knn_mult_fit <- workflow() %>%
    -                add_recipe(sacr_recipe) %>%
    -                add_model(sacr_spec) %>%
    -                fit(data = sacramento_train)
    -
    -knn_mult_preds <- knn_mult_fit %>%
    -                predict(sacramento_test) %>%
    -                bind_cols(sacramento_test)
    -
    -knn_mult_mets <- metrics(knn_mult_preds, truth = price, estimate = .pred) 
    -knn_mult_mets
    +
    sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = sacr_k) %>%
    +                  set_engine("kknn") %>%
    +                  set_mode("regression")
    +
    +knn_mult_fit <- workflow() %>%
    +                add_recipe(sacr_recipe) %>%
    +                add_model(sacr_spec) %>%
    +                fit(data = sacramento_train)
    +
    +knn_mult_preds <- knn_mult_fit %>%
    +                predict(sacramento_test) %>%
    +                bind_cols(sacramento_test)
    +
    +knn_mult_mets <- metrics(knn_mult_preds, truth = price, estimate = .pred) 
    +knn_mult_mets
    ## # A tibble: 3 x 3
     ##   .metric .estimator .estimate
     ##   <chr>   <chr>          <dbl>
    @@ -841,8 +841,8 @@ 

    8.10 Multivariate K-NN regression

    We can also visualize the model’s predictions overlaid on top of the data. This time the predictions will be a surface in 3-D space, instead of a line in 2-D space, as we have 2 predictors instead of 1. -

    -

    +
    +

    We can see that the predictions in this case, where we have 2 predictors, form a surface instead of a line. Because the newly added predictor, number of bedrooms, is correlated with price (USD) (meaning as price changes, so does diff --git a/docs/regression2.html b/docs/regression2.html index b1993cd99..225b2d85c 100644 --- a/docs/regression2.html +++ b/docs/regression2.html @@ -26,7 +26,7 @@ - + @@ -440,31 +440,31 @@

    9.4 Linear regression in R

    As usual, we start by putting some test data away in a lock box that we can come back to after we choose our final model. Let’s take care of that now.

    -
    set.seed(1234)
    -sacramento_split <- initial_split(sacramento, prop = 0.6, strata = price)
    -sacramento_train <- training(sacramento_split)
    -sacramento_test <- testing(sacramento_split)
    +
    set.seed(1234)
    +sacramento_split <- initial_split(sacramento, prop = 0.6, strata = price)
    +sacramento_train <- training(sacramento_split)
    +sacramento_test <- testing(sacramento_split)

    Now that we have our training data, we will create the model specification and recipe, and fit our simple linear regression model:

    -
    lm_spec <- linear_reg() %>%
    -            set_engine("lm") %>%
    -            set_mode("regression")
    -
    -lm_recipe <- recipe(price ~ sqft, data = sacramento_train) 
    -
    -lm_fit <- workflow() %>%
    -            add_recipe(lm_recipe) %>%
    -            add_model(lm_spec) %>%
    -            fit(data = sacramento_train)
    -lm_fit
    -
    ## ══ Workflow [trained] ══════════════════════════════════════════════════════════
    +
    lm_spec <- linear_reg() %>%
    +            set_engine("lm") %>%
    +            set_mode("regression")
    +
    +lm_recipe <- recipe(price ~ sqft, data = sacramento_train) 
    +
    +lm_fit <- workflow() %>%
    +            add_recipe(lm_recipe) %>%
    +            add_model(lm_spec) %>%
    +            fit(data = sacramento_train)
    +lm_fit
    +
    ## ══ Workflow [trained] ════════════════════════════════════════════════════════════════════════════════════════════
     ## Preprocessor: Recipe
     ## Model: linear_reg()
     ## 
    -## ── Preprocessor ────────────────────────────────────────────────────────────────
    +## ── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────
     ## 0 Recipe Steps
     ## 
    -## ── Model ───────────────────────────────────────────────────────────────────────
    +## ── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────
     ## 
     ## Call:
     ## stats::lm(formula = formula, data = data)
    @@ -480,11 +480,11 @@ 

    9.4 Linear regression in R

    and that the model predicts that houses start at $15059 for 0 square feet, and that every extra square foot increases the cost of the house by $138. Finally, we predict on the test data set to assess how well our model does:

    -
    lm_test_results <- lm_fit %>%
    -                predict(sacramento_test) %>%
    -                bind_cols(sacramento_test) %>%
    -                metrics(truth = price, estimate = .pred)
    -lm_test_results
    +
    lm_test_results <- lm_fit %>%
    +                predict(sacramento_test) %>%
    +                bind_cols(sacramento_test) %>%
    +                metrics(truth = price, estimate = .pred)
    +lm_test_results
    ## # A tibble: 3 x 3
     ##   .metric .estimator .estimate
     ##   <chr>   <chr>          <dbl>
    @@ -507,20 +507,20 @@ 

    9.4 Linear regression in R

    plausible range to this line that we are not interested in at this point, so to avoid plotting it, we provide the argument se = FALSE in our call to geom_smooth.

    -
    lm_plot_final <- ggplot(sacramento_train, aes(x = sqft, y = price)) +
    -    geom_point(alpha = 0.4) +
    -    xlab("House size (square footage)") +
    -    ylab("Price (USD)") +
    -    scale_y_continuous(labels = dollar_format())  +
    -    geom_smooth(method = "lm", se = FALSE) 
    -lm_plot_final
    +
    lm_plot_final <- ggplot(sacramento_train, aes(x = sqft, y = price)) +
    +    geom_point(alpha = 0.4) +
    +    xlab("House size (square footage)") +
    +    ylab("Price (USD)") +
    +    scale_y_continuous(labels = dollar_format())  +
    +    geom_smooth(method = "lm", se = FALSE) 
    +lm_plot_final

    We can extract the coefficients from our model by accessing the fit object that is output by the fit function; we first have to extract it from the workflow using the pull_workflow_fit function, and then apply the tidy function to convert the result into a data frame:

    -
    coeffs <- tidy(pull_workflow_fit(lm_fit))
    -coeffs
    +
    coeffs <- tidy(pull_workflow_fit(lm_fit))
    +coeffs
    ## # A tibble: 2 x 5
     ##   term        estimate std.error statistic   p.value
     ##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
    @@ -591,21 +591,21 @@ 

    9.6 Multivariate linear regressio continue to use house sale price as our outcome/target variable that we are trying to predict. We will start by changing the formula in the recipe to include both the sqft and beds variables as predictors:

    -
    lm_recipe <- recipe(price ~ sqft + beds, data = sacramento_train) 
    +
    lm_recipe <- recipe(price ~ sqft + beds, data = sacramento_train) 

    Now we can build our workflow and fit the model:

    -
    lm_fit <- workflow() %>%
    -            add_recipe(lm_recipe) %>%
    -            add_model(lm_spec) %>%
    -            fit(data = sacramento_train)
    -lm_fit
    -
    ## ══ Workflow [trained] ══════════════════════════════════════════════════════════
    +
    lm_fit <- workflow() %>%
    +            add_recipe(lm_recipe) %>%
    +            add_model(lm_spec) %>%
    +            fit(data = sacramento_train)
    +lm_fit
    +
    ## ══ Workflow [trained] ════════════════════════════════════════════════════════════════════════════════════════════
     ## Preprocessor: Recipe
     ## Model: linear_reg()
     ## 
    -## ── Preprocessor ────────────────────────────────────────────────────────────────
    +## ── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────
     ## 0 Recipe Steps
     ## 
    -## ── Model ───────────────────────────────────────────────────────────────────────
    +## ── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────
     ## 
     ## Call:
     ## stats::lm(formula = formula, data = data)
    @@ -614,11 +614,11 @@ 

    9.6 Multivariate linear regressio ## (Intercept) sqft beds ## 52690.1 154.8 -20209.4

    And finally, we predict on the test data set to assess how well our model does:

    -
    lm_mult_test_results <- lm_fit %>%
    -                predict(sacramento_test) %>%
    -                bind_cols(sacramento_test) %>%
    -                metrics(truth = price, estimate = .pred)
    -lm_mult_test_results
    +
    lm_mult_test_results <- lm_fit %>%
    +                predict(sacramento_test) %>%
    +                bind_cols(sacramento_test) %>%
    +                metrics(truth = price, estimate = .pred)
    +lm_mult_test_results
    ## # A tibble: 3 x 3
     ##   .metric .estimator .estimate
     ##   <chr>   <chr>          <dbl>
    @@ -626,8 +626,8 @@ 

    9.6 Multivariate linear regressio ## 2 rsq standard 0.596 ## 3 mae standard 61008.

    In the case of two predictors, our linear regression creates a plane of best fit, shown below:

    -

    - +

    + We see that the predictions from linear regression with two predictors form a flat plane. This is the hallmark of linear regression, and differs from the wiggly, flexible surface we get from other methods such as K-NN regression. @@ -635,8 +635,8 @@

    9.6 Multivariate linear regressio predictor, we can get slopes/intercept from linear regression, and thus describe the plane mathematically. We can extract those slope values from our model object as shown below:

    -
    coeffs <- tidy(pull_workflow_fit(lm_fit))
    -coeffs
    +
    coeffs <- tidy(pull_workflow_fit(lm_fit))
    +coeffs
    ## # A tibble: 3 x 5
     ##   term        estimate std.error statistic  p.value
     ##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
    @@ -661,7 +661,7 @@ 

    9.6 Multivariate linear regressio predicting compared to K-NN regression in this multivariate regression case. To do that we can use this linear regression model to predict on the test data to get our test error.

    -
    lm_mult_test_results
    +
    lm_mult_test_results
    ## # A tibble: 3 x 3
     ##   .metric .estimator .estimate
     ##   <chr>   <chr>          <dbl>
    diff --git a/docs/search_index.json b/docs/search_index.json
    index 30e2e348b..15533b68c 100644
    --- a/docs/search_index.json
    +++ b/docs/search_index.json
    @@ -1,3 +1,17 @@
     [
    +<<<<<<< HEAD
     ["index.html", "Introduction to Data Science Chapter 1 R, Jupyter, and the tidyverse 1.1 Chapter learning objectives 1.2 Jupyter notebooks 1.3 Loading a spreadsheet-like dataset 1.4 Assigning value to a data frame 1.5 Creating subsets of data frames with select & filter 1.6 Exploring data with visualizations", " Introduction to Data Science Tiffany-Anne Timbers Trevor Campbell Melissa Lee 2020-12-07 Chapter 1 R, Jupyter, and the tidyverse This is an open source textbook aimed at introducing undergraduate students to data science. It was originally written for the University of British Columbia’s DSCI 100 - Introduction to Data Science course. In this book, we define data science as the study and development of reproducible, auditable processes to obtain value (i.e., insight) from data. The book is structured so that learners spend the first four chapters learning how to use the R programming language and Jupyter notebooks to load, wrangle/clean, and visualize data, while answering descriptive and exploratory data analysis questions. The remaining chapters illustrate how to solve four common problems in data science, which are useful for answering predictive and inferential data analysis questions: Predicting a class/category for a new observation/measurement (e.g., cancerous or benign tumour) Predicting a value for a new observation/measurement (e.g., 10 km race time for 20 year old females with a BMI of 25). Finding previously unknown/unlabelled subgroups in your data (e.g., products commonly bought together on Amazon) Estimating an average or a proportion from a representative sample (group of people or units) and using that estimate to generalize to the broader population (e.g., the proportion of undergraduate students that own an iphone) For each of these problems, we map them to the type of data analysis question being asked and discuss what kinds of data are needed to answer such questions. More advanced (e.g., causal or mechanistic) data analysis questions are beyond the scope of this text. Types of data analysis questions Question type Description Example Descriptive A question which asks about summarized characteristics of a data set without interpretation (i.e., report a fact). How many people live in each US state? Exploratory A question asks if there are patterns, trends, or relationships within a single data set. Often used to propose hypotheses for future study. Does politcal party voting change with indicators of wealth in a set of data collected from groups of individuals from several regions in the United States? Inferential A question that looks for patterns, trends, or relationships in a single data set and also asks for quantification of how applicable these findings are to the wider population. Does politcal party voting change with indicators of wealth in the United States? Predictive A question that asks about predicting measurements or labels for individuals (people or things). The focus is on what things predict some outcome, but not what causes the outcome. What political party will someone vote for in the next US election? Causal A question that asks about whether changing one factor will lead to a change in another factor, on average, in the wider population. Does wealth lead to voting for a certain political party candidate in the US Presidential election? Mechanistic A question that asks about the underlying mechanism of the observed patterns, trends, or relationship (i.e., how does it happen?) How does wealth lead to voting for a certain political party candidate in the US Presidential election? Source: What is the question? by Jeffery T. Leek, Roger D. Peng & The Art of Data Science by Roger Peng & Elizabeth Matsui 1.1 Chapter learning objectives By the end of the chapter, students will be able to: use a Jupyter notebook to execute provided R code edit code and markdown cells in a Jupyter notebook create new code and markdown cells in a Jupyter notebook load the tidyverse library into R create new variables and objects in R using the assignment symbol use the help and documentation tools in R match the names of the following functions from the tidyverse library to their documentation descriptions: read_csv select mutate filter ggplot aes 1.2 Jupyter notebooks Jupyter notebooks are documents that contain a mix of computer code (and its output) and formattable text. Given that they are able to combine these two in a single document—code is not separate from the output or written report—notebooks are one of the leading tools to create reproducible data analyses. A reproducible data analysis is one where you can reliably and easily recreate the same results when analyzing the same data. Although this sounds like something that should always be true of any data analysis, in reality this is not often the case; one needs to make a conscious effort to perform data analysis in a reproducible manner. The name Jupyter came from combining the names of the three programming language that it was initially targeted for (Julia, Python, and R), and now many other languages can be used with Jupyter notebooks. A notebook looks like this: We have included a short demo video here to help you get started and to introduce you to R and Jupyter. However, the best way to learn how to write and run code and formattable text in a Jupyter notebook is to do it yourself! Here is a worksheet that provides a step-by-step guide through the basics. 1.3 Loading a spreadsheet-like dataset Often, the first thing we need to do in data analysis is to load a dataset into R. When we bring spreadsheet-like (think Microsoft Excel tables) data, generally shaped like a rectangle, into R it is represented as what we call a data frame object. It is very similar to a spreadsheet where the rows are the collected observations and the columns are the variables. The first kind of data we will learn how to load into R (as a data frame) is the spreadsheet-like comma-separated values format (.csv for short). These files have names ending in .csv, and can be opened open and saved from common spreadsheet programs like Microsoft Excel and Google Sheets. For example, a .csv file named state_property_vote.csv is included with the code for this book. This file— originally from Data USA—has US state-level property, income, population and voting data from 2015 and 2016. If we were to open this data in a plain text editor, we would see each row on its own line, and each entry in the table separated by a comma: state,pop,med_prop_val,med_income,avg_commute,party Montana,1042520,217200,46608,16.35,Republican Alabama,4863300,136200,42917,23.78,Republican Arizona,6931071,205900,50036,23.69,Republican Arkansas,2988248,123300,41335,20.49,Republican California,39250017,477500,61927,27.67,Democratic Colorado,5540545,314200,61324,23.02,Democratic Connecticut,3576452,274600,70007,24.92,Democratic Delaware,952065,243400,59853,24.97,Democratic District of Columbia,681170,576100,75506,28.96,Democratic To load this data into R, and then to do anything else with it afterwards, we will need to use something called a function. A function is a special word in R that takes in instructions (we call these arguments) and does something. The function we will use to read a .csv file into R is called read_csv. In its most basic use-case, read_csv expects that the data file: has column names (or headers), uses a comma (,) to separate the columns, and does not have row names. Below you’ll see the code used to load the data into R using the read_csv function. But there is one extra step we need to do first. Since read_csv is not included in the base installation of R, to be able to use it we have to load it from somewhere else: a collection of useful functions known as a library. The read_csv function in particular is in the tidyverse library (more on this later), which we load using the library function. Next, we call the read_csv function and pass it a single argument: the name of the file, \"state_property_vote.csv\". We have to put quotes around filenames and other letters and words that we use in our code to distinguish it from the special words that make up R programming language. This is the only argument we need to provide for this file, because our file satifies everthing else the read_csv function expects in the default use-case (which we just discussed). Later in the course, we’ll learn more about how to deal with more complicated files where the default arguments are not appropriate. For example, files that use spaces or tabs to separate the columns, or with no column names. library(tidyverse) read_csv("data/state_property_vote.csv") ## # A tibble: 52 x 6 ## state pop med_prop_val med_income avg_commute party ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 Montana 1042520 217200 46608 16.4 Republican ## 2 Alabama 4863300 136200 42917 23.8 Republican ## 3 Arizona 6931071 205900 50036 23.7 Republican ## 4 Arkansas 2988248 123300 41335 20.5 Republican ## 5 California 39250017 477500 61927 27.7 Democratic ## 6 Colorado 5540545 314200 61324 23.0 Democratic ## 7 Connecticut 3576452 274600 70007 24.9 Democratic ## 8 Delaware 952065 243400 59853 25.0 Democratic ## 9 District of Columbia 681170 576100 75506 29.0 Democratic ## 10 Florida 20612439 197700 47439 25.8 Republican ## # … with 42 more rows Above you can also see something neat that Jupyter does to help us understand our code: it colours text depending on its meaning in R. For example, you’ll note that functions get bold green text, while letters and words surrounded by quotations like filenames get blue text. In case you want to know more (optional): We use the read_csv function from the tidyverse instead of the base R function read.csv because it’s faster and it creates a nicer variant of the base R data frame called a tibble. This has several benefits that we’ll discuss in further detail later in the course. 1.4 Assigning value to a data frame When we loaded the US state-level property, income, population, and voting data in R above using read_csv, we did not give this data frame a name, so it was just printed to the screen and we cannot do anything else with it. That isn’t very useful; what we would like to do is give a name to the data frame that read_csv outputs so that we can use it later for analysis and visualization. To assign name to something in R, there are two possible ways—using either the assignment symbol (<-) or the equals symbol (=). From a style perspective, the assignment symbol is preferred and is what we will use in this course. When we name something in R using the assignment symbol, <-, we do not need to surround it with quotes like the filename. This is because we are formally telling R about this word and giving it a value. Only characters and words that act as values need to be surrounded by quotes. Let’s now use the assignment symbol to give the name us_data to the US state-level property, income, population, and voting data frame that we get from read_csv. us_data <- read_csv("data/state_property_vote.csv") Wait a minute! Nothing happened this time! Or at least it looks like that. But actually something did happen: the data was read in and now has the name us_data associated with it. And we can use that name to access the data frame and do things with it. First we will type the name of the data frame to print it to the screen. us_data ## # A tibble: 52 x 6 ## state pop med_prop_val med_income avg_commute party ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 Montana 1042520 217200 46608 16.4 Republican ## 2 Alabama 4863300 136200 42917 23.8 Republican ## 3 Arizona 6931071 205900 50036 23.7 Republican ## 4 Arkansas 2988248 123300 41335 20.5 Republican ## 5 California 39250017 477500 61927 27.7 Democratic ## 6 Colorado 5540545 314200 61324 23.0 Democratic ## 7 Connecticut 3576452 274600 70007 24.9 Democratic ## 8 Delaware 952065 243400 59853 25.0 Democratic ## 9 District of Columbia 681170 576100 75506 29.0 Democratic ## 10 Florida 20612439 197700 47439 25.8 Republican ## # … with 42 more rows 1.5 Creating subsets of data frames with select & filter Now, we are going to learn how to obtain subsets of data from a data frame in R using two other tidyverse functions: select and filter. The select function allows you to create a subset of the columns of a data frame, while the filter function allows you to obtain a subset of the rows with specific values. Before we start using select and filter, let’s take a look at the US state-level property, income, and population data again to familiarize ourselves with it. We will do this by printing the data we loaded earlier in the chapter to the screen. us_data ## # A tibble: 52 x 6 ## state pop med_prop_val med_income avg_commute party ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 Montana 1042520 217200 46608 16.4 Republican ## 2 Alabama 4863300 136200 42917 23.8 Republican ## 3 Arizona 6931071 205900 50036 23.7 Republican ## 4 Arkansas 2988248 123300 41335 20.5 Republican ## 5 California 39250017 477500 61927 27.7 Democratic ## 6 Colorado 5540545 314200 61324 23.0 Democratic ## 7 Connecticut 3576452 274600 70007 24.9 Democratic ## 8 Delaware 952065 243400 59853 25.0 Democratic ## 9 District of Columbia 681170 576100 75506 29.0 Democratic ## 10 Florida 20612439 197700 47439 25.8 Republican ## # … with 42 more rows In this data frame there are 52 rows (corresponding to the 50 US states, the District of Columbia and the US territory, Puerto Rico) and 6 columns: US state name Population Median property value Median household income Average commute time in minutes The party each state voted for in the 2016 US presidential election Now let’s use select to extract the state column from this data frame. To do this, we need to provide the select function with two arguments. The first argument is the name of the data frame object, which in this example is us_data. The second argument is the column name that we want to select, here state. After passing these two arguments, the select function returns a single column (the state column that we asked for) as a data frame. state_column <- select(us_data, state) state_column ## # A tibble: 52 x 1 ## state ## <chr> ## 1 Montana ## 2 Alabama ## 3 Arizona ## 4 Arkansas ## 5 California ## 6 Colorado ## 7 Connecticut ## 8 Delaware ## 9 District of Columbia ## 10 Florida ## # … with 42 more rows 1.5.1 Using select to extract multiple columns We can also use select to obtain a subset of the data frame with multiple columns. Again, the first argument is the name of the data frame. Then we list all the columns we want as arguments separated by commas. Here we create a subset of three columns: state, median property value, and mean commute time in minutes. three_columns <- select(us_data, state, med_prop_val, avg_commute) three_columns ## # A tibble: 52 x 3 ## state med_prop_val avg_commute ## <chr> <dbl> <dbl> ## 1 Montana 217200 16.4 ## 2 Alabama 136200 23.8 ## 3 Arizona 205900 23.7 ## 4 Arkansas 123300 20.5 ## 5 California 477500 27.7 ## 6 Colorado 314200 23.0 ## 7 Connecticut 274600 24.9 ## 8 Delaware 243400 25.0 ## 9 District of Columbia 576100 29.0 ## 10 Florida 197700 25.8 ## # … with 42 more rows 1.5.2 Using select to extract a range of columns We can also use select to obtain a subset of the data frame constructed from a range of columns. To do this we use the colon (:) operator to denote the range. For example, to get all the columns in the data frame from state to med_prop_val we pass state:med_prop_val as the second argument to the select function. column_range <- select(us_data, state:med_prop_val) column_range ## # A tibble: 52 x 3 ## state pop med_prop_val ## <chr> <dbl> <dbl> ## 1 Montana 1042520 217200 ## 2 Alabama 4863300 136200 ## 3 Arizona 6931071 205900 ## 4 Arkansas 2988248 123300 ## 5 California 39250017 477500 ## 6 Colorado 5540545 314200 ## 7 Connecticut 3576452 274600 ## 8 Delaware 952065 243400 ## 9 District of Columbia 681170 576100 ## 10 Florida 20612439 197700 ## # … with 42 more rows 1.5.3 Using filter to extract a single row We can use the filter function to obtain the subset of rows with desired values from a data frame. Again, our first argument is the name of the data frame object, us_data. The second argument is a logical statement to use when filtering the rows. Here, for example, we’ll say that we are interested in rows where the state is New York. To make this comparison, we use the equivalency operator == to compare the values of the state column with the value \"New York\". Similar to when we loaded the data file and put quotes around the filename, here we need to put quotes around \"New York\" to tell R that this is a character value and not one of the special words that make up R programming language, nor one of the names we have given to data frames in the code we have already written. With these arguments, filter returns a data frame that has all the columns of the input data frame but only the rows we asked for in our logical filter statement. new_york <- filter(us_data, state == "New York") new_york ## # A tibble: 1 x 6 ## state pop med_prop_val med_income avg_commute party ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 New York 19745289 302400 58771 32.0 Democratic 1.5.4 Using filter to extract rows with values above a threshold If we are interested in finding information about the states who have a higher median household income than New York—whose median household income is $58,771—then we can create a filter to obtain rows where the value of med_income is greater than 58771. In this case, we see that filter returns a data frame with 16 rows; this indicates that there are 16 states with higher median household incomes than New York. high_incomes <- filter(us_data, med_income > 58771) high_incomes ## # A tibble: 16 x 6 ## state pop med_prop_val med_income avg_commute party ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 California 39250017 477500 61927 27.7 Democratic ## 2 Colorado 5540545 314200 61324 23.0 Democratic ## 3 Connecticut 3576452 274600 70007 24.9 Democratic ## 4 Delaware 952065 243400 59853 25.0 Democratic ## 5 District of Columbia 681170 576100 75506 29.0 Democratic ## 6 Hawaii 1428557 592000 69549 26.0 Democratic ## 7 Maryland 6016447 306900 73851 31.3 Democratic ## 8 Massachusetts 6811779 366900 69200 28.0 Democratic ## 9 Minnesota 5519952 211800 61473 22.1 Democratic ## 10 Alaska 741894 267800 70898 17.0 Republican ## 11 New Hampshire 1334795 251100 66469 25.2 Democratic ## 12 New Jersey 8944469 328200 71968 30.3 Democratic ## 13 North Dakota 757953 184100 60227 16.5 Republican ## 14 Utah 3051217 250300 60943 20.3 Republican ## 15 Virginia 8411808 264000 64923 27.0 Democratic ## 16 Washington 7288000 306400 61358 26.2 Democratic 1.6 Exploring data with visualizations Creating effective data visualizations is an essential piece to any data analysis. For the remainder of Chapter 1, we will learn how to use functions from the tidyverse to make visualizations that let us explore relationships in data. In particular, we’ll develop a visualization of the US property, income, population, and voting data we’ve been working with that will help us understand two potential relationships in the data: first, the relationship between median household income and median propery value across the US, and second, whether there is a pattern in which party each state voted for in the 2016 US election. This is an example of an exploratory data analysis question: we are looking for relationships and patterns within the data set we have, but are not trying to generalize what we find beyond this data set. 1.6.1 Using ggplot to create a scatter plot Taking another look at our dataset below, we can immediately see that the three columns (or variables) we are interested in visualizing—median household income, median property value, and election result—are all in separate columns. In addition, there is a single row (or observation) for each state. The data are therefore in what we call a tidy data format. This is particularly important and will be a major focus in the remainder of this course: many of the functions from tidyverse require tidy data, including the ggplot function that we will use shortly for our visualization. Note below that we use the print function to display the us_data rather than just typing us_data; for data frames, these do the same thing. print(us_data) ## # A tibble: 52 x 6 ## state pop med_prop_val med_income avg_commute party ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 Montana 1042520 217200 46608 16.4 Republican ## 2 Alabama 4863300 136200 42917 23.8 Republican ## 3 Arizona 6931071 205900 50036 23.7 Republican ## 4 Arkansas 2988248 123300 41335 20.5 Republican ## 5 California 39250017 477500 61927 27.7 Democratic ## 6 Colorado 5540545 314200 61324 23.0 Democratic ## 7 Connecticut 3576452 274600 70007 24.9 Democratic ## 8 Delaware 952065 243400 59853 25.0 Democratic ## 9 District of Columbia 681170 576100 75506 29.0 Democratic ## 10 Florida 20612439 197700 47439 25.8 Republican ## # … with 42 more rows 1.6.2 Using ggplot to create a scatter plot We will begin with a scatter plot of the income and property value columns from our data frame. To create a scatter plot of these two variables using the ggplot function, we do the following: call the ggplot function provide the name of the data frame as the first argument call the aesthetic function, aes, to specify which column will correspond to the x-axis and which will correspond to the y-axis add a + symbol at the end of the ggplot call to add a layer to the plot call the geom_point function to tell R that we want to represent the data points as dots/points to create a scatter plot. ggplot(us_data, aes(x = med_income, y = med_prop_val)) + geom_point() In case you have used R before and are curious: There are a small number of situations in which you can have a single R expression span multiple lines. Here, the + symbol at the end of the first line tells R that the expression isn’t done yet and to continue reading on the next line. While not strictly necessary, this sort of pattern will appear a lot when using ggplot as it keeps things more readable. 1.6.3 Formatting ggplot objects One common and easy way to format your ggplot visualization is to add additional layers to the plot object using the + symbol. For example, we can use the xlab and ylab functions to add layers where we specify human readable labels for the x and y axes. Again, since we are specifying words (e.g. \"Income (USD)\") as arguments to xlab and ylab, we surround them with double quotes. There are many more layers we can add to format the plot further, and we will explore these in later chapters. ggplot(us_data, aes(x = med_income, y = med_prop_val)) + geom_point() + xlab("Income (USD)") + ylab("Median property value (USD)") From this visualization we see that for the 52 US regions in this data set, as median household income increases so does median property value. When we see two variables do this, we call this a positive relationship. Because the increasing pattern is fairly clear (not fuzzy) we can say that the relationship is strong. Because of the data point in the lower left-hand corner, drawing a straight line through these points wouldn’t fit very well. When a straight-line doesn’t fit the data well we say that it’s non-linear. However, we should have caution when using one point to claim non-linearity. As we will see later, this might be due to a single point not really belonging in the data set (this is often called an outlier). Learning how to describe data visualizations is a very useful skill. We will provide descriptions for you in this course (as we did above) until we get to Chapter 4, which focuses on data visualization. Then, we will explicitly teach you how to do this yourself, and how to not over-state or over-interpret the results from a visualization. 1.6.4 Coloring points by group Now we’ll move onto the second part of our exploratory data analysis question: when considering the relationship between median household income and median property value, is there a pattern in which party each state voted for in the 2016 US election? One common way to explore this is to colour the data points on the scatter plot we have already created by group/category. For example, given that we have the party each state voted for in the 2016 US Presidential election in the column named party, we can colour the points in our previous scatter plot to represent who each stated voted for. To do this we modify our scatter plot code above. Specifically, we will add an argument to the aes function, specifying that the points should be coloured by the party column: ggplot(us_data, aes(x = med_income, y = med_prop_val, color = party)) + geom_point() + xlab("Income (USD)") + ylab("Median property value (USD)") This data visualization shows that the one data point we singled out earlier on the far left of the plot has the label of “not applicable” instead of “democrat” or “republican”. Let’s use filter to look at the row that contains the “not applicable” value in the party column: missing_party <- filter(us_data, party == "Not Applicable") missing_party ## # A tibble: 1 x 6 ## state pop med_prop_val med_income avg_commute party ## <chr> <dbl> <dbl> <dbl> <dbl> <fct> ## 1 Puerto Rico 3411307 111900 20078 28.4 Not Applicable That explains it! That row in the dataset is actually not a US state, but rather the US territory of Peurto Rico. Similar to other US territories, residents of Puerto Rico cannot vote in presidential elections. Hence the “not applicable” label. Let’s remove this row from the data frame and rename the data frame vote_data. To do this, we use the opposite of the equivalency operator (==) for our filter statement, the not equivalent operator (!=). vote_data <- filter(us_data, party != "Not Applicable") vote_data ## # A tibble: 51 x 6 ## state pop med_prop_val med_income avg_commute party ## <chr> <dbl> <dbl> <dbl> <dbl> <fct> ## 1 Montana 1042520 217200 46608 16.4 Republican ## 2 Alabama 4863300 136200 42917 23.8 Republican ## 3 Arizona 6931071 205900 50036 23.7 Republican ## 4 Arkansas 2988248 123300 41335 20.5 Republican ## 5 California 39250017 477500 61927 27.7 Democratic ## 6 Colorado 5540545 314200 61324 23.0 Democratic ## 7 Connecticut 3576452 274600 70007 24.9 Democratic ## 8 Delaware 952065 243400 59853 25.0 Democratic ## 9 District of Columbia 681170 576100 75506 29.0 Democratic ## 10 Florida 20612439 197700 47439 25.8 Republican ## # … with 41 more rows Now we see that the data frame has 51 rows corresponding to the 50 states and the District of Columbia - all regions where residents can vote in US presidential elections. Let’s now recreate the scatter plot we made above using this data frame subset: ggplot(vote_data, aes(x = med_income, y = med_prop_val, color = party)) + geom_point() + xlab("Income (USD)") + ylab("Median property value (USD)") What do we see when considering the second part of our exploratory question? Do we see a pattern in how certain states voted in the 2016 Presidential election? We do! Most of the US States who voted for the Republican candidate in the 2016 US Presidential election had lower median household income and lower median property values (data points primarily fall in lower left-hand side of the scatter plot), whereas most of the US states who voted for the Democratic candidate in the 2016 US Presidential election had higher median household income and higher median property values (data points primarily fall in the upper right-hand side of the scatter plot). Does this mean that rich states usually vote for Democrats and poorer states generally vote for Republicans? Or could we use this data visualization on its own to predict which party each state will vote for in the next presidential election? The answer to both these questions is “no.” What we can do with this exploratory data analysis is create new hypotheses, ideas, and questions (like the ones at the beginning of this paragraph). Answering those questions would likely involve gathering additional data and doing more complex analyses, which we will see more of later in this course. 1.6.5 Putting it all together Below, we put everything from this chapter together in one code chunk. This demonstrates the power of R: in relatively few lines of code, we are able to create an entire data science workflow. library(tidyverse) us_data <- read_csv("data/state_property_vote.csv") vote_data <- filter(us_data, party != "Not Applicable") ggplot(vote_data, aes(x = med_income, y = med_prop_val, color = party)) + geom_point() + xlab("Income (USD)") + ylab("Median property value (USD)") "]
    +=======
    +["index.html", "Introduction to Data Science Chapter 1 R, Jupyter, and the tidyverse 1.1 Chapter learning objectives 1.2 Jupyter notebooks 1.3 Loading a spreadsheet-like dataset 1.4 Assigning value to a data frame 1.5 Creating subsets of data frames with select & filter 1.6 Exploring data with visualizations", " Introduction to Data Science Tiffany-Anne Timbers Trevor Campbell Melissa Lee 2020-11-24 Chapter 1 R, Jupyter, and the tidyverse This is an open source textbook aimed at introducing undergraduate students to data science. It was originally written for the University of British Columbia’s DSCI 100 - Introduction to Data Science course. In this book, we define data science as the study and development of reproducible, auditable processes to obtain value (i.e., insight) from data. The book is structured so that learners spend the first four chapters learning how to use the R programming language and Jupyter notebooks to load, wrangle/clean, and visualize data, while answering descriptive and exploratory data analysis questions. The remaining chapters illustrate how to solve four common problems in data science, which are useful for answering predictive and inferential data analysis questions: Predicting a class/category for a new observation/measurement (e.g., cancerous or benign tumour) Predicting a value for a new observation/measurement (e.g., 10 km race time for 20 year old females with a BMI of 25). Finding previously unknown/unlabelled subgroups in your data (e.g., products commonly bought together on Amazon) Estimating an average or a proportion from a representative sample (group of people or units) and using that estimate to generalize to the broader population (e.g., the proportion of undergraduate students that own an iphone) For each of these problems, we map them to the type of data analysis question being asked and discuss what kinds of data are needed to answer such questions. More advanced (e.g., causal or mechanistic) data analysis questions are beyond the scope of this text. Types of data analysis questions Question type Description Example Descriptive A question which asks about summarized characteristics of a data set without interpretation (i.e., report a fact). How many people live in each US state? Exploratory A question asks if there are patterns, trends, or relationships within a single data set. Often used to propose hypotheses for future study. Does politcal party voting change with indicators of wealth in a set of data collected from groups of individuals from several regions in the United States? Inferential A question that looks for patterns, trends, or relationships in a single data set and also asks for quantification of how applicable these findings are to the wider population. Does politcal party voting change with indicators of wealth in the United States? Predictive A question that asks about predicting measurements or labels for individuals (people or things). The focus is on what things predict some outcome, but not what causes the outcome. What political party will someone vote for in the next US election? Causal A question that asks about whether changing one factor will lead to a change in another factor, on average, in the wider population. Does wealth lead to voting for a certain political party candidate in the US Presidential election? Mechanistic A question that asks about the underlying mechanism of the observed patterns, trends, or relationship (i.e., how does it happen?) How does wealth lead to voting for a certain political party candidate in the US Presidential election? Source: What is the question? by Jeffery T. Leek, Roger D. Peng & The Art of Data Science by Roger Peng & Elizabeth Matsui 1.1 Chapter learning objectives By the end of the chapter, students will be able to: use a Jupyter notebook to execute provided R code edit code and markdown cells in a Jupyter notebook create new code and markdown cells in a Jupyter notebook load the tidyverse library into R create new variables and objects in R using the assignment symbol use the help and documentation tools in R match the names of the following functions from the tidyverse library to their documentation descriptions: read_csv select mutate filter ggplot aes 1.2 Jupyter notebooks Jupyter notebooks are documents that contain a mix of computer code (and its output) and formattable text. Given that they are able to combine these two in a single document—code is not separate from the output or written report—notebooks are one of the leading tools to create reproducible data analyses. A reproducible data analysis is one where you can reliably and easily recreate the same results when analyzing the same data. Although this sounds like something that should always be true of any data analysis, in reality this is not often the case; one needs to make a conscious effort to perform data analysis in a reproducible manner. The name Jupyter came from combining the names of the three programming language that it was initially targeted for (Julia, Python, and R), and now many other languages can be used with Jupyter notebooks. A notebook looks like this: We have included a short demo video here to help you get started and to introduce you to R and Jupyter. However, the best way to learn how to write and run code and formattable text in a Jupyter notebook is to do it yourself! Here is a worksheet that provides a step-by-step guide through the basics. 1.3 Loading a spreadsheet-like dataset Often, the first thing we need to do in data analysis is to load a dataset into R. When we bring spreadsheet-like (think Microsoft Excel tables) data, generally shaped like a rectangle, into R it is represented as what we call a data frame object. It is very similar to a spreadsheet where the rows are the collected observations and the columns are the variables. The first kind of data we will learn how to load into R (as a data frame) is the spreadsheet-like comma-separated values format (.csv for short). These files have names ending in .csv, and can be opened open and saved from common spreadsheet programs like Microsoft Excel and Google Sheets. For example, a .csv file named state_property_vote.csv is included with the code for this book. This file— originally from Data USA—has US state-level property, income, population and voting data from 2015 and 2016. If we were to open this data in a plain text editor, we would see each row on its own line, and each entry in the table separated by a comma: state,pop,med_prop_val,med_income,avg_commute,party Montana,1042520,217200,46608,16.35,Republican Alabama,4863300,136200,42917,23.78,Republican Arizona,6931071,205900,50036,23.69,Republican Arkansas,2988248,123300,41335,20.49,Republican California,39250017,477500,61927,27.67,Democratic Colorado,5540545,314200,61324,23.02,Democratic Connecticut,3576452,274600,70007,24.92,Democratic Delaware,952065,243400,59853,24.97,Democratic District of Columbia,681170,576100,75506,28.96,Democratic To load this data into R, and then to do anything else with it afterwards, we will need to use something called a function. A function is a special word in R that takes in instructions (we call these arguments) and does something. The function we will use to read a .csv file into R is called read_csv. In its most basic use-case, read_csv expects that the data file: has column names (or headers), uses a comma (,) to separate the columns, and does not have row names. Below you’ll see the code used to load the data into R using the read_csv function. But there is one extra step we need to do first. Since read_csv is not included in the base installation of R, to be able to use it we have to load it from somewhere else: a collection of useful functions known as a library. The read_csv function in particular is in the tidyverse library (more on this later), which we load using the library function. Next, we call the read_csv function and pass it a single argument: the name of the file, \"state_property_vote.csv\". We have to put quotes around filenames and other letters and words that we use in our code to distinguish it from the special words that make up R programming language. This is the only argument we need to provide for this file, because our file satifies everthing else the read_csv function expects in the default use-case (which we just discussed). Later in the course, we’ll learn more about how to deal with more complicated files where the default arguments are not appropriate. For example, files that use spaces or tabs to separate the columns, or with no column names. library(tidyverse) read_csv("data/state_property_vote.csv") ## # A tibble: 52 x 6 ## state pop med_prop_val med_income avg_commute party ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 Montana 1042520 217200 46608 16.4 Republican ## 2 Alabama 4863300 136200 42917 23.8 Republican ## 3 Arizona 6931071 205900 50036 23.7 Republican ## 4 Arkansas 2988248 123300 41335 20.5 Republican ## 5 California 39250017 477500 61927 27.7 Democratic ## 6 Colorado 5540545 314200 61324 23.0 Democratic ## 7 Connecticut 3576452 274600 70007 24.9 Democratic ## 8 Delaware 952065 243400 59853 25.0 Democratic ## 9 District of Columbia 681170 576100 75506 29.0 Democratic ## 10 Florida 20612439 197700 47439 25.8 Republican ## # … with 42 more rows Above you can also see something neat that Jupyter does to help us understand our code: it colours text depending on its meaning in R. For example, you’ll note that functions get bold green text, while letters and words surrounded by quotations like filenames get blue text. In case you want to know more (optional): We use the read_csv function from the tidyverse instead of the base R function read.csv because it’s faster and it creates a nicer variant of the base R data frame called a tibble. This has several benefits that we’ll discuss in further detail later in the course. 1.4 Assigning value to a data frame When we loaded the US state-level property, income, population, and voting data in R above using read_csv, we did not give this data frame a name, so it was just printed to the screen and we cannot do anything else with it. That isn’t very useful; what we would like to do is give a name to the data frame that read_csv outputs so that we can use it later for analysis and visualization. To assign name to something in R, there are two possible ways—using either the assignment symbol (<-) or the equals symbol (=). From a style perspective, the assignment symbol is preferred and is what we will use in this course. When we name something in R using the assignment symbol, <-, we do not need to surround it with quotes like the filename. This is because we are formally telling R about this word and giving it a value. Only characters and words that act as values need to be surrounded by quotes. Let’s now use the assignment symbol to give the name us_data to the US state-level property, income, population, and voting data frame that we get from read_csv. us_data <- read_csv("data/state_property_vote.csv") Wait a minute! Nothing happened this time! Or at least it looks like that. But actually something did happen: the data was read in and now has the name us_data associated with it. And we can use that name to access the data frame and do things with it. First we will type the name of the data frame to print it to the screen. us_data ## # A tibble: 52 x 6 ## state pop med_prop_val med_income avg_commute party ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 Montana 1042520 217200 46608 16.4 Republican ## 2 Alabama 4863300 136200 42917 23.8 Republican ## 3 Arizona 6931071 205900 50036 23.7 Republican ## 4 Arkansas 2988248 123300 41335 20.5 Republican ## 5 California 39250017 477500 61927 27.7 Democratic ## 6 Colorado 5540545 314200 61324 23.0 Democratic ## 7 Connecticut 3576452 274600 70007 24.9 Democratic ## 8 Delaware 952065 243400 59853 25.0 Democratic ## 9 District of Columbia 681170 576100 75506 29.0 Democratic ## 10 Florida 20612439 197700 47439 25.8 Republican ## # … with 42 more rows 1.5 Creating subsets of data frames with select & filter Now, we are going to learn how to obtain subsets of data from a data frame in R using two other tidyverse functions: select and filter. The select function allows you to create a subset of the columns of a data frame, while the filter function allows you to obtain a subset of the rows with specific values. Before we start using select and filter, let’s take a look at the US state-level property, income, and population data again to familiarize ourselves with it. We will do this by printing the data we loaded earlier in the chapter to the screen. us_data ## # A tibble: 52 x 6 ## state pop med_prop_val med_income avg_commute party ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 Montana 1042520 217200 46608 16.4 Republican ## 2 Alabama 4863300 136200 42917 23.8 Republican ## 3 Arizona 6931071 205900 50036 23.7 Republican ## 4 Arkansas 2988248 123300 41335 20.5 Republican ## 5 California 39250017 477500 61927 27.7 Democratic ## 6 Colorado 5540545 314200 61324 23.0 Democratic ## 7 Connecticut 3576452 274600 70007 24.9 Democratic ## 8 Delaware 952065 243400 59853 25.0 Democratic ## 9 District of Columbia 681170 576100 75506 29.0 Democratic ## 10 Florida 20612439 197700 47439 25.8 Republican ## # … with 42 more rows In this data frame there are 52 rows (corresponding to the 50 US states, the District of Columbia and the US territory, Puerto Rico) and 6 columns: US state name Population Median property value Median household income Average commute time in minutes The party each state voted for in the 2016 US presidential election Now let’s use select to extract the state column from this data frame. To do this, we need to provide the select function with two arguments. The first argument is the name of the data frame object, which in this example is us_data. The second argument is the column name that we want to select, here state. After passing these two arguments, the select function returns a single column (the state column that we asked for) as a data frame. state_column <- select(us_data, state) state_column ## # A tibble: 52 x 1 ## state ## <chr> ## 1 Montana ## 2 Alabama ## 3 Arizona ## 4 Arkansas ## 5 California ## 6 Colorado ## 7 Connecticut ## 8 Delaware ## 9 District of Columbia ## 10 Florida ## # … with 42 more rows 1.5.1 Using select to extract multiple columns We can also use select to obtain a subset of the data frame with multiple columns. Again, the first argument is the name of the data frame. Then we list all the columns we want as arguments separated by commas. Here we create a subset of three columns: state, median property value, and mean commute time in minutes. three_columns <- select(us_data, state, med_prop_val, avg_commute) three_columns ## # A tibble: 52 x 3 ## state med_prop_val avg_commute ## <chr> <dbl> <dbl> ## 1 Montana 217200 16.4 ## 2 Alabama 136200 23.8 ## 3 Arizona 205900 23.7 ## 4 Arkansas 123300 20.5 ## 5 California 477500 27.7 ## 6 Colorado 314200 23.0 ## 7 Connecticut 274600 24.9 ## 8 Delaware 243400 25.0 ## 9 District of Columbia 576100 29.0 ## 10 Florida 197700 25.8 ## # … with 42 more rows 1.5.2 Using select to extract a range of columns We can also use select to obtain a subset of the data frame constructed from a range of columns. To do this we use the colon (:) operator to denote the range. For example, to get all the columns in the data frame from state to med_prop_val we pass state:med_prop_val as the second argument to the select function. column_range <- select(us_data, state:med_prop_val) column_range ## # A tibble: 52 x 3 ## state pop med_prop_val ## <chr> <dbl> <dbl> ## 1 Montana 1042520 217200 ## 2 Alabama 4863300 136200 ## 3 Arizona 6931071 205900 ## 4 Arkansas 2988248 123300 ## 5 California 39250017 477500 ## 6 Colorado 5540545 314200 ## 7 Connecticut 3576452 274600 ## 8 Delaware 952065 243400 ## 9 District of Columbia 681170 576100 ## 10 Florida 20612439 197700 ## # … with 42 more rows 1.5.3 Using filter to extract a single row We can use the filter function to obtain the subset of rows with desired values from a data frame. Again, our first argument is the name of the data frame object, us_data. The second argument is a logical statement to use when filtering the rows. Here, for example, we’ll say that we are interested in rows where the state is New York. To make this comparison, we use the equivalency operator == to compare the values of the state column with the value \"New York\". Similar to when we loaded the data file and put quotes around the filename, here we need to put quotes around \"New York\" to tell R that this is a character value and not one of the special words that make up R programming language, nor one of the names we have given to data frames in the code we have already written. With these arguments, filter returns a data frame that has all the columns of the input data frame but only the rows we asked for in our logical filter statement. new_york <- filter(us_data, state == "New York") new_york ## # A tibble: 1 x 6 ## state pop med_prop_val med_income avg_commute party ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 New York 19745289 302400 58771 32.0 Democratic 1.5.4 Using filter to extract rows with values above a threshold If we are interested in finding information about the states who have a higher median household income than New York—whose median household income is $58,771—then we can create a filter to obtain rows where the value of med_income is greater than 58771. In this case, we see that filter returns a data frame with 16 rows; this indicates that there are 16 states with higher median household incomes than New York. high_incomes <- filter(us_data, med_income > 58771) high_incomes ## # A tibble: 16 x 6 ## state pop med_prop_val med_income avg_commute party ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 California 39250017 477500 61927 27.7 Democratic ## 2 Colorado 5540545 314200 61324 23.0 Democratic ## 3 Connecticut 3576452 274600 70007 24.9 Democratic ## 4 Delaware 952065 243400 59853 25.0 Democratic ## 5 District of Columbia 681170 576100 75506 29.0 Democratic ## 6 Hawaii 1428557 592000 69549 26.0 Democratic ## 7 Maryland 6016447 306900 73851 31.3 Democratic ## 8 Massachusetts 6811779 366900 69200 28.0 Democratic ## 9 Minnesota 5519952 211800 61473 22.1 Democratic ## 10 Alaska 741894 267800 70898 17.0 Republican ## 11 New Hampshire 1334795 251100 66469 25.2 Democratic ## 12 New Jersey 8944469 328200 71968 30.3 Democratic ## 13 North Dakota 757953 184100 60227 16.5 Republican ## 14 Utah 3051217 250300 60943 20.3 Republican ## 15 Virginia 8411808 264000 64923 27.0 Democratic ## 16 Washington 7288000 306400 61358 26.2 Democratic 1.6 Exploring data with visualizations Creating effective data visualizations is an essential piece to any data analysis. For the remainder of Chapter 1, we will learn how to use functions from the tidyverse to make visualizations that let us explore relationships in data. In particular, we’ll develop a visualization of the US property, income, population, and voting data we’ve been working with that will help us understand two potential relationships in the data: first, the relationship between median household income and median propery value across the US, and second, whether there is a pattern in which party each state voted for in the 2016 US election. This is an example of an exploratory data analysis question: we are looking for relationships and patterns within the data set we have, but are not trying to generalize what we find beyond this data set. 1.6.1 Using ggplot to create a scatter plot Taking another look at our dataset below, we can immediately see that the three columns (or variables) we are interested in visualizing—median household income, median property value, and election result—are all in separate columns. In addition, there is a single row (or observation) for each state. The data are therefore in what we call a tidy data format. This is particularly important and will be a major focus in the remainder of this course: many of the functions from tidyverse require tidy data, including the ggplot function that we will use shortly for our visualization. Note below that we use the print function to display the us_data rather than just typing us_data; for data frames, these do the same thing. print(us_data) ## # A tibble: 52 x 6 ## state pop med_prop_val med_income avg_commute party ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 Montana 1042520 217200 46608 16.4 Republican ## 2 Alabama 4863300 136200 42917 23.8 Republican ## 3 Arizona 6931071 205900 50036 23.7 Republican ## 4 Arkansas 2988248 123300 41335 20.5 Republican ## 5 California 39250017 477500 61927 27.7 Democratic ## 6 Colorado 5540545 314200 61324 23.0 Democratic ## 7 Connecticut 3576452 274600 70007 24.9 Democratic ## 8 Delaware 952065 243400 59853 25.0 Democratic ## 9 District of Columbia 681170 576100 75506 29.0 Democratic ## 10 Florida 20612439 197700 47439 25.8 Republican ## # … with 42 more rows 1.6.2 Using ggplot to create a scatter plot We will begin with a scatter plot of the income and property value columns from our data frame. To create a scatter plot of these two variables using the ggplot function, we do the following: call the ggplot function provide the name of the data frame as the first argument call the aesthetic function, aes, to specify which column will correspond to the x-axis and which will correspond to the y-axis add a + symbol at the end of the ggplot call to add a layer to the plot call the geom_point function to tell R that we want to represent the data points as dots/points to create a scatter plot. ggplot(us_data, aes(x = med_income, y = med_prop_val)) + geom_point() In case you have used R before and are curious: There are a small number of situations in which you can have a single R expression span multiple lines. Here, the + symbol at the end of the first line tells R that the expression isn’t done yet and to continue reading on the next line. While not strictly necessary, this sort of pattern will appear a lot when using ggplot as it keeps things more readable. 1.6.3 Formatting ggplot objects One common and easy way to format your ggplot visualization is to add additional layers to the plot object using the + symbol. For example, we can use the xlab and ylab functions to add layers where we specify human readable labels for the x and y axes. Again, since we are specifying words (e.g. \"Income (USD)\") as arguments to xlab and ylab, we surround them with double quotes. There are many more layers we can add to format the plot further, and we will explore these in later chapters. ggplot(us_data, aes(x = med_income, y = med_prop_val)) + geom_point() + xlab("Income (USD)") + ylab("Median property value (USD)") From this visualization we see that for the 52 US regions in this data set, as median household income increases so does median property value. When we see two variables do this, we call this a positive relationship. Because the increasing pattern is fairly clear (not fuzzy) we can say that the relationship is strong. Because of the data point in the lower left-hand corner, drawing a straight line through these points wouldn’t fit very well. When a straight-line doesn’t fit the data well we say that it’s non-linear. However, we should have caution when using one point to claim non-linearity. As we will see later, this might be due to a single point not really belonging in the data set (this is often called an outlier). Learning how to describe data visualizations is a very useful skill. We will provide descriptions for you in this course (as we did above) until we get to Chapter 4, which focuses on data visualization. Then, we will explicitly teach you how to do this yourself, and how to not over-state or over-interpret the results from a visualization. 1.6.4 Coloring points by group Now we’ll move onto the second part of our exploratory data analysis question: when considering the relationship between median household income and median property value, is there a pattern in which party each state voted for in the 2016 US election? One common way to explore this is to colour the data points on the scatter plot we have already created by group/category. For example, given that we have the party each state voted for in the 2016 US Presidential election in the column named party, we can colour the points in our previous scatter plot to represent who each stated voted for. To do this we modify our scatter plot code above. Specifically, we will add an argument to the aes function, specifying that the points should be coloured by the party column: ggplot(us_data, aes(x = med_income, y = med_prop_val, color = party)) + geom_point() + xlab("Income (USD)") + ylab("Median property value (USD)") This data visualization shows that the one data point we singled out earlier on the far left of the plot has the label of “not applicable” instead of “democrat” or “republican”. Let’s use filter to look at the row that contains the “not applicable” value in the party column: missing_party <- filter(us_data, party == "Not Applicable") missing_party ## # A tibble: 1 x 6 ## state pop med_prop_val med_income avg_commute party ## <chr> <dbl> <dbl> <dbl> <dbl> <fct> ## 1 Puerto Rico 3411307 111900 20078 28.4 Not Applicable That explains it! That row in the dataset is actually not a US state, but rather the US territory of Peurto Rico. Similar to other US territories, residents of Puerto Rico cannot vote in presidential elections. Hence the “not applicable” label. Let’s remove this row from the data frame and rename the data frame vote_data. To do this, we use the opposite of the equivalency operator (==) for our filter statement, the not equivalent operator (!=). vote_data <- filter(us_data, party != "Not Applicable") vote_data ## # A tibble: 51 x 6 ## state pop med_prop_val med_income avg_commute party ## <chr> <dbl> <dbl> <dbl> <dbl> <fct> ## 1 Montana 1042520 217200 46608 16.4 Republican ## 2 Alabama 4863300 136200 42917 23.8 Republican ## 3 Arizona 6931071 205900 50036 23.7 Republican ## 4 Arkansas 2988248 123300 41335 20.5 Republican ## 5 California 39250017 477500 61927 27.7 Democratic ## 6 Colorado 5540545 314200 61324 23.0 Democratic ## 7 Connecticut 3576452 274600 70007 24.9 Democratic ## 8 Delaware 952065 243400 59853 25.0 Democratic ## 9 District of Columbia 681170 576100 75506 29.0 Democratic ## 10 Florida 20612439 197700 47439 25.8 Republican ## # … with 41 more rows Now we see that the data frame has 51 rows corresponding to the 50 states and the District of Columbia - all regions where residents can vote in US presidential elections. Let’s now recreate the scatter plot we made above using this data frame subset: ggplot(vote_data, aes(x = med_income, y = med_prop_val, color = party)) + geom_point() + xlab("Income (USD)") + ylab("Median property value (USD)") What do we see when considering the second part of our exploratory question? Do we see a pattern in how certain states voted in the 2016 Presidential election? We do! Most of the US States who voted for the Republican candidate in the 2016 US Presidential election had lower median household income and lower median property values (data points primarily fall in lower left-hand side of the scatter plot), whereas most of the US states who voted for the Democratic candidate in the 2016 US Presidential election had higher median household income and higher median property values (data points primarily fall in the upper right-hand side of the scatter plot). Does this mean that rich states usually vote for Democrats and poorer states generally vote for Republicans? Or could we use this data visualization on its own to predict which party each state will vote for in the next presidential election? The answer to both these questions is “no.” What we can do with this exploratory data analysis is create new hypotheses, ideas, and questions (like the ones at the beginning of this paragraph). Answering those questions would likely involve gathering additional data and doing more complex analyses, which we will see more of later in this course. 1.6.5 Putting it all together Below, we put everything from this chapter together in one code chunk. This demonstrates the power of R: in relatively few lines of code, we are able to create an entire data science workflow. library(tidyverse) us_data <- read_csv("data/state_property_vote.csv") vote_data <- filter(us_data, party != "Not Applicable") ggplot(vote_data, aes(x = med_income, y = med_prop_val, color = party)) + geom_point() + xlab("Income (USD)") + ylab("Median property value (USD)") "],
    +["reading.html", "Chapter 2 Reading in data locally and from the web 2.1 Overview 2.2 Chapter learning objectives 2.3 Absolute and relative file paths 2.4 Reading tabular data from a plain text file into R 2.5 Reading data from an Microsoft Excel file 2.6 Reading data from a database 2.7 Writing data from R to a .csv file 2.8 Scraping data off the web using R 2.9 Additional readings/resources", " Chapter 2 Reading in data locally and from the web 2.1 Overview In this chapter, you’ll learn to read spreadsheet-like data of various formats into R from your local device and from the web. “Reading” (or “loading”) is the process of converting data (stored as plain text, a database, HTML, etc.) into an object (e.g., a dataframe) that R can easily access and manipulate, and is thus the gateway to any data analysis; you won’t be able to analyze data unless you’ve loaded it first. And because there are many ways to store data, there are similarly many ways to read data into R. If you spend more time upfront matching the data reading method to the type of data you have, you will have to spend less time re-formatting, cleaning and wrangling your data (the second step to all data analyses). It’s like making sure your shoelaces are tied well before going for a run so that you don’t trip later on! 2.2 Chapter learning objectives By the end of the chapter, students will be able to: define the following: absolute file path relative file path url read data into R using a relative path and a url compare and contrast the following functions: read_csv read_tsv read_csv2 read_delim read_excel match the following tidyverse read_* function arguments to their descriptions: file delim col_names skip choose the appropriate tidyverse read_* function and function arguments to load a given plain text tabular data set into R use readxl library’s read_excel function and arguments to load a sheet from an excel file into R connect to a database using the DBI library’s dbConnect function list the tables in a database using the DBI library’s dbListTables function create a reference to a database table that is queriable using the tbl from the dbplyr library retrieve data from a database query and bring it into R using the collect function from the dbplyr library (optional) scrape data from the web read/scrape data from an internet URL using the rvest html_nodes and html_text functions compare downloading tabular data from a plain text file (e.g. .csv) from the web versus scraping data from a .html file 2.3 Absolute and relative file paths When you load a data set into R, you first need to tell R where that files lives. The file could live on your computer (local), or somewhere on the internet (remote). In this section we will discuss the case where the file lives on your computer. The place where the file lives on your computer is called the “path”. You can think of the path as directions to the file. There are two kinds of paths: relative paths and absolute paths. A relative path is where the file is in respect to where you currently are on the computer (e.g., where the Jupyter notebook file that you’re working in is). On the other hand, an absolute path is where the file is in respect to the base (or root) folder of the computer’s filesystem. Suppose our computer’s filesystem looks like the picture below, and we are working in the Jupyter notebook titled worksheetk_02.ipynb. If we want to read in the .csv file named happiness_report.csv into our Jupyter notebook using R, we could do this using either a relative or an absolute path. We show what both choices below. Reading happiness_report.csv using a relative path: happiness_data <- read_csv("data/happiness_report.csv") Reading happiness_report.csv using an absolute path: happiness_data <- read_csv("/home/jupyter/dsci-100/worksheet_02/data/happiness_report.csv") So which one should you use? Generally speaking, to ensure your code can be run on a different computer, you should use relative paths (and it’s also less typing!). This is because the absolute path of a file (the names of folders between the computer’s root / and the file) isn’t usually the same across multiple computers. For example, suppose Alice and Bob are working on a project together on the happiness_report.csv data. Alice’s file is stored at /home/alice/project/data/happiness_report.csv, while Bob’s is stored at /home/bob/project/data/happiness_report.csv. Even though Alice and Bob stored their files in the same place on their computers (in their home folders), the absolute paths are different due to their different usernames. If Bob has code that loads the happiness_report.csv data using an absolute path, the code won’t work on Alice’s computer. But the relative path from inside the project folder (data/happiness_report.csv) is the same on both computers; any code that uses relative paths will work on both! See this video for another explanation: Source: Udacity course “Linux Command Line Basics” 2.4 Reading tabular data from a plain text file into R Now we will learn more about reading tabular data from a plain text file into R, as well as how to write tabular data to a file. Last chapter we learned about using the tidyverse read_csv function when reading files that match that functions expected defaults (column names are present and commas are used as the delimiter/separator between columns). In this section, we will learn how to read files do not satisfy the default expectations of read_csv. Before we jump into the cases where the data aren’t in the expected default format for tidyverse and read_csv, let’s revisit the simpler case where the defaults hold and the only argument we need to give to the function is the path to the file, data/state_property_vote.csv. We put data/ before the name of the file when we are loading the dataset because this dataset is located in a sub-folder, named data, relative to where we are running our R code. Here is what the file would look in a plain text editor: state,pop,med_prop_val,med_income,avg_commute,party Montana,1042520,217200,46608,16.35,Republican Alabama,4863300,136200,42917,23.78,Republican Arizona,6931071,205900,50036,23.69,Republican Arkansas,2988248,123300,41335,20.49,Republican California,39250017,477500,61927,27.67,Democratic Colorado,5540545,314200,61324,23.02,Democratic Connecticut,3576452,274600,70007,24.92,Democratic Delaware,952065,243400,59853,24.97,Democratic District of Columbia,681170,576100,75506,28.96,Democratic And here is a review of how we can use read_csv to load it into R: library(tidyverse) us_data <- read_csv("data/state_property_vote.csv") us_data ## # A tibble: 52 x 6 ## state pop med_prop_val med_income avg_commute party ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 Montana 1042520 217200 46608 16.4 Republican ## 2 Alabama 4863300 136200 42917 23.8 Republican ## 3 Arizona 6931071 205900 50036 23.7 Republican ## 4 Arkansas 2988248 123300 41335 20.5 Republican ## 5 California 39250017 477500 61927 27.7 Democratic ## 6 Colorado 5540545 314200 61324 23.0 Democratic ## 7 Connecticut 3576452 274600 70007 24.9 Democratic ## 8 Delaware 952065 243400 59853 25.0 Democratic ## 9 District of Columbia 681170 576100 75506 29.0 Democratic ## 10 Florida 20612439 197700 47439 25.8 Republican ## # … with 42 more rows 2.4.1 Skipping rows when reading in data Often times information about how data was collected, or other relevant information, is included at the top of the data file. This information is usually written in sentence and paragraph form, with no delimiter because it is not organized into columns. An example of this is shown below: Data source: https://datausa.io/ Record of how data was collected: https://github.com/UBC-DSCI/introduction-to-datascience/blob/master/data/retrieve_data.ipynb Date collected: 2020-07-08 state,pop,med_prop_val,med_income,avg_commute,party Montana,1042520,217200,46608,16.35,Republican Alabama,4863300,136200,42917,23.78,Republican Arizona,6931071,205900,50036,23.69,Republican Arkansas,2988248,123300,41335,20.49,Republican California,39250017,477500,61927,27.67,Democratic Colorado,5540545,314200,61324,23.02,Democratic Connecticut,3576452,274600,70007,24.92,Democratic Delaware,952065,243400,59853,24.97,Democratic District of Columbia,681170,576100,75506,28.96,Democratic Using read_csv as we did previously does not allow us to correctly load the data into R. In the case of this file we end up only reading in one column of the data set: us_data <- read_csv("data/state_property_vote_meta-data.csv") ## Parsed with column specification: ## cols( ## `Data source: https://datausa.io/` = col_character() ## ) ## Warning: 53 parsing failures. ## row col expected actual file ## 3 -- 1 columns 6 columns 'data/state_property_vote_meta-data.csv' ## 4 -- 1 columns 6 columns 'data/state_property_vote_meta-data.csv' ## 5 -- 1 columns 6 columns 'data/state_property_vote_meta-data.csv' ## 6 -- 1 columns 6 columns 'data/state_property_vote_meta-data.csv' ## 7 -- 1 columns 6 columns 'data/state_property_vote_meta-data.csv' ## ... ... ......... ......... ........................................ ## See problems(...) for more details. us_data ## # A tibble: 55 x 1 ## `Data source: https://datausa.io/` ## <chr> ## 1 Record of how data was collected: https://github.com/UBC-DSCI/introduction-to-datascience/blob/master/data/ret… ## 2 Date collected: 2020-07-08 ## 3 state ## 4 Montana ## 5 Alabama ## 6 Arizona ## 7 Arkansas ## 8 California ## 9 Colorado ## 10 Connecticut ## # … with 45 more rows To successfully read data like this into R, the skip argument can be useful to tell R how many lines to skip before it should start reading in the data. In the example above, we would set this value to 3: us_data <- read_csv("data/state_property_vote_meta-data.csv", skip = 3) ## Parsed with column specification: ## cols( ## state = col_character(), ## pop = col_double(), ## med_prop_val = col_double(), ## med_income = col_double(), ## avg_commute = col_double(), ## party = col_character() ## ) us_data ## # A tibble: 52 x 6 ## state pop med_prop_val med_income avg_commute party ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 Montana 1042520 217200 46608 16.4 Republican ## 2 Alabama 4863300 136200 42917 23.8 Republican ## 3 Arizona 6931071 205900 50036 23.7 Republican ## 4 Arkansas 2988248 123300 41335 20.5 Republican ## 5 California 39250017 477500 61927 27.7 Democratic ## 6 Colorado 5540545 314200 61324 23.0 Democratic ## 7 Connecticut 3576452 274600 70007 24.9 Democratic ## 8 Delaware 952065 243400 59853 25.0 Democratic ## 9 District of Columbia 681170 576100 75506 29.0 Democratic ## 10 Florida 20612439 197700 47439 25.8 Republican ## # … with 42 more rows 2.4.2 read_delim as a more flexible method to get tabular data into R When our tabular data comes in a different format, we can use the read_delim function instead. For example, a different version of this same dataset has no column names and uses tabs as the delimiter instead of commas. Here is how the file would look in a plain text editor: Montana 1042520 217200 46608 16.35 Republican Alabama 4863300 136200 42917 23.78 Republican Arizona 6931071 205900 50036 23.69 Republican Arkansas 2988248 123300 41335 20.49 Republican California 39250017 477500 61927 27.67 Democratic Colorado 5540545 314200 61324 23.02 Democratic Connecticut 3576452 274600 70007 24.92 Democratic Delaware 952065 243400 59853 24.97 Democratic District of Columbia 681170 576100 75506 28.96 Democratic Florida 20612439 197700 47439 25.8 Republican To get this into R using the read_delim() function, we specify the first argument as the path to the file (as done with read_csv), and then provide values to the delim argument (here a tab, which we represent by \"\\t\") and the col_names argument (here we specify that there are no column names be assigning it the value of FALSE). Both read_csv() and read_delim() have a col_names argument and the default is TRUE. us_data <- read_delim("data/state_property_vote.tsv", delim = "\\t", col_names = FALSE) ## Parsed with column specification: ## cols( ## X1 = col_character(), ## X2 = col_double(), ## X3 = col_double(), ## X4 = col_double(), ## X5 = col_double(), ## X6 = col_character() ## ) us_data ## # A tibble: 52 x 6 ## X1 X2 X3 X4 X5 X6 ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 Montana 1042520 217200 46608 16.4 Republican ## 2 Alabama 4863300 136200 42917 23.8 Republican ## 3 Arizona 6931071 205900 50036 23.7 Republican ## 4 Arkansas 2988248 123300 41335 20.5 Republican ## 5 California 39250017 477500 61927 27.7 Democratic ## 6 Colorado 5540545 314200 61324 23.0 Democratic ## 7 Connecticut 3576452 274600 70007 24.9 Democratic ## 8 Delaware 952065 243400 59853 25.0 Democratic ## 9 District of Columbia 681170 576100 75506 29.0 Democratic ## 10 Florida 20612439 197700 47439 25.8 Republican ## # … with 42 more rows Data frames in R need to have column names, thus if you read data into R as a data frame without column names then R assigns column names for them. If you used the read_* functions to read the data into R, then R gives each column a name of X1, X2, …, XN, where N is the number of columns in the data set. 2.4.3 Reading tabular data directly from a URL We can also use read_csv() or read_delim() (and related functions) to read in tabular data directly from a url that contains tabular data. In this case, we provide the url to read_csv() as the path to the file instead of a path to a local file on our computer. Similar to when we specify a path on our local computer, here we need to surround the url by quotes. All other arguments that we use are the same as when using these functions with a local file on our computer. us_data <- read_csv("https://raw.githubusercontent.com/UBC-DSCI/introduction-to-datascience/master/data/state_property_vote.csv") ## Parsed with column specification: ## cols( ## state = col_character(), ## pop = col_double(), ## med_prop_val = col_double(), ## med_income = col_double(), ## avg_commute = col_double(), ## party = col_character() ## ) us_data ## # A tibble: 52 x 6 ## state pop med_prop_val med_income avg_commute party ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 Montana 1042520 217200 46608 16.4 Republican ## 2 Alabama 4863300 136200 42917 23.8 Republican ## 3 Arizona 6931071 205900 50036 23.7 Republican ## 4 Arkansas 2988248 123300 41335 20.5 Republican ## 5 California 39250017 477500 61927 27.7 Democratic ## 6 Colorado 5540545 314200 61324 23.0 Democratic ## 7 Connecticut 3576452 274600 70007 24.9 Democratic ## 8 Delaware 952065 243400 59853 25.0 Democratic ## 9 District of Columbia 681170 576100 75506 29.0 Democratic ## 10 Florida 20612439 197700 47439 25.8 Republican ## # … with 42 more rows 2.4.4 Previewing a data file before reading it into R In all the examples above, we gave you previews of the data file before we read it into R. This is essential so that you can see whether or not there are column names, what the delimiters are, and if there are lines you need to skip. You should do this yourself when trying to read in data files. In Jupyter, you preview data as a plain text file by clicking on the file’s name in the Jupyter home menu. We demonstrate this in the video below: 2.5 Reading data from an Microsoft Excel file There are many other ways to store tabular datasets beyond plain text files, and similarly many ways to load those datasets into R. For example, it is very common to encounter, and need to load into R, data stored as a Microsoft Excel spreadsheet (with the filename extension .xlsx). To be able to do this, a key thing to know is that even though .csv and .xlsx files look almost identical when loaded into Excel, the data themselves are stored completely differently. While .csv files are plain text files, where the characters you see when you open the file in a text editor are exactly the data they represent, this is not the case for .xlsx files. Take a look at what a .xlsx file would look like in a text editor: ,?'O _rels/.rels???J1??>E?{7? <?V????w8?'J???'QrJ???Tf?d??d?o?wZ'???@>?4'?|??hlIo??F t 8f??3wn ????t??u"/ %~Ed2??<?w?? ?Pd(??J-?E???7?'t(?-GZ?????y???c~N?g[^_r?4 yG?O ?K??G?RPX?<??,?'O[Content_Types].xml???n?0E%?J ]TUEe??O??c[???????6q??s??d?m???\\???H?^????3} ?rZY? ?:L60?^?????XTP+?|?3???"~?3T1W3???,?#p?R?!??w(??R???[S?D?kP?P!XS(?i?t?$?ei X?a??4VT?,D?Jq D ?????u?]??;??L?.8AhfNv}?hHF*??Jr?Q?%?g?U??CtX"8x>?.|????5j?/$???JE?c??~??4iw?????E;?+?S??w?cV+?:???2l???=?2nel???;|?V??????c'?????9?P&Bcj,?'OdocProps/app.xml??1 ?0???k????A?u?U?]??{#?:;/<?g?Cd????M+?=???Z?O??R+??u?P?X KV@??M$??a???d?_???4??5v?R????9D????t??Fk?Ú'P?=?,?'OdocProps/core.xml??MO?0 ??J?{???3j?h'??(q??U4J ??=i?I'?b??[v?!??{gk? F2????v5yj??"J???,?d???J???C??l??4?-?`$?4t?K?.;?%c?J??G<?H???? X????z???6?????~q??X??????q^>??tH???*?D???M?g ??D?????????d?:g).?3.??j?P?F?'Oxl/_rels/workbook.xml.rels??Ak1??J?{7???R?^J?kk@Hf7??I?L???E]A?Þ?{a??`f?????b?6xUQ?@o?m}??o????X{???Q?????;?y?\\? O ?YY??4?L??S??k?252j?? ??V ?C?g?C]??????? ? ???E??TENyf6% ?Y????|??:%???}^ N?Q?N'????)??F?\\??P?G??,?'O'xl/printerSettings/printerSettings1.bin?Wmn? ??Sp>?G???q?# ?I??5R'???q????(?L ??m??8F?5< L`??`?A??2{dp??9R#?>7??Xu???/?X??HI?|? ??r)???\\?VA8?2dFfq???I]]o 5`????6A ? This type of file representation allows Excel files to store additional things that you cannot store in a .csv file, such as fonts, text formatting, graphics, multiple sheets and more. And despite looking odd in a plain text editor, we can read Excel spreadsheets into R using the readxl package developed specifically for this purpose. library(readxl) us_data <- read_excel("data/state_property_vote.xlsx") us_data ## # A tibble: 52 x 6 ## state pop med_prop_val med_income avg_commute party ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 Montana 1042520 217200 46608 16.4 Republican ## 2 Alabama 4863300 136200 42917 23.8 Republican ## 3 Arizona 6931071 205900 50036 23.7 Republican ## 4 Arkansas 2988248 123300 41335 20.5 Republican ## 5 California 39250017 477500 61927 27.7 Democratic ## 6 Colorado 5540545 314200 61324 23.0 Democratic ## 7 Connecticut 3576452 274600 70007 24.9 Democratic ## 8 Delaware 952065 243400 59853 25.0 Democratic ## 9 District of Columbia 681170 576100 75506 29.0 Democratic ## 10 Florida 20612439 197700 47439 25.8 Republican ## # … with 42 more rows If the .xlsx file has multiple sheets, then you have to use the sheet argument to specify either the sheet number or name. You can also specify cell ranges using the range argument. This is useful in cases where a single sheet contains multiple tables (a sad thing that happens to many Excel spreadsheets). As with plain text files, you should always try to explore the data file before importing it into R. This helps you decide which arguments you need to use to successfully load the data into R. If you do not have the Excel program on your computer, there are other free programs you can use to preview the file. Examples include Google Sheets and Libre Office. 2.6 Reading data from a database Another very common form of data storage to be read into R for the purpose of data analysis is the relational database. There are many relational database management systems, such as SQLite, MySQL, PosgreSQL, Oracle, and many more. Almost all employ SQL (structured query language) to pull data from the database. Thankfully, you don’t need to know SQL to analyze data from a database; several packages have been written that allow R to connect to relational databases and use the R programming language as the front end (what the user types in) to pull data from them. In this book we will give examples of how to do this using R with SQLite and PostgreSQL databases. 2.6.1 Reading data from a SQLite database SQLite is probably the simplest relational database that one can use in combination with R. SQLite databases are self-contained and usually stored and accessed locally on one computer. Data is usually stored in a file with a .db extension. Similar to Excel files, these are not plain text files and cannot be read in a plain text editor. The first thing you need to do to read data into R from a database is to connect to the database. We do that using the dbConnect function from the DBI (database interface) package. This does not read in the data, but simply tells R where the database is and opens up a communication channel. library(DBI) con_state_data <- dbConnect(RSQLite::SQLite(), "data/state_property_vote.db") Often times relational databases have many tables, and their power comes from the useful ways they can be joined. Thus anytime you want to access data from a relational database you need to know the table names. You can get the names of all the tables in the database using the dbListTables function: tables <- dbListTables(con_state_data) tables ## [1] "state" We only get one table name returned form calling dbListTables, and this tells us that there is only one table in this database. To reference a table in the database so we can do things like select columns and filter rows, we use the tbl function from the dbplyr package: library(dbplyr) ## ## Attaching package: 'dbplyr' ## The following objects are masked from 'package:dplyr': ## ## ident, sql state_db <- tbl(con_state_data, "state") state_db ## # Source: table<state> [?? x 6] ## # Database: sqlite 3.30.1 [/home/rstudio/introduction-to-datascience/data/state_property_vote.db] ## state pop med_prop_val med_income avg_commute party ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 Montana 1042520 217200 46608 16.4 Republican ## 2 Alabama 4863300 136200 42917 23.8 Republican ## 3 Arizona 6931071 205900 50036 23.7 Republican ## 4 Arkansas 2988248 123300 41335 20.5 Republican ## 5 California 39250017 477500 61927 27.7 Democratic ## 6 Colorado 5540545 314200 61324 23.0 Democratic ## 7 Connecticut 3576452 274600 70007 24.9 Democratic ## 8 Delaware 952065 243400 59853 25.0 Democratic ## 9 District of Columbia 681170 576100 75506 29.0 Democratic ## 10 Florida 20612439 197700 47439 25.8 Republican ## # … with more rows Although it looks like we just got a data frame from the database, we didn’t! It’s a reference, showing us data that is still in the SQLite database (note the first two lines of the output). It does this because databases are often more efficient at selecting, filtering and joining large datasets than R. And typically, the database will not even be stored on your computer, but rather a more powerful machine somewhere on the web. So R is lazy and waits to bring this data into memory until you explicitly tell it to do so using the collect function from the dbplyr library. Here we will filter for only states that voted for the Republican candidate in the 2016 Presidential election, and then use collect to finally bring this data into R as a data frame. republican_db <- filter(state_db, party == "Republican") republican_db ## # Source: lazy query [?? x 6] ## # Database: sqlite 3.30.1 [/home/rstudio/introduction-to-datascience/data/state_property_vote.db] ## state pop med_prop_val med_income avg_commute party ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 Montana 1042520 217200 46608 16.4 Republican ## 2 Alabama 4863300 136200 42917 23.8 Republican ## 3 Arizona 6931071 205900 50036 23.7 Republican ## 4 Arkansas 2988248 123300 41335 20.5 Republican ## 5 Florida 20612439 197700 47439 25.8 Republican ## 6 Georgia 10310371 166800 49240 26.9 Republican ## 7 Idaho 1683140 189400 47572 19.7 Republican ## 8 Indiana 6633053 134800 49384 22.7 Republican ## 9 Iowa 3134693 142300 53816 18.1 Republican ## 10 Kansas 2907289 144900 52392 18.5 Republican ## # … with more rows republican_data <- collect(republican_db) republican_data ## # A tibble: 30 x 6 ## state pop med_prop_val med_income avg_commute party ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 Montana 1042520 217200 46608 16.4 Republican ## 2 Alabama 4863300 136200 42917 23.8 Republican ## 3 Arizona 6931071 205900 50036 23.7 Republican ## 4 Arkansas 2988248 123300 41335 20.5 Republican ## 5 Florida 20612439 197700 47439 25.8 Republican ## 6 Georgia 10310371 166800 49240 26.9 Republican ## 7 Idaho 1683140 189400 47572 19.7 Republican ## 8 Indiana 6633053 134800 49384 22.7 Republican ## 9 Iowa 3134693 142300 53816 18.1 Republican ## 10 Kansas 2907289 144900 52392 18.5 Republican ## # … with 20 more rows Why bother to use the collect function? The data looks pretty similar in both outputs shown above. And dbplyr provides lots of functions similar to filter that you can use to directly feed the database reference (what tbl gives you) into downstream analysis functions (e.g., ggplot2 for data visualization and lm for linear regression modeling). However, this does not work in every case; look what happens when we try to use nrow to count rows in a data frame: nrow(republican_db) ## [1] NA or tail to preview the last 6 rows of a data frame: tail(republican_db) ## Error: tail() is not supported by sql sources Additionally, some operations will not work to extract columns or single values from the reference given by the tbl function. Thus, once you have finished your data wrangling of the tbl database reference object, it is advisable to then bring it into your local machine’s memory using collect as a data frame. 2.6.2 Reading data from a PostgreSQL database PostgreSQL (also called Postgres) is a very popular free and open-source option for relational database software. Unlike SQLite, PostgreSQL uses a client–server database engine, as it was designed to be used and accessed on a network. This means that you have to provide more information to R when connecting to Postgres databases. The additional information that you need to include when you call the dbConnect function is listed below: dbname - the name of the database (a single PostgreSQL instance can host more than one database) host - the URL pointing to where the database is located port - the communication endpoint between R and the PostgreSQL database (this is typically 5432 for PostgreSQL) user - the username for accessing the database password - the password for accessing the database Additionally, we must use the RPostgres library instead of RSQLite in the dbConnect function call. Below we demonstrate how to connect to a version of the can_mov_db database, which contains information about Canadian movies (note - this is a synthetic, or artificial, database). library(RPostgres) can_mov_db_con <- dbConnect(RPostgres::Postgres(), dbname = "can_mov_db", host = "r7k3-mds1.stat.ubc.ca", port = 5432, user = "user0001", password = '################') After opening the connection, everything looks and behaves almost identically to when we were using an SQLite database in R. For example, we can again use dbListTables to find out what tables are in the can_mov_db database: dbListTables(can_mov_db_con) [1] "themes" "medium" "titles" "title_aliases" "forms" [6] "episodes" "names" "names_occupations" "occupation" "ratings" We see that there are 10 tables in this database. Let’s first look at the \"ratings\" table to find the lowest rating that exists in the can_mov_db database: ratings_db <- tbl(can_mov_db_con, "ratings") ratings_db # Source: table<ratings> [?? x 3] # Database: postgres [user0001@r7k3-mds1.stat.ubc.ca:5432/can_mov_db] title average_rating num_votes <chr> <dbl> <int> 1 The Grand Seduction 6.6 150 2 Rhymes for Young Ghouls 6.3 1685 3 Mommy 7.5 1060 4 Incendies 6.1 1101 5 Bon Cop, Bad Cop 7.0 894 6 Goon 5.5 1111 7 Monsieur Lazhar 5.6 610 8 What if 5.3 1401 9 The Barbarian Invations 5.8 99 10 Away from Her 6.9 2311 # … with more rows To find the lowest rating that exists in the data base, we first need to extract the average_rating column using select: avg_rating_db <- select(ratings_db, average_rating) avg_rating_db # Source: lazy query [?? x 1] # Database: postgres [user0001@r7k3-mds1.stat.ubc.ca:5432/can_mov_db] average_rating <dbl> 1 6.6 2 6.3 3 7.5 4 6.1 5 7.0 6 5.5 7 5.6 8 5.3 9 5.8 10 6.9 # … with more rows Next we use min to find the minimum rating in that column: min(avg_rating_db) Error in min(avg_rating_db) : invalid 'type' (list) of argument Instead of the minimum, we get an error! This is another example of when we need to use the collect function to bring the data into R for further computation: avg_rating_data <- collect(avg_rating_db) min(avg_rating_data) [1] 1 We see the lowest rating given to a movie is 1, indicating that it must have been a really bad movie… 2.7 Writing data from R to a .csv file At the middle and end of a data analysis, we often want to write a data frame that has changed (either through filtering, selecting, mutating or summarizing) to a file so that we can share it with others or use it for another step in the analysis. The most straightforward way to do this is to use the write_csv function from the tidyverse library. The default arguments for this file are to use a comma (,) as the delimiter and include column names. Below we demonstrate creating a new version of the US state-level property, income, population and voting data from 2015 and 2016 that does not contain the territory of Puerto Rico, and then writing this to a .csv file: state_data <- filter(us_data, state != "Puerto Rico") write_csv(state_data, "data/us_states_only.csv") 2.8 Scraping data off the web using R In the first part of this chapter we learned how to read in data from plain text files that are usually “rectangular” in shape using the tidyverse read_* functions. Sadly, not all data comes in this simple format, but happily there are many other tools we can use to read in more messy/wild data formats. One common place people often want/need to read in data from is websites. Such data exists in an a non-rectangular format. One quick and easy solution to get this data is to copy and paste it, however this becomes painstakingly long and boring when there is a lot of data that needs gathering, and anytime you start doing a lot of copying and pasting it is very likely you will introduce errors. The formal name for gathering non-rectangular data from the web and transforming it into a more useful format for data analysis is web scraping. There are two different ways to do web scraping: 1) screen scraping (similar to copying and pasting from a website, but done in a programmatic way to minimize errors and maximize efficiency) and 2) web APIs (application programming interface) (a website that provides a programatic way of returning the data as JSON or XML files via http requests). In this course we will explore the first method, screen scraping using R’s rvest package. 2.8.1 HTML and CSS selectors Before we jump into scraping, let’s set up some motivation and learn a little bit about what the “source code” of a website looks like. Say we are interested in knowing the average rental price (per square footage) of the most recently available 1 bedroom apartments in Vancouver from https://vancouver.craigslist.org. When we visit the Vancouver Craigslist website and search for 1 bedroom apartments, this is what we are shown: From that page, it’s pretty easy for our human eyes to find the apartment price and square footage. But how can we do this programmatically so we don’t have to copy and paste all these numbers? Well, we have to deal with the webpage source code, which we show a snippet of below (and link to the entire source code here): <span class="result-meta"> <span class="result-price">$800</span> <span class="housing"> 1br - </span> <span class="result-hood"> (13768 108th Avenue)</span> <span class="result-tags"> <span class="maptag" data-pid="6786042973">map</span> </span> <span class="banish icon icon-trash" role="button"> <span class="screen-reader-text">hide this posting</span> </span> <span class="unbanish icon icon-trash red" role="button" aria-hidden="true"></span> <a href="#" class="restore-link"> <span class="restore-narrow-text">restore</span> <span class="restore-wide-text">restore this posting</span> </a> </span> </p> </li> <li class="result-row" data-pid="6788463837"> <a href="https://vancouver.craigslist.org/nvn/apa/d/north-vancouver-luxury-1-bedroom/6788463837.html" class="result-image gallery" data-ids="1:00U0U_lLWbuS4jBYN,1:00T0T_9JYt6togdOB,1:00r0r_hlMkwxKqoeq,1:00n0n_2U8StpqVRYX,1:00M0M_e93iEG4BRAu,1:00a0a_PaOxz3JIfI,1:00o0o_4VznEcB0NC5,1:00V0V_1xyllKkwa9A,1:00G0G_lufKMygCGj6,1:00202_lutoxKbVTcP,1:00R0R_cQFYHDzGrOK,1:00000_hTXSBn1SrQN,1:00r0r_2toXdps0bT1,1:01616_dbAnv07FaE7,1:00g0g_1yOIckt0O1h,1:00m0m_a9fAvCYmO9L,1:00C0C_8EO8Yl1ELUi,1:00I0I_iL6IqV8n5MB,1:00b0b_c5e1FbpbWUZ,1:01717_6lFcmuJ2glV"> <span class="result-price">$2285</span> </a> <p class="result-info"> <span class="icon icon-star" role="button"> <span class="screen-reader-text">favorite this post</span> </span> <time class="result-date" datetime="2019-01-06 12:06" title="Sun 06 Jan 12:06:01 PM">Jan 6</time> <a href="https://vancouver.craigslist.org/nvn/apa/d/north-vancouver-luxury-1-bedroom/6788463837.html" data-id="6788463837" class="result-title hdrlnk">Luxury 1 Bedroom CentreView with View - Lonsdale</a> This is not easy for our human eyeballs to read! However, it is easy for us to use programmatic tools to extract the data we need by specifying which HTML tags (things inside < and > in the code above). For example, if we look in the code above and search for lines with a price, we can also look at the tags that are near that price and see if there’s a common “word” we can use that is near the price but doesn’t exist on other lines that have information we are not interested in: <span class="result-price">$800</span> and <span class="result-price">$2285</span> What we can see is there is a special “word” here, “result-price”, which appears only on the lines with prices and not on the other lines (that have information we are not interested in). This special word and the context in which is is used (learned from the other words inside the HTML tag) can be combined to create something called a CSS selector. The CSS selector can then be used by R’s rvest package to select the information we want (here price) from the website source code. Now, many websites are quite large and complex, and so then is their website source code. And as you saw above, it is not easy to read and pick out the special words we want with our human eyeballs. So to make this easier, we will use the SelectorGadget tool. It is an open source tool that simplifies generating and finding CSS selectors. We recommend you use the Chrome web browser to use this tool, and install the selector gadget tool from the Chrome Web Store. Here is a short video on how to install and use the SelectorGadget tool to get a CSS selector for use in web scraping: From installing and using the selectorgadget as shown in the video above, we get the two CSS selectors .housing and .result-price that we can use to scrape information about the square footage and the rental price, respectively. The selector gadget returns them to us as a comma separated list (here .housing , .result-price), which is exactly the format we need to provide to R if we are using more than one CSS selector. 2.8.2 Are you allowed to scrape that website? BEFORE scraping data from the web, you should always check whether or not you are ALLOWED to scrape it! There are two documents that are important for this: the robots.txt file and reading the website’s Terms of Service document. The website’s Terms of Service document is probably the more important of the two, and so you should look there first. What happens when we look at Craigslist’s Terms of Service document? Well we read this: “You agree not to copy/collect CL content via robots, spiders, scripts, scrapers, crawlers, or any automated or manual equivalent (e.g., by hand).” source: https://www.craigslist.org/about/terms.of.use Want to learn more about the legalities of web scraping and crawling? Read this interesting blog post titled “Web Scraping and Crawling Are Perfectly Legal, Right?” by Benoit Bernard (this is optional, not required reading). So what to do now? Well, we shouldn’t scrape Craigslist! Let’s instead scrape some data on the population of Canadian cities from Wikipedia (who’s Terms of Service document does not explicilty say do not scrape). In this video below we demonstrate using the selectorgadget tool to get CSS Selectors from Wikipedia’s Canada page to scrape a table that contains city names and their populations from the 2016 Canadian Census: 2.8.3 Using rvest Now that we have our CSS selectors we can use rvest R package to scrape our desired data from the website. First we start by loading the rvest package: library(rvest) ## Loading required package: xml2 ## ## Attaching package: 'rvest' ## The following object is masked from 'package:purrr': ## ## pluck ## The following object is masked from 'package:readr': ## ## guess_encoding library(rvest) gives error… If you get an error about R not being able to find the package (e.g., Error in library(rvest) : there is no package called ‘rvest’) this is likely because it was not installed. To install the rvest package, run the following command once inside R (and then delete that line of code): install.packages(\"rvest\"). Next, we tell R what page we want to scrape by providing the webpage’s URL in quotations to the function read_html: page <- read_html("https://en.wikipedia.org/wiki/Canada") Then we send the page object to the html_nodes function. We also provide that function with the CSS selectors we obtained from the selectorgadget tool. These should be surrounded by quotations. The html_nodes function select nodes from the HTML document using CSS selectors. nodes are the HTML tag pairs as well as the content between the tags. For our CSS selector td:nth-child(5) and example node that would be selected would be: <td style=\"text-align:left;background:#f0f0f0;\"><a href=\"/wiki/London,_Ontario\" title=\"London, Ontario\">London</a></td> population_nodes <- html_nodes(page, "td:nth-child(5) , td:nth-child(7) , .infobox:nth-child(122) td:nth-child(1) , .infobox td:nth-child(3)") head(population_nodes) ## {xml_nodeset (6)} ## [1] <td style="text-align:right;">5,928,040</td> ## [2] <td style="text-align:left;background:#f0f0f0;"><a href="/wiki/London,_Ontario" title="London, Ontario">Lon ... ## [3] <td style="text-align:right;">494,069\\n</td> ## [4] <td style="text-align:right;">4,098,927</td> ## [5] <td style="text-align:left;background:#f0f0f0;">\\n<a href="/wiki/St._Catharines" title="St. Catharines">St. ... ## [6] <td style="text-align:right;">406,074\\n</td> Next we extract the meaningful data from the HTML nodes using the html_text function. For our example, this functions only required argument is the an html_nodes object, which we named rent_nodes. In the case of this example node: <td style=\"text-align:left;background:#f0f0f0;\"><a href=\"/wiki/London,_Ontario\" title=\"London, Ontario\">London</a></td>, the html_text function would return London. population_text <- html_text(population_nodes) head(population_text) ## [1] "5,928,040" "London" "494,069\\n" "4,098,927" ## [5] "St. Catharines–Niagara" "406,074\\n" Are we done? Not quite… If you look at the data closely you see that the data is not in an optimal format for data analysis. Both the city names and population are encoded as characters in a single vector instead of being in a data frame with one character column for city and one numeric column for population (think of how you would organize the data in a spreadsheet). Additionally, the populations contain commas (not useful for programmatically dealing with numbers), and some even contain a line break character at the end (\\n). Next chapter we will learn more about data wrangling using R so that we can easily clean up this data with a few lines of code. 2.9 Additional readings/resources Data import chapter from R for Data Science by Garrett Grolemund & Hadley Wickham "],
    +["wrangling.html", "Chapter 3 Cleaning and wrangling data 3.1 Overview 3.2 Chapter learning objectives 3.3 Vectors and Data frames 3.4 Tidy Data 3.5 Combining functions using the pipe operator, %>%: 3.6 Iterating over data with group_by + summarize 3.7 Additional reading on the dplyr functions 3.8 Using purrr’s map* functions to iterate 3.9 Additional readings/resources", " Chapter 3 Cleaning and wrangling data 3.1 Overview This chapter will be centered around tools for cleaning and wrangling data that move data from its raw format into a format that is suitable for data analysis. They will be presented in the context of a real world data science application, providing more practice working through a whole case study. 3.2 Chapter learning objectives By the end of the chapter, students will be able to: define the term “tidy data” discuss the advantages and disadvantages from storing data in a tidy data format recall and use the following tidyverse functions and operators for their intended data wrangling tasks: select filter %>% map mutate summarize group_by gather separate %in% 3.3 Vectors and Data frames At this point, we know how to load data into R from various file formats. Once loaded into R, all the tools we have learned about for reading data into R represent the data as a data frame. So now we will spend some time learning more about data frames in R, such that we have a better understanding of how we can use and manipulate these objects. 3.3.1 What is a data frame? Let’s first start by defining exactly what a data frame is. From a data perspective, it is a rectangle where the rows are the observations: and the columns are the variables: From a computer programming perspective, in R, a data frame is a special subtype of a list object whose elements (columns) are vectors. For example, the data frame below has 3 elements that are vectors whose names are state, year and population. 3.3.2 What is a vector? In R, vectors are objects that can contain 1 or more elements. The vector elements are ordered, and they must all be of the same type. Common example types of vectors are character (e.g., letter or words), numeric (whole numbers and fractions) and logical (e.g., TRUE or FALSE). In the vector shown below, the elements are of numeric type: 3.3.3 How are vectors different from a list? Lists are also objects in R that have multiple elements. Vectors and lists differ by the requirement of element type consistency. All elements within a single vector must be of the same type (e.g., all elements are numbers), whereas elements within a single list can be of different types (e.g., characters, numbers, logicals and even other lists can be elements in the same list). 3.3.4 What does this have to do with data frames? As mentioned earlier, a data frame is really a special type of list where the elements can only be vectors. Representing data with such an object enables us to easily work with our data in a rectangular/spreadsheet like manner, and to have columns/vectors of different characteristics associated/linked in one object. This is similar to a table in a spreadsheet or a database. 3.4 Tidy Data There are many ways spreadsheet-like dataset can be organized. In this chapter we are going to focus on the tidy data format of organization, and how to make your raw (and likely messy) data tidy. This is because a variety of tools we would like to be able to use in R are designed to work most effectively (and efficiently) with tidy data. 3.4.1 What is tidy data? Tidy data satisfy the following three criteria: each row is a single observation, each column is a single variable, and each value is a single cell (i.e., its row and column position in the data frame is not shared with another value) image source: R for Data Science by Garrett Grolemund & Hadley Wickham Definitions to know: observation - all of the quantities or a qualities we collect from a given entity/object variable - any characteristic, number, or quantity that can be measured or collected value - a single collected quantity or a quality from a given entity/object 3.4.2 Why is tidy data important in R? First, one of the most popular plotting toolsets in R, the ggplot2 library, expects the data to be in a tidy format. Second, most statistical analysis functions expect data in tidy format. Given that both of these tasks are central in virtually any data analysis project, it is well worth spending the time to get your data into a tidy format up front. Luckily there are many well designed tidyverse data cleaning/wrangling tools to help you easily tidy your data. Let’s explore them now! 3.4.3 Going from wide to long (or tidy!) using gather One common thing that often has to be done to get data into a tidy format is to combine columns that are really part the same variable but currently stored in separate columns. To do this we can use the function gather. gather acts to combine columns, and thus makes the data frame narrower. Data is often stored in a wider, not tidy, format because this format is often more intuitive for human readability and understanding, and humans create data sets. An example of this is shown below: library(tidyverse) hist_vote_wide <- read_csv("data/us_vote.csv") hist_vote_wide <- select(hist_vote_wide, election_year, winner, runnerup) hist_vote_wide <- tail(hist_vote_wide, 10) hist_vote_wide ## # A tibble: 10 x 3 ## election_year winner runnerup ## <dbl> <chr> <chr> ## 1 1980 Ronald Reagan Jimmy Carter ## 2 1984 Ronald Reagan Walter Mondale ## 3 1988 George H. W. Bush Michael Dukakis ## 4 1992 Bill Clinton George H. W. Bush ## 5 1996 Bill Clinton Bob Dole ## 6 2000 George W. Bush Al Gore ## 7 2004 George W. Bush John Kerry ## 8 2008 Barack Obama John McCain ## 9 2012 Barack Obama Mitt Romney ## 10 2016 Donald Trump Hillary Clinton What is wrong with our untidy format above? From a data analysis perspective, this format is not ideal because in this format the outcome of the variable “result” (winner or runner up) is stored as column names and not easily accessible for the data analysis functions we will want to apply to our data set. Additionally, the values of the “candidate” variable are spread across two columns and will require some sort of binding or joining to get them into one single column to allow us to do our desired visualization and statistical tasks later on. To accomplish this data tranformation we will use the tidyverse function gather. To use gather we need to specify: the dataset the key: the name of a new column that will be created, whose values will come from the names of the columns that we want to combine (the result argument) the value: the name of a new column that will be created, whose values will come from the values of the columns we want to combine (the value argument) the names of the columns that we want to combine (we list these after specifying the key and value, and separate the column names with commas) For the above example, we use gather to combine the winner and runnerup columns into a single column called candidate, and create a column called result that contains the outcome of the election for each candidate: hist_vote_tidy <- gather(hist_vote_wide, key = result, value = candidate, winner, runnerup) hist_vote_tidy ## # A tibble: 20 x 3 ## election_year result candidate ## <dbl> <chr> <chr> ## 1 1980 winner Ronald Reagan ## 2 1984 winner Ronald Reagan ## 3 1988 winner George H. W. Bush ## 4 1992 winner Bill Clinton ## 5 1996 winner Bill Clinton ## 6 2000 winner George W. Bush ## 7 2004 winner George W. Bush ## 8 2008 winner Barack Obama ## 9 2012 winner Barack Obama ## 10 2016 winner Donald Trump ## 11 1980 runnerup Jimmy Carter ## 12 1984 runnerup Walter Mondale ## 13 1988 runnerup Michael Dukakis ## 14 1992 runnerup George H. W. Bush ## 15 1996 runnerup Bob Dole ## 16 2000 runnerup Al Gore ## 17 2004 runnerup John Kerry ## 18 2008 runnerup John McCain ## 19 2012 runnerup Mitt Romney ## 20 2016 runnerup Hillary Clinton Splitting code across lines: In the code above, the call to the gather function is split across several lines. This is allowed and encouraged when programming in R when your code line gets too long to read clearly. When doing this, it is important to end the line with a comma , so that R knows the function should continue to the next line.* The data above is now tidy because all 3 criteria for tidy data have now been met: All the variables (candidate and result) are now their own columns in the data frame. Each observation, i.e., each candidate’s name, result, and candidacy year, are in a single row. Each value is a single cell, i.e., its row, column position in the data frame is not shared with another value. 3.4.4 Using separate to deal with multiple delimiters As discussed above, data are also not considered tidy when multiple values are stored in the same cell. In addition to the previous untidiness we addressed in the earlier version of this data set, the one we show below is even messier: the winner and runnerup columns contain both the candidate’s name as well as their political party. To make this messy data tidy we’ll have to fix both of these issues. hist_vote_party <- read_csv("data/historical_vote_messy.csv") hist_vote_party ## # A tibble: 10 x 3 ## election_year winner runnerup ## <dbl> <chr> <chr> ## 1 2016 Donald Trump/Rep Hillary Clinton/Dem ## 2 2012 Barack Obama/Dem Mitt Romney/Rep ## 3 2008 Barack Obama/Dem John McCain/Rep ## 4 2004 George W Bush/Rep John Kerry/Dem ## 5 2000 George W Bush/Rep Al Gore/Dem ## 6 1996 Bill Clinton/Dem Bob Dole/Rep ## 7 1992 Bill Clinton/Dem George HW Bush/Rep ## 8 1988 George HW Bush/Rep Michael Dukakis/Dem ## 9 1984 Ronald Reagan/Rep Walter Mondale/Dem ## 10 1980 Ronald Reagan/Rep Jimmy Carter/Dem First we’ll use gather to create the result and candidate columns, as we did previously: hist_vote_party_gathered <- gather(hist_vote_party, key = result, value = candidate, winner, runnerup) hist_vote_party_gathered ## # A tibble: 20 x 3 ## election_year result candidate ## <dbl> <chr> <chr> ## 1 2016 winner Donald Trump/Rep ## 2 2012 winner Barack Obama/Dem ## 3 2008 winner Barack Obama/Dem ## 4 2004 winner George W Bush/Rep ## 5 2000 winner George W Bush/Rep ## 6 1996 winner Bill Clinton/Dem ## 7 1992 winner Bill Clinton/Dem ## 8 1988 winner George HW Bush/Rep ## 9 1984 winner Ronald Reagan/Rep ## 10 1980 winner Ronald Reagan/Rep ## 11 2016 runnerup Hillary Clinton/Dem ## 12 2012 runnerup Mitt Romney/Rep ## 13 2008 runnerup John McCain/Rep ## 14 2004 runnerup John Kerry/Dem ## 15 2000 runnerup Al Gore/Dem ## 16 1996 runnerup Bob Dole/Rep ## 17 1992 runnerup George HW Bush/Rep ## 18 1988 runnerup Michael Dukakis/Dem ## 19 1984 runnerup Walter Mondale/Dem ## 20 1980 runnerup Jimmy Carter/Dem Then we’ll use separate to split the candidate column into two columns, one that contains only the candidate’s name (“candidate”), and one that contains a short identifier for which political party the candidate belonged to (“party”): hist_vote_party_tidy <- separate(hist_vote_party_gathered, col = candidate, into = c("candidate", "party"), sep = "/") hist_vote_party_tidy ## # A tibble: 20 x 4 ## election_year result candidate party ## <dbl> <chr> <chr> <chr> ## 1 2016 winner Donald Trump Rep ## 2 2012 winner Barack Obama Dem ## 3 2008 winner Barack Obama Dem ## 4 2004 winner George W Bush Rep ## 5 2000 winner George W Bush Rep ## 6 1996 winner Bill Clinton Dem ## 7 1992 winner Bill Clinton Dem ## 8 1988 winner George HW Bush Rep ## 9 1984 winner Ronald Reagan Rep ## 10 1980 winner Ronald Reagan Rep ## 11 2016 runnerup Hillary Clinton Dem ## 12 2012 runnerup Mitt Romney Rep ## 13 2008 runnerup John McCain Rep ## 14 2004 runnerup John Kerry Dem ## 15 2000 runnerup Al Gore Dem ## 16 1996 runnerup Bob Dole Rep ## 17 1992 runnerup George HW Bush Rep ## 18 1988 runnerup Michael Dukakis Dem ## 19 1984 runnerup Walter Mondale Dem ## 20 1980 runnerup Jimmy Carter Dem Is this data now tidy? Well, if we recall the 3 criteria for tidy data: each row is a single observation, each column is a single variable, and each value is a single cell. We can see that this data now satifies all 3 criteria, making it easier to analyze. For example, we could visualize the number of winning candidates for each party over this time span: ggplot(hist_vote_party_tidy, aes(x = result, fill = party)) + geom_bar() + scale_fill_manual(values=c("blue", "red")) + xlab("US Presidential election result") + ylab("Number of US Presidential candidates") + ggtitle("US Presidential candidates (1980 - 2016)") From this visualization, we can see that between 1980 - 2016 (inclusive) the Republican party has won more US Presidential elections than the Democratic party. 3.4.5 Notes on defining tidy data Is there only one shape for tidy data for a given data set? Not necessarily, it depends on the statistical question you are asking and what the variables are for that question. For tidy data, each variable should be its own column. So just as its important to match your statistical question with the appropriate data analysis tool (classification, clustering, hypothesis testing, etc). It’s important to match your statistical question with the appropriate variables and ensure they are each represented as individual columns to make the data tidy. 3.5 Combining functions using the pipe operator, %>%: In R, we often have to call multiple functions in a sequence to process a data frame. The basic ways of doing this can become quickly unreadable if there are many steps. For example, suppose we need to perform three operations on a data frame data: add a new column new_col that is double another old_col filter for rows where another column, other_col, is more than 5, and select only the new column new_col for those rows. One way of doing is to just write multiple lines of code, storing temporary objects as you go: output_1 <- mutate(data, new_col = old_col*2) output_2 <- filter(output_1, other_col > 5) output <- select(output_2, new_col) This is difficult to understand for multiple reasons. The reader may be tricked into thinking the named output_1 and output_2 objects are important for some reason, while they are just temporary intermediate computations. Further, the reader has to look through and find where output_1 and output_2 are used in each subsequent line. Another option for doing this would be to compose the functions: output <- select(filter(mutate(data, new_col = old_col*2), other_col > 5), new_col) Code like this can also be difficult to understand. Functions compose (reading from left to right) in the opposite order in which they are computed by R (above, mutate happens first, then filter, then select). It is also just a really long line of code to read in one go. The pipe operator %>% solves this problem, resulting in cleaner and easier-to-follow code. The below accomplishes the same thing as the previous two code blocks: output <- data %>% mutate(new_col = old_col*2) %>% filter(other_col > 5) %>% select(new_col) You can think of the pipe as a physical pipe. It takes the output from the function on the left-hand side of the pipe, and passes it as the first argument to the function on the right-hand side of the pipe. Note here that we have again split the code across multiple lines for readability; R is fine with this, since it knows that a line ending in a pipe %>% is continued on the next line. Next, let’s learn about the details of using the pipe, and look at some examples of how to use it in data analysis. 3.5.1 Using %>% to combine filter and select Recall the US state-level property, income, population, and voting data that we explored in chapter 1: us_data <- read_csv("data/state_property_vote.csv") us_data ## # A tibble: 52 x 6 ## state pop med_prop_val med_income avg_commute party ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 Montana 1042520 217200 46608 16.4 Republican ## 2 Alabama 4863300 136200 42917 23.8 Republican ## 3 Arizona 6931071 205900 50036 23.7 Republican ## 4 Arkansas 2988248 123300 41335 20.5 Republican ## 5 California 39250017 477500 61927 27.7 Democratic ## 6 Colorado 5540545 314200 61324 23.0 Democratic ## 7 Connecticut 3576452 274600 70007 24.9 Democratic ## 8 Delaware 952065 243400 59853 25.0 Democratic ## 9 District of Columbia 681170 576100 75506 29.0 Democratic ## 10 Florida 20612439 197700 47439 25.8 Republican ## # … with 42 more rows Suppose we want to create a subset of the data with only the values for median income and median property value for the state of California. To do this, we can use the functions filter and select. First we use filter to create a data frame called ca_prop_data that contains only values for the state of California. We then use select on this data frame to keep only the median income and median property value variables: ca_prop_data <- filter(us_data, state == "California") ca_inc_prop <- select(ca_prop_data, med_income, med_prop_val) ca_inc_prop ## # A tibble: 1 x 2 ## med_income med_prop_val ## <dbl> <dbl> ## 1 61927 477500 Although this is valid code, there is a more readable approach we could take by using the pipe, %>%. With the pipe, we do not need to create an intermediate object to store the output from filter. Instead we can directly send the output of filter to the input of select: ca_inc_prop <- filter(us_data, state == "California") %>% select(med_income, med_prop_val) ca_inc_prop ## # A tibble: 1 x 2 ## med_income med_prop_val ## <dbl> <dbl> ## 1 61927 477500 But wait - why does our select function call look different in these two examples? When you use the pipe, the output of the function on the left is automatically provided as the first argument for the function on the right, and thus you do not specify that argument in that function call. In the code above, the first argument of select is the data frame we are select-ing from, which is provided by the output of filter. As you can see, both of these approaches give us the same output but the second approach is more clear and readable. 3.5.2 Using %>% with more than two functions The %>% can be used with any function in R. Additionally, we can pipe together more than two functions. For example, we can pipe together three functions to order the states by commute time for states whose population is less than 1 million people: small_state_commutes <- filter(us_data, pop < 1000000) %>% select(state, avg_commute) %>% arrange(avg_commute) small_state_commutes ## # A tibble: 7 x 2 ## state avg_commute ## <chr> <dbl> ## 1 South Dakota 15.7 ## 2 Wyoming 15.9 ## 3 North Dakota 16.5 ## 4 Alaska 17.0 ## 5 Vermont 21.5 ## 6 Delaware 25.0 ## 7 District of Columbia 29.0 Note:: arrange is a function that takes the name of a data frame and one or more column(s), and returns a data frame where the rows are ordered by those columns in ascending order. Here we used only one column for sorting (avg_commute), but more than one can also be used. To do this, list additional columns separated by commas. The order they are listed in indicates the order in which they will be used for sorting. This is much like how an English dictionary sorts words: first by the first letter, then by the second letter, and so on. Another Note: You might also have noticed that we split the function calls across lines after the pipe, similar as to when we did this earlier in the chapter for long function calls. Again this is allowed and recommeded, especially when the piped function calls would create a long line of code. Doing this makes your code more readable. When you do this it is important to end each line with the pipe operator %>% to tell R that your code is continuing onto the next line. 3.6 Iterating over data with group_by + summarize 3.6.1 Calculating summary statistics: As a part of many data analyses, we need to calculate a summary value for the data (a summary statistic). A useful dplyr function for doing this is summarize. Examples of summary statistics we might want to calculate are the number of observations, the average/mean value for a column, the minimum value for a column, etc. Below we show how to use the summarize function to calculate the minimum, maximum and mean commute time for all US states: us_commute_time_summary <- summarize(us_data, min_mean_commute = min(avg_commute), max_mean_commute = max(avg_commute), mean_mean_commute = mean(avg_commute)) us_commute_time_summary ## # A tibble: 1 x 3 ## min_mean_commute max_mean_commute mean_mean_commute ## <dbl> <dbl> <dbl> ## 1 15.7 32.0 23.3 3.6.2 Calculating group summary statistics: A common pairing with summarize is group_by. Pairing these functions together can let you summarize values for subgroups within a data set. For example, here we can use group_by to group the states based on which party they voted for in the US election, and then calculate the minimum, maximum and mean commute time for each of the groups. The group_by function takes at least two arguments. The first is the data frame that will be grouped, and the second and onwards are columns to use in the grouping. Here we use only one column for grouping (party), but more than one can also be used. To do this, list additional columns separated by commas. us_commute_time_summary_by_party <- group_by(us_data, party) %>% summarize(min_mean_commute = min(avg_commute), max_mean_commute = max(avg_commute), mean_mean_commute = mean(avg_commute)) ## `summarise()` ungrouping output (override with `.groups` argument) us_commute_time_summary_by_party ## # A tibble: 3 x 4 ## party min_mean_commute max_mean_commute mean_mean_commute ## <chr> <dbl> <dbl> <dbl> ## 1 Democratic 20.8 32.0 25.7 ## 2 Not Applicable 28.4 28.4 28.4 ## 3 Republican 15.7 26.9 21.5 3.7 Additional reading on the dplyr functions We haven’t explicitly said this yet, but the tidyverse is actually a meta R package: it installs a collection of R packages that all follow the tidy data philosophy we discussed above. One of the tidyverse packages is dplyr - a data wrangling workhorse. You have already met 6 of the dplyr function (select, filter, mutate, arrange, summarize, and group_by). To learn more about those six and meet a few more useful functions, read the post at this link. 3.8 Using purrr’s map* functions to iterate Where should you turn when you discover the next step in your data wrangling/cleaning process requires you to apply a function to each column in a data frame? For example, if you wanted to know the maximum value of each column in a data frame? Well you could use summarize as discussed above, but this becomes inconvenient when you have many columns, as summarize requires you to type out a column name and a data tranformation for each summary statistic that you want to calculate. In cases like this, where you want to apply the same data transformation to all columns, it is more efficient to use purrr’s map function to apply it to each column. For example, let’s find the maximum value of each column of the mtcars data frame (a built-in data set that comes with R) by using map with the max function. First, let’s peak at the data to familiarize ourselves with it: head(mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 Next, we use map to apply the max function to each column. map takes two arguments, an object (a vector, data frame or list) that you want to apply the function to, and the function that you would like to apply. Here our arguments will be mtcars and max: max_of_columns <- map(mtcars, max) max_of_columns ## $mpg ## [1] 33.9 ## ## $cyl ## [1] 8 ## ## $disp ## [1] 472 ## ## $hp ## [1] 335 ## ## $drat ## [1] 4.93 ## ## $wt ## [1] 5.424 ## ## $qsec ## [1] 22.9 ## ## $vs ## [1] 1 ## ## $am ## [1] 1 ## ## $gear ## [1] 5 ## ## $carb ## [1] 8 Note: purrr is part of the tidyverse, and so like the dplyr and ggplot functions, once we call library(tidyverse) we do not need to separately load the purrr package. Our output looks a bit weird…we passed in a data frame, but our output doesn’t look like a data frame. As it so happens, it is not a data frame, but rather a plain vanilla list: typeof(max_of_columns) ## [1] "list" So what do we do? Should we convert this to a data frame? We could, but a simpler alternative is to just use a different map_* function from the purrr package. There are quite a few to choose from, they all work similarly and their name refects the type of output you want from the mapping operation: map function Output map() list map_lgl() logical vector map_int() integer vector map_dbl() double vector map_chr() character vector map_df() data frame Let’s get the column maximum’s again, but this time use the map_df function to return the output as a data frame: max_of_columns <- map_df(mtcars, max) max_of_columns ## # A tibble: 1 x 11 ## mpg cyl disp hp drat wt qsec vs am gear carb ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 33.9 8 472 335 4.93 5.42 22.9 1 1 5 8 Which map_* function you choose depends on what you want to do with the output; you don’t always have to pick map_df! What if you need to add other arguments to the functions you want to map? For example, what if there were NA values in our columns that we wanted to know the maximum of? Well then we also need to add the argument na.rm = TRUE to the max function so that we get a more useful value than NA returned (remember that is what happens with many of the built-in R statistical functions when NA’s are present…). What we need to do in that case is do what is called “creating an anonymous function” within the map_df function. We do that in the place where we previously specified our max function. Here we will put the two calls to map_df right after each other so you can see the difference: # no additional arguments to the max function map_df(mtcars, max) versus # adding the na.rm = TRUE argument to the max function map_df(mtcars, function(df) max(df, na.rm = TRUE)) You can see that’s quite a bit of extra typing… So the creators of purrr have made a shortcut for this because it is so commonly done. In the shortcut we replace function(VARIABLE) with a ~ and replace the VARIABLE in the function call with a ., see the example below: # adding the na.rm = TRUE argument to the max function using the shortcut map_df(mtcars, ~ max(., na.rm = TRUE)) 3.8.1 A bit more about the map_* functions The map_* functions are generally quite useful for solving problems involving iteration/repetition. Additionally, their use is not limited to columns of a data frame; map_* functions can be used to apply functions to elements of a vector or list, and even to lists of data frames, or nested data frames. 3.9 Additional readings/resources Grolemund & Wickham’s R for Data Science has a number of useful sections that provide additional information: Data transformation Tidy data The map_* functions "],
    +["viz.html", "Chapter 4 Effective data visualization 4.1 Overview 4.2 Chapter learning objectives 4.3 Choosing the visualization 4.4 Refining the visualization 4.5 Creating visualizations with ggplot2 4.6 Explaining the visualization 4.7 Saving the visualization", " Chapter 4 Effective data visualization 4.1 Overview This chapter will introduce concepts and tools relating to data visualization beyond what we have seen and practiced so far. We will focus on guiding principles for effective data visualization and explaining visualizations independent of any particular tool or programming language. In the process, we will cover some specifics of creating visualizations (scatter plots, bar charts, line graphs, and histograms) for data using R. There are external references that contain a wealth of additional information on the topic of data visualization: Professor Claus Wilke’s Fundamentals of Data Visualization has more details on general principles of effective visualizations Grolemund & Wickham’s R for Data Science chapter on creating visualizations using ggplot2 has a deeper introduction into the syntax and grammar of plotting with ggplot2 specifically the ggplot2 reference has a useful list of useful ggplot2 functions 4.2 Chapter learning objectives Describe when to use the following kinds of visualizations: scatter plots line plots bar plots histogram plots Given a data set and a question, select from the above plot types to create a visualization that best answers the question Given a visualization and a question, evaluate the effectiveness of the visualization and suggest improvements to better answer the question Identify rules of thumb for creating effective visualizations Define the three key aspects of ggplot objects: aesthetic mappings geometric objects scales Use the ggplot2 library in R to create and refine the above visualizations using: geometric objects: geom_point, geom_line, geom_histogram, geom_bar, geom_vline, geom_hline scales: scale_x_continuous, scale_y_continuous aesthetic mappings: x, y, fill, colour, shape labelling: xlab, ylab, labs font control and legend positioning: theme flipping axes: coord_flip subplots: facet_grid Describe the difference in raster and vector output formats Use ggsave to save visualizations in .png and .svg format 4.3 Choosing the visualization 4.3.0.1 Ask a question, and answer it The purpose of a visualization is to answer a question about a data set of interest. So naturally, the first thing to do before creating a visualization is to formulate the question about the data that you are trying to answer. A good visualization will answer your question in a clear way without distraction; a great visualization will suggest even what the question was itself without additional explanation. Imagine your visualization as part of a poster presentation for your project; even if you aren’t standing at the poster explaining things, an effective visualization will be able to convey your message to the audience. Recall the different types of data analysis question from the very first chapter of this book. With the visualizations we will cover in this chapter, we will be able to answer only descriptive and exploratory questions. Be careful not to try to answer any predictive, inferential, causal or mechanistic questions, as we have not learned the tools necessary to do that properly just yet. As with most coding tasks, it is totally fine (and quite common) to make mistakes and iterate a few times before you find the right visualization for your data and question. There are many different kinds of plotting graphic available to use. For the kinds we will introduce in this course, the general rules of thumb are: line plots visualize trends with respect to an independent, ordered quantity (e.g. time) histograms visualize the distribution of one quantitative variable (i.e., all its possible values and how often they occur) scatter plots visualize the distribution / relationship of two quantitative variables bar plots visualize comparisons of amounts All types of visualization have their (mis)uses, but there are three kinds that are usually hard to understand or are easily replaced with an oft-better alternative. In particular you should avoid pie charts; it is usually better to use bars, as it is easier to compare bar heights than pie slice sizes. You should also not use 3-D visualizations, as they are typically hard to understand when converted to a static 2-D image format. Finally, do not use tables to make numerical comparisons; humans are much better at quickly processing visual information than text and math. Bar plots are again typically a better alternative. 4.4 Refining the visualization 4.4.0.1 Convey the message, minimize noise Just being able to make a visualization in R with ggplot2 (or any other tool for that matter) doesn’t mean that it is effective at communicating your message to others. Once you have selected a broad type of visualization to use, you will have to refine it to suit your particular need. Some rules of thumb for doing this are listed below. They generally fall into two classes: you want to make your visualization convey your message, and you want to reduce visual noise as much as possible. Humans have limited cognitive ability to process information; both of these types of refinement aim to reduce the mental load on your audience when viewing your visualization, making it easier for them to quickly understand and remember your message. Convey the message Make sure the visualization answers the question you have asked in the simplest and most plain way possible. Use legends and labels so that your visualization is understandable without reading the surrounding text. Ensure the text, symbols, lines, etc. on your visualization are big enough to be easily read. Make sure the data are clearly visible; don’t hide the shape/distribution of the data behind other objects (e.g. a bar). Make sure to use colourschemes that are understandable by those with colourblindness (a surprisingly large fraction of the overall population). For example, colorbrewer.org and the RColorBrewer R library provide the ability to pick such colourschemes, and you can check your visualizations after you have created them by uploading to online tools such as the colour blindness simulator. Redundancy can be helpful; sometimes conveying the same message in multiple ways reinforces it for the audience. Minimize noise Use colours sparingly. Too many different colours can be distracting, create false patterns, and detract from the message. Be wary of overplotting. If your plot has too many dots or lines and it starts to look like a mess, then you need to do something different. Only make the plot area (where the dots, lines, bars are) as big as needed. Simple plots can be made small. Don’t adjust the axes to zoom in on small differences. If the difference is small, show that it’s small! 4.5 Creating visualizations with ggplot2 4.5.0.1 Build the visualization iteratively This section will cover examples of how to choose and refine a visualization given a data set and a question that you want to answer, and then how to create the visualization in R using ggplot2. To use the ggplot2 library, we need to load the tidyverse metapackage. library(tidyverse) 4.5.1 The Mauna Loa CO2 data set The Mauna Loa CO2 data set, curated by Dr. Pieter Tans, NOAA/GML and Dr. Ralph Keeling, Scripps Institution of Oceanography records the atmospheric concentration of carbon dioxide (CO2, in parts per million) at the Mauna Loa research station in Hawaii from 1959 onwards. Question: Does the concentration of atmospheric CO2 change over time, and are there any interesting patterns to note? # mauna loa carbon dioxide data co2_df <- read_csv("data/mauna_loa.csv") %>% filter(ppm > 0, date_decimal < 2000) head(co2_df) ## # A tibble: 6 x 4 ## year month date_decimal ppm ## <dbl> <dbl> <dbl> <dbl> ## 1 1958 3 1958. 316. ## 2 1958 4 1958. 317. ## 3 1958 5 1958. 318. ## 4 1958 7 1959. 316. ## 5 1958 8 1959. 315. ## 6 1958 9 1959. 313. Since we are investigating a relationship between two variables (CO2 concentration and date), a scatter plot is a good place to start. Scatter plots show the data as individual points with x (horizonal axis) and y (vertical axis) coordinates. Here, we will use the decimal date as the x coordinate and CO2 concentration as the y coordinate. When using the ggplot2 library, we create the plot object with the ggplot function; there are a few basic aspects of a plot that we need to specify: the data: the name of the dataframe object that we would like to visualize here, we specify the co2_df dataframe the aesthetic mapping: tells ggplot how the columns in the dataframe map to properties of the visualization to create an aesthetic mapping, we use the aes function here, we set the plot x axis to the date_decimal variable, and the plot y axis to the ppm variable the geometric object: specifies how the mapped data should be displayed to create a geometric object, we use a geom_* function (see the ggplot reference for a list of geometric objects) here, we use the geom_point function to visualize our data as a scatterplot There are many other possible arguments we could pass to the aesthetic mapping and geometric object to change how the plot looks. For the purposes of quickly testing things out to see what they look like, though, we can just go with the default settings: co2_scatter <- ggplot(co2_df, aes(x = date_decimal, y = ppm)) + geom_point() co2_scatter Certainly the visualization shows a clear upward trend in the atmospheric concentration of CO2 over time. This plot answers the first part of our question in the affirmative, but that appears to be the only conclusion one can make from the scatter visualization. However, since time is an ordered quantity, we can try using a line plot instead using the geom_line function. Line plots require that the data are ordered by their x coordinate, and connect the sequence of x and y coordinates with line segments. Let’s again try this with just the default arguments: co2_line <- ggplot(co2_df, aes(x = date_decimal, y = ppm)) + geom_line() co2_line Aha! There is another interesting phenomenon in the data: in addition to increasing over time, the concentration seems to oscillate as well. Given the visualization as it is now, it is still hard to tell how fast the oscillation is, but nevertheless, the line seems to be a better choice for answering the question than the scatter plot was. The comparison between these two visualizations illustrates a common issue with scatter plots: often the points are shown too close together or even on top of one another, muddling information that would otherwise be clear (overplotting). Now that we have settled on the rough details of the visualization, it is time to refine things. This plot is fairly straightforward, and there is not much visual noise to remove. But there are a few things we must do to improve clarity, such as adding informative axis labels and making the font a more readable size. In order to add axis labels we use the xlab and ylab functions. To change the font size we use the theme function with the text argument: co2_line <- ggplot(co2_df, aes(x = date_decimal, y = ppm)) + geom_line() + xlab('Year') + ylab('Atmospheric CO2 (ppm)') + theme(text = element_text(size = 18)) co2_line Finally, let’s see if we can better understand the oscillation by changing the visualization a little bit. Note that it is totally fine to use a small number of visualizations to answer different aspects of the question you are trying to answer. We will accomplish this by using scales, another important feature of ggplot2 that allow us to easily transform the different variables and set limits. We scale the horizontal axis by using the scale_x_continuous function, and the vertical axis with the scale_y_continuous function. We can transform the axis by passing the trans argument, and set limits by passing the limits argument. In particular, here we will use the scale_x_continuous function with the limits argument to zoom in on just five years of data (say, 1990-1995): co2_line <- ggplot(co2_df, aes(x = date_decimal, y = ppm)) + geom_line() + xlab('Year') + ylab('Atmospheric CO2 (ppm)') + scale_x_continuous(limits = c(1990, 1995)) + theme(text = element_text(size = 18)) co2_line Interesting! It seems that each year, the atmospheric CO2 increases until it reaches its peak somewhere around April, decreases until around late September, and finally increases again until the end of the year. In Hawaii, there are two seasons: summer from May through October, and winter from November through April. Therefore, the oscillating pattern in CO2 matches up fairly closely with the two seasons. 4.5.2 The island landmass data set This data set contains a list of Earth’s land masses as well as their area (in thousands of square miles). Question: Are the continents (North / South America, Africa, Europe, Asia, Australia, Antarctica) Earth’s 7 largest landmasses? If so, what are the next few largest landmasses after those? # islands data islands_df <- read_csv("data/islands.csv") head(islands_df) ## # A tibble: 6 x 2 ## landmass size ## <chr> <dbl> ## 1 Africa 11506 ## 2 Antarctica 5500 ## 3 Asia 16988 ## 4 Australia 2968 ## 5 Axel Heiberg 16 ## 6 Baffin 184 Here, we have a list of Earth’s landmasses, and are trying to compare their sizes. The right type of visualization to answer this question is a bar plot, specified by the geom_bar function in ggplot2. However, by default, geom_bar sets the heights of bars to the number of times a value appears in a dataframe (its count); here we want to plot exactly the values in the dataframe, i.e., the landmass sizes. So we have to pass the stat = \"identity\" argument to geom_bar: islands_bar <- ggplot(islands_df, aes(x = landmass, y = size)) + geom_bar(stat = "identity") islands_bar Alright, not bad! This is definitely the right kind of visualization, as we can clearly see and compare sizes of landmasses. The major issues are that the sizes of the smaller landmasses are hard to distinguish, and that the names of the landmasses are obscuring each other as they have been squished into too little space. But remember that the question we asked was only about the largest landmasses; let’s make the plot a little bit clearer by keeping only the largest 12 landmasses. We do this using the top_n function. Then to help us make sure the labels have enough space, we’ll use horizontal bars instead of vertical ones. We do this using the coord_flip function, which swaps the x and y coordinate axes: islands_top12 <- top_n(islands_df, 12, size) islands_bar <- ggplot(islands_top12, aes(x = landmass, y = size)) + geom_bar(stat = "identity") + coord_flip() islands_bar This is definitely clearer now, and allows us to answer our question (“are the top 7 largest landmasses continents?”) in the affirmative. But the question could be made clearer from the plot by organizing the bars not by alphabetical order but by size, and to colour them based on whether or not they are a continent. In order to do this, we use mutate to add a column to the data regarding whether or not the landmass is a continent: islands_top12 <- top_n(islands_df, 12, size) continents <- c('Africa', 'Antarctica', 'Asia', 'Australia', 'Europe', 'North America', 'South America') islands_ct <- mutate(islands_top12, is_continent = ifelse(landmass %in% continents, 'Continent', 'Other')) head(islands_ct) ## # A tibble: 6 x 3 ## landmass size is_continent ## <chr> <dbl> <chr> ## 1 Africa 11506 Continent ## 2 Antarctica 5500 Continent ## 3 Asia 16988 Continent ## 4 Australia 2968 Continent ## 5 Baffin 184 Other ## 6 Borneo 280 Other In order to colour the bars, we add the fill argument to the aesthetic mapping. Then we use the reorder function in the aesthetic mapping to organize the landmasses by their size variable. Finally, we use the labs and theme functions to add labels, change the font size, and position the legend: islands_bar <- ggplot(islands_ct, aes(x = reorder(landmass, size), y = size, fill = is_continent)) + geom_bar(stat="identity") + labs(x = 'Landmass', y = 'Size (1000 square mi)', fill = 'Type') + coord_flip() + theme(text = element_text(size = 18), legend.position = c(0.75, 0.45)) islands_bar This is now a very effective visualization for answering our original questions. Landmasses are organized by their size, and continents are coloured differently than other landmasses, making it quite clear that continents are the largest 7 landmasses. 4.5.3 The Old Faithful eruption / waiting time data set This data set contains measurements of the waiting time between eruptions and the subsequent eruption duration (in minutes). Question: Is there a relationship between the waiting time before an eruption to the duration of the eruption? # old faithful eruption time / wait time data head(faithful) ## eruptions waiting ## 1 3.600 79 ## 2 1.800 54 ## 3 3.333 74 ## 4 2.283 62 ## 5 4.533 85 ## 6 2.883 55 Here again we are investigating the relationship between two quantitative variables (waiting time and eruption time). But if you look at the output of the head function, you’ll notice that neither of the columns are ordered. So in this case, let’s start again with a scatter plot: faithful_scatter <- ggplot(faithful, aes(x = waiting, y = eruptions)) + geom_point() faithful_scatter We can see that the data tend to fall into two groups: one with a short waiting and eruption times, and one with long waiting and eruption times. Note that in this case, there is no overplotting: the points are generally nicely visually separated, and the pattern they form is clear. In order to refine the visualization, we need only to add axis labels and make the font more readable: faithful_scatter <- ggplot(faithful, aes(x = waiting, y = eruptions)) + geom_point() + labs(x = 'Waiting Time (mins)', y = 'Eruption Duration (mins)') + theme(text = element_text(size = 18)) faithful_scatter 4.5.4 The Michelson speed of light data set This data set contains measurements of the speed of light (in kilometres per second with 299,000 subtracted) from the year 1879 for 5 experiments, each with 20 consecutive runs. Question: Given what we know now about the speed of light (299,792.458 kilometres per second), how accurate were each of the experiments? # michelson morley experimental data head(morley) ## Expt Run Speed ## 001 1 1 850 ## 002 1 2 740 ## 003 1 3 900 ## 004 1 4 1070 ## 005 1 5 930 ## 006 1 6 850 In this experimental data, Michelson was trying to measure just a single quantitative number (the speed of light). The data set contains many measurements of this single quantity. To tell how accurate the experiments were, we need to visualize the distribution of the measurements (i.e., all their possible values and how often each occurs). We can do this using a histogram. A histogram helps us visualize how a particular variable is distributed in a data set by separating the data into bins, and then using vertical bars to show how many data points fell in each bin. To create a histogram in ggplot2 we will use the geom_histogram geometric object, setting the x axis to the Speed measurement variable; and as we did before, let’s use the default arguments just to see how things look: morley_hist <- ggplot(morley, aes(x = Speed)) + geom_histogram() morley_hist This is a great start. However, we cannot tell how accurate the measurements are using this visualization unless we can see what the true value is. In order to visualize the true speed of light, we will add a vertical line with the geom_vline function, setting the xintercept argument to the true value. There is a similar function, geom_hline, that is used for plotting horizontal lines. Note that vertical lines are used to denote quantities on the horizontal axis, while horizontal lines are used to denote quantities on the vertical axis. morley_hist <- ggplot(morley, aes(x = Speed)) + geom_histogram() + geom_vline(xintercept = 792.458, linetype = "dashed", size = 1.0) morley_hist We also still cannot tell which experiments (denoted in the Expt column) led to which measurements; perhaps some experiments were more accurate than others. To fully answer our question, we need to separate the measurements from each other visually. We can try to do this using a coloured histogram, where counts from different experiments are stacked on top of each other in different colours. We create a histogram coloured by the Expt variable by adding it to the fill aesthetic mapping. We make sure the different colours can be seen (despite them all sitting on top of each other) by setting the alpha argument in geom_histogram to 0.5 to make the bars slightly translucent: morley_hist <- ggplot(morley, aes(x = Speed, fill = factor(Expt))) + geom_histogram(position = "identity", alpha = 0.5) + geom_vline(xintercept = 792.458, linetype = "dashed", size = 1.0) morley_hist Unfortunately, the attempt to separate out the experiment number visually has created a bit of a mess. All of the colours are blending together, and although it is possible to derive some insight from this (e.g., experiments 1 and 3 had some of the most incorrect measurements), it isn’t the clearest way to convey our message and answer the question. Let’s try a different strategy of creating multiple separate histograms on top of one another. In order to create a plot in ggplot2 that has multiple subplots arranged in a grid, we use the facet_grid function. The argument to facet_grid specifies the variable(s) used to split the plot into subplots. It has the syntax vertical_variable ~ horizontal_variable, where veritcal_variable is used to split the plot vertically, horizontal_variable is used to split horizontally, and . is used if there should be no split along that axis. In our case we only want to split vertically along the Expt variable, so we use Expt ~ . as the argument to facet_grid. morley_hist <- ggplot(morley, aes(x = Speed, fill = factor(Expt))) + geom_histogram(position = "identity") + facet_grid(Expt ~ .) + geom_vline(xintercept = 792.458, linetype = "dashed", size = 1.0) morley_hist The visualization now makes it quite clear how accurate the different experiments were with respect to one another. There are two finishing touches to make this visualization even clearer. First and foremost, we need to add informative axis labels using the labs function, and increase the font size to make it readable using the theme function. Second, and perhaps more subtly, even though it is easy to compare the experiments on this plot to one another, it is hard to get a sense for just how accurate all the experiments were overall. For example, how accurate is the value 800 on the plot, relative to the true speed of light? To answer this question we’ll use the mutate function to transform our data into a relative measure of accuracy rather than absolute measurements: morley_rel <- mutate(morley, relative_accuracy = 100*( (299000 + Speed) - 299792.458 ) / (299792.458)) morley_hist <- ggplot(morley_rel, aes(x = relative_accuracy, fill = factor(Expt))) + geom_histogram(position = "identity") + facet_grid(Expt ~ .) + geom_vline(xintercept = 0, linetype = "dashed", size = 1.0) + labs(x = 'Relative Accuracy (%)', y = '# Measurements', fill = 'Experiment ID') + theme(text = element_text(size = 18)) morley_hist Wow, impressive! These measurements of the speed of light from 1879 had errors around 0.05% of the true speed. This shows you that even though experiments 2 and 5 were perhaps the most accurate, all of the experiments did quite an admirable job given the technology available at the time period. 4.6 Explaining the visualization 4.6.0.1 Tell a story Typically, your visualization will not be shown completely on its own, but rather it will be part of a larger presentation. Further, visualizations can provide supporting information for any part of a presentation, from opening to conclusion. For example, you could use an exploratory visualization in the opening of the presentation to motivate your choice of a more detailed data analysis / model, a visualization of the results of your analysis to show what your analysis has uncovered, or even one at the end of a presentation to help suggest directions for future work. Regardless of where it appears, a good way to discuss your visualization is as a story: Establish the setting and scope, and motivate why you did what you did. Pose the question that your visualization answers. Justify why the question is important to answer. Answer the question using your visualization. Make sure you describe all aspects of the visualization (including describing the axes). But you can emphasize different aspects based on what is important to answering your question: trends (lines): Does a line describe the trend well? If so, the trend is linear, and if not, the trend is nonlinear. Is the trend increasing, decreasing, or neither? Is there a periodic oscillation (wiggle) in the trend? Is the trend noisy (does the line “jump around” a lot) or smooth? distributions (scatters, histograms): How spread out are the data? Where are they centered, roughly? Are there any obvious “clusters” or “subgroups”, which would be visible as multiple bumps in the histogram? distributions of two variables (scatters): is there a clear / strong relationship between the variables (points fall in a distinct pattern), a weak one (points fall in a pattern but there is some noise), or no discernible relationship (the data are too noisy to make any conclusion)? amounts (bars): How large are the bars relative to one another? Are there patterns in different groups of bars? Summarize your findings, and use them to motivate whatever you will discuss next. Below are two examples of how might one take these four steps in describing the example visualizations that appeared earlier in this chapter. Each of the steps is denoted by its numeral in parentheses, e.g. (3). Mauna Loa Atmospheric CO2 Measurements: (1) Many current forms of energy generation and conversion—from automotive engines to natural gas power plants—rely on burning fossil fuels and produce greenhouse gases, typically primarily carbon dioxide (CO2), as a byproduct. Too much of these gases in the Earth’s atmosphere will cause it to trap more heat from the sun, leading to global warming. (2) In order to assess how quickly the atmospheric concentration of CO2 is increasing over time, we (3) used a data set from the Mauna Loa observatory from Hawaii, consisting of CO2 measurements from 1959 to the present. We plotted the measured concentration of CO2 (on the vertical axis) over time (on the horizontal axis). From this plot you can see a clear, increasing, and generally linear trend over time. There is also a periodic oscillation that occurs once per year and aligns with Hawaii’s seasons, with an amplitude that is small relative to the growth in the overall trend. This shows that atmospheric CO2 is clearly increasing over time, and (4) it is perhaps worth investigating more into the causes. Michelson Light Speed Experiments: (1) Our modern understanding of the physics of light has advanced significantly from the late 1800s when experiments of Michelson and Morley first demonstrated that it had a finite speed. We now know based on modern experiments that it moves at roughly 299792.458 kilometres per second. (2) But how accurately were we first able to measure this fundamental physical constant, and did certain experiments produce more accurate results than others? (3) To better understand this we plotted data from 5 experiments by Michelson in 1879, each with 20 trials, as histograms stacked on top of one another. The horizontal axis shows the accuracy of the measurements relative to the true speed of light as we know it today, expressed as a percentage. From this visualization you can see that most results had relative errors of at most 0.05%. You can also see that experiments 1 and 3 had measurements that were the farthest from the true value, and experiment 5 tended to provide the most consistently accurate result. (4) It would be worth further investigation into the differences between these experiments to see why they produced different results. 4.7 Saving the visualization 4.7.0.1 Choose the right output format for your needs Just as there are many ways to store data sets, there are many ways to store visualizations and images. Which one you choose can depend on a number of factors, such as file size/type limitations (e.g., if you are submitting your visualization as part of a conference paper or to a poster printing shop) and where it will be displayed (e.g., online, in a paper, on a poster, on a billboard, in talk slides). Generally speaking, images come in two flavours: bitmap (or raster) formats and vector (or scalable graphics) formats. Bitmap / Raster images are represented as a 2-D grid of square pixels, each with their own colour. Raster images are often compressed before storing so they take up less space. A compressed format is lossy if the image cannot be perfectly recreated when loading and displaying, with the hope that the change is not noticeable. Lossless formats, on the other hand, allow a perfect display of the original image. Common file types: JPEG (.jpg, .jpeg): lossy, usually used for photographs PNG (.png): lossless, usually used for plots / line drawings BMP (.bmp): lossless, raw image data, no compression (rarely used) TIFF (.tif, .tiff): typically lossless, no compression, used mostly in graphic arts, publishing Open-source software: GIMP Vector / Scalable Graphics images are represented as a collection of mathematical objects (lines, surfaces, shapes, curves). When the computer displays the image, it redraws all of the elements using their mathematical formulas. Common file types: SVG (.svg): general-purpose use EPS (.eps), general-purpose use (rarely used) Open-source software: Inkscape Raster and vector images have opposing advantages and disadvantages. A raster image of a fixed width / height takes the same amount of space and time to load regardless of what the image shows (caveat: the compression algorithms may shrink the image more or run faster for certain images). A vector image takes space and time to load corresponding to how complex the image is, since the computer has to draw all the elements each time it is displayed. For example, if you have a scatter plot with 1 million points stored as an SVG file, it may take your computer some time to open the image. On the other hand, you can zoom into / scale up vector graphics as much as you like without the image looking bad, while raster images eventually start to look “pixellated.” PDF files: The portable document format PDF (.pdf) is commonly used to store both raster and vector graphics formats. If you try to open a PDF and it’s taking a long time to load, it may be because there is a complicated vector graphics image that your computer is rendering. Let’s investigate how different image file formats behave with a scatter plot of the Old Faithful data set, which happens to be available in base R under the name faithful: library(svglite) #we need this to save SVG files faithful_plot <- ggplot(data = faithful, aes(x = waiting, y = eruptions))+ geom_point() faithful_plot ggsave('faithful_plot.png', faithful_plot) ggsave('faithful_plot.jpg', faithful_plot) ggsave('faithful_plot.bmp', faithful_plot) ggsave('faithful_plot.tiff', faithful_plot) ggsave('faithful_plot.svg', faithful_plot) print(paste("PNG filesize: ", file.info('faithful_plot.png')['size'] / 1000000, "MB")) ## [1] "PNG filesize: 0.07861 MB" print(paste("JPG filesize: ", file.info('faithful_plot.jpg')['size'] / 1000000, "MB")) ## [1] "JPG filesize: 0.139187 MB" print(paste("BMP filesize: ", file.info('faithful_plot.bmp')['size'] / 1000000, "MB")) ## [1] "BMP filesize: 3.148978 MB" print(paste("TIFF filesize: ", file.info('faithful_plot.tiff')['size'] / 1000000, "MB")) ## [1] "TIFF filesize: 9.443892 MB" print(paste("SVG filesize: ", file.info('faithful_plot.svg')['size'] / 1000000, "MB")) ## [1] "SVG filesize: 0.046145 MB" Wow, that’s quite a difference! Notice that for such a simple plot with few graphical elements (points), the vector graphics format (SVG) is over 100 times smaller than the uncompressed raster images (BMP, TIFF). Also note that the JPG format is twice as large as the PNG format, since the JPG compression algorithm is designed for natural images (not plots). Below, we also show what the images look like when we zoom in to a rectangle with only 3 data points. You can see why vector graphics formats are so useful: because they’re just based on mathematical formulas, vector graphics can be scaled up to arbitrary sizes. This makes them great for presentation media of all sizes, from papers to posters to billboards. Zoomed in faithful, raster (PNG, left) and vector (SVG, right) formats. "],
    +["GitHub.html", "Chapter 5 Version control with GitHub 5.1 Overview 5.2 Videos to learn about version control with GitHub and Git 5.3 Git command cheatsheet 5.4 Terminal cheatsheet", " Chapter 5 Version control with GitHub image source: https://en.wikipedia.org/wiki/File:Page_Under_Construction.png 5.1 Overview We will be using version control with GitHub and Git to share our code on group projects. Here is a list of videos you might want to watch to familiarize yourself further with these tools, as well as a cheatsheet of Git and terminal commands. 5.2 Videos to learn about version control with GitHub and Git 5.2.1 Creating a GitHub repository 5.2.2 Exploring a GitHub repository 5.2.3 Directly editing files on GitHub 5.2.4 Logging changes and pushing them to GitHub 5.3 Git command cheatsheet Because we are writing code on a server, we need to use Git in the terminal to get files from GitHub, and to send back changes to the files that we make on the server. Below is a cheat sheet of the commands you will need and what they are for: 5.3.1 Getting a repository from GitHub onto the server for the first time This is done only once for a repository when you want to copy it to a new computer. git clone https://github.com/USERNAME/GITHUB_REPOSITORY_NAME.git 5.3.2 Logging changes After editing and saving your files (e.g., a Jupyter notebook): git add FILENAME git commit -m "some message about the changes you made" 5.3.3 Sending your changes back to GitHub After logging your changes (as shown above): git push 5.3.4 Getting changes To get the changes your collaborator just sent to GitHub onto your server: git pull 5.4 Terminal cheatsheet We need to run the above Git commands from inside the repository/folder that we cloned from GitHub. To navigate there in the terminal, you will need to use the following commands: 5.4.1 See where you are: pwd 5.4.2 See what is inside the directory where you are: ls 5.4.3 Move to a different directory cd DIRECTORY_PATH "],
    +["classification.html", "Chapter 6 Classification I: training & predicting 6.1 Overview 6.2 Chapter learning objectives 6.3 The classification problem 6.4 Exploring a labelled data set 6.5 Classification with K-nearest neighbours 6.6 K-nearest neighbours with tidymodels 6.7 Data preprocessing with tidymodels 6.8 Putting it together in a workflow", " Chapter 6 Classification I: training & predicting 6.1 Overview Up until this point, we have focused solely on descriptive and exploratory questions about data. This chapter and the next together serve as our first foray into answering predictive questions about data. In particular, we will focus on the problem of classification, i.e., using one or more quantitative variables to predict the value of a third, categorical variable. This chapter will cover the basics of classification, how to preprocess data to make it suitable for use in a classifier, and how to use our observed data to make predictions. The next will focus on how to evaluate how accurate the predictions from our classifier are, as well as how to improve our classifier (where possible) to maximize its accuracy. 6.2 Chapter learning objectives Recognize situations where a classifier would be appropriate for making predictions Describe what a training data set is and how it is used in classification Interpret the output of a classifier Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two explanatory variables/predictors Explain the K-nearest neighbour classification algorithm Perform K-nearest neighbour classification in R using tidymodels Explain why one should center, scale, and balance data in predictive modelling Preprocess data to center, scale, and balance a dataset using a recipe Combine preprocessing and model training using a Tidymodels workflow 6.3 The classification problem In many situations, we want to make predictions based on the current situation as well as past experiences. For instance, a doctor may want to diagnose a patient as either diseased or healthy based on their symptoms and the doctor’s past experience with patients; an email provider might want to tag a given email as “spam” or “not spam” depending on past email text data; or an online store may want to predict whether an order is fraudulent or not. These tasks are all examples of classification, i.e., predicting a categorical class (sometimes called a label) for an observation given its other quantitative variables (sometimes called features). Generally, a classifier assigns an observation (e.g. a new patient) to a class (e.g. diseased or healthy) on the basis of how similar it is to other observations for which we know the class (e.g. previous patients with known diseases and symptoms). These observations with known classes that we use as a basis for prediction are called a training set. We call them a “training set” because we use these observations to train, or teach, our classifier so that we can use it to make predictions on new data that we have not seen previously. There are many possible classification algorithms that we could use to predict a categorical class/label for an observation. In addition, there are many variations on the basic classification problem, e.g., binary classification where only two classes are involved (e.g. disease or healthy patient), or multiclass classification, which involves assigning an object to one of several classes (e.g., private, public, or not for-profit organization). Here we will focus on the simple, widely used K-nearest neighbours algorithm for the binary classification problem. Other examples you may encounter in future courses include decision trees, support vector machines (SVMs), logistic regression, and neural networks. 6.4 Exploring a labelled data set In this chapter and the next, we will study a data set of digitized breast cancer image features, created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian at the University of Wisconsin, Madison. Each row in the data set represents an image of a tumour sample, including the diagnosis (benign or malignant) and several other measurements (e.g., nucleus texture, perimeter, area, etc.). Diagnosis for each image was conducted by physicians. As with all data analyses, we first need to formulate a precise question that we want to answer. Here, the question is predictive: can we use the tumour image measurements available to us to predict whether a future tumour image (with unknown diagnosis) shows a benign or malignant tumour? Answering this question is important because traditional, non-data-driven methods for tumour diagnosis are quite subjective and dependent upon how skilled and experienced the diagnosing physician is. Furthermore, benign tumours are not normally dangerous; the cells stay in the same place and the tumour stops growing before it gets very large. By contrast, in malignant tumours, the cells invade the surrounding tissue and spread into nearby organs where they can cause serious damage (learn more about cancer here). Thus, it is important to quickly and accurately diagnose the tumour type to guide patient treatment. Loading the data Our first step is to load, wrangle, and explore the data using visualizations in order to better understand the data we are working with. We start by loading the necessary libraries for our analysis. Below you’ll see (in addition to the usual tidyverse) a new library: forcats. The forcats library enables us to easily manipulate factors in R; factors are a special categorical type of variable in R that are often used for class label data. library(tidyverse) library(forcats) In this case, the file containing the breast cancer data set is a simple .csv file with headers. We’ll use the read_csv function with no additional arguments, and then the head function to inspect its contents: cancer <- read_csv("data/wdbc.csv") head(cancer) ## # A tibble: 6 x 12 ## ID Class Radius Texture Perimeter Area Smoothness Compactness Concavity Concave_Points Symmetry ## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 8.42e5 M 1.10 -2.07 1.27 0.984 1.57 3.28 2.65 2.53 2.22 ## 2 8.43e5 M 1.83 -0.353 1.68 1.91 -0.826 -0.487 -0.0238 0.548 0.00139 ## 3 8.43e7 M 1.58 0.456 1.57 1.56 0.941 1.05 1.36 2.04 0.939 ## 4 8.43e7 M -0.768 0.254 -0.592 -0.764 3.28 3.40 1.91 1.45 2.86 ## 5 8.44e7 M 1.75 -1.15 1.78 1.82 0.280 0.539 1.37 1.43 -0.00955 ## 6 8.44e5 M -0.476 -0.835 -0.387 -0.505 2.24 1.24 0.866 0.824 1.00 ## # … with 1 more variable: Fractal_Dimension <dbl> Variable descriptions Breast tumours can be diagnosed by performing a biopsy, a process where tissue is removed from the body and examined for the presence of disease. Traditionally these procedures were quite invasive; modern methods such as fine needle asipiration, used to collect the present data set, extract only a small amount of tissue and are less invasive. Based on a digital image of each breast tissue sample collected for this data set, 10 different variables were measured for each cell nucleus in the image (3-12 below), and then the mean for each variable across the nuclei was recorded. As part of the data preparation, these values have been scaled; we will discuss what this means and why we do it later in this chapter. Each image additionally was given a unique ID and a diagnosis for malignance by a physician. Therefore, the total set of variables per image in this data set are: ID number Class: the diagnosis of Malignant or Benign Radius: the mean of distances from center to points on the perimeter Texture: the standard deviation of gray-scale values Perimeter: the length of the surrounding contour Area: the area inside the contour Smoothness: the local variation in radius lengths Compactness: the ratio of squared perimeter and area Concavity: severity of concave portions of the contour Concave Points: the number of concave portions of the contour Symmetry Fractal Dimension A magnified image of a malignant breast fine needle aspiration image. White lines denote the boundary of the cell nuclei. Source Below we use glimpse to preview the data frame. This function is similar to head, but can be easier to read when we have a lot of columns: glimpse(cancer) ## Rows: 569 ## Columns: 12 ## $ ID <dbl> 842302, 842517, 84300903, 84348301, 84358402, 843786, 844359, 84458202, 844981, 84501… ## $ Class <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", … ## $ Radius <dbl> 1.09609953, 1.82821197, 1.57849920, -0.76823332, 1.74875791, -0.47595587, 1.16987830,… ## $ Texture <dbl> -2.0715123, -0.3533215, 0.4557859, 0.2535091, -1.1508038, -0.8346009, 0.1605082, 0.35… ## $ Perimeter <dbl> 1.26881726, 1.68447255, 1.56512598, -0.59216612, 1.77501133, -0.38680772, 1.13712450,… ## $ Area <dbl> 0.98350952, 1.90703027, 1.55751319, -0.76379174, 1.82462380, -0.50520593, 1.09433201,… ## $ Smoothness <dbl> 1.56708746, -0.82623545, 0.94138212, 3.28066684, 0.28012535, 2.23545452, -0.12302797,… ## $ Compactness <dbl> 3.28062806, -0.48664348, 1.05199990, 3.39991742, 0.53886631, 1.24324156, 0.08821762, … ## $ Concavity <dbl> 2.65054179, -0.02382489, 1.36227979, 1.91421287, 1.36980615, 0.86554001, 0.29980860, … ## $ Concave_Points <dbl> 2.53024886, 0.54766227, 2.03543978, 1.45043113, 1.42723695, 0.82393067, 0.64636637, 0… ## $ Symmetry <dbl> 2.215565542, 0.001391139, 0.938858720, 2.864862154, -0.009552062, 1.004517928, -0.064… ## $ Fractal_Dimension <dbl> 2.25376381, -0.86788881, -0.39765801, 4.90660199, -0.56195552, 1.88834350, -0.7616619… We can see from the summary of the data above that Class is of type character (denoted by <chr>). Since we are going to be working with Class as a categorical statistical variable, we will convert it to factor using the function as_factor. cancer <- cancer %>% mutate(Class = as_factor(Class)) glimpse(cancer) ## Rows: 569 ## Columns: 12 ## $ ID <dbl> 842302, 842517, 84300903, 84348301, 84358402, 843786, 844359, 84458202, 844981, 84501… ## $ Class <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, B, B, B, M, M, M, M, M, M, M… ## $ Radius <dbl> 1.09609953, 1.82821197, 1.57849920, -0.76823332, 1.74875791, -0.47595587, 1.16987830,… ## $ Texture <dbl> -2.0715123, -0.3533215, 0.4557859, 0.2535091, -1.1508038, -0.8346009, 0.1605082, 0.35… ## $ Perimeter <dbl> 1.26881726, 1.68447255, 1.56512598, -0.59216612, 1.77501133, -0.38680772, 1.13712450,… ## $ Area <dbl> 0.98350952, 1.90703027, 1.55751319, -0.76379174, 1.82462380, -0.50520593, 1.09433201,… ## $ Smoothness <dbl> 1.56708746, -0.82623545, 0.94138212, 3.28066684, 0.28012535, 2.23545452, -0.12302797,… ## $ Compactness <dbl> 3.28062806, -0.48664348, 1.05199990, 3.39991742, 0.53886631, 1.24324156, 0.08821762, … ## $ Concavity <dbl> 2.65054179, -0.02382489, 1.36227979, 1.91421287, 1.36980615, 0.86554001, 0.29980860, … ## $ Concave_Points <dbl> 2.53024886, 0.54766227, 2.03543978, 1.45043113, 1.42723695, 0.82393067, 0.64636637, 0… ## $ Symmetry <dbl> 2.215565542, 0.001391139, 0.938858720, 2.864862154, -0.009552062, 1.004517928, -0.064… ## $ Fractal_Dimension <dbl> 2.25376381, -0.86788881, -0.39765801, 4.90660199, -0.56195552, 1.88834350, -0.7616619… Factors have what are called “levels”, which you can think of as categories. We can ask for the levels from the Class column by using the levels function. This function should return the name of each category in that column. Given that we only have 2 different values in our Class column (B and M), we only expect to get two names back. Note that the levels function requires a vector argument, while the select function outputs a data frame; so we use the pull function, which converts a single column of a data frame into a vector. cancer %>% select(Class) %>% pull() %>% # turns a data frame into a vector levels() ## [1] "M" "B" Exploring the data Before we start doing any modelling, let’s explore our data set. Below we use the group_by + summarize code pattern we used before to see that we have 357 (63%) benign and 212 (37%) malignant tumour observations. num_obs <- nrow(cancer) cancer %>% group_by(Class) %>% summarize(n = n(), percentage = n() / num_obs * 100) ## # A tibble: 2 x 3 ## Class n percentage ## <fct> <int> <dbl> ## 1 M 212 37.3 ## 2 B 357 62.7 Next, let’s draw a scatter plot to visualize the relationship between the perimeter and concavity variables. Rather than use ggplot's default palette, we define our own here (cbPalette) and pass it as the values argument to the scale_color_manual function. We also make the category labels (“B” and “M”) more readable by changing them to “Benign” and “Malignant” using the labels argument. # colour palette cbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#999999") perim_concav <- cancer %>% ggplot(aes(x = Perimeter, y = Concavity, color = Class)) + geom_point(alpha = 0.5) + labs(color = "Diagnosis") + scale_color_manual(labels = c("Malignant", "Benign"), values = cbPalette) perim_concav In this visualization, we can see that malignant observations typically fall in the the upper right-hand corner of the plot area. By contrast, benign observations typically fall in lower left-hand corner of the plot. Suppose we obtain a new observation not in the current data set that has all the variables measured except the label (i.e., an image without the physician’s diagnosis for the tumour class). We could compute the perimeter and concavity values, resulting in values of, say, 1 and 1. Could we use this information to classify that observation as benign or malignant? What about a new observation with perimeter value of -1 and concavity value of -0.5? What about 0 and 1? It seems like the prediction of an unobserved label might be possible, based on our visualization. In order to actually do this computationally in practice, we will need a classification algorithm; here we will use the K-nearest neighbour classification algorithm. 6.5 Classification with K-nearest neighbours To predict the label of a new observation, i.e., classify it as either benign or malignant, the K-nearest neighbour classifier generally finds the \\(K\\) “nearest” or “most similar” observations in our training set, and then uses their diagnoses to make a prediction for the new observation’s diagnosis. To illustrate this concept, we will walk through an example. Suppose we have a new observation, with perimeter of 2 and concavity of 4 (labelled in red on the scatterplot), whose diagnosis “Class” is unknown. We see that the nearest point to this new observation is malignant and located at the coordinates (2.1, 3.6). The idea here is that if a point is close to another in the scatterplot, then the perimeter and concavity values are similar, and so we may expect that they would have the same diagnosis. Suppose we have another new observation with perimeter 0.2 and concavity of 3.3. Looking at the scatterplot below, how would you classify this red observation? The nearest neighbour to this new point is a benign observation at (0.2, 2.7). Does this seem like the right prediction to make? Probably not, if you consider the other nearby points… So instead of just using the one nearest neighbour, we can consider several neighbouring points, say \\(K = 3\\), that are closest to the new red observation to predict its diagnosis class. Among those 3 closest points, we use the majority class as our prediction for the new observation. In this case, we see that the diagnoses of 2 of the 3 nearest neighbours to our new observation are malignant. Therefore we take majority vote and classify our new red observation as malignant. Here we chose the \\(K=3\\) nearest observations, but there is nothing special about \\(K=3\\). We could have used \\(K=4, 5\\) or more (though we may want to choose an odd number to avoid ties). We will discuss more about choosing \\(K\\) in the next chapter. Distance between points How do we decide which points are the \\(K\\) “nearest” to our new observation? We can compute the distance between any pair of points using the following formula: \\[\\mathrm{Distance} = \\sqrt{(x_a -x_b)^2 + (y_a - y_b)^2}\\] This formula – sometimes called the Euclidean distance – is simply the straight line distance between two points on the x-y plane with coordinates \\((x_a, y_a)\\) and \\((x_b, y_b)\\). Suppose we want to classify a new observation with perimeter of 0 and concavity of 3.5. Let’s calculate the distances between our new point and each of the observations in the training set to find the \\(K=5\\) observations in the training data that are nearest to our new point. new_obs_Perimeter <- 0 new_obs_Concavity <- 3.5 cancer %>% select(ID, Perimeter, Concavity, Class) %>% mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 + (Concavity - new_obs_Concavity)^2)) %>% arrange(dist_from_new) %>% head(n = 5) ## # A tibble: 5 x 5 ## ID Perimeter Concavity Class dist_from_new ## <dbl> <dbl> <dbl> <fct> <dbl> ## 1 86409 0.241 2.65 B 0.881 ## 2 887181 0.750 2.87 M 0.980 ## 3 899667 0.623 2.54 M 1.14 ## 4 907914 0.417 2.31 M 1.26 ## 5 8710441 -1.16 4.04 B 1.28 From this, we see that 3 of the 5 nearest neighbours to our new observation are malignant so classify our new observation as malignant. We circle those 5 in the plot below: It can be difficult sometimes to read code as math, so here we mathematically show the calculation of distance for each of the 5 closest points. Perimeter Concavity Distance Class 0.24 2.65 \\(\\sqrt{0 - 0.241)^2 + (3.5 - 2.65)^2}=\\) 0.88 B 0.75 2.87 \\(\\sqrt{(0 - 0.750)^2 + (3.5 - 2.87)^2} =\\) 0.98 M 0.62 2.54 \\(\\sqrt{(0 - 0.623)^2 + (3.5 - 2.54)^2} =\\) 1.14 M 0.42 2.31 \\(\\sqrt{(0 - 0.417)^2 + (3.5 - 2.31)^2} =\\) 1.26 M -1.16 4.04 \\(\\sqrt{(0 - (-1.16))^2 + (3.5 - 4.04)^2} =\\) 1.28 B More than two explanatory variables Although the above description is directed toward two explanatory variables / predictors, exactly the same K-nearest neighbour algorithm applies when you have a higher number of explanatory variables (i.e., a higher-dimensional predictor space). Each explanatory variable/predictor can give us new information to help create our classifier. The only difference is the formula for the distance between points. In particular, let’s say we have \\(m\\) predictor variables for two observations \\(u\\) and \\(v\\), i.e., \\(u = (u_{1}, u_{2}, \\dots, u_{m})\\) and \\(v = (v_{1}, v_{2}, \\dots, v_{m})\\). Before, we added up the squared difference between each of our (two) variables, and then took the square root; now we will do the same, except for all of our \\(m\\) variables. In other words, the distance formula becomes \\[Distance = \\sqrt{(u_{1} -v_{1})^2 + (u_{2} - v_{2})^2 + \\dots + (u_{m} - v_{m})^2}\\] Click and drag the plot above to rotate it, and scroll to zoom. Note that in general we recommend against using 3D visualizations; here we show the data in 3D only to illustrate what “higher dimensions” look like for learning purposes. Summary In order to classify a new observation using a K-nearest neighbour classifier, we have to: Compute the distance between the new observation and each observation in the training set Sort the data table in ascending order according to the distances Choose the top \\(K\\) rows of the sorted table Classify the new observation based on a majority vote of the neighbour classes 6.6 K-nearest neighbours with tidymodels Coding the K-nearest neighbour algorithm in R ourselves would get complicated if we might have to predict the label/class for multiple new observations, or when there are multiple classes and more than two variables. Thankfully, in R, the K-nearest neighbour algorithm is implemented in the parsnip package included in the tidymodels package collection, along with many other models that you will encounter in this and future classes. The tidymodels collection provides tools to help make and use models, such as classifiers. Using the packages in this collection will help keep our code simple, readable and accurate; the less we have to code ourselves, the fewer mistakes we are likely to make. We start off by loading tidymodels: library(tidymodels) Let’s again suppose we have a new observation with perimeter 0 and concavity 3.5, but its diagnosis is unknown (as in our example above). Suppose we want to use the perimeter and concavity explanatory variables/predictors to predict the diagnosis class of this observation. Let’s pick out our 2 desired predictor variables and class label and store it as a new dataset named cancer_train: cancer_train <- cancer %>% select(Class, Perimeter, Concavity) head(cancer_train) ## # A tibble: 6 x 3 ## Class Perimeter Concavity ## <fct> <dbl> <dbl> ## 1 M 1.27 2.65 ## 2 M 1.68 -0.0238 ## 3 M 1.57 1.36 ## 4 M -0.592 1.91 ## 5 M 1.78 1.37 ## 6 M -0.387 0.866 Next, we create a model specification for K-nearest neighbours classification by calling the nearest_neighbor function, specifying that we want to use \\(K = 5\\) neighbours (we will discuss how to choose \\(K\\) in the next chapter) and the straight-line distance (weight_func = \"rectangular\"). The weight_func argument controls how neighbours vote when classifying a new observation; by setting it to \"rectangular\", each of the \\(K\\) nearest neighbours gets exactly 1 vote as described above. Other choices, which weight each neighbour’s vote differently, can be found on the tidymodels website. We specify the particular computational engine (in this case, the kknn engine) for training the model with the set_engine function. Finally we specify that this is a classification problem with the set_mode function. knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 5) %>% set_engine("kknn") %>% set_mode("classification") knn_spec ## K-Nearest Neighbor Model Specification (classification) ## ## Main Arguments: ## neighbors = 5 ## weight_func = rectangular ## ## Computational engine: kknn In order to fit the model on the breast cancer data, we need to pass the model specification and the dataset to the fit function. We also need to specify what variables to use as predictors and what variable to use as the target. Below, the Class ~ . argument specifies that Class is the target variable (the one we want to predict), and . (everything except Class) is to be used as the predictor. knn_fit <- knn_spec %>% fit(Class ~ ., data = cancer_train) knn_fit ## parsnip model object ## ## Fit time: 37ms ## ## Call: ## kknn::train.kknn(formula = formula, data = data, ks = ~5, kernel = ~"rectangular") ## ## Type of response variable: nominal ## Minimal misclassification: 0.07557118 ## Best kernel: rectangular ## Best k: 5 Here you can see the final trained model summary. It confirms that the computational engine used to train the model was kknn::train.kknn. It also shows the fraction of errors made by the nearest neighbour model, but we will ignore this for now and discuss it in more detail in the next chapter. Finally it shows (somewhat confusingly) that the “best” weight function was “rectangular” and “best” setting of \\(K\\) was 5; but since we specified these earlier, R is just repeating those settings to us here. In the next chapter, we will actually let R tune the model for us. Finally, we make the prediction on the new observation by calling the predict function, passing the fit object we just created. As above when we ran the K-nearest neighbours classification algorithm manually, the knn_fit object classifies the new observation as malignant (“M”). Note that the predict function outputs a data frame with a single variable named .pred_class. new_obs <- tibble(Perimeter = 0, Concavity = 3.5) predict(knn_fit, new_obs) ## # A tibble: 1 x 1 ## .pred_class ## <fct> ## 1 M 6.7 Data preprocessing with tidymodels 6.7.1 Centering and scaling When using K-nearest neighbour classification, the scale of each variable (i.e., its size and range of values) matters. Since the classifier predicts classes by identifying observations that are nearest to it, any variables that have a large scale will have a much larger effect than variables with a small scale. But just because a variable has a large scale doesn’t mean that it is more important for making accurate predictions. For example, suppose you have a data set with two attributes, salary (in dollars) and years of education, and you want to predict the corresponding type of job. When we compute the neighbour distances, a difference of $1000 is huge compared to a difference of 10 years of education. But for our conceptual understanding and answering of the problem, it’s the opposite; 10 years of education is huge compared to a difference of $1000 in yearly salary! In many other predictive models, the center of each variable (e.g., its mean) matters as well. For example, if we had a data set with a temperature variable measured in degrees Kelvin, and the same data set with temperature measured in degrees Celcius, the two variables would differ by a constant shift of 273 (even though they contain exactly the same information). Likewise in our hypothetical job classification example, we would likely see that the center of the salary variable is in the tens of thousands, while the center of the years of education variable is in the single digits. Although this doesn’t affect the K-nearest neighbour classification algorithm, this large shift can change the outcome of using many other predictive models. Standardization: when all variables in a data set have a mean (center) of 0 and a standard deviation (scale) of 1, we say that the data have been standardized. To illustrate the effect that standardization can have on the K-nearest neighbour algorithm, we will read in the original, unscaled Wisconsin breast cancer data set; we have been using a standardized version of the data set up until now. To keep things simple, we will just use the Area, Smoothness, and Class variables: unscaled_cancer <- read_csv("data/unscaled_wdbc.csv") %>% mutate(Class = as_factor(Class)) %>% select(Class, Area, Smoothness) head(unscaled_cancer) ## # A tibble: 6 x 3 ## Class Area Smoothness ## <fct> <dbl> <dbl> ## 1 M 1001 0.118 ## 2 M 1326 0.0847 ## 3 M 1203 0.110 ## 4 M 386. 0.142 ## 5 M 1297 0.100 ## 6 M 477. 0.128 Looking at the unscaled / uncentered data above, you can see that the difference between the values for area measurements are much larger than those for smoothness, and the mean appears to be much larger too. Will this affect predictions? In order to find out, we will create a scatter plot of these two predictors (coloured by diagnosis) for both the unstandardized data we just loaded, and the standardized version of that same data. In the tidymodels framework, all data preprocessing happens using a recipe. Here we will initialize a recipe for the unscaled_cancer data above, specifying that the Class variable is the target, and all other variables are predictors: uc_recipe <- recipe(Class ~ ., data = unscaled_cancer) print(uc_recipe) ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 2 So far, there is not much in the recipe; just a statement about the number of targets and predictors. Let’s add scaling (step_scale) and centering (step_center) steps for all of the predictors so that they each have a mean of 0 and standard deviation of 1. The prep function finalizes the recipe by using the data (here, unscaled_cancer) to compute anything necessary to run the recipe (in this case, the column means and standard deviations): uc_recipe <- uc_recipe %>% step_scale(all_predictors()) %>% step_center(all_predictors()) %>% prep() uc_recipe ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 2 ## ## Training data contained 569 data points and no missing data. ## ## Operations: ## ## Scaling for Area, Smoothness [trained] ## Centering for Area, Smoothness [trained] You can now see that the recipe includes a scaling and centering step for all predictor variables. Note that when you add a step to a recipe, you must specify what columns to apply the step to. Here we used the all_predictors() function to specify that each step should be applied to all predictor variables. However, there are a number of different arguments one could use here, as well as naming particular columns with the same syntax as the select function. For example: all_nominal() and all_numeric(): specify all categorical or all numeric variables all_predictors() and all_outcomes(): specify all predictor or all target variables Area, Smoothness: specify both the Area and Smoothness variable -Class: specify everything except the Class variable You can find a full set of all the steps and variable selection functions on the recipes home page. We finally use the bake function to apply the recipe. scaled_cancer <- bake(uc_recipe, unscaled_cancer) head(scaled_cancer) ## # A tibble: 6 x 3 ## Area Smoothness Class ## <dbl> <dbl> <fct> ## 1 0.984 1.57 M ## 2 1.91 -0.826 M ## 3 1.56 0.941 M ## 4 -0.764 3.28 M ## 5 1.82 0.280 M ## 6 -0.505 2.24 M Now let’s generate the two scatter plots, one for unscaled_cancer and one for scaled_cancer, and show them side-by-side. Each has the same new observation annotated with its \\(K=3\\) nearest neighbours: In the plot for the nonstandardized original data, you can see some odd choices for the three nearest neighbours. In particular, the “neighbours” are visually well within the cloud of benign observations, and the neighbours are all nearly vertically aligned with the new observation (which is why it looks like there is only one black line on this plot). Here the computation of nearest neighbours is dominated by the much larger-scale area variable. On the right, the plot for standardized data shows a much more intuitively reasonable selection of nearest neighbours. Thus, standardizing the data can change things in an important way when we are using predictive algorithms. As a rule of thumb, standardizing your data should be a part of the preprocessing you do before any predictive modelling / analysis. 6.7.2 Balancing Another potential issue in a data set for a classifier is class imbalance, i.e., when one label is much more common than another. Since classifiers like the K-nearest neighbour algorithm use the labels of nearby points to predict the label of a new point, if there are many more data points with one label overall, the algorithm is more likely to pick that label in general (even if the “pattern” of data suggests otherwise). Class imbalance is actually quite a common and important problem: from rare disease diagnosis to malicious email detection, there are many cases in which the “important” class to identify (presence of disease, malicious email) is much rarer than the “unimportant” class (no disease, normal email). To better illustrate the problem, let’s revisit the breast cancer data; except now we will remove many of the observations of malignant tumours, simulating what the data would look like if the cancer was rare. We will do this by picking only 3 observations randomly from the malignant group, and keeping all of the benign observations. set.seed(3) rare_cancer <- bind_rows(filter(cancer, Class == "B"), cancer %>% filter(Class == "M") %>% sample_n(3)) %>% select(Class, Perimeter, Concavity) rare_plot <- rare_cancer %>% ggplot(aes(x = Perimeter, y = Concavity, color = Class)) + geom_point(alpha = 0.5) + labs(color = "Diagnosis") + scale_color_manual(labels = c("Malignant", "Benign"), values = cbPalette) rare_plot Note: You will see in the code above that we use the set.seed function. This is because we are using sample_n to artificially pick only 3 of the malignant tumour observations, which uses random sampling to choose which rows will be in the training set. In order to make the code reproducible, we use set.seed to specify where the random number generator starts for this process, which then guarantees the same result, i.e., the same choice of 3 observations, each time the code is run. In general, when your code involves random numbers, if you want the same result each time, you should use set.seed; if you want a different result each time, you should not. Suppose we now decided to use \\(K = 7\\) in K-nearest neighbour classification. With only 3 observations of malignant tumours, the classifier will always predict that the tumour is benign, no matter what its concavity and perimeter are! This is because in a majority vote of 7 observations, at most 3 will be malignant (we only have 3 total malignant observations), so at least 4 must be benign, and the benign vote will always win. For example, look what happens for a new tumour observation that is quite close to two that were tagged as malignant: And if we set the background colour of each area of the plot to the decision the K-nearest neighbour classifier would make, we can see that the decision is always “benign,” corresponding to the blue colour: Despite the simplicity of the problem, solving it in a statistically sound manner is actually fairly nuanced, and a careful treatment would require a lot more detail and mathematics than we will cover in this textbook. For the present purposes, it will suffice to rebalance the data by oversampling the rare class. In other words, we will replicate rare observations multiple times in our data set to give them more voting power in the K-nearest neighbour algorithm. In order to do this, we will add an oversampling step to the earlier uc_recipe recipe with the step_upsample function. We show below how to do this, and also use the group_by + summarize pattern we’ve seen before to see that our classes are now balanced: ups_recipe <- recipe(Class ~ ., data = rare_cancer) %>% step_upsample(Class, over_ratio = 1, skip = FALSE) %>% prep() ups_recipe ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 2 ## ## Training data contained 360 data points and no missing data. ## ## Operations: ## ## Up-sampling based on Class [trained] upsampled_cancer <- bake(ups_recipe, rare_cancer) upsampled_cancer %>% group_by(Class) %>% summarize(n = n()) ## # A tibble: 2 x 2 ## Class n ## <fct> <int> ## 1 M 357 ## 2 B 357 Now suppose we train our K-nearest neighbour classifier with \\(K=7\\) on this balanced data. Setting the background colour of each area of our scatter plot to the decision the K-nearest neighbour classifier would make, we can see that the decision is more reasonable; when the points are close to those labelled malignant, the classifier predicts a malignant tumour, and vice versa when they are closer to the benign tumour observations: 6.8 Putting it together in a workflow The tidymodels package collection also provides the workflow, a simple way to chain together multiple data analysis steps without a lot of otherwise necessary code for intermediate steps. To illustrate the whole pipeline, let’s start from scratch with the unscaled_wdbc.csv data. First we will load the data, create a model, and specify a recipe for how the data should be preprocessed: # load the unscaled cancer data and make sure the target Class variable is a factor unscaled_cancer <- read_csv("data/unscaled_wdbc.csv") %>% mutate(Class = as_factor(Class)) # create the KNN model knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) %>% set_engine("kknn") %>% set_mode("classification") # create the centering / scaling recipe uc_recipe <- recipe(Class ~ Area + Smoothness, data = unscaled_cancer) %>% step_scale(all_predictors()) %>% step_center(all_predictors()) Note that each of these steps is exactly the same as earlier, except for one major difference: we did not use the select function to extract the relevant variables from the data frame, and instead simply specified the relevant variables to use via the formula Class ~ Area + Smoothness (instead of Class ~ .) in the recipe. You will also notice that we did not call prep() on the recipe; this is unnecssary when it is placed in a workflow. We will now place these steps in a workflow using the add_recipe and add_model functions, and finally we will use the fit function to run the whole workflow on the unscaled_cancer data. Note another difference from earlier here: we do not include a formula in the fit function. This is again because we included the formula in the recipe, so there is no need to respecify it: knn_fit <- workflow() %>% add_recipe(uc_recipe) %>% add_model(knn_spec) %>% fit(data = unscaled_cancer) knn_fit ## ══ Workflow [trained] ════════════════════════════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: nearest_neighbor() ## ## ── Preprocessor ────────────────────────────────────────────────────────────────────────────────────────────────── ## 2 Recipe Steps ## ## ● step_scale() ## ● step_center() ## ## ── Model ───────────────────────────────────────────────────────────────────────────────────────────────────────── ## ## Call: ## kknn::train.kknn(formula = formula, data = data, ks = ~7, kernel = ~"rectangular") ## ## Type of response variable: nominal ## Minimal misclassification: 0.112478 ## Best kernel: rectangular ## Best k: 7 As before, the fit object lists the function that trains the model as well as the “best” settings for the number of neighbours and weight function (for now, these are just the values we chose manually when we created knn_spec above). But now the fit object also includes information about the overall workflow, including the centering and scaling preprocessing steps. Let’s visualize the predictions that this trained K-nearest neighbour model will make on new observations. Below you will see how to make the coloured prediction map plots from earlier in this chapter. The basic idea is to create a grid of synthetic new observations using the expand.grid function, predict the label of each, and visualize the predictions with a coloured scatter having a very high transparency (low alpha value) and large point radius. We include the code here as a learning challenge; see if you can figure out what each line is doing! # create the grid of area/smoothness vals, and arrange in a data frame are_grid <- seq(min(unscaled_cancer$Area), max(unscaled_cancer$Area), length.out = 100) smo_grid <- seq(min(unscaled_cancer$Smoothness), max(unscaled_cancer$Smoothness), length.out = 100) asgrid <- as_tibble(expand.grid(Area=are_grid, Smoothness=smo_grid)) # use the fit workflow to make predictions at the grid points knnPredGrid <- predict(knn_fit, asgrid) # bind the predictions as a new column with the grid points prediction_table <- bind_cols(knnPredGrid, asgrid) %>% rename(Class = .pred_class) # plot: # 1. the coloured scatter of the original data # 2. the faded coloured scatter for the grid points wkflw_plot <- ggplot() + geom_point(data = unscaled_cancer, mapping = aes(x = Area, y = Smoothness, color = Class), alpha=0.75) + geom_point(data = prediction_table, mapping = aes(x = Area, y = Smoothness, color = Class), alpha=0.02, size=5.)+ labs(color = "Diagnosis") + scale_color_manual(labels = c("Malignant", "Benign"), values = cbPalette) "],
    +["classification-continued.html", "Chapter 7 Classification II: evaluation & tuning 7.1 Overview 7.2 Chapter learning objectives 7.3 Evaluating accuracy 7.4 Tuning the classifier 7.5 Splitting data 7.6 Summary", " Chapter 7 Classification II: evaluation & tuning 7.1 Overview This chapter continues the introduction to predictive modelling through classification. While the previous chapter covered training and data preprocessing, this chapter focuses on how to split data, how to evaluate prediction accuracy, and how to choose model parameters to maximize performance. 7.2 Chapter learning objectives By the end of the chapter, students will be able to: Describe what training, validation, and test data sets are and how they are used in classification Split data into training, validation, and test data sets Evaluate classification accuracy in R using a validation data set and appropriate metrics Execute cross-validation in R to choose the number of neighbours in a K-nearest neighbour classifier Describe advantages and disadvantages of the K-nearest neighbour classification algorithm 7.3 Evaluating accuracy Sometimes our classifier might make the wrong prediction. A classifier does not need to be right 100% of the time to be useful, though we don’t want the classifier to make too many wrong predictions. How do we measure how “good” our classifier is? Let’s revisit the breast cancer images example and think about how our classifier will be used in practice. A biopsy will be performed on a new patient’s tumour, the resulting image will be analyzed, and the classifier will be asked to decide whether the tumour is benign or malignant. The key word here is new: our classifier is “good” if it provides accurate predictions on data not seen during training. But then how can we evaluate our classifier without having to visit the hospital to collect more tumour images? The trick is to split up the data set into a training set and test set, and only show the classifier the training set when building the classifier. Then to evaluate the accuracy of the classifier, we can use it to predict the labels (which we know) in the test set. If our predictions match the true labels for the observations in the test set very well, then we have some confidence that our classifier might also do a good job of predicting the class labels for new observations that we do not have the class labels for. Note: if there were a golden rule of machine learning, it might be this: you cannot use the test data to build the model! If you do, the model gets to “see” the test data in advance, making it look more accurate than it really is. Imagine how bad it would be to overestimate your classifier’s accuracy when predicting whether a patient’s tumour is malignant or benign! How exactly can we assess how well our predictions match the true labels for the observations in the test set? One way we can do this is to calculate the prediction accuracy. This is the fraction of examples for which the classifier made the correct prediction. To calculate this we divide the number of correct predictions by the number of predictions made. Other measures for how well our classifier performed include precision and recall; these will not be discussed here, but you will encounter them in other more advanced courses on this topic. This process is illustrated below: In R, we can use the tidymodels library collection not only to perform K-nearest neighbour classification, but also to assess how well our classification worked. Let’s start by loading the necessary libraries, reading in the breast cancer data from the previous chapter, and making a quick scatter plot visualization of tumour cell concavity versus smoothness coloured by diagnosis. # load libraries library(tidyverse) library(tidymodels) #load data cancer <- read_csv("data/unscaled_wdbc.csv") %>% mutate(Class = as_factor(Class)) # convert the character Class variable to the factor datatype # colour palette cbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#999999") # create scatter plot of tumour cell concavity versus smoothness, # labelling the points be diagnosis class perim_concav <- cancer %>% ggplot(aes(x = Smoothness, y = Concavity, color = Class)) + geom_point(alpha = 0.5) + labs(color = "Diagnosis") + scale_color_manual(labels = c("Malignant", "Benign"), values = cbPalette) perim_concav 1. Create the train / test split Once we have decided on a predictive question to answer and done some preliminary exploration, the very next thing to do is to split the data into the training and test sets. Typically, the training set is between 50 - 100% of the data, while the test set is the remaining 0 - 50%; the intuition is that you want to trade off between training an accurate model (by using a larger training data set) and getting an accurate evaluation of its performance (by using a larger test data set). Here, we will use 75% of the data for training, and 25% for testing. To do this we will use the initial_split function, specifying that prop = 0.75 and the target variable is Class: set.seed(1) cancer_split <- initial_split(cancer, prop = 0.75, strata = Class) cancer_train <- training(cancer_split) cancer_test <- testing(cancer_split) Note: You will see in the code above that we use the set.seed function again, as discussed in the previous chapter. In this case it is because initial_split uses random sampling to choose which rows will be in the training set. Since we want our code to be reproducible and generate the same train/test split each time it is run, we use set.seed. glimpse(cancer_train) ## Rows: 427 ## Columns: 12 ## $ ID <dbl> 842302, 842517, 84300903, 84348301, 84358402, 843786, 844359, 84458202, 84501001, 845… ## $ Class <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, B, B, M, M, M, M, M, M, M, M, M, M, M, M… ## $ Radius <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450, 18.250, 13.710, 12.460, 16.020, 15.78… ## $ Texture <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.98, 20.83, 24.04, 23.24, 17.89, 24.80, 2… ## $ Perimeter <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, 119.60, 90.20, 83.97, 102.70, 103.60, 1… ## $ Area <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, 1040.0, 577.9, 475.9, 797.8, 781.0, 112… ## $ Smoothness <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0.12780, 0.09463, 0.11890, 0.11860, 0.08… ## $ Compactness <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0.17000, 0.10900, 0.16450, 0.23960, 0.06… ## $ Concavity <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0.15780, 0.11270, 0.09366, 0.22730, 0.03… ## $ Concave_Points <dbl> 0.147100, 0.070170, 0.127900, 0.105200, 0.104300, 0.080890, 0.074000, 0.059850, 0.085… ## $ Symmetry <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087, 0.1794, 0.2196, 0.2030, 0.1528, 0.184… ## $ Fractal_Dimension <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0.07613, 0.05742, 0.07451, 0.08243, 0.05… glimpse(cancer_test) ## Rows: 142 ## Columns: 12 ## $ ID <dbl> 844981, 84799002, 848406, 849014, 8510426, 8511133, 853401, 854002, 855167, 856106, 8… ## $ Class <fct> M, M, M, M, B, M, M, M, M, M, M, B, B, M, M, M, B, M, M, B, B, B, M, M, B, B, M, B, B… ## $ Radius <dbl> 13.000, 14.540, 14.680, 19.810, 13.540, 15.340, 18.630, 19.270, 13.440, 13.280, 18.22… ## $ Texture <dbl> 21.82, 27.54, 20.13, 22.15, 14.36, 14.26, 25.11, 26.47, 21.58, 20.28, 18.70, 11.79, 1… ## $ Perimeter <dbl> 87.50, 96.73, 94.74, 130.00, 87.46, 102.50, 124.80, 127.90, 86.18, 87.32, 120.30, 54.… ## $ Area <dbl> 519.8, 658.8, 684.5, 1260.0, 566.3, 704.4, 1088.0, 1162.0, 563.0, 545.2, 1033.0, 224.… ## $ Smoothness <dbl> 0.12730, 0.11390, 0.09867, 0.09831, 0.09779, 0.10730, 0.10640, 0.09401, 0.08162, 0.10… ## $ Compactness <dbl> 0.19320, 0.15950, 0.07200, 0.10270, 0.08129, 0.21350, 0.18870, 0.17190, 0.06031, 0.14… ## $ Concavity <dbl> 0.18590, 0.16390, 0.07395, 0.14790, 0.06664, 0.20770, 0.23190, 0.16570, 0.03110, 0.09… ## $ Concave_Points <dbl> 0.093530, 0.073640, 0.052590, 0.094980, 0.047810, 0.097560, 0.124400, 0.075930, 0.020… ## $ Symmetry <dbl> 0.2350, 0.2303, 0.1586, 0.1582, 0.1885, 0.2521, 0.2183, 0.1853, 0.1784, 0.1974, 0.209… ## $ Fractal_Dimension <dbl> 0.07389, 0.07077, 0.05922, 0.05395, 0.05766, 0.07032, 0.06197, 0.06261, 0.05587, 0.06… We can see from glimpse in the code above that the training set contains 427 observations, while the test set contains 142 observations. This corresponds to a train / test split of 75% / 25%, as desired. 2. Pre-process the data As we mentioned last chapter, K-NN is sensitive to the scale of the predictors, and so we should perform some preprocessing to standardize them. An additional consideration we need to take when doing this is that we should create the standardization preprocessor using only the training data. This ensures that our test data does not influence any aspect of our model training. Once we have created the standardization preprocessor, we can then apply it separately to both the training and test data sets. Fortunately, the recipe framework from tidymodels makes it simple to handle this properly. Below we construct and prepare the recipe using only the training data (due to data = cancer_train in the first line). cancer_recipe <- recipe(Class ~ Smoothness + Concavity, data = cancer_train) %>% step_scale(all_predictors()) %>% step_center(all_predictors()) 3. Train the classifier Now that we have split our original data set into training and test sets, we can create our K-nearest neighbour classifier with only the training set using the technique we learned in the previous chapter. For now, we will just choose the number \\(K\\) of neighbours to be 3, and use concavity and smoothness as the predictors. set.seed(1) knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 3) %>% set_engine("kknn") %>% set_mode("classification") knn_fit <- workflow() %>% add_recipe(cancer_recipe) %>% add_model(knn_spec) %>% fit(data = cancer_train) knn_fit ## ══ Workflow [trained] ════════════════════════════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: nearest_neighbor() ## ## ── Preprocessor ────────────────────────────────────────────────────────────────────────────────────────────────── ## 2 Recipe Steps ## ## ● step_scale() ## ● step_center() ## ## ── Model ───────────────────────────────────────────────────────────────────────────────────────────────────────── ## ## Call: ## kknn::train.kknn(formula = formula, data = data, ks = ~3, kernel = ~"rectangular") ## ## Type of response variable: nominal ## Minimal misclassification: 0.1264637 ## Best kernel: rectangular ## Best k: 3 Note: Here again you see the set.seed function. In the K-nearest neighbour algorithm, there is a tie for the majority neighbour class, the winner is randomly selected. Although there is no chance of a tie when \\(K\\) is odd (here \\(K=3\\)), it is possible that the code may be changed in the future to have an even value of \\(K\\). Thus, to prevent potential issues with reproducibility, we have set the seed. Note that in your own code, you only have to set the seed once at the beginning of your analysis. 4. Predict the labels in the test set Now that we have a K-nearest neighbour classifier object, we can use it to predict the class labels for our test set. We use the bind_cols to add the column of predictions to the original test data, creating the cancer_test_predictions data frame. The Class variable contains the true diagnoses, while the .pred_class contains the predicted diagnoses from the model. cancer_test_predictions <- predict(knn_fit, cancer_test) %>% bind_cols(cancer_test) head(cancer_test_predictions) ## # A tibble: 6 x 13 ## .pred_class ID Class Radius Texture Perimeter Area Smoothness Compactness Concavity Concave_Points Symmetry ## <fct> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 M 8.45e5 M 13 21.8 87.5 520. 0.127 0.193 0.186 0.0935 0.235 ## 2 M 8.48e7 M 14.5 27.5 96.7 659. 0.114 0.160 0.164 0.0736 0.230 ## 3 B 8.48e5 M 14.7 20.1 94.7 684. 0.0987 0.072 0.0740 0.0526 0.159 ## 4 M 8.49e5 M 19.8 22.2 130 1260 0.0983 0.103 0.148 0.0950 0.158 ## 5 B 8.51e6 B 13.5 14.4 87.5 566. 0.0978 0.0813 0.0666 0.0478 0.188 ## 6 M 8.51e6 M 15.3 14.3 102. 704. 0.107 0.214 0.208 0.0976 0.252 ## # … with 1 more variable: Fractal_Dimension <dbl> 5. Compute the accuracy Finally we can assess our classifier’s accuracy. To do this we use the metrics function from tidymodels to get the statistics about the quality of our model, specifying the truth and estimate arguments: cancer_test_predictions %>% metrics(truth = Class, estimate = .pred_class) ## # A tibble: 2 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.880 ## 2 kap binary 0.741 This shows that the accuracy of the classifier on the test data was 88%. We can also look at the confusion matrix for the classifier, which shows the table of predicted labels and correct labels, using the conf_mat function: cancer_test_predictions %>% conf_mat(truth = Class, estimate = .pred_class) ## Truth ## Prediction M B ## M 43 7 ## B 10 82 This says that the classifier labelled 43+82 = 125 observations correctly, 10 observations as benign when they were truly malignant, and 7 observations as malignant when they were truly benign. 7.4 Tuning the classifier The vast majority of predictive models in statistics and machine learning have parameters that you have to pick. For example, in the K-nearest neighbour classification algorithm we have been using in the past two chapters, we have had to pick the number of neighbours \\(K\\) for the class vote. Is it possible to make this selection, i.e., tune the model, in a principled way? Ideally what we want is to somehow maximize the performance of our classifier on data it hasn’t seen yet. So we will play the same trick we did before when evaluating our classifier: we’ll split our overall training data set further into two subsets, called the training set and validation set. We will use the newly-named training set for building the classifier, and the validation set for evaluating it! Then we will try different values of the parameter \\(K\\) and pick the one that yields the highest accuracy. Remember: don’t touch the test set during the tuning process. Tuning is a part of model training! 7.4.1 Cross-validation There is an important detail to mention about the process of tuning: we can, if we want to, split our overall training data up in multiple different ways, train and evaluate a classifier for each split, and then choose the parameter based on all of the different results. If we just split our overall training data once, our best parameter choice will depend strongly on whatever data was lucky enough to end up in the validation set. Perhaps using multiple different train / validation splits, we’ll get a better estimate of accuracy, which will lead to a better choice of the number of neighbours \\(K\\) for the overall set of training data. Note: you might be wondering why we can’t we use the multiple splits to test our final classifier after tuning is done. This is simply because at the end of the day, we will produce a single classifier using our overall training data. If we do multiple train / test splits, we will end up with multiple classifiers, each with their own accuracy evaluated on different test data. Let’s investigate this idea in R! In particular, we will use different seed values in the set.seed function to generate five different train / validation splits of our overall training data, train five different K-nearest neighbour models, and evaluate their accuracy. accuracies <- c() for (i in 1:5){ set.seed(i) # makes the random selection of rows reproducible # create the 25/75 split of the training data into training and validation cancer_split <- initial_split(cancer_train, prop = 0.75, strata = Class) cancer_subtrain <- training(cancer_split) cancer_validation <- testing(cancer_split) # recreate the standardization recipe from before (since it must be based on the training data) cancer_recipe <- recipe(Class ~ Smoothness + Concavity, data = cancer_subtrain) %>% step_scale(all_predictors()) %>% step_center(all_predictors()) # fit the knn model (we can reuse the old knn_spec model from before) knn_fit <- workflow() %>% add_recipe(cancer_recipe) %>% add_model(knn_spec) %>% fit(data = cancer_subtrain) # get predictions on the validation data validation_predicted <- predict(knn_fit, cancer_validation) %>% bind_cols(cancer_validation) #compute the accuracy acc <- validation_predicted %>% metrics(truth = Class, estimate = .pred_class) %>% filter(.metric == "accuracy") %>% select(.estimate) %>% pull() accuracies <- append(accuracies, acc) } accuracies ## [1] 0.9150943 0.8679245 0.8490566 0.8962264 0.9150943 With five different shuffles of the data, we get five different values for accuracy. None of these is necessarily “more correct” than any other; they’re just five estimates of the true, underlying accuracy of our classifier built using our overall training data. We can combine the estimates by taking their average (here 0.8886792) to try to get a single assessment of our classifier’s accuracy; this has the effect of reducing the influence of any one (un)lucky validation set on the estimate. In practice, we don’t use random splits, but rather use a more structured splitting procedure so that each observation in the data set is used in a validation set only a single time. The name for this strategy is called cross-validation. In cross-validation, we split our overall training data into \\(C\\) evenly-sized chunks, and then iteratively use \\(1\\) chunk as the validation set and combine the remaining \\(C-1\\) chunks as the training set: In the picture above, \\(C=5\\) different chunks of the data set are used, resulting in 5 different choices for the validation set; we call this 5-fold cross-validation. To do 5-fold cross-validation in R with tidymodels, we use another function: vfold_cv. This function splits our training data into v folds automatically: cancer_vfold <- vfold_cv(cancer_train, v = 5, strata = Class) cancer_vfold ## # 5-fold cross-validation using stratification ## # A tibble: 5 x 2 ## splits id ## <list> <chr> ## 1 <split [341/86]> Fold1 ## 2 <split [341/86]> Fold2 ## 3 <split [341/86]> Fold3 ## 4 <split [342/85]> Fold4 ## 5 <split [343/84]> Fold5 Then, when we create our data analysis workflow, we use the fit_resamples function instead of the fit function for training. This runs cross-validation on each train/validation split. Note: we set the seed when we call train not only because of the potential for ties, but also because we are doing cross-validation. Cross-validation uses a random process to select how to partition the training data. set.seed(1) # recreate the standardization recipe from before (since it must be based on the training data) cancer_recipe <- recipe(Class ~ Smoothness + Concavity, data = cancer_train) %>% step_scale(all_predictors()) %>% step_center(all_predictors()) # fit the knn model (we can reuse the old knn_spec model from before) knn_fit <- workflow() %>% add_recipe(cancer_recipe) %>% add_model(knn_spec) %>% fit_resamples(resamples = cancer_vfold) knn_fit ## # 5-fold cross-validation using stratification ## # A tibble: 5 x 4 ## splits id .metrics .notes ## <list> <chr> <list> <list> ## 1 <split [341/86]> Fold1 <tibble [2 × 3]> <tibble [0 × 1]> ## 2 <split [341/86]> Fold2 <tibble [2 × 3]> <tibble [0 × 1]> ## 3 <split [341/86]> Fold3 <tibble [2 × 3]> <tibble [0 × 1]> ## 4 <split [342/85]> Fold4 <tibble [2 × 3]> <tibble [0 × 1]> ## 5 <split [343/84]> Fold5 <tibble [2 × 3]> <tibble [0 × 1]> The collect_metrics function is used to aggregate the mean and standard error of the classifier’s validation accuracy across the folds. The standard error is a measure of how uncertain we are in the mean value. A detailed treatment of this is beyond the scope of this chapter; but roughly, if your estimated mean (that the collect_metrics function gives you) is 0.88 and standard error is 0.02, you can expect the true average accuracy of the classifier to be somewhere roughly between 0.86 and 0.90 (although it may fall outside this range). knn_fit %>% collect_metrics() ## # A tibble: 2 x 5 ## .metric .estimator mean n std_err ## <chr> <chr> <dbl> <int> <dbl> ## 1 accuracy binary 0.883 5 0.0189 ## 2 roc_auc binary 0.923 5 0.0104 We can choose any number of folds, and typically the more we use the better our accuracy estimate will be (lower standard error). However, we are limited by computational power: the more folds we choose, the more computation it takes, and hence the more time it takes to run the analysis. So when you do cross-validation, you need to consider the size of the data, and the speed of the algorithm (e.g., K-nearest neighbour) and the speed of your computer. In practice, this is a trial and error process, but typically \\(C\\) is chosen to be either 5 or 10. Here we show how the standard error decreases when we use 10-fold cross validation rather than 5-fold: cancer_vfold <- vfold_cv(cancer_train, v = 10, strata = Class) workflow() %>% add_recipe(cancer_recipe) %>% add_model(knn_spec) %>% fit_resamples(resamples = cancer_vfold) %>% collect_metrics() ## # A tibble: 2 x 5 ## .metric .estimator mean n std_err ## <chr> <chr> <dbl> <int> <dbl> ## 1 accuracy binary 0.885 10 0.0104 ## 2 roc_auc binary 0.927 10 0.0111 7.4.2 Parameter value selection Using 5- and 10-fold cross-validation, we have estimated that the prediction accuracy of our classifier is somewhere around 88%. Whether 88% is good or not depends entirely on the downstream application of the data analysis. In the present situation, we are trying to predict a tumour diagnosis, with expensive, damaging chemo/radiation therapy or patient death as potential consequences of misprediction. Hence, we’d like to do better than 88% for this application. In order to improve our classifier, we have one choice of parameter: the number of neighbours, \\(K\\). Since cross-validation helps us evaluate the accuracy of our classifier, we can use cross-validation to calculate an accuracy for each value of \\(K\\) in a reasonable range, and then pick the value of \\(K\\) that gives us the best accuracy. The tidymodels package collection provides a very simple syntax for tuning models: each parameter in the model to be tuned should be specified as tune() in the model specification rather than given a particular value. knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>% set_engine("kknn") %>% set_mode("classification") Then instead of using fit or fit_resamples, we will use the tune_grid function to fit the model for each value in a range of parameter values. Here the grid = 10 argument specifies that the tuning should try 10 values of the number of neighbours \\(K\\) when tuning. We set the seed prior to tuning to ensure results are reproducible: set.seed(1) knn_results <- workflow() %>% add_recipe(cancer_recipe) %>% add_model(knn_spec) %>% tune_grid(resamples = cancer_vfold, grid = 10) %>% collect_metrics() knn_results ## # A tibble: 20 x 6 ## neighbors .metric .estimator mean n std_err ## <int> <chr> <chr> <dbl> <int> <dbl> ## 1 2 accuracy binary 0.864 10 0.0167 ## 2 2 roc_auc binary 0.897 10 0.0127 ## 3 3 accuracy binary 0.885 10 0.0104 ## 4 3 roc_auc binary 0.927 10 0.0111 ## 5 5 accuracy binary 0.890 10 0.0136 ## 6 5 roc_auc binary 0.927 10 0.00960 ## 7 6 accuracy binary 0.890 10 0.0136 ## 8 6 roc_auc binary 0.930 10 0.0111 ## 9 7 accuracy binary 0.890 10 0.00966 ## 10 7 roc_auc binary 0.931 10 0.0106 ## 11 9 accuracy binary 0.887 10 0.0104 ## 12 9 roc_auc binary 0.935 10 0.0132 ## 13 10 accuracy binary 0.887 10 0.0104 ## 14 10 roc_auc binary 0.934 10 0.0130 ## 15 12 accuracy binary 0.882 10 0.0164 ## 16 12 roc_auc binary 0.940 10 0.0113 ## 17 13 accuracy binary 0.885 10 0.0170 ## 18 13 roc_auc binary 0.939 10 0.0118 ## 19 15 accuracy binary 0.873 10 0.0153 ## 20 15 roc_auc binary 0.940 10 0.0103 We can select the best value of the number of neighbours (i.e., the one that results in the highest classifier accuracy estimate) by plotting the accuracy versus \\(K\\): accuracies <- knn_results %>% filter(.metric == 'accuracy') accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) + geom_point() + geom_line() + labs(x = 'Neighbors', y = 'Accuracy Estimate') accuracy_vs_k This visualization suggests that \\(K = 7\\) provides the highest accuracy. But as you can see, there is no exact or perfect answer here; any selection between \\(K = 3\\) and \\(13\\) would be reasonably justified, as all of these differ in classifier accuracy by less than 1%. Remember: the values you see on this plot are estimates of the true accuracy of our classifier. Although the \\(K=7\\) value is higher than the others on this plot, that doesn’t mean the classifier is actually more accurate with this parameter value! Generally, when selecting \\(K\\) (and other parameters for other predictive models), we are looking for a value where: we get roughly optimal accuracy, so that our model will likely be accurate changing the value to a nearby one (e.g. from \\(K=7\\) to 6 or 8) doesn’t decrease accuracy too much, so that our choice is reliable in the presence of uncertainty the cost of training the model is not prohibitive (e.g., in our situation, if \\(K\\) is too large, predicting becomes expensive!) 7.4.3 Under/overfitting To build a bit more intuition, what happens if we keep increasing the number of neighbours \\(K\\)? In fact, the accuracy actually starts to decrease! Rather than setting grid = 10 and letting tidymodels decide what values of \\(K\\) to try, let’s specify the values explicitly by creating a data frame with a neighbors variable. Take a look as the plot below as we vary \\(K\\) from 1 to almost the number of observations in the data set: set.seed(1) k_lots = tibble(neighbors = seq(from = 1, to = 385, by = 10)) knn_results <- workflow() %>% add_recipe(cancer_recipe) %>% add_model(knn_spec) %>% tune_grid(resamples = cancer_vfold, grid = k_lots) %>% collect_metrics() accuracies <- knn_results %>% filter(.metric == 'accuracy') accuracy_vs_k_lots <- ggplot(accuracies, aes(x = neighbors, y = mean)) + geom_point() + geom_line() + labs(x = 'Neighbors', y = 'Accuracy Estimate') accuracy_vs_k_lots Underfitting: What is actually happening to our classifier that causes this? As we increase the number of neighbours, more and more of the training observations (and those that are farther and farther away from the point) get a “say” in what the class of a new observation is. This causes a sort of “averaging effect” to take place, making the boundary between where our classifier would predict a tumour to be malignant versus benign to smooth out and become simpler. If you take this to the extreme, setting \\(K\\) to the total training data set size, then the classifier will always predict the same label regardless of what the new observation looks like. In general, if the model isn’t influenced enough by the training data, it is said to underfit the data. Overfitting: In contrast, when we decrease the number of neighbours, each individual data point has a stronger and stronger vote regarding nearby points. Since the data themselves are noisy, this causes a more “jagged” boundary corresponding to a less simple model. If you take this case to the extreme, setting \\(K = 1\\), then the classifier is essentially just matching each new observation to its closest neighbour in the training data set. This is just as problematic as the large \\(K\\) case, because the classifier becomes unreliable on new data: if we had a different training set, the predictions would be completely different. In general, if the model is influenced too much by the training data, it is said to overfit the data. You can see this effect in the plots below as we vary the number of neighbours \\(K\\) in (1, 7, 20, 200): 7.5 Splitting data Shuffling: When we split the data into train, test, and validation sets, we make the assumption that there is no order to our originally collected data set. However, if we think that there might be some order to the original data set, then we can randomly shuffle the data before splitting it. The tidymodels function initial_split and vfold_cv functions do this for us. Stratification: If the data are imbalanced, we also need to be extra careful about splitting the data to ensure that enough of each class ends up in each of the train, validation, and test partitions. The strata argument in the initial_split and vfold_cv functions handles this for us too. 7.6 Summary Classification algorithms use one or more quantitative variables to predict the value of a third, categorical variable. The K-nearest neighbour algorithm in particular does this by first finding the K points in the training data nearest to the new observation, and then returning the majority class vote from those training observations. We can evaluate a classifier by splitting the data randomly into a training and test data set, using the training set to build the classifier, and using the test set to estimate its accuracy. To tune the classifier (e.g., select the K in K-nearest neighbours), we maximize accuracy estimates from cross-validation. A typical 10-fold cross-validation data set split. Source: https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6 The overall workflow for performing K-nearest neighbour classification using tidymodels is as follows: Use the initial_split function to split the data into a training and test set. Set the strata argument to the target variable. Put the test set aside for now. Use the vfold_cv function to split up the training data for cross validation. Create a recipe that specifies the target and predictor variables, as well as preprocessing steps for all variables. Pass the training data as the data argument of the recipe. Create a nearest_neighbors model specification, with neighbors = tune(). Add the recipe and model specification to a workflow(), and use the tune_grid function on the train/validation splits to estimate the classifier accuracy for a range of \\(K\\) values. Pick a value of \\(K\\) that yields a high accuracy estimate that doesn’t change much if you change \\(K\\) to a nearby value. Make a new model specification for the best parameter value, and retrain the classifier using the fit function. Evaluate the estimated accuracy of the classifier on the test set using the predict function. Strengths: Simple and easy to understand No assumptions about what the data must look like Works easily for binary (two-class) and multi-class (> 2 classes) classification problems Weaknesses: As data gets bigger and bigger, K-nearest neighbour gets slower and slower, quite quickly Does not perform well with a large number of predictors Does not perform well when classes are imbalanced (when many more observations are in one of the classes compared to the others) "],
    +["regression1.html", "Chapter 8 Regression I: K-nearest neighbours 8.1 Overview 8.2 Chapter learning objectives 8.3 Regression 8.4 Sacremento real estate example 8.5 K-nearest neighbours regression 8.6 Training, evaluating, and tuning the model 8.7 Underfitting and overfitting 8.8 Evaluating on the test set 8.9 Strengths and limitations of K-NN regression 8.10 Multivariate K-NN regression", " Chapter 8 Regression I: K-nearest neighbours 8.1 Overview This chapter will provide an introduction to regression through K-nearest neighbours (K-NN) in a predictive context, focusing primarily on the case where there is a single predictor and single response variable of interest. The chapter concludes with an example of K-nearest neighbours regression with multiple predictors. 8.2 Chapter learning objectives By the end of the chapter, students will be able to: Recognize situations where a simple regression analysis would be appropriate for making predictions. Explain the K-nearest neighbour (K-NN) regression algorithm and describe how it differs from K-NN classification. Interpret the output of a K-NN regression. In a dataset with two or more variables, perform K-nearest neighbour regression in R using a tidymodels workflow Execute cross-validation in R to choose the number of neighbours. Evaluate K-NN regression prediction accuracy in R using a test data set and an appropriate metric (e.g., root means square prediction error). In the context of K-NN regression, compare and contrast goodness of fit and prediction properties (namely RMSE vs RMSPE). Describe advantages and disadvantages of the K-nearest neighbour regression approach. 8.3 Regression Regression, like classification, is a predictive problem setting where we want to use past information to predict future observations. But in the case of regression, the goal is to predict numerical values instead of class labels. For example, we could try to use the number of hours a person spends on exercise each week to predict whether they would qualify for the annual Boston marathon (classification) or to predict their race time itself (regression). As another example, we could try to use the size of a house to predict whether it sold for more than $500,000 (classification) or to predict its sale price itself (regression). We will use K-nearest neighbours to explore this question in the rest of this chapter, using a real estate data set from Sacremento, California. 8.4 Sacremento real estate example Let’s start by loading the libraries we need and doing some preliminary exploratory analysis. The Sacramento real estate data set we will study in this chapter was originally reported in the Sacramento Bee, but we have provided it with this repository as a stable source for the data. library(tidyverse) library(tidymodels) library(gridExtra) sacramento <- read_csv('data/sacramento.csv') head(sacramento) ## # A tibble: 6 x 9 ## city zip beds baths sqft type price latitude longitude ## <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> ## 1 SACRAMENTO z95838 2 1 836 Residential 59222 38.6 -121. ## 2 SACRAMENTO z95823 3 1 1167 Residential 68212 38.5 -121. ## 3 SACRAMENTO z95815 2 1 796 Residential 68880 38.6 -121. ## 4 SACRAMENTO z95815 2 1 852 Residential 69307 38.6 -121. ## 5 SACRAMENTO z95824 2 1 797 Residential 81900 38.5 -121. ## 6 SACRAMENTO z95841 3 1 1122 Condo 89921 38.7 -121. The purpose of this exercise is to understand whether we can we use house size to predict house sale price in the Sacramento, CA area. The columns in this data that we are interested in are sqft (house size, in livable square feet) and price (house price, in US dollars (USD). The first step is to visualize the data as a scatter plot where we place the predictor/explanatory variable (house size) on the x-axis, and we place the target/response variable that we want to predict (price) on the y-axis: eda <- ggplot(sacramento, aes(x = sqft, y = price)) + geom_point(alpha = 0.4) + xlab("House size (square footage)") + ylab("Price (USD)") + scale_y_continuous(labels = dollar_format()) eda Based on the visualization above, we can see that in Sacramento, CA, as the size of a house increases, so does its sale price. Thus, we can reason that we may be able to use the size of a not-yet-sold house (for which we don’t know the sale price) to predict its final sale price. 8.5 K-nearest neighbours regression Much like in the case of classification, we can use a K-nearest neighbours-based approach in regression to make predictions. Let’s take a small sample of the data above and walk through how K-nearest neighbours (knn) works in a regression context before we dive in to creating our model and assessing how well it predicts house price. This subsample is taken to allow us to illustrate the mechanics of K-NN regression with a few data points; later in this chapter we will use all the data. To take a small random sample of size 30, we’ll use the function sample_n. This function takes two arguments: tbl (a data frame-like object to sample from) size (the number of observations/rows to be randomly selected/sampled) set.seed(1234) small_sacramento <- sample_n(sacramento, size = 30) Next let’s say we come across a 2,000 square-foot house in Sacramento we are interested in purchasing, with an advertised list price of $350,000. Should we offer to pay the asking price for this house, or is it overpriced and we should offer less? Absent any other information, we can get a sense for a good answer to this question by using the data we have to predict the sale price given the sale prices we have already observed. But in the plot below, we have no observations of a house of size exactly 2000 square feet. How can we predict the price? small_plot <- ggplot(small_sacramento, aes(x = sqft, y = price)) + geom_point() + xlab("House size (square footage)") + ylab("Price (USD)") + scale_y_continuous(labels=dollar_format()) + geom_vline(xintercept = 2000, linetype = "dotted") small_plot We will employ the same intuition from the classification chapter, and use the neighbouring points to the new point of interest to suggest/predict what its price should be. For the example above, we find and label the 5 nearest neighbours to our observation of a house that is 2000 square feet: nearest_neighbours <- small_sacramento %>% mutate(diff = abs(2000 - sqft)) %>% arrange(diff) %>% head(5) nearest_neighbours ## # A tibble: 5 x 10 ## city zip beds baths sqft type price latitude longitude diff ## <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 GOLD_RIVER z95670 3 2 1981 Residential 305000 38.6 -121. 19 ## 2 ELK_GROVE z95758 4 2 2056 Residential 275000 38.4 -121. 56 ## 3 ELK_GROVE z95624 5 3 2136 Residential 223058 38.4 -121. 136 ## 4 RANCHO_CORDOVA z95742 4 2 1713 Residential 263500 38.6 -121. 287 ## 5 RIO_LINDA z95673 2 2 1690 Residential 136500 38.7 -121. 310 Now that we have the 5 nearest neighbours (in terms of house size) to our new 2,000 square-foot house of interest, we can use their values to predict a selling price for the new home. Specifically, we can take the mean (or average) of these 5 values as our predicted value. prediction <- nearest_neighbours %>% summarise(predicted = mean(price)) prediction ## # A tibble: 1 x 1 ## predicted ## <dbl> ## 1 240612. Our predicted price is $240612 (shown as a red point above), which is much less than $350,000; perhaps we might want to offer less than the list price at which the house is advertised. But this is only the very beginning of the story. We still have all the same unanswered questions here with K-NN regression that we had with K-NN classification: which \\(K\\) do we choose, and is our model any good at making predictions? In the next few sections, we will address these questions in the context of K-NN regression. 8.6 Training, evaluating, and tuning the model As usual, we must start by putting some test data away in a lock box that we will come back to only after we choose our final model. Let’s take care of that now. Note that for the remainder of the chapter we’ll be working with the entire Sacramento data set, as opposed to the smaller sample of 30 points above. set.seed(1234) sacramento_split <- initial_split(sacramento, prop = 0.6, strata = price) sacramento_train <- training(sacramento_split) sacramento_test <- testing(sacramento_split) Next, we’ll use cross-validation to choose \\(K\\). In K-NN classification, we used accuracy to see how well our predictions matched the true labels. Here in the context of K-NN regression we will use root mean square prediction error (RMSPE) instead. The mathematical formula for calculating RMSPE is: \\[\\text{RMSPE} = \\sqrt{\\frac{1}{n}\\sum\\limits_{i=1}^{n}(y_i - \\hat{y}_i)^2}\\] Where: \\(n\\) is the number of observations \\(y_i\\) is the observed value for the \\(i^\\text{th}\\) observation \\(\\hat{y}_i\\) is the forcasted/predicted value for the \\(i^\\text{th}\\) observation A key feature of the formula for RMPSE is the squared difference between the observed target/response variable value, \\(y\\), and the prediction target/response variable value, \\(\\hat{y}_i\\), for each observation (from 1 to \\(i\\)). If the predictions are very close to the true values, then RMSPE will be small. If, on the other-hand, the predictions are very different to the true values, then RMSPE will be quite large. When we use cross validation, we will choose the \\(K\\) that gives us the smallest RMSPE. RMSPE versus RMSE When using many code packages (tidymodels included), the evaluation output we will get to assess the prediction quality of our K-NN regression models is labelled “RMSE”, or “root mean squared error”. Why is this so, and why not just RMSPE? In statistics, we try to be very precise with our language to indicate whether we are calculating the prediction error on the training data (in-sample prediction) versus on the testing data (out-of-sample prediction). When predicting and evaluating prediction quality on the training data, we say RMSE. By contrast, when predicting and evaluating prediction quality on the testing or validation data, we say RMSPE. The equation for calculating RMSE and RMSPE is exactly the same; all that changes is whether the \\(y\\)s are training or testing data. But many people just use RMSE for both, and rely on context to denote which data the root mean squared error is being calculated on. Now that we know how we can assess how well our model predicts a numerical value, let’s use R to perform cross-validation and to choose the optimal \\(K\\). First, we will create a model specification for K-nearest neighbours regression, as well as a recipe for preprocessing our data. Note that we use set_mode(\"regression\") now in the model specification to denote a regression problem, as opposed to the classification problems from the previous chapters. Note also that we include standardization in our preprocessing to build good habits, but since we only have one predictor it is technically not necessary; there is no risk of comparing two predictors of different scales. sacr_recipe <- recipe(price ~ sqft, data = sacramento_train) %>% step_scale(all_predictors()) %>% step_center(all_predictors()) sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>% set_engine("kknn") %>% set_mode("regression") sacr_vfold <- vfold_cv(sacramento_train, v = 5, strata = price) sacr_wkflw <- workflow() %>% add_recipe(sacr_recipe) %>% add_model(sacr_spec) sacr_wkflw ## ══ Workflow ══════════════════════════════════════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: nearest_neighbor() ## ## ── Preprocessor ────────────────────────────────────────────────────────────────────────────────────────────────── ## 2 Recipe Steps ## ## ● step_scale() ## ● step_center() ## ## ── Model ───────────────────────────────────────────────────────────────────────────────────────────────────────── ## K-Nearest Neighbor Model Specification (regression) ## ## Main Arguments: ## neighbors = tune() ## weight_func = rectangular ## ## Computational engine: kknn The major difference you can see in the above workflow compared to previous chapters is that we are running regression rather than classification. The fact that we use set_mode(\"regression\") essentially tells tidymodels that we need to use different metrics (RMSPE, not accuracy) for tuning and evaluation. You can see this in the following code, which tunes the model and returns the RMSPE for each number of neighbours. gridvals <- tibble(neighbors = seq(1,200)) sacr_results <- sacr_wkflw %>% tune_grid(resamples = sacr_vfold, grid = gridvals) %>% collect_metrics() # show all the results sacr_results ## # A tibble: 400 x 6 ## neighbors .metric .estimator mean n std_err ## <int> <chr> <chr> <dbl> <int> <dbl> ## 1 1 rmse standard 116654. 5 7186. ## 2 1 rsq standard 0.372 5 0.0463 ## 3 2 rmse standard 101103. 5 5872. ## 4 2 rsq standard 0.453 5 0.0386 ## 5 3 rmse standard 96708. 5 4211. ## 6 3 rsq standard 0.483 5 0.0277 ## 7 4 rmse standard 93927. 5 3502. ## 8 4 rsq standard 0.503 5 0.0298 ## 9 5 rmse standard 89642. 5 3768. ## 10 5 rsq standard 0.543 5 0.0303 ## # … with 390 more rows We take the minimum RMSPE to find the best setting for the number of neighbours: # show only the row of minimum RMSPE sacr_min <- sacr_results %>% filter(.metric == 'rmse') %>% filter(mean == min(mean)) sacr_min ## # A tibble: 1 x 6 ## neighbors .metric .estimator mean n std_err ## <int> <chr> <chr> <dbl> <int> <dbl> ## 1 14 rmse standard 84356. 5 4050. Here we can see that the smallest RMSPE occurs when \\(K =\\) 14. 8.7 Underfitting and overfitting Similar to the setting of classification, by setting the number of neighbours to be too small or too large, we cause the RMSPE to increase: What is happening here? To visualize the effect of different settings of \\(K\\) on the regression model, we will plot the predicted values for house price from our K-NN regression models for 6 different values for \\(K\\). For each model, we predict a price for every possible home size across the range of home sizes we observed in the data set (here 500 to 4250 square feet) and we plot the predicted prices as a blue line: Based on the plots above, we see that when \\(K\\) = 1, the blue line runs perfectly through almost all of our training observations. This happens because our predicted values for a given region depend on just a single observation. A model like this has high variance and low bias (intuitively, it provides unreliable predictions). It has high variance because the flexible blue line follows the training observations very closely, and if we were to change any one of the training observation data points we would change the flexible blue line quite a lot. This means that the blue line matches the data we happen to have in this training data set, however, if we were to collect another training data set from the Sacramento real estate market it likely wouldn’t match those observations as well. Another term that we use to collectively describe this phenomenon is overfitting. What about the plot where \\(K\\) is quite large, say \\(K\\) = 450, for example? When \\(K\\) = 450 for this data set, the blue line is extremely smooth, and almost flat. This happens because our predicted values for a given x value (here home size), depend on many many (450) neighbouring observations. A model like this has low variance and high bias (intuitively, it provides very reliable, but generally very inaccurate predictions). It has low variance because the smooth, inflexible blue line does not follow the training observations very closely, and if we were to change any one of the training observation data points it really wouldn’t affect the shape of the smooth blue line at all. This means that although the blue line matches does not match the data we happen to have in this particular training data set perfectly, if we were to collect another training data set from the Sacramento real estate market it likely would match those observations equally as well as it matches those in this training data set. Another term that we use to collectively describe this kind of model is underfitting. Ideally, what we want is neither of the two situations discussed above. Instead, we would like a model with low variance (so that it will transfer/generalize well to other data sets, and isn’t too dependent on the observations that happen to be in the training set) and low bias (where the model does not completely ignore our training data). If we explore the other values for \\(K\\), in particular \\(K\\) = 14 (as suggested by cross-validation), we can see it has a lower bias than our model with a very high \\(K\\) (e.g., 450), and thus the model/predicted values better match the actual observed values than the high \\(K\\) model. Additionally, it has lower variance than our model with a very low \\(K\\) (e.g., 1) and thus it should better transer/generalize to other data sets compared to the low \\(K\\) model. All of this is similar to how the choice of \\(K\\) affects K-NN classification (discussed in the previous chapter). 8.8 Evaluating on the test set To assess how well our model might do at predicting on unseen data, we will assess its RMSPE on the test data. To do this, we will first re-train our K-NN regression model on the entire training data set, using \\(K =\\) 14 neighbours. Then we will use predict to make predictions on the test data, and use the metrics function again to compute the summary of regression quality. Because we specify that we are performing regression in set_mode, the metrics function knows to output a quality summary related to regression, and not, say, classification. set.seed(1234) kmin <- sacr_min %>% pull(neighbors) sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = kmin) %>% set_engine("kknn") %>% set_mode("regression") sacr_fit <- workflow() %>% add_recipe(sacr_recipe) %>% add_model(sacr_spec) %>% fit(data = sacramento_train) sacr_summary <- sacr_fit %>% predict(sacramento_test) %>% bind_cols(sacramento_test) %>% metrics(truth = price, estimate = .pred) sacr_summary ## # A tibble: 3 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 87737. ## 2 rsq standard 0.546 ## 3 mae standard 65400. Our final model’s test error as assessed by RMSPE is 87737. But what does this RMSPE score mean? When we calculated test set prediction accuracy in K-NN classification, the highest possible value was 1 and the lowest possible value was 0. If we got a value close to 1, our model was “good;” and otherwise, the model was “not good.” What about RMSPE? Unfortunately there is no default scale for RMSPE. Instead, it is measured in the units of the target/response variable, and so it is a bit hard to interpret. For now, let’s consider this approach to thinking about RMSPE from our testing data set: as long as its not significantly worse than the cross-validation RMSPE of our best model, then we can say that we’re not doing too much worse on the test data than we did on the training data. So the model appears to be generalizing well to a new data set it has never seen before. In future courses on statistical/machine learning, you will learn more about how to interpret RMSPE from testing data and other ways to assess models. Finally, what does our model look like when we predict across all possible house sizes we might encounter in the Sacramento area? We plotted it above where we explored how \\(k\\) affects K-NN regression, but we show it again now, along with the code that generated it: set.seed(1234) sacr_preds <- sacr_fit %>% predict(sacramento_train) %>% bind_cols(sacramento_train) plot_final <- ggplot(sacr_preds, aes(x = sqft, y = price)) + geom_point(alpha = 0.4) + xlab("House size (square footage)") + ylab("Price (USD)") + scale_y_continuous(labels = dollar_format()) + geom_line(data = sacr_preds, aes(x = sqft, y = .pred), color = "blue") + ggtitle(paste0("K = ", kmin)) plot_final 8.9 Strengths and limitations of K-NN regression As with K-NN classification (or any prediction algorithm for that manner), K-NN regression has both strengths and weaknesses. Some are listed here: Strengths of K-NN regression Simple and easy to understand No assumptions about what the data must look like Works well with non-linear relationships (i.e., if the relationship is not a straight line) Limitations of K-NN regression As data gets bigger and bigger, K-NN gets slower and slower, quite quickly 2. Does not perform well with a large number of predictors unless the size of the training set is exponentially larger Does not predict well beyond the range of values input in your training data 8.10 Multivariate K-NN regression As in K-NN classification, in K-NN regression we can have multiple predictors. When we have multiple predictors in K-NN regression, we have the same concern regarding the scale of the predictors. This is because once again, predictions are made by identifying the \\(K\\) observations that are nearest to the new point we want to predict, and any variables that are on a large scale will have a much larger effect than variables on a small scale. Since the recipe we built above scales and centers all predictor variables, this is handled for us. We will now demonstrate a multivariate K-NN regression analysis of the Sacramento real estate data using tidymodels. This time we will use house size (measured in square feet) as well as number of bedrooms as our predictors, and continue to use house sale price as our outcome/target variable that we are trying to predict. It is always a good practice to do exploratory data analysis, such as visualizing the data, before we start modeling the data. Thus the first thing we will do is use ggpairs (from the GGally package) to plot all the variables we are interested in using in our analyses: library(GGally) plot_pairs <- sacramento %>% select(price, sqft, beds) %>% ggpairs() plot_pairs From this we can see that generally, as both house size and number of bedrooms increase, so does price. Does adding the number of bedrooms to our model improve our ability to predict house price? To answer that question, we will have to come up with the test error for a K-NN regression model using house size and number of bedrooms, and then we can compare it to the test error for the model we previously came up with that only used house size to see if it is smaller (decreased test error indicates increased prediction quality). Let’s do that now! First we’ll build a new model specification and recipe for the analysis. Note that we use the formula price ~ sqft + beds to denote that we have two predictors, and set neighbors = tune() to tell tidymodels to tune the number of neighbours for us. sacr_recipe <- recipe(price ~ sqft + beds, data = sacramento_train) %>% step_scale(all_predictors()) %>% step_center(all_predictors()) sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>% set_engine("kknn") %>% set_mode("regression") Next, we’ll use 5-fold cross-validation to choose the number of neighbours via the minimum RMSPE: gridvals <- tibble(neighbors = seq(1,200)) sacr_k <- workflow() %>% add_recipe(sacr_recipe) %>% add_model(sacr_spec) %>% tune_grid(sacr_vfold, grid = gridvals) %>% collect_metrics() %>% filter(.metric == 'rmse') %>% filter(mean == min(mean)) %>% pull(neighbors) sacr_k ## [1] 14 Here we see that the smallest RMSPE occurs when \\(K =\\) 14. Now that we have chosen \\(K\\), we need to re-train the model on the entire training data set, and after that we can use that model to predict on the test data to get our test error. sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = sacr_k) %>% set_engine("kknn") %>% set_mode("regression") knn_mult_fit <- workflow() %>% add_recipe(sacr_recipe) %>% add_model(sacr_spec) %>% fit(data = sacramento_train) knn_mult_preds <- knn_mult_fit %>% predict(sacramento_test) %>% bind_cols(sacramento_test) knn_mult_mets <- metrics(knn_mult_preds, truth = price, estimate = .pred) knn_mult_mets ## # A tibble: 3 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 85152. ## 2 rsq standard 0.572 ## 3 mae standard 63575. This time when we performed K-NN regression on the same data set, but also included number of bedrooms as a predictor we obtained a RMSPE test error of 85152. This compares to a RMSPE test error of 87737 when we used only house size as the single predictor. Thus in this case, we did not improve the model by a large amount by adding this additional predictor. We can also visualize the model’s predictions overlaid on top of the data. This time the predictions will be a surface in 3-D space, instead of a line in 2-D space, as we have 2 predictors instead of 1. We can see that the predictions in this case, where we have 2 predictors, form a surface instead of a line. Because the newly added predictor, number of bedrooms, is correlated with price (USD) (meaning as price changes, so does number of bedrooms) and not totally determined by house size (our other predictor), we get additional and useful information for making our predictions. For example, in this model we would predict that the cost of a house with a size of 2,500 square feet generally increases slightly as the number of bedrooms increases. Without having the additional predictor of number of bedrooms, we would predict the same price for these two houses. "],
    +["regression2.html", "Chapter 9 Regression II: linear regression 9.1 Overview 9.2 Chapter learning objectives 9.3 Simple linear regression 9.4 Linear regression in R 9.5 Comparing simple linear and K-NN regression 9.6 Multivariate linear regression 9.7 The other side of regression 9.8 Additional readings/resources", " Chapter 9 Regression II: linear regression 9.1 Overview This chapter provides an introduction to linear regression models in a predictive context, focusing primarily on the case where there is a single predictor and single response variable of interest, as well as comparison to K-nearest neighbours methods. The chapter concludes with a discussion of linear regression with multiple predictors. 9.2 Chapter learning objectives By the end of the chapter, students will be able to: Perform linear regression in R using tidymodels and evaluate it on a test dataset. Compare and contrast predictions obtained from K-nearest neighbour regression to those obtained using simple ordinary least squares regression from the same dataset. In R, overlay regression lines from geom_smooth on a single plot. 9.3 Simple linear regression K-NN is not the only type of regression; another quite useful, and arguably the most common, type of regression is called simple linear regression. Simple linear regression is similar to K-NN regression in that the target/response variable is quantitative. However, one way it varies quite differently is how the training data is used to predict a value for a new observation. Instead of looking at the \\(K\\)-nearest neighbours and averaging over their values for a prediction, in simple linear regression all the training data points are used to create a straight line of best fit, and then the line is used to “look up” the predicted value. Note: for simple linear regression there is only one response variable and only one predictor. Later in this chapter we introduce the more general linear regression case where more than one predictor can be used. For example, let’s revisit the smaller version of the Sacramento housing data set. Recall that we have come across a new 2,000-square foot house we are interested in purchasing with an advertised list price of $350,000. Should we offer the list price, or is that over/undervalued? To answer this question using simple linear regression, we use the data we have to draw the straight line of best fit through our existing data points: The equation for the straight line is: \\[\\text{house price} = \\beta_0 + \\beta_1 \\cdot (\\text{house size}),\\] where \\(\\beta_0\\) is the vertical intercept of the line (the value where the line cuts the vertical axis) \\(\\beta_1\\) is the slope of the line Therefore using the data to find the line of best fit is equivalent to finding coefficients \\(\\beta_0\\) and \\(\\beta_1\\) that parametrize (correspond to) the line of best fit. Once we have the coefficients, we can use the equation above to evaluate the predicted price given the value we have for the predictor/explanatory variable—here 2,000 square feet. ## [1] 287961.9 By using simple linear regression on this small data set to predict the sale price for a 2,000 square foot house, we get a predicted value of $287962. But wait a minute…how exactly does simple linear regression choose the line of best fit? Many different lines could be drawn through the data points. We show some examples below: Simple linear regression chooses the straight line of best fit by choosing the line that minimzes the average vertical distance between itself and each of the observed data points. From the lines shown above, that is the blue line. What exactly do we mean by the vertical distance between the predicted values (which fall along the line of best fit) and the observed data points? We illustrate these distances in the plot below with a red line: To assess the predictive accuracy of a simple linear regression model, we use RMSPE—the same measure of predictive performance we used with K-NN regression. 9.4 Linear regression in R We can perform simple linear regression in R using tidymodels in a very similar manner to how we performed K-NN regression. To do this, instead of creating a nearest_neighbor model specification with the kknn engine, we use a linear_reg model specification with the lm engine. Another difference is that we do not need to choose \\(K\\) in the context of linear regression, and so we do not need to perform cross validation. Below we illustrate how we can use the usual tidymodels workflow to predict house sale price given house size using a simple linear regression approach using the full Sacramento real estate data set. An additional difference that you will notice below is that we do not standardize (i.e., scale and center) our predictors. In K-nearest neighbours models, recall that the model fit changes depending on whether we standardize first or not. In linear regression, standardization does not affect the fit (it does affect the coefficients in the equation, though!). So you can standardize if you want—it won’t hurt anything—but if you leave the predictors in their original form, the best fit coefficients are usually easier to interpret afterward. As usual, we start by putting some test data away in a lock box that we can come back to after we choose our final model. Let’s take care of that now. set.seed(1234) sacramento_split <- initial_split(sacramento, prop = 0.6, strata = price) sacramento_train <- training(sacramento_split) sacramento_test <- testing(sacramento_split) Now that we have our training data, we will create the model specification and recipe, and fit our simple linear regression model: lm_spec <- linear_reg() %>% set_engine("lm") %>% set_mode("regression") lm_recipe <- recipe(price ~ sqft, data = sacramento_train) lm_fit <- workflow() %>% add_recipe(lm_recipe) %>% add_model(lm_spec) %>% fit(data = sacramento_train) lm_fit ## ══ Workflow [trained] ════════════════════════════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: linear_reg() ## ## ── Preprocessor ────────────────────────────────────────────────────────────────────────────────────────────────── ## 0 Recipe Steps ## ## ── Model ───────────────────────────────────────────────────────────────────────────────────────────────────────── ## ## Call: ## stats::lm(formula = formula, data = data) ## ## Coefficients: ## (Intercept) sqft ## 15059 138 Our coefficients are (intercept) \\(\\beta_0=\\) 15059 and (slope) \\(\\beta_1=\\) 138. This means that the equation of the line of best fit is \\[\\text{house price} = 15059 + 138\\cdot (\\text{house size}),\\] and that the model predicts that houses start at $15059 for 0 square feet, and that every extra square foot increases the cost of the house by $138. Finally, we predict on the test data set to assess how well our model does: lm_test_results <- lm_fit %>% predict(sacramento_test) %>% bind_cols(sacramento_test) %>% metrics(truth = price, estimate = .pred) lm_test_results ## # A tibble: 3 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 85161. ## 2 rsq standard 0.572 ## 3 mae standard 62608. Our final model’s test error as assessed by RMSPE is 85161. Remember that this is in units of the target/response variable, and here that is US Dollars (USD). Does this mean our model is “good” at predicting house sale price based off of the predictor of home size? Again answering this is tricky to answer and requires to use domain knowledge and think about the application you are using the prediction for. To visualize the simple linear regression model, we can plot the predicted house price across all possible house sizes we might encounter superimposed on a scatter plot of the original housing price data. There is a plotting function in the tidyverse, geom_smooth, that allows us to do this easily by adding a layer on our plot with the simple linear regression predicted line of best fit. The default for this adds a plausible range to this line that we are not interested in at this point, so to avoid plotting it, we provide the argument se = FALSE in our call to geom_smooth. lm_plot_final <- ggplot(sacramento_train, aes(x = sqft, y = price)) + geom_point(alpha = 0.4) + xlab("House size (square footage)") + ylab("Price (USD)") + scale_y_continuous(labels = dollar_format()) + geom_smooth(method = "lm", se = FALSE) lm_plot_final We can extract the coefficients from our model by accessing the fit object that is output by the fit function; we first have to extract it from the workflow using the pull_workflow_fit function, and then apply the tidy function to convert the result into a data frame: coeffs <- tidy(pull_workflow_fit(lm_fit)) coeffs ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 15059. 8745. 1.72 8.56e- 2 ## 2 sqft 138. 4.77 28.9 3.13e-113 9.5 Comparing simple linear and K-NN regression Now that we have a general understanding of both simple linear and K-NN regression, we can start to compare and contrast these methods as well as the predictions made by them. To start, let’s look at the visualization of the simple linear regression model predictions for the Sacramento real estate data (predicting price from house size) and the “best” K-NN regression model obtained from the same problem: What differences do we observe from the visualization above? One obvious difference is the shape of the blue lines. In simple linear regression we are restricted to a straight line, whereas in K-NN regression our line is much more flexible and can be quite wiggly. But there is a major interpretability advantage in limiting the model to a straight line. A straight line can be defined by two numbers, the vertical intercept and the slope. The intercept tells us what the prediction is when all of the predictors are equal to 0; and the slope tells us what unit increase in the target/response variable we predict given a unit increase in the predictor/explanatory variable. K-NN regression, as simple as it is to implement and understand, has no such interpretability from its wiggly line. There can however also be a disadvantage to using a simple linear regression model in some cases, particularly when the relationship between the target and the predictor is not linear, but instead some other shape (e.g. curved or oscillating). In these cases the prediction model from a simple linear regression will underfit (have high bias), meaning that model/predicted values does not match the actual observed values very well. Such a model would probably have a quite high RMSE when assessing model goodness of fit on the training data and a quite high RMPSE when assessing model prediction quality on a test data set. On such a data set, K-NN regression may fare better. Additionally, there are other types of regression you can learn about in future courses that may do even better at predicting with such data. How do these two models compare on the Sacramento house prices data set? On the visualizations above we also printed the RMPSE as calculated from predicting on the test data set that was not used to train/fit the models. The RMPSE for the simple linear regression model is slightly lower than the RMPSE for the K-NN regression model. Considering that the simple linear regression model is also more interpretable, if we were comparing these in practice we would likely choose to use the simple linear regression model. Finally, note that the K-NN regression model becomes “flat” at the left and right boundaries of the data, while the linear model predicts a constant slope. Predicting outside the range of the observed data is known as extrapolation; K-NN and linear models behave quite differently when extrapolating. Depending on the application, the flat or constant slope trend may make more sense. For example, if our housing data were slightly different, the linear model may have actually predicted a negative price for a small houses (if the intercept \\(\\beta_0\\) was negative), which obviously does not match reality. On the other hand, the trend of increasing house size corresponding to increasing house price probably continues for large houses, so the “flat” extrapolation of K-NN likely does not match reality. 9.6 Multivariate linear regression As in K-NN classification and K-NN regression, we can move beyond the simple case of one response variable and only one predictor and perform multivariate linear regression where we can have multiple predictors. In this case we fit a plane to the data, as opposed to a straight line. To do this, we follow a very similar approach to what we did for K-NN regression; but recall that we do not need to use cross-validation to choose any parameters, nor do we need to standardize (i.e., center and scale) the data for linear regression. We demonstrate how to do this below using the Sacramento real estate data with both house size (measured in square feet) as well as number of bedrooms as our predictors, and continue to use house sale price as our outcome/target variable that we are trying to predict. We will start by changing the formula in the recipe to include both the sqft and beds variables as predictors: lm_recipe <- recipe(price ~ sqft + beds, data = sacramento_train) Now we can build our workflow and fit the model: lm_fit <- workflow() %>% add_recipe(lm_recipe) %>% add_model(lm_spec) %>% fit(data = sacramento_train) lm_fit ## ══ Workflow [trained] ════════════════════════════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: linear_reg() ## ## ── Preprocessor ────────────────────────────────────────────────────────────────────────────────────────────────── ## 0 Recipe Steps ## ## ── Model ───────────────────────────────────────────────────────────────────────────────────────────────────────── ## ## Call: ## stats::lm(formula = formula, data = data) ## ## Coefficients: ## (Intercept) sqft beds ## 52690.1 154.8 -20209.4 And finally, we predict on the test data set to assess how well our model does: lm_mult_test_results <- lm_fit %>% predict(sacramento_test) %>% bind_cols(sacramento_test) %>% metrics(truth = price, estimate = .pred) lm_mult_test_results ## # A tibble: 3 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 82835. ## 2 rsq standard 0.596 ## 3 mae standard 61008. In the case of two predictors, our linear regression creates a plane of best fit, shown below: We see that the predictions from linear regression with two predictors form a flat plane. This is the hallmark of linear regression, and differs from the wiggly, flexible surface we get from other methods such as K-NN regression. As discussed this can be advantageous in one aspect, which is that for each predictor, we can get slopes/intercept from linear regression, and thus describe the plane mathematically. We can extract those slope values from our model object as shown below: coeffs <- tidy(pull_workflow_fit(lm_fit)) coeffs ## # A tibble: 3 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 52690. 13745. 3.83 1.41e- 4 ## 2 sqft 155. 6.72 23.0 4.46e-83 ## 3 beds -20209. 5734. -3.52 4.59e- 4 And then use those slopes to write a mathematical equation to describe the prediction plane: \\[\\text{house price} = \\beta_0 + \\beta_1\\cdot(\\text{house size}) + \\beta_2\\cdot(\\text{number of bedrooms}),\\] where: \\(\\beta_0\\) is the vertical intercept of the hyperplane (the value where it cuts the vertical axis) \\(\\beta_1\\) is the slope for the first predictor (house size) \\(\\beta_2\\) is the slope for the second predictor (number of bedrooms) Finally, we can fill in the values for \\(\\beta_0\\), \\(\\beta_1\\) and \\(\\beta_2\\) from the model output above to create the equation of the plane of best fit to the data: \\[\\text{house price} = 52690 + 155\\cdot (\\text{house size}) -20209 \\cdot (\\text{number of bedrooms})\\] This model is more interpretable than the multivariate K-NN regression model; we can write a mathematical equation that explains how each predictor is affecting the predictions. But as always, we should look at the test error and ask whether linear regression is doing a better job of predicting compared to K-NN regression in this multivariate regression case. To do that we can use this linear regression model to predict on the test data to get our test error. lm_mult_test_results ## # A tibble: 3 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 82835. ## 2 rsq standard 0.596 ## 3 mae standard 61008. We get that the RMSPE for the multivariate linear regression model of 82835.42. This prediction error is less than the prediction error for the multivariate K-NN regression model, indicating that we should likely choose linear regression for predictions of house price on this data set. But we should also ask if this more complex model is doing a better job of predicting compared to our simple linear regression model with only a single predictor (house size). Revisiting last section, we see that our RMSPE for our simple linear regression model with only a single predictor was 85160.85, which is slightly more than that of our more complex model. Our model with two predictors provided a slightly better fit on test data than our model with just one. But should we always end up choosing a model with more predictors than fewer? The answer is no; you never know what model will be the best until you go through the process of comparing their performance on held-out test data. Exploratory data analysis can give you some hints, but until you look at the prediction errors to compare the models you don’t really know. Additionally, here we compare test errors purely for the purposes of teaching. In practice, when you want to compare several regression models with differing numbers of predictor variables, you should use cross-validation on the training set only; in this case choosing the model is part of tuning, so you cannot use the test data. There are several well known and more advanced methods to do this that are beyond the scope of this course, and they include backward or forward selection, and L1 or L2 regularization (also known as Lasso and ridge regression, respectively). 9.7 The other side of regression So far in this textbook we have used regression only in the context of prediction. However, regression is also a powerful method to understand and/or describe the relationship between a quantitative response variable and one or more explanatory variables. Extending the case we have been working with in this chapter (where we are interested in house price as the outcome/response variable), we might also be interested in describing the individual effects of house size and the number of bedrooms on house price, quantifying how big each of these effects are, and assessing how accurately we can estimate each of these effects. This side of regression is the topic of many follow-on statistics courses and beyond the scope of this course. 9.8 Additional readings/resources Pages 59-71 of Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani Pages 104 - 109 of An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani The caret Package Chapters 6 - 11 of Modern Dive Statistical Inference via Data Science by Chester Ismay and Albert Y. Kim "],
    +["clustering.html", "Chapter 10 Clustering 10.1 Overview 10.2 Chapter learning objectives 10.3 Clustering 10.4 K-means 10.5 K-means in R 10.6 Additional readings:", " Chapter 10 Clustering 10.1 Overview As part of exploratory data analysis, it is often helpful to see if there are meaningful subgroups (or clusters) in the data; this grouping can be used to for many purposes, such as generating new questions or improving predictive analyses. This chapter provides an introduction to clustering using the K-means algorithm, including techniques to choose the number of clusters as well as other practical considerations (such as scaling). 10.2 Chapter learning objectives By the end of the chapter, students will be able to: Describe a case where clustering is appropriate, and what insight it might extract from the data. Explain the K-means clustering algorithm. Interpret the output of a K-means analysis. Perform kmeans clustering in R using kmeans. Visualize the output of K-means clustering in R using pair-wise scatter plots. Identify when it is necessary to scale variables before clustering and do this using R. Use the elbow method to choose the number of clusters for K-means. Describe advantages, limitations and assumptions of the K-means clustering algorithm. 10.3 Clustering Clustering is a data analysis task involving separating a data set into subgroups of related data. For example, we might use clustering to separate a dataset of documents into groups that correspond to topics, a dataset of human genetic information into groups that correspond to ancestral subpopulations, or a dataset of online customers into groups that correspond to purchasing behaviours. Once the data are separated we can, for example, use the subgroups to generate new questions about the data and follow up with a predictive modelling exercise. In this course, clustering will be used only for exploratory analysis, i.e., uncovering patterns in the data that we have. Note that clustering is a fundamentally different kind of task than classification or regression. Most notably, both classification and regression are supervised tasks where there is a predictive target (a class label or value), and we have examples of past data with labels/values that help us predict those of future data. By contrast, clustering is an unsupervised task, as we are trying to understand and examine the structure of data without any labels to help us. This approach has both advantages and disadvantages. Clustering requires no additional annotation or input on the data; for example, it would be nearly impossible to annotate all the articles on Wikipedia with human-made topic labels, but we can still cluster the articles without this information to automatically find groupings corresponding to topics. However, because there is no predictive target, it is not as easy to evaluate the “quality” of a clustering. With classification, we are able to use a test data set to assess prediction performance. In clustering, there is not a single good choice for evaluation. In this book, we will use visualization to ascertain the quality of a clustering, and leave rigorous evaluation for more advanced courses. There are also so-called semisupervised tasks, where only some of the data come with labels / annotations, but the vast majority don’t. The goal is to try to uncover underlying structure in the data that allows one to guess the missing labels. This sort of task is very useful, for example, when one has an unlabelled data set that is too large to manually label, but one is willing to provide a few informative example labels as a “seed” to guess the labels for all the data. An illustrative example Suppose we have customer data with two variables measuring customer loyalty and satisfaction, and we want to learn whether there are distinct “types” of customer. Understanding this might help us come up with better products or promotions to improve our business in a data-driven way. head(marketing_data) ## # A tibble: 6 x 2 ## loyalty csat ## <dbl> <dbl> ## 1 7 1 ## 2 7.5 1 ## 3 8 2 ## 4 7 2 ## 5 8 3 ## 6 3 2 Figure 10.1: Modified from http://www.segmentationstudyguide.com/using-cluster-analysis-for-market-segmentation/ Based on this visualization, we might suspect there are a few subtypes of customer, selected from combinations of high/low satisfaction and high/low loyalty. How do we find this grouping automatically, and how do we pick the number of subtypes to look for? The way to rigorously separate the data into groups is to use a clustering algorithm. In this chapter, we will focus on the K-means algorithm, a widely-used and often very effective clustering method, combined with the elbow method for selecting the number of clusters. This procedure will separate the data into the following groups denoted by colour: What are the labels for these groups? Unfortunately, we don’t have any. K-means, like almost all clustering algorithms, just outputs meaningless “cluster labels” that are typically whole numbers: 1, 2, 3, etc. But in a simple case like this, where we can easily visualize the clusters on a scatter plot, we can give human-made labels to the groups using their positions on the plot: low loyalty and low satisfaction (green cluster), high loyalty and low satisfaction (pink cluster), and high loyalty and high satisfaction (blue cluster). Once we have made these determinations, we can use them to inform our future business decisions, or to ask further questions about our data. For example, here we might notice based on our clustering that there aren’t any customers with high satisfaction but low loyalty, and generate new analyses or business strategies based on this information. 10.4 K-means 10.4.1 Measuring cluster quality The K-means algorithm is a procedure that groups data into K clusters. It starts with an initial clustering of the data, and then iteratively improves it by making adjustments to the assignment of data to clusters until it cannot improve any further. But how do we measure the “quality” of a clustering, and what does it mean to improve it? In K-means clustering, we measure the quality of a cluster by its within-cluster sum-of-squared-distances (WSSD). Computing this involves two steps. First, we find the cluster centers by computing the mean of each variable over data points in the cluster. For example, suppose we have a cluster containing 3 observations, and we are using two variables, \\(x\\) and \\(y\\), to cluster the data. Then we would compute the \\(x\\) and \\(y\\) variables, \\(\\mu_x\\) and \\(\\mu_y\\), of the cluster center via \\[\\mu_x = \\frac{1}{3}(x_1+x_2+x_3) \\quad \\mu_y = \\frac{1}{3}(y_1+y_2+y_3).\\] In the first cluster from the customer satisfaction/loyalty example, there are 5 data points. These are shown with their cluster center (csat = 1.8 and loyalty = 7.5) highlighted below. Figure 10.2: Cluster 1 from the toy example, with center highlighted. The second step in computing the WSSD is to add up the squared distance between each point in the cluster and the cluster center. We use the straight-line / Euclidean distance formula that we learned about in the classification chapter. In the 3-observation cluster example above, we would compute the WSSD \\(S^2\\) via \\[S^2 = \\left((x_1 - \\mu_x)^2 + (y_1 - \\mu_y)^2\\right) + \\left((x_2 - \\mu_x)^2 + (y_2 - \\mu_y)^2\\right) +\\left((x_3 - \\mu_x)^2 + (y_3 - \\mu_y)^2\\right).\\] These distances are denoted by lines for the first cluster of the customer satisfaction/loyalty data example below. Figure 10.3: Cluster 1 from the toy example, with distances to the center highlighted. The larger the value of \\(S^2\\), the more spread-out the cluster is, since large \\(S^2\\) means that points are far away from the cluster center. Note, however, that “large” is relative to both the scale of the variables for clustering and the number of points in the cluster; a cluster where points are very close to the center might still have a large \\(S^2\\) if there are many data points in the cluster. 10.4.2 The clustering algorithm The K-means algorithm is quite simple. We begin by picking K, and uniformly randomly assigning data to the K clusters. Then K-means consists of two major steps that attempt to minimize the sum of WSSDs over all the clusters, i.e. the total WSSD: Center update: Compute the center of each cluster. Label update: Reassign each data point to the cluster with the nearest center. These two steps are repeated until the cluster assignments no longer change. For example, in the customer data example from earlier, our initialization might look like this: Figure 10.4: Random initialization of labels. And the first three iterations of K-means would look like (each row corresponds to an iteration, where the left column depicts the center update, and the right column depicts the reassignment of data to clusters): Center Update Label Update Note that at this point we can terminate the algorithm, since none of the assignments changed in the third iteration; both the centers and labels will remain the same from this point onward. Is K-means guaranteed to stop at some point, or could it iterate forever? As it turns out, the answer is thankfully that K-means is guaranteed to stop after some number of iterations. For the interested reader, the logic for this has three steps: (1) both the label update and the center update decrease total WSSD in each iteration, (2) the total WSSD is always greater than or equal to 0, and (3) there are only a finite number of possible ways to assign the data to clusters. So at some point, the total WSSD must stop decreasing, which means none of the assignments are changing and the algorithm terminates. 10.4.3 Random restarts K-means, unlike the classification and regression models we studied in previous chapters, can get “stuck” in a bad solution. For example, if we were unlucky and initialized K-means with the following labels: Figure 10.5: Random initialization of labels. Then the iterations of K-means would look like: Center Update Label Update This looks like a relatively bad clustering of the data, but K-means cannot improve it. To solve this problem when clustering data using K-means, we should randomly re-initialize the labels a few times, run K-means for each initialization, and pick the clustering that has the lowest final total WSSD. 10.4.4 Choosing K In order to cluster data using K-means, we also have to pick the number of clusters, K. But unlike in classification, we have no data labels and cannot perform cross-validation with some measure of model prediction error. Further, if K is chosen too small, then multiple clusters get grouped together; if K is too large, then clusters get subdivided. In both cases, we will potentially miss interesting structure in the data. For example, take a look below at the K-means clustering of our customer satisfaction and loyalty data for a number of clusters ranging from 1 to 9. Figure 10.6: Clustering of the customer data for # clusters ranging from 1 to 9. If we set K less than 3, then the clustering merges separate groups of data; this causes a large causing a large total WSSD, since the cluster center (denoted by an “x”) is not close to any of the data in the cluster. On the other hand, if we set K greater than 3, the clustering subdivides subgroups of data; this does indeed still decrease the total WSSD, but by only a diminishing amount. If we plot the total WSSD versus the number of clusters, we see that the decrease in total WSSD levels off (or forms an “elbow shape”) when we reach roughly the right number of clusters. Figure 10.7: Total WSSD for # clusters ranging from 1 to 9. 10.5 K-means in R To peform K-means clustering in R, we use the kmeans function. It takes at least two arguments, the data frame containing the data you wish to cluster, and K, the number of clusters (here we choose K = 3). Note that since the K-means algorithm uses a random initialization of assignments, we need to set the random seed to make the clustering reproducible. set.seed(1234) marketing_clust <- kmeans(marketing_data, centers = 3) marketing_clust ## K-means clustering with 3 clusters of sizes 2, 10, 7 ## ## Cluster means: ## loyalty csat label ## 1 9.000000 8.500000 2.0 ## 2 4.950000 2.500000 1.5 ## 3 6.142857 7.142857 2.0 ## ## Clustering vector: ## [1] 2 2 2 2 2 2 2 2 2 3 3 3 3 3 1 3 1 3 2 ## ## Within cluster sum of squares by cluster: ## [1] 0.50000 84.22500 11.71429 ## (between_SS / total_SS = 60.6 %) ## ## Available components: ## ## [1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" ## [8] "iter" "ifault" As you can see above, the clustering object returned has a lot of information about our analysis that we need to explore. Let’s take a look at it now. To do this, we will call in help from the broom package so that we get the model output back in a tidy data format. Let’s first start by getting the cluster identification for each point and plotting that on the scatter plot. To do that we use the augment function. Augment takes in the model and the original data frame, and returns a data frame with the data and the cluster assignments for each point: clustered_data <- augment(marketing_clust, marketing_data) head(clustered_data) ## # A tibble: 6 x 4 ## loyalty csat label .cluster ## <dbl> <dbl> <chr> <fct> ## 1 7 1 2 2 ## 2 7.5 1 2 2 ## 3 8 2 2 2 ## 4 7 2 2 2 ## 5 8 3 2 2 ## 6 3 2 1 2 Now that we have this data frame, we can easily plot the data (i.e., cluster assignments of each point): cluster_plot <- ggplot(clustered_data, aes(x = csat, y = loyalty, colour = .cluster), size=2) + geom_point() + labs(x = 'Customer satisfaction', y = 'Loyalty', colour = 'Cluster') cluster_plot As mentioned above, we need to choose a K to perform K-means clustering by finding where the “elbow” occurs in the plot of total WSSD versus number of clusters. We can get at the total WSSD (tot.withinss) from our clustering using broom’s glance function (it gives model-level statistics). For example: glance(marketing_clust) ## # A tibble: 1 x 4 ## totss tot.withinss betweenss iter ## <dbl> <dbl> <dbl> <int> ## 1 245. 96.4 148. 2 To calculate the total WSSD for a variety of Ks, we will create a data frame with a column named k with rows containing the numbers of clusters we want to run K-means with (here, 1 to 9). Then we use map to apply the kmeans function to each K. We also use map to then apply glance to each of the clusterings. This results in a complex data frame with 3 columns, one for K, one for the models, and one for the model statistics (output of glance, which is a data frame): marketing_clust_ks <- tibble(k = 1:9) %>% mutate(marketing_clusts = map(k, ~kmeans(marketing_data, .x)), glanced = map(marketing_clusts, glance)) head(marketing_clust_ks) ## # A tibble: 6 x 3 ## k marketing_clusts glanced ## <int> <list> <list> ## 1 1 <kmeans> <tibble [1 × 4]> ## 2 2 <kmeans> <tibble [1 × 4]> ## 3 3 <kmeans> <tibble [1 × 4]> ## 4 4 <kmeans> <tibble [1 × 4]> ## 5 5 <kmeans> <tibble [1 × 4]> ## 6 6 <kmeans> <tibble [1 × 4]> We now extract the total WSSD from the glanced column. Given that each item in this column is a data frame, we will need to use the unnest function to unpack the data frames. clustering_statistics <- marketing_clust_ks %>% unnest(glanced) head(clustering_statistics) ## # A tibble: 6 x 6 ## k marketing_clusts totss tot.withinss betweenss iter ## <int> <list> <dbl> <dbl> <dbl> <int> ## 1 1 <kmeans> 245. 245. -1.42e-13 1 ## 2 2 <kmeans> 245. 144. 1.01e+ 2 1 ## 3 3 <kmeans> 245. 39.6 2.05e+ 2 1 ## 4 4 <kmeans> 245. 35.4 2.09e+ 2 1 ## 5 5 <kmeans> 245. 21.5 2.23e+ 2 2 ## 6 6 <kmeans> 245. 15.8 2.29e+ 2 2 Now that we have tot.withinss and k as columns in a data frame, we can make a line plot and search for the “elbow” to find which value of K to use. elbow_plot <- ggplot(clustering_statistics, aes(x = k, y = tot.withinss)) + geom_point() + geom_line() + xlab("K") + ylab("Total within-cluster sum of squares")+ scale_x_continuous(breaks = 1:9) elbow_plot It looks like 3 clusters is the right choice for this data. But why is there a “bump” in the total WSSD plot here? Shouldn’t total WSSD always decrease as we add more clusters? Technically yes, but remember: K-means can get “stuck” in a bad solution. Unfortunately, for K = 6 we had an unlucky initialization and found a bad clustering! We can help prevent finding a bad clustering by trying a few different random initializations via the nstart argument (here we use 10 restarts). marketing_clust_ks <- tibble(k = 1:9) %>% mutate(marketing_clusts = map(k, ~kmeans(marketing_data, nstart = 10, .x)), glanced = map(marketing_clusts, glance)) clustering_statistics <- marketing_clust_ks %>% unnest(glanced) elbow_plot <- ggplot(clustering_statistics, aes(x = k, y = tot.withinss)) + geom_point() + geom_line() + xlab("K") + ylab("Total within-cluster sum of squares")+ scale_x_continuous(breaks = 1:9) elbow_plot 10.6 Additional readings: For more about clustering and K-means, refer to pages 385-390 and 404-405 of Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, as well as the companion video linked to below: "],
    +["inference.html", "Chapter 11 Introduction to Statistical Inference 11.1 Overview 11.2 Chapter learning objectives 11.3 Why do we need sampling? 11.4 Sampling distributions 11.5 Bootstrapping 11.6 Additional readings", " Chapter 11 Introduction to Statistical Inference 11.1 Overview A typical data analysis task in practice is to draw conclusions about some unknown aspect of a population of interest based on observed data sampled from that population; we typically do not get data on the entire population. Data analysis questions regarding how summaries, patterns, trends, or relationships in a data set extend to the wider population are called inferential questions. This chapter will start with the fundamental ideas of sampling from populations and then introduce two common techniques in statistical inference: point estimation and interval estimation. 11.2 Chapter learning objectives By the end of the chapter, students will be able to: Describe real-world examples of questions that can be answered with the statistical inference. Define common population parameters (e.g. mean, proportion, standard deviation) that are often estimated using sampled data, and estimate these from a sample. Define the following statistical sampling terms (population, sample, population parameter, point estimate, sampling distribution). Explain the difference between a population parameter and sample point estimate. Use R to draw random samples from a finite population. Use R to create a sampling distribution from a finite population. Describe how sample size influences the sampling distribution. Define bootstrapping. Use R to create a bootstrap distribution to approximate a sampling distribution. Contrast the bootstrap and sampling distributions. 11.3 Why do we need sampling? Statistical inference can help us decide how quantities we observe in a subset of data relate to the same quantities in the broader population. Suppose a retailer is considering selling iPhone accessories, and they want to estimate how big the market might be. Additionally, they want to strategize how they can market their products on North American college and university campuses. This retailer might use statistical inference to answer the question: What proportion of all undergraduate students in North America own an iPhone? In the above question, we are interested in making a conclusion about all undergraduate students in North America; this is our population. In general, the population is the complete collection of individuals or cases we are interested in studying. Further, in the above question, we are interested in computing a quantity—the proportion of iPhone owners—based on the entire population. This is our population parameter. In general, a population parameter is a numerical characteristic of the entire population. To compute this number in the example above, we would need to ask every single undergraduate in North America whether or not they own an iPhone. In practice, directly computing population parameters is often time-consuming and costly, and sometimes impossible. A more practical approach would be to collect measurements for a sample: a subset of individuals collected from the population. We can then compute a sample estimate—a numerical characteristic of the sample—that estimates the population parameter. For example, suppose we randomly selected 100 undergraduate students across North America (the sample) and computed the proportion of those students who own an iPhone (the sample estimate). In that case, we might suspect that that proportion is a reasonable estimate of the proportion of students who own an iPhone in the entire population. Figure 11.1: Population versus sample Note that proportions are not the only kind of population parameter we might be interested in. Suppose an undergraduate student studying at the University of British Columbia in Vancouver, British Columbia, is looking for an apartment to rent. They need to create a budget, so they want to know something about studio apartment rental prices in Vancouver, BC. This student might use statistical inference to tackle the question: What is the average price-per-month of studio apartment rentals in Vancouver, Canada? The population consists of all studio apartment rentals in Vancouver, and the population parameter is the average price-per-month. Here we used the average as a measure of center to describe the “typical value” of studio apartment rental prices. But even within this one example, we could also be interested in many other population parameters. For instance, we know that not every studio apartment rental in Vancouver will have the same price-per-month. The student might be interested in how much monthly prices vary and want to find a measure of the rentals’ spread (or variability), such as the standard deviation. We might be interested in the fraction of studio apartment rentals that cost more than $1000 per month. And the list of population parameters we might want to calculate goes on. The question we want to answer will help us determine the parameter we want to estimate. If we were somehow able to observe the whole population of studio apartment rental offerings in Vancouver, we could compute each of these numbers exactly; therefore, these are all population parameters. There are many kinds of observations and population parameters that you will run into in practice, but in this chapter, we will focus on two settings: Using categorical observations to estimate the proportion of each category Using quantitative observations to estimate the average (or mean) 11.4 Sampling distributions 11.4.1 Sampling distributions for proportions Let’s start with an illustrative (and tasty!) example. Timbits are bite-sized doughnuts sold at Tim Hortons, a popular Canadian-based fast-food restaurant chain founded in Hamilton, Ontario, Canada. Figure 11.2: Timbits. Source: wikimedia.org Suppose we wanted to estimate the true proportion of chocolate doughnuts at Tim Hortons restaurants. Now, of course, we (the authors!) do not have access to the true population. So for this chapter, we created a fictitious box of 10,000 Timbits with two flavours—old-fashioned and chocolate—as our population, and use this to illustrate inferential concepts. Below we have a tibble() called virtual_box with a Timbit ID and flavour as our columns. We have also loaded our necessary packages: tidyverse and the infer package, which we will need to perform sampling later in the chapter. library(tidyverse) library(infer) virtual_box ## # A tibble: 10,000 x 2 ## timbit_id flavour ## <dbl> <fct> ## 1 1 chocolate ## 2 2 chocolate ## 3 3 chocolate ## 4 4 chocolate ## 5 5 old fashioned ## 6 6 old fashioned ## 7 7 chocolate ## 8 8 chocolate ## 9 9 old fashioned ## 10 10 chocolate ## # … with 9,990 more rows From our simulated box, we can see that the proportion of chocolate Timbits is 0.63. This value, 0.63, is the population parameter. Note that this parameter value is usually unknown in real data analysis problems. virtual_box %>% group_by(flavour) %>% summarize(n = n(), proportion = n() / 10000) ## # A tibble: 2 x 3 ## flavour n proportion ## <fct> <int> <dbl> ## 1 old fashioned 3705 0.370 ## 2 chocolate 6295 0.630 What would happen if we were to buy a box of 40 randomly-selected Timbits and count the number of chocolate Timbits (i.e., take a random sample of size 40 from our Timbits population)? Let’s use R to simulate this using our virtual_box population. We can do this using the rep_sample_n function from the infer package. The arguments of rep_sample_n are (1) the data frame (or tibble) to sample from, and (2) the size of the sample to take. set.seed(1) samples_1 <- rep_sample_n(tbl = virtual_box, size = 40) choc_sample_1 <- summarize(samples_1, n = sum(flavour == "chocolate"), prop = sum(flavour == "chocolate") / 40) choc_sample_1 ## # A tibble: 1 x 3 ## replicate n prop ## <int> <int> <dbl> ## 1 1 29 0.725 Here we see that the proportion of chocolate Timbits in this random sample is 0.72. This value is our estimate — our best guess of our population parameter using this sample. Given that it is a single value that we are estimating, we often refer to it as a point estimate. Now imagine we took another random sample of 40 Timbits from the population. Do you think we would get the same proportion? Let’s try sampling from the population again and see what happens. set.seed(2) samples_2 <- rep_sample_n(virtual_box, size = 40) choc_sample_2 <- summarize(samples_2, n = sum(flavour == "chocolate"), prop = sum(flavour == "chocolate") / 40) choc_sample_2 ## # A tibble: 1 x 3 ## replicate n prop ## <int> <int> <dbl> ## 1 1 27 0.675 Notice that we get a different value for our estimate this time. The proportion of chocolate Timbits in this sample is 0.68. If we were to do this again, another random sample could also give a different result. Estimates vary from sample to sample due to sampling variability. But just how much should we expect the estimates of our random samples to vary? In order to understand this, we will simulate taking more samples of size 40 from our population of Timbits, and calculate the proportion of chocolate Timbits in each sample. We can then visualize the distribution of sample proportions we calculate. The distribution of the estimate for all possible samples of a given size (which we commonly refer to as \\(n\\)) from a population is called a sampling distribution. The sampling distribution will help us see how much we would expect our sample proportions from this population to vary for samples of size 40. Below we again use the rep_sample_n to take samples of size 40 from our population of Timbits, but we set the reps argument to specify the number of samples to take, here 15,000. We will use the function head() to see the first few rows and tail() to see the last few rows of our samples data frame. samples <- rep_sample_n(virtual_box, size = 40, reps = 15000) head(samples) ## # A tibble: 6 x 3 ## # Groups: replicate [1] ## replicate timbit_id flavour ## <int> <dbl> <fct> ## 1 1 9054 chocolate ## 2 1 4322 old fashioned ## 3 1 1685 chocolate ## 4 1 3958 chocolate ## 5 1 2765 old fashioned ## 6 1 358 old fashioned tail(samples) ## # A tibble: 6 x 3 ## # Groups: replicate [1] ## replicate timbit_id flavour ## <int> <dbl> <fct> ## 1 15000 4633 old fashioned ## 2 15000 552 chocolate ## 3 15000 7998 old fashioned ## 4 15000 8649 chocolate ## 5 15000 2974 chocolate ## 6 15000 7811 old fashioned Notice the column replicate is indicating the replicate, or sample, with which each Timbit belongs. Since we took 15,000 samples of size 40, there are 15,000 replicates. Now that we have taken 15,000 samples, to create a sampling distribution of sample proportions for samples of size 40, we need to calculate the proportion of chocolate Timbits for each sample, \\(\\hat{p}_\\text{chocolate}\\): sample_estimates <- samples %>% group_by(replicate) %>% summarise(sample_proportion = sum(flavour == "chocolate") / 40) head(sample_estimates) ## # A tibble: 6 x 2 ## replicate sample_proportion ## <int> <dbl> ## 1 1 0.625 ## 2 2 0.675 ## 3 3 0.7 ## 4 4 0.675 ## 5 5 0.45 ## 6 6 0.425 tail(sample_estimates) ## # A tibble: 6 x 2 ## replicate sample_proportion ## <int> <dbl> ## 1 14995 0.575 ## 2 14996 0.6 ## 3 14997 0.45 ## 4 14998 0.675 ## 5 14999 0.7 ## 6 15000 0.525 Now that we have calculated the proportion of chocolate Timbits for each sample, \\(\\hat{p}_\\text{chocolate}\\), we can visualize the sampling distribution of sample proportions for samples of size 40: sampling_distribution <- ggplot(sample_estimates, aes(x = sample_proportion)) + geom_histogram(fill="dodgerblue3", color="lightgrey", bins = 12) + xlab("Sample proportions") sampling_distribution Figure 11.3: Sampling distribution of the sample proportion for sample size 40 The sampling distribution appears to be bell-shaped with one peak. It is centered around 0.6 and the sample proportions range from about 0.3 to about 0.9. In fact, we can calculate the mean of the sample proportions. sample_estimates %>% summarise(mean = mean(sample_proportion)) ## # A tibble: 1 x 1 ## mean ## <dbl> ## 1 0.629 We notice that the sample proportions are centred around the population proportion value, 0.63! In general, the mean of the distribution of \\(\\hat{p}\\) should be equal to \\(p\\), which is good because that means the sample proportion is neither an overestimate nor an underestimate of the population proportion. So what can we learn from this sampling distribution? This distribution tells us what we might expect from proportions from samples of size \\(40\\) when our population proportion is 0.63. In practice, we usually don’t know the proportion of our population, but if we can use what we know about the sampling distribution, we can use it to make inferences about our population when we only have a single sample. 11.4.2 Sampling distributions for means In the previous section, our variable of interest—Timbit flavour—was categorical, and the population parameter of interest was the proportion of chocolate Timbits. As mentioned in the introduction to this chapter, there are many choices of population parameter for each type of observed variable. What if we wanted to infer something about a population of quantitative variables instead? For instance, a traveller visiting Vancouver, BC may wish to know about the prices of staying somewhere using Airbnb, an online marketplace for arranging places to stay. Particularly, they might be interested in estimating the population mean price per night of Airbnb listings in Vancouver, BC. This section will study the case where we are interested in the population mean of a quantitative variable. We will look at an example using data from Inside Airbnb. The data set contains Airbnb listings for Vancouver, Canada, in September 2020. Let’s imagine (for learning purposes) that our data set represents the population of all Airbnb rental listings in Vancouver, and we are interested in the population mean price per night. Our data contains an ID number, neighbourhood, type of room, the number of people the rental accommodates, number of bathrooms, bedrooms, beds, and the price per night. ## # A tibble: 6 x 8 ## id neighbourhood room_type accommodates bathrooms bedrooms beds price ## <int> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> ## 1 1 Downtown Entire home/apt 5 2 baths 2 2 150 ## 2 2 Downtown Eastside Entire home/apt 4 2 baths 2 2 132 ## 3 3 West End Entire home/apt 2 1 bath 1 1 85 ## 4 4 Kensington-Cedar Cottage Entire home/apt 2 1 bath 1 0 146 ## 5 5 Kensington-Cedar Cottage Entire home/apt 4 1 bath 1 2 110 ## 6 6 Hastings-Sunrise Entire home/apt 4 1 bath 2 3 195 We can visualize the population distribution of the price per night with a histogram. population_distribution <- ggplot(airbnb, aes(x = price)) + geom_histogram(fill="dodgerblue3", color="lightgrey") + xlab("Price per night ($)") population_distribution Figure 11.4: Population distribution of price per night ($) for all Airbnb listings in Vancouver, Canada We see that the distribution has one peak and is skewed—most of the listings are less than $250 per night, but a small proportion of listings cost more than that, creating a long tail on the histogram’s right side. We can also calculate the population mean, the average price per night for all the Airbnb listings. population_parameters <- airbnb %>% summarize(pop_mean = mean(price)) population_parameters ## # A tibble: 1 x 1 ## pop_mean ## <dbl> ## 1 155. The price per night of all Airbnb rentals in Vancouver, BC is $154.51, on average. This value is our population parameter since we are calculating it using the population data. Suppose that we did not have access to the population data, yet we still wanted to estimate the mean price per night. We could answer this question by taking a random sample of as many Airbnb listings as we had time to, let’s say we could do this for 40 listings. What would such a sample look like? Let’s take advantage of the fact that we do have access to the population data and simulate taking one random sample of 40 listings in R, again using rep_sample_n. After doing this we create a histogram to visualize the distribution of observations in the sample, and calculate the mean of our sample. This number is a point estimate for the mean of the full population. sample_1 <- airbnb %>% rep_sample_n(40) head(sample_1) ## # A tibble: 6 x 9 ## # Groups: replicate [1] ## replicate id neighbourhood room_type accommodates bathrooms bedrooms beds price ## <int> <int> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> ## 1 1 436 Kensington-Cedar Cottage Entire home/apt 7 2 baths 3 6 140 ## 2 1 2794 Riley Park Entire home/apt 3 1 bath 1 1 100 ## 3 1 4423 Mount Pleasant Entire home/apt 6 1 bath 3 3 207 ## 4 1 853 Kensington-Cedar Cottage Private room 2 1 shared bath 1 1 45 ## 5 1 1545 Kensington-Cedar Cottage Entire home/apt 4 1 bath 2 4 80 ## 6 1 2505 Oakridge Private room 2 1 private bath 1 1 154 sample_distribution <- ggplot(sample_1, aes(price)) + geom_histogram(fill="dodgerblue3", color="lightgrey") + xlab("Price per night ($)") sample_distribution Figure 11.5: Distribution of price per night ($) for sample of 40 Airbnb listings estimates <- sample_1 %>% summarize(sample_mean = mean(price)) estimates ## # A tibble: 1 x 2 ## replicate sample_mean ## <int> <dbl> ## 1 1 128. Recall that the population mean was $154.51. We see that our point estimate for the mean is $127.8. So our estimate was actually quite close to the population parameter: the mean was about 17.3% off. Note that in practice, we usually cannot compute the accuracy of the estimate, since we do not have access to the population parameter; if we did, we wouldn’t need to estimate it! Also recall from the previous section that the point estimate can vary; if we took another random sample from the population, then the value of our estimate may change. So then did we just get lucky with our point estimate above? How much does our estimate vary across different samples of size 40 in this example? Again, since we have access to the population, we can take many samples and plot the sampling distribution of sample means for samples of size 40 to get a sense for this variation. In this case, we’ll use 15,000 samples of size 40. samples <- rep_sample_n(airbnb, size = 40, reps = 15000) head(samples) ## # A tibble: 6 x 9 ## # Groups: replicate [1] ## replicate id neighbourhood room_type accommodates bathrooms bedrooms beds price ## <int> <int> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> ## 1 1 3101 West End Entire home/apt 5 2 baths 2 2 152 ## 2 1 281 Renfrew-Collingwood Private room 2 1 shared bath 1 1 40 ## 3 1 3343 Oakridge Entire home/apt 4 1 bath 1 2 101 ## 4 1 3394 Mount Pleasant Entire home/apt 6 1 bath 2 2 126 ## 5 1 3339 Downtown Eastside Entire home/apt 4 1 bath 1 2 113 ## 6 1 1908 Riley Park Entire home/apt 3 1 bath 2 3 102 sample_estimates <- samples %>% group_by(replicate) %>% summarise(sample_mean = mean(price)) head(sample_estimates) ## # A tibble: 6 x 2 ## replicate sample_mean ## <int> <dbl> ## 1 1 136. ## 2 2 145. ## 3 3 111. ## 4 4 173. ## 5 5 131. ## 6 6 174. sampling_distribution_40 <- ggplot(sample_estimates, aes(x = sample_mean)) + geom_histogram(fill="dodgerblue3", color="lightgrey") + xlab("Sample mean price per night ($)") sampling_distribution_40 Figure 11.6: Sampling distribution of the sample means for sample size of 40 Here we see that the sampling distribution of the mean has one peak and is bell-shaped. Most of the estimates are between about $140 and $170; but there are a good fraction of cases outside this range (i.e., where the point estimate was not close to the population parameter). So it does indeed look like we were quite lucky when we estimated the population mean with only 17.3% error. Let’s visualize the population distribution, distribution of the sample, and the sampling distribution on one plot to compare them. Figure 11.7: Comparision of population distribution, sample distribution and sampling distribution Given that there is quite a bit of variation in the sampling distribution of the sample mean—i.e., the point estimate that we obtain is not very reliable—is there any way to improve the estimate? One way to improve a point estimate is to take a larger sample. To illustrate what effect this has, we will take many samples of size 20, 50, 100, and 500, and plot the sampling distribution of the sample mean below. Figure 11.8: Comparision of sampling distributions Based on the visualization, two points about the sample mean become clear. First, the mean of the sample mean (across samples) is equal to the population mean. Second, increasing the size of the sample decreases the spread (i.e., the variability) in the sample mean point estimate of the population mean. Therefore, a larger sample size results in a more reliable point estimate of the population parameter. 11.4.3 Summary A point estimate is a single value computed using a sample from a population (e.g. a mean or proportion) The sampling distribution of an estimate is the distribution of the estimate for all possible samples of a fixed size from the same population. The sample means and proportions calculated from samples are centered around the population mean and proportion, respectively. The spread of the sampling distribution is related to the sample size. As the sample size increases, the spread of the sampling distribution decreases. The shape of the sampling distribution is usually bell-shaped with one peak and centred at the population mean or proportion. Why all this emphasis on sampling distributions? Usually, we don’t have access to the population data, so we cannot construct the sampling distribution as we did in this section. As we saw, our sample estimate’s value will likely not equal the population parameter value exactly. We saw from the sampling distribution just how much our estimates can vary. So reporting a single point estimate for the population parameter alone may not be enough. Using simulations, we can see patterns of the sample estimate’s sampling distribution would look like for a sample of a given size. We can use these patterns to approximate the sampling distribution when we only have one sample, which is the realistic case. If we can “predict” what the sampling distribution would look like for a sample, we could construct a range of values we think the population parameter’s value might lie. We can use our single sample and its properties that influence sampling distributions, such as the spread and sample size, to approximate the sampling distribution as best as we can. There are several methods to do this; however, in this book, we will use the bootstrap method to do this, as we will see in the next section. 11.5 Bootstrapping 11.5.1 Overview We saw in the previous section that we could compute a point estimate of a population parameter using a sample of observations from the population. And since we had access to the population, we could evaluate how accurate the estimate was, and even get a sense for how much the estimate would vary for different samples from the population. But in real data analysis settings, we usually have just one sample from our population, and do not have access to the population itself. So how do we get a sense for how variable our point estimate is when we only have one sample to work with? In this section, we will discuss interval estimation and construct confidence intervals using just a single sample from a population. A confidence interval is a range of plausible values for our population parameter. Here is the key idea. First, if you take a big enough sample, it looks like the population. Notice the histograms’ shapes for samples of different sizes taken from the population in the picture below. We see that for a large enough sample, the sample’s distribution looks like that of the population. Figure 11.9: Comparision of samples of different sizes from the population In the previous section, we took many samples of the same size from our population to get a sense for the variability of a sample estimate. But if our sample is big enough that it looks like our population, we can pretend that our sample is the population, and take more samples (with replacement) of the same size from it instead! This very clever technique is called the bootstrap. Note that by taking many samples from our single, observed sample, we do not obtain the true sampling distribution, but rather an approximation that we call the bootstrap distribution. Note that we need to sample with replacement when using the bootstrap. Otherwise, if we had a sample of size \\(n\\), and obtained a sample from it of size \\(n\\) without replacement, it would just return our original sample. This section will explore how to create a bootstrap distribution from a single sample using R. For a sample of size \\(n\\), the process we will go through is as follows: Randomly select an observation from the original sample, which was drawn from the population Record the observation’s value Replace that observation Repeat steps 1 - 3 (sampling with replacement) until you have \\(n\\) observations, which form a bootstrap sample Calculate the bootstrap point estimate (e.g., mean, median, proportion, slope, etc.) of the \\(n\\) observations in your bootstrap sample Repeat steps (1) - (5) many times to create a distribution of point estimates (the bootstrap distribution) Calculate the plausible range of values around our observed point estimate Figure 11.10: Overview of the bootstrap process 11.5.2 Bootstrapping in R Let’s continue working with our Airbnb data. Once again, let’s say we are interested in estimating the population mean price per night of all Airbnb listings in Vancouver, Canada using a single sample we collected of size 40. To simulate doing this in R, we will use rep_sample_n to take a random sample from from our population. In real life we wouldn’t do this step in R, we would instead simply load the data into R, that we, or our collaborators collected. After we have our sample, we will visualize it’s distribution and calculate our point estimate, the sample mean. one_sample <- airbnb %>% rep_sample_n(40) %>% ungroup() %>% # ungroup the data frame select(price) # drop the replicate column head(one_sample) ## # A tibble: 6 x 1 ## price ## <dbl> ## 1 250 ## 2 106 ## 3 150 ## 4 357 ## 5 50 ## 6 110 one_sample_dist <- ggplot(one_sample, aes(price)) + geom_histogram(fill="dodgerblue3", color="lightgrey") + xlab("Price per night ($)") one_sample_dist Figure 11.11: Histogram of price per night ($) for one sample of size 40 one_sample_estimates <- one_sample %>% summarise(sample_mean = mean(price)) one_sample_estimates ## # A tibble: 1 x 1 ## sample_mean ## <dbl> ## 1 166. The sample distribution is skewed with a few observations out to the right. The mean of the sample is $165.62. Remember, in practice, we usually only have one sample from the population. So this sample and estimate are the only data we can work with. We now perform steps (1) - (5) listed above to generate a single bootstrap sample in R using the sample we just took, and calculate the bootstrap estimate for that sample. We will use the rep_sample_n function as we did when we were creating our sampling distribution. Since we want to sample with replacement, we change the argument for replace from its default value of FALSE to TRUE. boot1 <- one_sample %>% rep_sample_n(size = 40, replace = TRUE, reps = 1) head(boot1) ## # A tibble: 6 x 2 ## # Groups: replicate [1] ## replicate price ## <int> <dbl> ## 1 1 201 ## 2 1 199 ## 3 1 127. ## 4 1 85 ## 5 1 169 ## 6 1 60 boot1_dist <- ggplot(boot1, aes(price)) + geom_histogram(fill="dodgerblue3", color="lightgrey") + xlab("Price per night ($)") boot1_dist Figure 11.12: Bootstrap distribution summarise(boot1, mean = mean(price)) ## # A tibble: 1 x 2 ## replicate mean ## <int> <dbl> ## 1 1 152. Notice that our bootstrap distribution has a similar shape to the original sample distribution. Though the shapes of the distributions are similar, they are not identical. You’ll also notice that the original sample mean and the bootstrap sample mean differ. How might that happen? Remember that we are sampling with replacement from the original sample, so we don’t end up with the same sample values again. We are trying to mimic drawing another sample from the population without actually having to do that. Let’s now take 15,000 bootstrap samples from the original sample we drew from the population (one_sample) using rep_sample_n and calculate the means for each of those replicates. Recall that this assumes that one_sample looks like our original population; but since we do not have access to the population itself, this is often the best we can do. boot15000 <- one_sample %>% rep_sample_n(size = 40, replace = TRUE, reps = 15000) head(boot15000) ## # A tibble: 6 x 2 ## # Groups: replicate [1] ## replicate price ## <int> <dbl> ## 1 1 200 ## 2 1 176 ## 3 1 105 ## 4 1 105 ## 5 1 105 ## 6 1 132 tail(boot15000) ## # A tibble: 6 x 2 ## # Groups: replicate [1] ## replicate price ## <int> <dbl> ## 1 15000 357 ## 2 15000 49 ## 3 15000 115 ## 4 15000 169 ## 5 15000 145 ## 6 15000 357 Let’s take a look at histograms of the first six replicates of our bootstrap samples. six_bootstrap_samples <- boot15000 %>% filter(replicate <= 6) ggplot(six_bootstrap_samples, aes(price)) + geom_histogram(fill="dodgerblue3", color="lightgrey") + xlab("Price per night ($)") + facet_wrap(~replicate) We see in the graph above how the bootstrap samples differ. We can also calculate the sample mean for each of these six replicates. six_bootstrap_samples %>% group_by(replicate) %>% summarize(mean = mean(price)) ## # A tibble: 6 x 2 ## replicate mean ## <int> <dbl> ## 1 1 154. ## 2 2 162. ## 3 3 151. ## 4 4 163. ## 5 5 158. ## 6 6 156. We can see that the bootstrap sample distributions and the sample means are different. This is because we are sampling with replacement. We will now calculate point estimates for our 15,000 bootstrap samples and generate a bootstrap distribution of our point estimates. The bootstrap distribution suggests how we might expect our point estimate to behave if we took another sample. boot15000_means <- boot15000 %>% group_by(replicate) %>% summarize(mean = mean(price)) head(boot15000_means) ## # A tibble: 6 x 2 ## replicate mean ## <int> <dbl> ## 1 1 154. ## 2 2 162. ## 3 3 151. ## 4 4 163. ## 5 5 158. ## 6 6 156. tail(boot15000_means) ## # A tibble: 6 x 2 ## replicate mean ## <int> <dbl> ## 1 14995 155. ## 2 14996 148. ## 3 14997 139. ## 4 14998 156. ## 5 14999 158. ## 6 15000 176. boot_est_dist <- ggplot(boot15000_means, aes(x = mean)) + geom_histogram(fill="dodgerblue3", color="lightgrey") + xlab("Sample mean price per night ($)") Let’s compare our bootstrap distribution with the true sampling distribution (taking many samples from the population). Figure 11.13: Comparison of distribution of the bootstrap sample means and sampling distribution There are two essential points that we can take away from these plots. First, the shape and spread of the true sampling distribution and the bootstrap distribution are similar; the bootstrap distribution lets us get a sense of the point estimate’s variability. The second important point is that the means of these two distributions are different. The sampling distribution is centred at $154.51, the population mean value. However, the bootstrap distribution is centred at the original sample’s mean price per night, $165.56. Because we are resampling from the original sample repeatedly, we see that the bootstrap distribution is centred at the original sample’s mean value (unlike the sampling distribution of the sample mean, which is centred at the population parameter value). The idea here is that we can use this distribution of bootstrap sample means to approximate the sampling distribution of the sample means when we only have one sample. Since the bootstrap distribution pretty well approximates the sampling distribution spread, we can use the bootstrap spread to help us develop a plausible range for our population parameter along with our estimate! Figure 11.14: Summary of bootstrapping process 11.5.3 Using the bootstrap to calculate a plausible range Now that we have constructed our bootstrap distribution let’s use it to create an approximate bootstrap confidence interval, a range of plausible values for the population mean. We will build a 95% percentile bootstrap confidence interval and find the range of values that cover the middle 95% of the bootstrap distribution. A 95% confidence interval means that if we were to repeat the sampling process and calculate 95% confidence intervals each time and repeat this process many times, then 95% of the intervals would capture the population parameter’s value. Note that there’s nothing particularly special about 95%, we could have used other confidence levels, such as 90% or 99%. There is a balance between our level of confidence and precision. A higher confidence level corresponds to a wider range of the interval, and a lower confidence level corresponds to a narrower range. Therefore the level we choose is based on what chance we are willing to take of being wrong based on the implications of being wrong for our application. In general, we choose confidence levels to be comfortable with our level of uncertainty, but not so strict that the interval is unhelpful. For instance, if our decision impacts human life and the implications of being wrong are deadly, we may want to be very confident and choose a higher confidence level. To calculate our 95% percentile bootstrap confidence interval, we will do the following: Arrange the observations in the bootstrap distribution in ascending order Find the value such that 2.5% of observations fall below it (the 2.5% percentile). Use that value as the lower bound of the interval Find the value such that 97.5% of observations fall below it (the 97.5% percentile). Use that value as the upper bound of the interval To do this in R, we can use the quantile() function: bounds <- boot15000_means %>% select(mean) %>% pull() %>% quantile(c(0.025, 0.975)) bounds ## 2.5% 97.5% ## 134.0778 200.2759 Our interval, $134.08 to $200.28, captures the middle 95% of the sample mean prices in the bootstrap distribution. We can visualize the interval on our distribution in the picture below. Figure 11.15: Distribution of the bootstrap sample means with percentile lower and upper bounds To finish our estimation of the population parameter, we would report the point estimate and our confidence interval’s lower and upper bounds. Here the sample mean price-per-night of 40 Airbnb listings was $127.8, and we are 95% “confident” that the true population mean price-per-night for all Airbnb listings in Vancouver is between $(134.08, 200.28). Notice that our interval does indeed contain the true population mean value, $154.51! However, in practice, we would not know whether our interval captured the population parameter or not because we usually only have a single sample, not the entire population. However, this is the best we can do when we only have one sample! This chapter is only the beginning of the journey into statistical inference. We can extend the concepts learned here to do much more than report point estimates and confidence intervals, such as hypothesis testing for differences between populations, tests for associations between variables, and so much more! We have just scratched the surface of statistical inference; however, the material presented here will serve as the foundation for more advanced statistical techniques you may learn about in the future! 11.6 Additional readings For more about statistical inference and bootstrapping, refer to Chapters 7 - 8 of Modern Dive: Statistical Inference via Data Science by Chester Ismay and Albert Y. Kim Chapters 4 - 7 of OpenIntro Statistics - Fourth Edition by David M. Diez, Christopher D. Barr and Mine Cetinkaya-Rundel "]
    +>>>>>>> dev
     ]
    diff --git a/docs/viz.html b/docs/viz.html
    index 37c21f94f..aed2e8e42 100644
    --- a/docs/viz.html
    +++ b/docs/viz.html
    @@ -26,7 +26,7 @@
     
     
     
    -
    +
     
       
       
    @@ -394,8 +394,8 @@ 

    4.2 Chapter learning objectives

    4.3 Choosing the visualization

    -
    -

    Ask a question, and answer it

    +
    +

    4.3.0.1 Ask a question, and answer it

    The purpose of a visualization is to answer a question about a data set of interest. So naturally, the first thing to do before creating a visualization is to formulate the question about the data that you are trying to answer. A good visualization will answer your question in a clear way without distraction; a great visualization will suggest even what the question @@ -422,8 +422,8 @@

    Ask a question, and answer it

    4.4 Refining the visualization

    -
    -

    Convey the message, minimize noise

    +
    +

    4.4.0.1 Convey the message, minimize noise

    Just being able to make a visualization in R with ggplot2 (or any other tool for that matter) doesn’t mean that it is effective at communicating your message to others. Once you have selected a broad type of visualization to use, you will have to refine it to suit your particular need. Some rules of thumb for doing this are listed below. They generally fall into two classes: you want to make your @@ -453,20 +453,20 @@

    Convey the message, minimize noise

    4.5 Creating visualizations with ggplot2

    -
    -

    Build the visualization iteratively

    +
    +

    4.5.0.1 Build the visualization iteratively

    This section will cover examples of how to choose and refine a visualization given a data set and a question that you want to answer, and then how to create the visualization in R using ggplot2. To use the ggplot2 library, we need to load the tidyverse metapackage.

    -
    library(tidyverse)
    +
    library(tidyverse)

    4.5.1 The Mauna Loa CO2 data set

    The Mauna Loa CO2 data set, curated by Dr. Pieter Tans, NOAA/GML and Dr. Ralph Keeling, Scripps Institution of Oceanography records the atmospheric concentration of carbon dioxide (CO2, in parts per million) at the Mauna Loa research station in Hawaii from 1959 onwards. Question: Does the concentration of atmospheric CO2 change over time, and are there any interesting patterns to note?

    -
    # mauna loa carbon dioxide data 
    -co2_df <- read_csv("data/mauna_loa.csv") %>%
    -        filter(ppm > 0, date_decimal < 2000)
    -head(co2_df)
    +
    # mauna loa carbon dioxide data 
    +co2_df <- read_csv("data/mauna_loa.csv") %>%
    +        filter(ppm > 0, date_decimal < 2000)
    +head(co2_df)
    ## # A tibble: 6 x 4
     ##    year month date_decimal   ppm
     ##   <dbl> <dbl>        <dbl> <dbl>
    @@ -499,18 +499,18 @@ 

    4.5.1 The Mauna Loa CO2 data set<

    There are many other possible arguments we could pass to the aesthetic mapping and geometric object to change how the plot looks. For the purposes of quickly testing things out to see what they look like, though, we can just go with the default settings:

    -
    co2_scatter <- ggplot(co2_df, aes(x = date_decimal, y = ppm)) + 
    -        geom_point() 
    -co2_scatter
    +
    co2_scatter <- ggplot(co2_df, aes(x = date_decimal, y = ppm)) + 
    +        geom_point() 
    +co2_scatter

    Certainly the visualization shows a clear upward trend in the atmospheric concentration of CO2 over time. This plot answers the first part of our question in the affirmative, but that appears to be the only conclusion one can make from the scatter visualization. However, since time is an ordered quantity, we can try using a line plot instead using the geom_line function. Line plots require that the data are ordered by their x coordinate, and connect the sequence of x and y coordinates with line segments. Let’s again try this with just the default arguments:

    -
    co2_line <- ggplot(co2_df, aes(x = date_decimal, y = ppm)) + 
    -        geom_line() 
    -co2_line
    +
    co2_line <- ggplot(co2_df, aes(x = date_decimal, y = ppm)) + 
    +        geom_line() 
    +co2_line

    Aha! There is another interesting phenomenon in the data: in addition to increasing over time, the concentration seems to oscillate as well. Given the visualization as it is now, it is still hard to tell how fast the oscillation is, but nevertheless, the line seems to @@ -519,12 +519,12 @@

    4.5.1 The Mauna Loa CO2 data set<

    Now that we have settled on the rough details of the visualization, it is time to refine things. This plot is fairly straightforward, and there is not much visual noise to remove. But there are a few things we must do to improve clarity, such as adding informative axis labels and making the font a more readable size. In order to add axis labels we use the xlab and ylab functions. To change the font size we use the theme function with the text argument:

    -
    co2_line <- ggplot(co2_df, aes(x = date_decimal, y = ppm)) + 
    -                   geom_line() +
    -                   xlab('Year') +
    -                   ylab('Atmospheric CO2 (ppm)') + 
    -                   theme(text = element_text(size = 18))
    -co2_line
    +
    co2_line <- ggplot(co2_df, aes(x = date_decimal, y = ppm)) + 
    +                   geom_line() +
    +                   xlab('Year') +
    +                   ylab('Atmospheric CO2 (ppm)') + 
    +                   theme(text = element_text(size = 18))
    +co2_line

    Finally, let’s see if we can better understand the oscillation by changing the visualization a little bit. Note that it is totally fine to use a small number of visualizations to answer different aspects of the question you are trying to answer. We will accomplish @@ -532,13 +532,13 @@

    4.5.1 The Mauna Loa CO2 data set< We scale the horizontal axis by using the scale_x_continuous function, and the vertical axis with the scale_y_continuous function. We can transform the axis by passing the trans argument, and set limits by passing the limits argument. In particular, here we will use the scale_x_continuous function with the limits argument to zoom in on just five years of data (say, 1990-1995):

    -
    co2_line <- ggplot(co2_df, aes(x = date_decimal, y = ppm)) + 
    -                   geom_line() +
    -                   xlab('Year') +
    -                   ylab('Atmospheric CO2 (ppm)') + 
    -                   scale_x_continuous(limits = c(1990, 1995)) +
    -                   theme(text = element_text(size = 18))
    -co2_line
    +
    co2_line <- ggplot(co2_df, aes(x = date_decimal, y = ppm)) + 
    +                   geom_line() +
    +                   xlab('Year') +
    +                   ylab('Atmospheric CO2 (ppm)') + 
    +                   scale_x_continuous(limits = c(1990, 1995)) +
    +                   theme(text = element_text(size = 18))
    +co2_line

    Interesting! It seems that each year, the atmospheric CO2 increases until it reaches its peak somewhere around April, decreases until around late September, and finally increases again until the end of the year. In Hawaii, there are two seasons: summer from May through October, and winter from November through April. @@ -547,9 +547,9 @@

    4.5.1 The Mauna Loa CO2 data set<

    4.5.2 The island landmass data set

    This data set contains a list of Earth’s land masses as well as their area (in thousands of square miles). Question: Are the continents (North / South America, Africa, Europe, Asia, Australia, Antarctica) Earth’s 7 largest landmasses? If so, what are the next few largest landmasses after those?

    -
    # islands data 
    -islands_df <- read_csv("data/islands.csv")
    -head(islands_df)
    +
    # islands data 
    +islands_df <- read_csv("data/islands.csv")
    +head(islands_df)
    ## # A tibble: 6 x 2
     ##   landmass      size
     ##   <chr>        <dbl>
    @@ -563,9 +563,9 @@ 

    4.5.2 The island landmass data se question is a bar plot, specified by the geom_bar function in ggplot2. However, by default, geom_bar sets the heights of bars to the number of times a value appears in a dataframe (its count); here we want to plot exactly the values in the dataframe, i.e., the landmass sizes. So we have to pass the stat = "identity" argument to geom_bar:

    -
    islands_bar <- ggplot(islands_df, aes(x = landmass, y = size)) + 
    -            geom_bar(stat = "identity")
    -islands_bar
    +
    islands_bar <- ggplot(islands_df, aes(x = landmass, y = size)) + 
    +            geom_bar(stat = "identity")
    +islands_bar

    Alright, not bad! This is definitely the right kind of visualization, as we can clearly see and compare sizes of landmasses. The major issues are that the sizes of the smaller landmasses are hard to distinguish, and that the names of the landmasses @@ -573,20 +573,20 @@

    4.5.2 The island landmass data se landmasses; let’s make the plot a little bit clearer by keeping only the largest 12 landmasses. We do this using the top_n function. Then to help us make sure the labels have enough space, we’ll use horizontal bars instead of vertical ones. We do this using the coord_flip function, which swaps the x and y coordinate axes:

    -
    islands_top12 <- top_n(islands_df, 12, size)
    -islands_bar <- ggplot(islands_top12, aes(x = landmass, y = size)) + 
    -        geom_bar(stat = "identity") + 
    -        coord_flip()
    -islands_bar
    +
    islands_top12 <- top_n(islands_df, 12, size)
    +islands_bar <- ggplot(islands_top12, aes(x = landmass, y = size)) + 
    +        geom_bar(stat = "identity") + 
    +        coord_flip()
    +islands_bar

    This is definitely clearer now, and allows us to answer our question (“are the top 7 largest landmasses continents?”) in the affirmative. But the question could be made clearer from the plot by organizing the bars not by alphabetical order but by size, and to colour them based on whether or not they are a continent. In order to do this, we use mutate to add a column to the data regarding whether or not the landmass is a continent:

    -
    islands_top12 <- top_n(islands_df, 12, size)
    -continents <- c('Africa', 'Antarctica', 'Asia', 'Australia', 'Europe', 'North America', 'South America')
    -islands_ct <- mutate(islands_top12, is_continent = ifelse(landmass %in% continents, 'Continent', 'Other'))
    -head(islands_ct)
    +
    islands_top12 <- top_n(islands_df, 12, size)
    +continents <- c('Africa', 'Antarctica', 'Asia', 'Australia', 'Europe', 'North America', 'South America')
    +islands_ct <- mutate(islands_top12, is_continent = ifelse(landmass %in% continents, 'Continent', 'Other'))
    +head(islands_ct)
    ## # A tibble: 6 x 3
     ##   landmass    size is_continent
     ##   <chr>      <dbl> <chr>       
    @@ -599,12 +599,12 @@ 

    4.5.2 The island landmass data se

    In order to colour the bars, we add the fill argument to the aesthetic mapping. Then we use the reorder function in the aesthetic mapping to organize the landmasses by their size variable. Finally, we use the labs and theme functions to add labels, change the font size, and position the legend:

    -
    islands_bar <- ggplot(islands_ct, aes(x = reorder(landmass, size), y = size, fill = is_continent)) + 
    -                   geom_bar(stat="identity") +
    -                   labs(x = 'Landmass', y = 'Size (1000 square mi)', fill = 'Type') +
    -                   coord_flip() +
    -                   theme(text = element_text(size = 18), legend.position = c(0.75, 0.45))
    -islands_bar
    +
    islands_bar <- ggplot(islands_ct, aes(x = reorder(landmass, size), y = size, fill = is_continent)) + 
    +                   geom_bar(stat="identity") +
    +                   labs(x = 'Landmass', y = 'Size (1000 square mi)', fill = 'Type') +
    +                   coord_flip() +
    +                   theme(text = element_text(size = 18), legend.position = c(0.75, 0.45))
    +islands_bar

    This is now a very effective visualization for answering our original questions. Landmasses are organized by their size, and continents are coloured differently than other landmasses, making it quite clear that continents are the largest 7 landmasses.

    @@ -613,8 +613,8 @@

    4.5.2 The island landmass data se

    4.5.3 The Old Faithful eruption / waiting time data set

    This data set contains measurements of the waiting time between eruptions and the subsequent eruption duration (in minutes). Question: Is there a relationship between the waiting time before an eruption to the duration of the eruption?

    -
    # old faithful eruption time / wait time data
    -head(faithful)
    +
    # old faithful eruption time / wait time data
    +head(faithful)
    ##   eruptions waiting
     ## 1     3.600      79
     ## 2     1.800      54
    @@ -623,18 +623,18 @@ 

    4.5.3 The Old Faithful eruption / ## 5 4.533 85 ## 6 2.883 55

    Here again we are investigating the relationship between two quantitative variables (waiting time and eruption time). But if you look at the output of the head function, you’ll notice that neither of the columns are ordered. So in this case, let’s start again with a scatter plot:

    -
    faithful_scatter <- ggplot(faithful, aes(x = waiting, y = eruptions)) + 
    -            geom_point()  
    -faithful_scatter
    +
    faithful_scatter <- ggplot(faithful, aes(x = waiting, y = eruptions)) + 
    +            geom_point()  
    +faithful_scatter

    We can see that the data tend to fall into two groups: one with a short waiting and eruption times, and one with long waiting and eruption times. Note that in this case, there is no overplotting: the points are generally nicely visually separated, and the pattern they form is clear. In order to refine the visualization, we need only to add axis labels and make the font more readable:

    -
    faithful_scatter <- ggplot(faithful, aes(x = waiting, y = eruptions)) + 
    -                         geom_point() +
    -                         labs(x = 'Waiting Time (mins)', y = 'Eruption Duration (mins)') +
    -                         theme(text = element_text(size = 18))
    -faithful_scatter
    +
    faithful_scatter <- ggplot(faithful, aes(x = waiting, y = eruptions)) + 
    +                         geom_point() +
    +                         labs(x = 'Waiting Time (mins)', y = 'Eruption Duration (mins)') +
    +                         theme(text = element_text(size = 18))
    +faithful_scatter

    @@ -642,8 +642,8 @@

    4.5.4 The Michelson speed of ligh

    This data set contains measurements of the speed of light (in kilometres per second with 299,000 subtracted) from the year 1879 for 5 experiments, each with 20 consecutive runs. Question: Given what we know now about the speed of light (299,792.458 kilometres per second), how accurate were each of the experiments?

    -
    # michelson morley experimental data
    -head(morley)
    +
    # michelson morley experimental data
    +head(morley)
    ##     Expt Run Speed
     ## 001    1   1   850
     ## 002    1   2   740
    @@ -657,18 +657,18 @@ 

    4.5.4 The Michelson speed of ligh particular variable is distributed in a data set by separating the data into bins, and then using vertical bars to show how many data points fell in each bin. To create a histogram in ggplot2 we will use the geom_histogram geometric object, setting the x axis to the Speed measurement variable; and as we did before, let’s use the default arguments just to see how things look:

    -
    morley_hist <- ggplot(morley, aes(x = Speed)) + 
    -        geom_histogram() 
    -morley_hist
    +
    morley_hist <- ggplot(morley, aes(x = Speed)) + 
    +        geom_histogram() 
    +morley_hist

    This is a great start. However, we cannot tell how accurate the measurements are using this visualization unless we can see what the true value is. In order to visualize the true speed of light, we will add a vertical line with the geom_vline function, setting the xintercept argument to the true value. There is a similar function, geom_hline, that is used for plotting horizontal lines. Note that vertical lines are used to denote quantities on the horizontal axis, while horizontal lines are used to denote quantities on the vertical axis.

    -
    morley_hist <- ggplot(morley, aes(x = Speed)) + 
    -                 geom_histogram() +
    -                 geom_vline(xintercept = 792.458, linetype = "dashed", size = 1.0)
    -morley_hist
    +
    morley_hist <- ggplot(morley, aes(x = Speed)) + 
    +                 geom_histogram() +
    +                 geom_vline(xintercept = 792.458, linetype = "dashed", size = 1.0)
    +morley_hist

    We also still cannot tell which experiments (denoted in the Expt column) led to which measurements; perhaps some experiments were more accurate than others. To fully answer our question, we need to separate the measurements from each other visually. We can @@ -677,10 +677,10 @@

    4.5.4 The Michelson speed of ligh it to the fill aesthetic mapping. We make sure the different colours can be seen (despite them all sitting on top of each other) by setting the alpha argument in geom_histogram to 0.5 to make the bars slightly translucent:

    -
    morley_hist <- ggplot(morley, aes(x = Speed, fill = factor(Expt))) + 
    -                 geom_histogram(position = "identity", alpha = 0.5) +
    -                 geom_vline(xintercept = 792.458, linetype = "dashed", size = 1.0)
    -morley_hist
    +
    morley_hist <- ggplot(morley, aes(x = Speed, fill = factor(Expt))) + 
    +                 geom_histogram(position = "identity", alpha = 0.5) +
    +                 geom_vline(xintercept = 792.458, linetype = "dashed", size = 1.0)
    +morley_hist

    Unfortunately, the attempt to separate out the experiment number visually has created a bit of a mess. All of the colours are blending together, and although it is possible to derive some insight from this (e.g., experiments 1 and 3 had some @@ -691,11 +691,11 @@

    4.5.4 The Michelson speed of ligh where veritcal_variable is used to split the plot vertically, horizontal_variable is used to split horizontally, and . is used if there should be no split along that axis. In our case we only want to split vertically along the Expt variable, so we use Expt ~ . as the argument to facet_grid.

    -
    morley_hist <- ggplot(morley, aes(x = Speed, fill = factor(Expt))) + 
    -                 geom_histogram(position = "identity") +
    -                 facet_grid(Expt ~ .) +
    -                 geom_vline(xintercept = 792.458, linetype = "dashed", size = 1.0)
    -morley_hist
    +
    morley_hist <- ggplot(morley, aes(x = Speed, fill = factor(Expt))) + 
    +                 geom_histogram(position = "identity") +
    +                 facet_grid(Expt ~ .) +
    +                 geom_vline(xintercept = 792.458, linetype = "dashed", size = 1.0)
    +morley_hist

    The visualization now makes it quite clear how accurate the different experiments were with respect to one another. There are two finishing touches to make this visualization even clearer. First and foremost, we need to add informative axis labels @@ -703,14 +703,14 @@

    4.5.4 The Michelson speed of ligh is easy to compare the experiments on this plot to one another, it is hard to get a sense for just how accurate all the experiments were overall. For example, how accurate is the value 800 on the plot, relative to the true speed of light? To answer this question we’ll use the mutate function to transform our data into a relative measure of accuracy rather than absolute measurements:

    -
    morley_rel <- mutate(morley, relative_accuracy = 100*( (299000 + Speed) - 299792.458 ) / (299792.458))
    -morley_hist <- ggplot(morley_rel, aes(x = relative_accuracy, fill = factor(Expt))) + 
    -                 geom_histogram(position = "identity") +
    -                 facet_grid(Expt ~ .) +
    -                 geom_vline(xintercept = 0, linetype = "dashed", size = 1.0) + 
    -                 labs(x = 'Relative Accuracy (%)', y = '# Measurements', fill = 'Experiment ID') + 
    -                 theme(text = element_text(size = 18))
    -morley_hist
    +
    morley_rel <- mutate(morley, relative_accuracy = 100*( (299000 + Speed) - 299792.458 ) / (299792.458))
    +morley_hist <- ggplot(morley_rel, aes(x = relative_accuracy, fill = factor(Expt))) + 
    +                 geom_histogram(position = "identity") +
    +                 facet_grid(Expt ~ .) +
    +                 geom_vline(xintercept = 0, linetype = "dashed", size = 1.0) + 
    +                 labs(x = 'Relative Accuracy (%)', y = '# Measurements', fill = 'Experiment ID') + 
    +                 theme(text = element_text(size = 18))
    +morley_hist

    Wow, impressive! These measurements of the speed of light from 1879 had errors around 0.05% of the true speed. This shows you that even though experiments 2 and 5 were perhaps the most accurate, all of the experiments did quite an @@ -719,8 +719,8 @@

    4.5.4 The Michelson speed of ligh

    4.6 Explaining the visualization

    -
    -

    Tell a story

    +
    +

    4.6.0.1 Tell a story

    Typically, your visualization will not be shown completely on its own, but rather it will be part of a larger presentation. Further, visualizations can provide supporting information for any part of a presentation, from opening to conclusion. For example, you could use an exploratory visualization in the opening of the presentation to motivate your choice @@ -765,8 +765,8 @@

    Tell a story

    4.7 Saving the visualization

    -
    -

    Choose the right output format for your needs

    +
    +

    4.7.0.1 Choose the right output format for your needs

    Just as there are many ways to store data sets, there are many ways to store visualizations and images. Which one you choose can depend on a number of factors, such as file size/type limitations (e.g., if you are submitting your visualization as part of a conference paper or to a poster printing shop) @@ -805,29 +805,29 @@

    Choose the right output format for your needs

    Let’s investigate how different image file formats behave with a scatter plot of the Old Faithful data set, which happens to be available in base R under the name faithful:

    -
    library(svglite) #we need this to save SVG files
    -faithful_plot <- ggplot(data = faithful, aes(x = waiting, y = eruptions))+
    -  geom_point()
    -
    -faithful_plot
    +
    library(svglite) #we need this to save SVG files
    +faithful_plot <- ggplot(data = faithful, aes(x = waiting, y = eruptions))+
    +  geom_point()
    +
    +faithful_plot

    -
    
    -ggsave('faithful_plot.png', faithful_plot)
    -ggsave('faithful_plot.jpg', faithful_plot)
    -ggsave('faithful_plot.bmp', faithful_plot)
    -ggsave('faithful_plot.tiff', faithful_plot)
    -ggsave('faithful_plot.svg', faithful_plot)
    -
    -print(paste("PNG filesize: ", file.info('faithful_plot.png')['size'] / 1000000, "MB"))
    -## [1] "PNG filesize:  0.07861 MB"
    -print(paste("JPG filesize: ", file.info('faithful_plot.jpg')['size'] / 1000000, "MB"))
    -## [1] "JPG filesize:  0.139187 MB"
    -print(paste("BMP filesize: ", file.info('faithful_plot.bmp')['size'] / 1000000, "MB"))
    -## [1] "BMP filesize:  3.148978 MB"
    -print(paste("TIFF filesize: ", file.info('faithful_plot.tiff')['size'] / 1000000, "MB"))
    -## [1] "TIFF filesize:  9.443892 MB"
    -print(paste("SVG filesize: ", file.info('faithful_plot.svg')['size'] / 1000000, "MB"))
    -## [1] "SVG filesize:  0.046145 MB"
    +
    
    +ggsave('faithful_plot.png', faithful_plot)
    +ggsave('faithful_plot.jpg', faithful_plot)
    +ggsave('faithful_plot.bmp', faithful_plot)
    +ggsave('faithful_plot.tiff', faithful_plot)
    +ggsave('faithful_plot.svg', faithful_plot)
    +
    +print(paste("PNG filesize: ", file.info('faithful_plot.png')['size'] / 1000000, "MB"))
    +## [1] "PNG filesize:  0.07861 MB"
    +print(paste("JPG filesize: ", file.info('faithful_plot.jpg')['size'] / 1000000, "MB"))
    +## [1] "JPG filesize:  0.139187 MB"
    +print(paste("BMP filesize: ", file.info('faithful_plot.bmp')['size'] / 1000000, "MB"))
    +## [1] "BMP filesize:  3.148978 MB"
    +print(paste("TIFF filesize: ", file.info('faithful_plot.tiff')['size'] / 1000000, "MB"))
    +## [1] "TIFF filesize:  9.443892 MB"
    +print(paste("SVG filesize: ", file.info('faithful_plot.svg')['size'] / 1000000, "MB"))
    +## [1] "SVG filesize:  0.046145 MB"

    Wow, that’s quite a difference! Notice that for such a simple plot with few graphical elements (points), the vector graphics format (SVG) is over 100 times smaller than the uncompressed raster images (BMP, TIFF). Also note that the JPG format is twice as large as the PNG format, since the JPG compression algorithm is designed for natural images (not plots). Below, we also show what the images look like when we zoom in to a rectangle with only 3 data points. diff --git a/docs/wrangling.html b/docs/wrangling.html index 08a2d4a09..27e83f07c 100644 --- a/docs/wrangling.html +++ b/docs/wrangling.html @@ -26,7 +26,7 @@ - + @@ -440,11 +440,11 @@

    3.4.2 Why is tidy data important

    3.4.3 Going from wide to long (or tidy!) using gather

    One common thing that often has to be done to get data into a tidy format is to combine columns that are really part the same variable but currently stored in separate columns. To do this we can use the function gather. gather acts to combine columns, and thus makes the data frame narrower.

    Data is often stored in a wider, not tidy, format because this format is often more intuitive for human readability and understanding, and humans create data sets. An example of this is shown below:

    -
    library(tidyverse)
    -hist_vote_wide <- read_csv("data/us_vote.csv") 
    -hist_vote_wide <- select(hist_vote_wide, election_year, winner, runnerup)
    -hist_vote_wide <- tail(hist_vote_wide, 10)
    -hist_vote_wide
    +
    library(tidyverse)
    +hist_vote_wide <- read_csv("data/us_vote.csv") 
    +hist_vote_wide <- select(hist_vote_wide, election_year, winner, runnerup)
    +hist_vote_wide <- tail(hist_vote_wide, 10)
    +hist_vote_wide
    ## # A tibble: 10 x 3
     ##    election_year winner            runnerup         
     ##            <dbl> <chr>             <chr>            
    @@ -468,11 +468,11 @@ 

    3.4.3 Going from wide to long (or

    For the above example, we use gather to combine the winner and runnerup columns into a single column called candidate, and create a column called result that contains the outcome of the election for each candidate:

    -
    hist_vote_tidy <- gather(hist_vote_wide, 
    -                         key = result, 
    -                         value = candidate, 
    -                         winner, runnerup)
    -hist_vote_tidy
    +
    hist_vote_tidy <- gather(hist_vote_wide, 
    +                         key = result, 
    +                         value = candidate, 
    +                         winner, runnerup)
    +hist_vote_tidy
    ## # A tibble: 20 x 3
     ##    election_year result   candidate        
     ##            <dbl> <chr>    <chr>            
    @@ -515,8 +515,8 @@ 

    3.4.4 Using separate to deal with previous untidiness we addressed in the earlier version of this data set, the one we show below is even messier: the winner and runnerup columns contain both the candidate’s name as well as their political party. To make this messy data tidy we’ll have to fix both of these issues.

    -
    hist_vote_party <- read_csv("data/historical_vote_messy.csv")
    -hist_vote_party
    +
    hist_vote_party <- read_csv("data/historical_vote_messy.csv")
    +hist_vote_party
    ## # A tibble: 10 x 3
     ##    election_year winner             runnerup           
     ##            <dbl> <chr>              <chr>              
    @@ -531,11 +531,11 @@ 

    3.4.4 Using separate to deal with ## 9 1984 Ronald Reagan/Rep Walter Mondale/Dem ## 10 1980 Ronald Reagan/Rep Jimmy Carter/Dem

    First we’ll use gather to create the result and candidate columns, as we did previously:

    -
    hist_vote_party_gathered <- gather(hist_vote_party, 
    -                          key = result, 
    -                          value = candidate, 
    -                          winner, runnerup)
    -hist_vote_party_gathered
    +
    hist_vote_party_gathered <- gather(hist_vote_party, 
    +                          key = result, 
    +                          value = candidate, 
    +                          winner, runnerup)
    +hist_vote_party_gathered
    ## # A tibble: 20 x 3
     ##    election_year result   candidate          
     ##            <dbl> <chr>    <chr>              
    @@ -562,11 +562,11 @@ 

    3.4.4 Using separate to deal with

    Then we’ll use separate to split the candidate column into two columns, one that contains only the candidate’s name (“candidate”), and one that contains a short identifier for which political party the candidate belonged to (“party”):

    -
    hist_vote_party_tidy <- separate(hist_vote_party_gathered,
    -                                 col = candidate, 
    -                                 into = c("candidate", "party"), 
    -                                 sep = "/") 
    -hist_vote_party_tidy
    +
    hist_vote_party_tidy <- separate(hist_vote_party_gathered,
    +                                 col = candidate, 
    +                                 into = c("candidate", "party"), 
    +                                 sep = "/") 
    +hist_vote_party_tidy
    ## # A tibble: 20 x 4
     ##    election_year result   candidate       party
     ##            <dbl> <chr>    <chr>           <chr>
    @@ -598,12 +598,12 @@ 

    3.4.4 Using separate to deal with

    We can see that this data now satifies all 3 criteria, making it easier to analyze. For example, we could visualize the number of winning candidates for each party over this time span:

    -
    ggplot(hist_vote_party_tidy, aes(x = result, fill = party)) +
    -  geom_bar() +
    -  scale_fill_manual(values=c("blue", "red")) +
    -  xlab("US Presidential election result") +
    -  ylab("Number of US Presidential candidates") +
    -  ggtitle("US Presidential candidates (1980 - 2016)") 
    +
    ggplot(hist_vote_party_tidy, aes(x = result, fill = party)) +
    +  geom_bar() +
    +  scale_fill_manual(values=c("blue", "red")) +
    +  xlab("US Presidential election result") +
    +  ylab("Number of US Presidential candidates") +
    +  ggtitle("US Presidential candidates (1980 - 2016)") 

    From this visualization, we can see that between 1980 - 2016 (inclusive) the Republican party has won more US Presidential elections than the Democratic party.

    @@ -648,8 +648,8 @@

    3.5 Combining functions using the

    3.5.1 Using %>% to combine filter and select

    Recall the US state-level property, income, population, and voting data that we explored in chapter 1:

    -
    us_data <- read_csv("data/state_property_vote.csv")
    -us_data
    +
    us_data <- read_csv("data/state_property_vote.csv")
    +us_data
    ## # A tibble: 52 x 6
     ##    state                     pop med_prop_val med_income avg_commute party     
     ##    <chr>                   <dbl>        <dbl>      <dbl>       <dbl> <chr>     
    @@ -668,9 +668,9 @@ 

    3.5.1 Using %>% t California. To do this, we can use the functions filter and select. First we use filter to create a data frame called ca_prop_data that contains only values for the state of California. We then use select on this data frame to keep only the median income and median property value variables:

    -
    ca_prop_data <- filter(us_data, state == "California")
    -ca_inc_prop <- select(ca_prop_data, med_income, med_prop_val)
    -ca_inc_prop
    +
    ca_prop_data <- filter(us_data, state == "California")
    +ca_inc_prop <- select(ca_prop_data, med_income, med_prop_val)
    +ca_inc_prop
    ## # A tibble: 1 x 2
     ##   med_income med_prop_val
     ##        <dbl>        <dbl>
    @@ -678,9 +678,9 @@ 

    3.5.1 Using %>% t

    Although this is valid code, there is a more readable approach we could take by using the pipe, %>%. With the pipe, we do not need to create an intermediate object to store the output from filter. Instead we can directly send the output of filter to the input of select:

    -
    ca_inc_prop <- filter(us_data, state == "California") %>% 
    -                    select(med_income, med_prop_val)
    -ca_inc_prop
    +
    ca_inc_prop <- filter(us_data, state == "California") %>% 
    +                    select(med_income, med_prop_val)
    +ca_inc_prop
    ## # A tibble: 1 x 2
     ##   med_income med_prop_val
     ##        <dbl>        <dbl>
    @@ -696,10 +696,10 @@ 

    3.5.2 Using %>% w

    The %>% can be used with any function in R. Additionally, we can pipe together more than two functions. For example, we can pipe together three functions to order the states by commute time for states whose population is less than 1 million people:

    -
    small_state_commutes <- filter(us_data, pop < 1000000) %>% 
    -  select(state, avg_commute) %>% 
    -  arrange(avg_commute)
    -small_state_commutes
    +
    small_state_commutes <- filter(us_data, pop < 1000000) %>% 
    +  select(state, avg_commute) %>% 
    +  arrange(avg_commute)
    +small_state_commutes
    ## # A tibble: 7 x 2
     ##   state                avg_commute
     ##   <chr>                      <dbl>
    @@ -731,11 +731,11 @@ 

    3.6.1 Calculating summary statist summarize. Examples of summary statistics we might want to calculate are the number of observations, the average/mean value for a column, the minimum value for a column, etc. Below we show how to use the summarize function to calculate the minimum, maximum and mean commute time for all US states:

    -
    us_commute_time_summary <- summarize(us_data, 
    -                                  min_mean_commute = min(avg_commute),
    -                                  max_mean_commute = max(avg_commute),
    -                                  mean_mean_commute = mean(avg_commute))
    -us_commute_time_summary
    +
    us_commute_time_summary <- summarize(us_data, 
    +                                  min_mean_commute = min(avg_commute),
    +                                  max_mean_commute = max(avg_commute),
    +                                  mean_mean_commute = mean(avg_commute))
    +us_commute_time_summary
    ## # A tibble: 1 x 3
     ##   min_mean_commute max_mean_commute mean_mean_commute
     ##              <dbl>            <dbl>             <dbl>
    @@ -749,12 +749,12 @@ 

    3.6.2 Calculating group summary s

    The group_by function takes at least two arguments. The first is the data frame that will be grouped, and the second and onwards are columns to use in the grouping. Here we use only one column for grouping (party), but more than one can also be used. To do this, list additional columns separated by commas.

    -
    us_commute_time_summary_by_party <- group_by(us_data, party) %>% 
    -  summarize(min_mean_commute = min(avg_commute),
    -            max_mean_commute = max(avg_commute),
    -            mean_mean_commute = mean(avg_commute))
    +
    us_commute_time_summary_by_party <- group_by(us_data, party) %>% 
    +  summarize(min_mean_commute = min(avg_commute),
    +            max_mean_commute = max(avg_commute),
    +            mean_mean_commute = mean(avg_commute))
    ## `summarise()` ungrouping output (override with `.groups` argument)
    -
    us_commute_time_summary_by_party
    +
    us_commute_time_summary_by_party
    ## # A tibble: 3 x 4
     ##   party          min_mean_commute max_mean_commute mean_mean_commute
     ##   <chr>                     <dbl>            <dbl>             <dbl>
    @@ -779,7 +779,7 @@ 

    3.8 Using purrr’s

    In cases like this, where you want to apply the same data transformation to all columns, it is more efficient to use purrr’s map function to apply it to each column. For example, let’s find the maximum value of each column of the mtcars data frame (a built-in data set that comes with R) by using map with the max function. First, let’s peak at the data to familiarize ourselves with it:

    -
    head(mtcars)
    +
    head(mtcars)
    ##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
     ## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
     ## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
    @@ -789,8 +789,8 @@ 

    3.8 Using purrr’s ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

    Next, we use map to apply the max function to each column. map takes two arguments, an object (a vector, data frame or list) that you want to apply the function to, and the function that you would like to apply. Here our arguments will be mtcars and max:

    -
    max_of_columns <- map(mtcars, max)
    -max_of_columns
    +
    max_of_columns <- map(mtcars, max)
    +max_of_columns
    ## $mpg
     ## [1] 33.9
     ## 
    @@ -829,7 +829,7 @@ 

    3.8 Using purrr’s

    Our output looks a bit weird…we passed in a data frame, but our output doesn’t look like a data frame. As it so happens, it is not a data frame, but rather a plain vanilla list:

    -
    typeof(max_of_columns)
    +
    typeof(max_of_columns)
    ## [1] "list"

    So what do we do? Should we convert this to a data frame? We could, but a simpler alternative is to just use a different map_* function from the purrr package. There are quite a few to choose from, they all work similarly and their name refects the type of output you want from @@ -869,8 +869,8 @@

    3.8 Using purrr’s

    Let’s get the column maximum’s again, but this time use the map_df function to return the output as a data frame:

    -
    max_of_columns <- map_df(mtcars, max)
    -max_of_columns
    +
    max_of_columns <- map_df(mtcars, max)
    +max_of_columns
    ## # A tibble: 1 x 11
     ##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
     ##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>