# MATH1005 Wednesady 10am Semester 2

Assessment DatesRough assessment dates have been provided below. These have been added purely to help you have an idea about what is coming up. These assessment dates should not be taken as fact. It is on the onus of students to check Canvas, EdStem and the Unit of Study outline.

Extra Help/Staff ContactAny extra help about course material should be asked on EdStem. Students are free to email with personal issues, but questions about content will be redirected to EdStem.

### Important Links

### Calendar

Week | Slides | Class Notes | Misc. | Further Learning | Assessments |
---|---|---|---|---|---|

Week 1 (Aug 2) | Introduction | Lab sheet from class | – | Britannica Simpson’s Paradox Article R Markdown Cheat Sheet | RQuiz1: Design of Experiments (Aug 6) |

Week 2 (Aug 9) | Lab 2 | Lab sheet from classPlease read clarification 3 | Please see below for notes and clarifications from todays lab. | Article on how to pick the right chart type Interpreting skewness from a boxplot | RQuiz2: Data & Graphical Summaries (Aug 13) |

Week 3 (Aug 16) | Lab 3 Code for graph in slides | Lab sheet from class | – | – | RQuiz3: Numerical Summaries (Aug 20) |

Week 4 (Aug 23) | Lab 4 | Lab sheet from class | Download the lab sheet for some tips on how to approach manual calculations for the normal distribution. | – | RQuiz4: Normal Model (Aug 27) |

Week 5 (Aug 30) | No slides this week. Please scroll below for week 5 notes. | Lab sheet from class | – | – | RQuiz5: Linear Model (Sep 3) |

Census Date (Aug 31) | – | – | Last day to drop a unit without incurring financial or academic penalty. | – | – |

Week 6 (Sep 6) | – | Lab sheet from class | – | Can computer’s generate random numbers? | RQuiz6: Understanding Chance (Sep 10) |

Week 7 (Sep 13) | – | Lab sheet from class Box model question 1.1 | – | See Week 7 Notes! | First Group Project Peer Review (Sep 14) RQuiz7: Chance Variability (Sep 17) |

Week 8 (Sep 20) | – | See Lab8Solution.html on Canvas | – | – | Second Group Project Peer Review (Sep 22) RQuiz8: Normal Approximation (Sep 24) |

Assessment (Sep 22) | – | – | – | – | Group Project 1 Due |

Mid Semester Break (Sep 25-29) | – | – | No class. | – | – |

Week 9 (Oct 4) | – | See Lab9Solution.html on Canvas | – | – | RQuiz9: Sample Survey + Bias (Oct 8) |

Week 10 (Oct 11) | – | z-test question Uber question | – | – | RQuiz10: z-test (October 15) |

Week 11 (Oct 18) | – | Caffeine Question Vitamin-C Question See Lab11Solution.html on Canvas | How to include figure captions in RMarkdown | – | RQuiz11: t-test (October 22) |

Assessment (Oct 20) | – | – | – | – | Individual Project 2 Due |

Week 12 (Oct 25) | – | – | – | – | – |

Week 13 (Nov 1) | – | – | Last week of classes. | – | – |

STUVAC (Nov 6-10) | – | – | Study vacation. | – | – |

Exam Period (Nov 13-25) | – | – | Exam data to be released by the University. | – | – |

### **Week 2: Notes/Clarifications**

__Clarification 1__

In the week 2 slides, on slide 10, the “qualitative” and “quantitative” labels in the flow chart were in the wrong location. The slides have been re-uploaded with the “qualtiative” and “quantitative” labels in the correct position.

__Clarification 2__

In the tutorial, I mentioned that we would soon be learning how to use ggplot to create our plots. This is actually **not** the case with MATH1005. If you are enjoying R and think that data science is something you would like to pursue, I would definetly recommend checking out ggplot!

__Clarification 3__

I was asked in class what `las=2`

inside the barplot code does. This changes the x-axis category names to be vertical.

I incorrectly said in class that this removes the column without a category name. This is incorrect! The reason why there apeared to be a missing title was that the string “Wednesday” was too long to be shown horizontally.

__Notes - Complex Boxplot__

In class, we didn’t quite get enough time to finish some of the harder plotting questions. One such question was regarding investigating the pattern between “age” and “crash type”. Here we are considering one quantiative variable (age), and one qualitative variable (crash type). This means that boxplots would be a good choice.

The first thing we do is select the “age” data: `age = road$Age`

Now, if we were to run `class(age)`

, we would see that the age list is of type character (which implies it is currently a qualitative variable). We actually want to change it to a numeric variable, and we can do that by running, `ageN = as.numeric(age)`

. We now have a variable called `ageN`

which stores the ages as a numeric type.

We also need to extract the crash type, which we can do by: `crash_type = road$Crash_Type`

Now we can create the boxplot using the following code: `boxplot(ageN ~ crash_type, horizontal=T, col=c("light blue", "light green", "light pink"), main = "Age distribution by crash type")`

How does this work? Well, the main thing is that we need to tell R how we want the boxplots to be formed. `ageN ~ crash_type`

informs R that we want “ageN” to be the value that we are finding the distribution of, and we want to seperate the ages by `crash_type`

. The other parameters should be fairly straight forward to understand - if they’re not, change them and see what happens!

To visualise what this plot looks like you can check out 2.0.2 of the “Lab Sheet From Class” in the “Class Notes” section of week 2.

There is one more boxplot in the explore section, but try this out youself! We’ll go through it next tutorial.

### **Week 5: Notes**

For the first part of the tutorial, we went through the group project. Here, I wanted to provide some broad thoughts about how to go about sourcing addition research in scientific reports.

Whenever we write something in a scientific report that did not come from us, it’s important that we also include a source to add some weight behind what we have just claimed. Citing is really important, because it shows that we didn’t simply make up what we claimed.

Here is a short excerpt (that has been slightly modified) from a report I wrote for one of my university classes:

Despite concerns that in industry, women only account for 25% of computer-science related jobs (Daley, 2021), this does not hold among the people which responded to the DATA2x02 survey according to the Chi-squared goodness of fit test. In fact, 36% of DATA2x02 students identify as female, which is much larger than 25%.

Now, I’m not claiming that my above excerpt is a masterpiece, but I do think it does a good job of showing how we can intertwine additional research into our reports. In the above excerpt, I am making a claim that in “industry, womeon only account for 25% of computer-science related jobs.” But I am no expert in this field, and so I have to share where I got this “25%” value from.

I do this by including what is called an in-text citation, evident where in brackets I write “Daley, 2021”. This comes straight after I mention this “25%” number which I gathered from a source online. I include the citation to indicate that this figure is not my own observation, but from somone else (who would know much more about this than me).

In the reference list later in my report, I would have a full citation for Daley, written as:

Daley, S. (2021) Women in Tech Statistics Show the Industry Has a Long Way to Go. https://builtin.com/women-tech/women-in-tech-workplace-statistics

In a nut shell, whenever we are making a claim that does not come from our own research, we have to include a reference to where we found that claim.

There are many different referencing styles that you can use, but in this course we require APA citations. You can find more information about how to write APA citations and in-text citations here:

### **Week 7: Notes**

In the original lab sheet, we were told to use the `multicon`

package to work out the population standard deviation. However, this package has become outdated, and is not suported in the newest versions of R (the versions we are using). Hence, we need to find some other way to work out the population sd. Here are some methods:

**Method 1: Using the rafalib Library**

I did not previously know that this option existed until we used it in another class that I teach. The `rafalib`

library allows us to calculate the population standard deviation directly (it is a very easy method)!

Before using this method, you first have to install the `rafalib`

package. To do this, type the following into the console, and then press enter:

`install.packages("rafalib")`

Notice here the use of quotation marks around `rafalib`

when installing the package.

Now, to load in `rafalib`

, in an R-chunk towards the top of your R-Markdown document, type the following to load in the library:

`library(rafalib)`

Notice that this time we don’t include the quotation marks around `rafalib`

.

To find the population sd of a list/vector/column of data, we type:

`popsd(variable)`

This is the method that I used in the “lab sheet from class” in the week 7 class notes section.

When we have access to R, I would definetly recommend method 1 as the best of the three methods!

**Method 2: Working out the Population SD by Hand**

This is probably the most tiring of the option, but you could work out the population sd by using the formula (see this article - the population sd part).

From the example from class where we have a list of data with the numbers 0,0,0,1, the following would yield the population sd:

`sqrt( ( (1-0.25)^2+ (0-0.25)^2+ (0-0.25)^2+ (0-0.25)^2 ) / 4 )`

**Method 3: Multipy the Sample SD by a Factor**

Another solution that you could use is to multiple the sample standard deviation (which is built into R) by the factor `sqrt( (n-1) /n )`

, where `n`

is the number of elements in our population.

Using the example from class with the population 0,0,0,1, in R, we can find the population sd by doing:

`sqrt((4-1)/4) * sd(population)`

Here, we use `n = 4`

as we have four elements in our population.