R Notes and snippets

Tags:

Installing R on debian

sh: 1: pygmentize: not found

    cat >> /etc/apt/sources.list << EOF
    deb http://cran.rstudio.com/bin/linux/debian stretch-cran34/
    EOF

Install dirmngr

sudo apt install dirmngr

Receive debian key

sudo apt-key adv --keyserver keys.gnupg.net --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF'

Update the repo and then install R

sudo apt update
sudo apt-install r-base

Lubridate - introductory technical paper

This paper (Grolemund and Wickham) offers a good introduction and comparison between using lubridate and not using it, as well as several examples of using the library. It also offers some case studies which can serve as useful drill exercises.

Importing multiple excel sheets from multiple excel files

This is one approach to importing multiple sheets from multiple excel files into a list of tibbles. The goal is that each sheet is imported as a separate tibble.

Loading the libraries: While you may have the tidyverse package installed, this approach uses the package rio ( ).

## install the rio library.
## Rio  makes data import a little easier for different file types.

## install.packages("rio")
library("rio")
library("tidyverse")

User input for the path. This basically points towards a folder which presumably contains multiple excel files.

## Note that patterns can be provided as an argument to filter file types.
folder_path <- c("~/temp/bsu_test/")

The information in the directory can be gleaned with the fs::dir_info function, and from this the path variable can be pulled which will contain the paths to the excel files found.

excel_paths_tbl <- fs::dir_info(folder_path)

paths_chr <- excel_paths_tbl %>%
  pull(path)
excel_data <- paths_chr %>%
map(~ import_list(. , setclass = "tbl")) %>%
combine()

glimpse(excel_data)

References

  1. BSU Course DSB-101-R
  2. I learnt about the Rio package in this Stack Overflow discussion

TODO Data Explorer package

The DataExplorer? package aims to have tools for EDA, Feature engineering and Data reporting. It is handy to get quick overview of the data from multiple perspectives.

Installation

install.packges("DataExplorer")

Salient points:

  1. A list of data frames can be provided as the input.
  2. plot_str : display a graphic networking the various variables, their types and the list of data frames. This is displayed in the browser. The type = "r" argument can be used for a radial network.
  3. introduce : provides a table of numbers rather than percentages, like the number of rows, columns, missing data and so on.
  4. plot_intro : Visualises the output of introduce.
  5. plot_missing : useful to know the percentage of missing values in each feature.

Devtools package

…devtools package, which is the public face for a suite of R functions that automate common development tasks.
R Packages (book)

Official details of package development : link

Basic libraries to aid package development

install.packages(c("devtools", "roxygen2", "testthat", "knitr"))

Visdat : preprocessing visualisation link

This package could be very useful in exploring new data or looking at how the data is changing after a wrangling operation. It could save repeatedly looking at the CSV file manually to make sure the change is implemented.

Installing Visdat

library("easypackages")
packages("visdat")

Main Functions:

vis_dat
vis_miss
vis_compare
vis_expect
vis_cor
vis_guess

General Exploration

Note: typical_data is a dataset that is included with the package and is useful to explore the functions.

libraries("tidyverse", "visdat")
vis_dat(typical_data)
vis_miss(typical_data)

Clustering the missing data in the columns

vis_miss(typical_data,
         cluster =  TRUE)

Long <-> Wide formats : example for gathering

library("tidyverse")

## Defining a sample tribble with several duplicates
a <- tribble(
    ~IDS, ~"client id 1", ~"client id 2", ~"client id 3", ~"client id 4", ~"old app", ~"new app",
    123, 767, 888,"" , "", "yes" , "no",
    222, 333, 455, 55, 677, "no", "yes",
    222, 333, 343, 55,677, "no", "yes"
)


## Defining vector to form column names
vec1 <- seq(1:4)
vec2 <- "client id"
vec3 <- str_glue("{vec2} {vec1}")

## Gathering and removing duplicates
a %>%
    gather(
        key = "Client number",
        value = "client ID",
        vec3
    ) %>%
    unique()

Matrix

Defining a matrix

A matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns.

A matrix is called two-dimensional, since there are rows and columns. It is constructed using the matrix() function.

Arguments:

matrix(1:10,byrow = TRUE, nrow = 4)

Demonstrating the difference of not using byrow

matrix(1:10, ncol = 2, nrow = 5)
matrix(1:10, ncol = 2, nrow= 5 , byrow = TRUE)

Naming the rows and the columns

rownames() and colnames() can be used.

#Defining the row data
row_1 <- c(250, 300)
row_2 <- c(55, 350)

# Defining the matrix
my_matrix <- matrix(c(row_1, row_2), byrow = TRUE, nrow = 2)

# Defining row and column names
my_rownames <- c("test_row1", "test_row2")
my_colnames <- c("test_col1", "test_col2")

# Attaching row and column names to the created matrix
rownames(my_matrix) <- my_rownames
colnames(my_matrix) <- my_colnames

my_matrix

Sums - rowSums() and colSums(), adding rows - rbind() and columns - cbind()

my_rowsums <-  rowSums(my_matrix)

# Adding a new column of the calculated sums
my_new_matrix <- cbind(my_matrix, my_rowsums)
my_new_matrix

# Adding a new row and calculating the sums again
row_3 <- c(200, 100 )
my_newest_matrix <- rbind(my_matrix, row_3)
my_new_rowsums <- rowSums(my_newest_matrix)
my_newest_matrix <- cbind(my_newest_matrix, my_new_rowsums)

my_newest_matrix

Dates

# Using the system date and time
todays_date <- Sys.Date()
todays_time <- Sys.time()
todays_date
todays_time

# Class of defined date and time
class(todays_date)
class(todays_time)

# Reading alternate formats of dates
test_date_alt_format <- "23/02/2019"
as.Date(test_date_alt_format, format = "%d/%m/%Y")

test2_date_alt_format <- "Sep 25,2020"
as.Date(test2_date_alt_format, format = "%B %d,%Y")

# Extractor functions
weekdays(as.Date(test2_date_alt_format, format = "%B %d,%Y"))

# Subtracting dates
date1 <- as.Date("2030-02-20")
date2 <- as.Date("2040-03-30")
date2 - date1
difftime(date2, date1, units = 'secs')
difftime(date1, date2, units = 'mins')

# Setting the weekdays as names()
dates3 <- c(date1, date2, as.Date(c("2025-03-23", "2015-04-25")))
names(dates3) <- weekdays(dates3)
dates3

# Syntax example of using Not (relational operators)
a <- c(100,140,2,240, 300)
# checking where a is Not greater than 200
!(a > 200)

# Testing runif()

Vectors

a <- c("This is a character type vector", "which contains 2 strings")
a
length(a) # the result will be 2 because there are 2 elements
nchar(a)  # Actual number of characters in each string

Vectorised functions

Most functions in R are vectorised. The function will apply itself to each element of a vector. This concept is important to understand especially while progressing onto tidyeval style Functions.

Example of multiple substitutions with the assignment operator which is a vectorised function.

languages <- c("English", "Italian", "Urdu")
print(languages)
languages[c(2,3)] <- c("Norwegian", "Latin")
print(languages)

Lists

                                        # Creating a simple list of 4 elements, name, age, height, horn.sizre

my.list <- list(
  name = "Shreyas",
  age = 776,
  height = 167,
  horn.size = 25
)

my.list
                                        # the tag names can be extracted using the names()
names(my.list)
people <- c("shreyas", "tom", "harry")
lapply(people, toupper)
                                        # the first argument is the list and the 2nd argument is the function. Additional arguments to the function can also be supplied. This returns a new list and the old list remains unmodified.
lapply(people, paste, "hello")
people

Examples using lapply and other list and vector Manipulation

                                        # Creating vectors of meals  and meal items

breakfast <- c("eggs", "bread", "orange juice")
lunch <- c("pasta", "coffee")
meals <- list(breakfast = breakfast, lunch =  lunch)
meals
meals <- c(meals, list(dinner = c("noodles", "bread")))
meals
names(meals)

                                        # Extracting dinner
dinner <- meals$dinner

                                        # Adding earlier meals to a separate list
early_meals <- c(meals["breakfast"], meals["lunch"])
early_meals

                                        # Finding the number of items in each meal.
number_items_meal <- lapply(meals , length)
number_items_meal

                                        # Write a function `add_pizza` that adds pizza to a given meal vector, and  returns the pizza-fied vector
add_pizza <- function(vector, string = "pizza") {
  pizzafied <- paste(vector, string, sep = "-")
  return(pizzafied)
  }

add_pizza(breakfast)

                                        # Create a vector `better_meals` that is all your meals, but with pizza!
updated_meals <- c(add_pizza(breakfast),
                   add_pizza(lunch),
                   add_pizza(dinner)
                   )
updated_meals

Factors

Working with categorical data:

ranking <- c(1:20)
head(ranking)
buckets <- c(0, 5, 10, 15, 20)
ranking_grouped <- cut(ranking, breaks = buckets)
head(ranking_grouped)
ranking_grouped

Dataframe

Used to store a table of data. Multiple data types can be stored in a single dataframe. A matrix can store only a single data type.

TODO Dataframe peek function in R

head()
tail()
str()
desc()
glimpse()

Package installation (especially for data science and ML)

The package easypackages enables quickly loading or installing multiple libraries. This snippet will enable installing multiple packages. In general, it is better to install packages one by one. They can however be called together.

install.packages("easypackages")
library("easypackages")
packages("tidyverse", "tidyquant", "glmnet", "rpart", "rpart.plot", "ranger", "randomForest", "xgboost", "kernlab", "visdat")

Basic Statistics concepts

Median

##' Source: Conway, Drew; White, John Myles. Machine Learning for Hackers: Case Studies and Algorithms to Get You Started (p. 39). O'Reilly Media. Kindle Edition.
##' Additional comments are my own.
##' Function to illustrate how a median is calculated for odd and even datasets

my.median  <- function(x){
                                        # Step 1:  Sort x ascending or descending
  sorted.x  <-  sort(x)
                                        # Find the length of x whether (odd number of digits or even). If odd : there are 2 medians. If even: there is a single median.
  if(length(x) %% 2 != 0){
    indices  <- c(length(x)/2 , length(x)/2 +1)
                                        # These numbers are used as indices for the initially sorted vector to return the exact median.
    return(mean(sorted.x[indices]))
}
else {
  index  <- ceiling(length(x)/2)
  return(sorted.x[index])
}

Quantile

                                        # Defining a sample of numbers to calculate quantile.
a  <- c(seq(from = 1, to = 30), seq(from = 40, to = 50, by = 0.2))
quantile(a)

                                        # Defining bins or cuts for quantile. The default is 0.25.
quantile(a, probs =  seq(0,1,by = 0.2))

promptData() : generate shell documentation of dataset

If the filename argument is given as "NA", the output will provide lists of the information. If no filename is specified, then an .Rd file will be created in the same working directory.

promptData(sunspots, filename = NA)

Downloading a file to specific location

With wget : -P is the flag for the prefix directory for the file being downloaded. The path will be created if it does not exist. If the file already exists, a duplicate will be created with the '.1' suffix. Since this is a string being passed to wger, the " and other characters have to be explicitly escaped.

## Download file to specific location
system("wget \"https://raw.githubusercontent.com/amrrs/sample_revenue_dashboard_shiny/master/recommendation.csv\" -P ./sales-rev-app/")

Removing user installed packages alone

Sometimes, it is not possible to remove R completely. This is a nice snippet from an R-bloggers post to remove the user installed packages alone.

# create a list of all installed packages
 ip <- as.data.frame(installed.packages())
 head(ip)
# if you use MRO, make sure that no packages in this library will be removed
 ip <- subset(ip, !grepl("MRO", ip$LibPath))
# we don't want to remove base or recommended packages either\
 ip <- ip[!(ip[,"Priority"] %in% c("base", "recommended")),]
# determine the library where the packages are installed
 path.lib <- unique(ip$LibPath)
# create a vector with all the names of the packages you want to remove
 pkgs.to.remove <- ip[,1]
 head(pkgs.to.remove)
# remove the packages
 sapply(pkgs.to.remove, remove.packages, lib = path.lib)

Rprofile and user files

To find the installation location of R, use the R.home() function with component specified as shown below. More information.

R.home(component='home')
R.home(component='etc')

Jupytext for conversion to Rmd

Jupytext can save Jupyter notebooks as:
- Markdown and R Markdown Documents,
- Julia, Python, R, Bash, Scheme, Clojure, Matlab, Octave, C++ and q/kdb+ scripts.
Jupytext package

The is a convenient tool to convert the jupyter notebook into multiple formats, and it also enables collaboration across documents.

Installing Jupytext using conda:

conda install -c conda-forge jupytext

My most common usage of this tool is to convert jupyter notebooks (.ipynb) to Rmarkdown(Rmd). Deploying jupytext as a Library of Babel(LOB) Ingest makes it easy to be called from anywhere in Emacs.

jupytext $jup_notebook --to rmarkdown

Package installation (especially for data science and ML)

The package easypackages enables quickly loading or installing multiple libraries. This snippet will enable installing multiple packages. In general, it is better to install packages one by one. They can however be called together.

install.packages("easypackages", )
library("easypackages")
packages("tidyverse", "tidyquant", "glmnet", "rpart", "rpart.plot", "ranger", "randomForest", "xgboost", "kernlab")

Installing the R kernel for Jupyter notebooks

Reference: link

The easiest way for me to export org files to a notebook format will be using the Ipython notebook export available in Scimax. Installing the R kernel for Jupyter notebooks is as simple as installing an R package:

install.packages('IRkernel')

To register the kernel in the current R installation:

IRKernel::installspec()

Per default IRkernel::installspec() will install a kernel with the name “ir” and a display name of “R”. For having multiple versions of R available as kernels:

# in R 3.3
IRkernel::installspec(name = 'ir33', displayname = 'R 3.3')
# in R 3.2
IRkernel::installspec(name = 'ir32', displayname = 'R 3.2')

It is possible to install the IRKernel package via Docker.

Note: Some additional packages may be required before installing IRKernel. Try the following:

install.packages(c('repr', 'IRdisplay', 'evaluate', 'crayon', 'pbdZMQ', 'devtools', 'uuid', 'digest'))
devtools::install_github('IRkernel/IRkernel')

Troubleshooting with R version.

version
which R

How an R session starts

Source: <https://stat.ethz.ch/R-manual/R-devel/library/base/html/Startup.html>

Upgrading packages in R (R session)

Source: Arch wiki When you also need to rebuild packages which were built for an older version:

update.packages(ask=FALSE,checkBuilt=TRUE)

when you also need to select a specific mirror (<https://cran.r-project.org/mirrors.html>) to download the packages from (changing the url as needed):

update.packages(ask=FALSE,checkBuilt=TRUE,repos="https://cran.cnr.berkeley.edu/")

You can use Rscript, which comes with r to update packages from a Shell:

Rscript -e "update.packages()"

Installing R on Debian

sudo cat >> /etc/apt/sources.list << EOF
# adding mirror for installation of R
deb http://cran.rstudio.com/bin/linux/debian stretch-cran34/
EOF

sudo apt-get update

Using Debian's GPG Key

sudo apt install dirmngr
sudo apt-key adv --keyserver keys.gnupg.net --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF'

Installing r-Base

sudo apt-get install r-base

Some pre-requisite libraries are required for installing various R Packages

sudo apt-get install libcurl4-openssl-dev