Course: R Programming Language

R Programming Language

  • Life Time Access
  • Certificate on Completion
  • Access on Android and iOS App
  • Self-Paced
About this Course

This course is designed for software programmers, statisticians and data miners who are looking forward for developing statistical software using R programming. If you are trying to understand the R programming language as a beginner, this tutorial will give you enough understanding on almost all the concepts of the language from where you can take yourself to higher levels of expertise.

Before proceeding with this course, you should have a basic understanding of Computer Programming terminologies. A basic understanding of any of the programming languages will help you in understanding the R programming concepts and move fast on the learning track.

Who this course is for:

  • All graduates and pursuing students
Basic knowledge
  • Before proceeding with this course, you should have a basic understanding of Computer Programming terminologies. A basic understanding of any of the programming languages will help you in understanding the R programming concepts and move fast on the learning track
What you will learn
  • R Programming Language for Statistical Computing and Graphical Representation
Number of Lectures: 82
Total Duration: 68:52:55
R Programming Language
  • Introduction to R Programming  


    • R is a programming language
    • Free software
    • Statistical computing, graphical representation and reporting.
    • Designed by: Ross Ihaka, Robert Gentleman, Developed at University of Aukland
    • Derived from S and S-plus language (commercial product)
    • Typing discipline: Dynamic
    • Stable release: 3.5.2 ("Eggshell Igloo") / December 20, 2018; 58 days ago
    • First appeared: August 1993; 25 years ago
    • License: GNU GPL
    • Functional based language
    • Interpreted programming language
    • Distributed by CRAN (Comprehensive R Archive Network)
    • Open source product (R-Community)
    • Functions are available as a package
    • Default packages are already attached to the R-console eg base, utils, stats, graphics etc
    • Attach the package to the R-application
    • Install Add-on packages from CRAN Mirrors.

    Write a program to print HELLO WORLD in C language:



    void main()


    printf("HELLO WORLD");



    Write a program to print HELLO WORLD in Java:

    class Hello


    public static void main(String args[])


    System.out.println("HELLO WORLD");



    Write a program to print HELLO WORLD in R:

    print("HELLO WORLD")

    NOTE: R programming language is very simple to learn when compare to traditional programming languages (C, C++, C#, Java).

  • R Installation & Setting R Environment  

    How to Download & Install R:

    • Once goto official website of R i.e.,
    • (or)
    • Search "R" in Google and click on first link (The R Project for Statistical Computing).
    • Click on "Download R".
    • Click on any one of the CRAN Mirror. Eg: https//
    • Click on Download R for Windows.
    • Click on Install R for the first time.
    • Finally click on Download R 3.5.1 for Windows (32/64 bit).

    Setting R Environment:

    • R come with a lot of packages.
    • By default only some packages will be attached to the R environment.
    1. search()
    2. Displays the currently attached packages
    3. installed.packages()
    4. Displays the installed packages in the machine
    5. library(package name) / require(package name)
    6. Attaches the packages to the R application
    7. install.packages("package name")
    8. Installs the add-on packages from CRAN
    9. detach(package:package name)
    10. Detaches the packages from the R environment

    Package - Help

    • library(help="package name")

    Function - Help

    • help(function name)
    • or
    • ?function name

  • Variables, Operators & Data types  
  • Structures  

    Comments in R:


    --> Single comment is written using # in the beginning of the statement.

    # Comments are like helping text in your R Program

    --> Multi-line comments is written using if()

    if(FALSE) {

    "We put such comments inside, either

    single or double quote" }

    Variable Assignmet:


    1. print()

    2. cat()



    --> print() function is used to print the value stored in variable


    a <- 10




    --> cat() function is used to combines multiples items into a continuous print output.


    a <- "DataHills"

    cat("Welcome to ", a)

    Datatype of a Variable:


    1. typeof()

    2. class()

    3. mode()

    1. typeof(var_name/value)


    --> typeof determines the (R internal) type or storage mode of any object






    --> R possesses a simple generic function mechanism which can be used for an object-oriented style of programming.

    --> Method dispatch takes place based on the class of the first argument to the generic function.




    3. mode(var_name/value)


    --> Get or set the type or storage mode of an object.




    Displaying & Deleting Variables in R:


    1. ls()

    2. rm()

    1. ls():


    --> ls() function is used to display all the variables currently availabe in the R environment.



    --> ls() function is also used to display patterns to match the variables names by using pattern.


    # Display the variables starting with the pattern "a"


    --> ls() function is also used to display hidden variables i.e, the variable starting with dot(.) by using all.names=TRUE.

    Ex: Display the variables which are hidden


    --> rm() function is used to delete the variable.



    --> rm() function is also used to delete all the variables by using rm() and ls() function together.

    Ex: Remove all the variables at a time


    Structures/Objects in R:


    1. Vectors

    2. Lists

    3. Matrices

    4. Data Frames

    5. Arrays

    6. Factors

  • Vectors  



    --> Single dimensional object with homogenous data types.

    --> To create a vector use fucntion c()

    --> Here "c" means combine

    # if i try like this

    a <- 10,20,30,40

    it gives an error.

    # then combine all these values by using c()

    a <- c(10,20,30,40)

    # to check the internal storage of a


    # to check the internal storage of each value in a




    lapply(a,typeof) # list of values

    sapply(a,typeof) # vector of values

    --> Vectors are the most basic R structures/objects

    --> The types of atomic vectors are in

    1. logical

    2. integer

    3. double

    4. complex

    5. character

    Vector Creation:


    --> We can create vectors with single element and multiple elements.

    --> They are

    1. Single Element Vector

    2. Multiple Elements Vector

    Single Element Vector:


    --> When we assign a single value into variable, it becomes a vector of length 1 and belongs to one of the above vector types.


    a <- 10

    b <- 20L

    c <- "DataHills"

    d <- TRUE

    e <- 2+3i

    Multiple Elements Vector:


    --> When we assign multiple value into a variable, it becomes a vector of length n

    and belongs to one of the above vector types.


    a <- c(10,20,30,40,50)

    b <- c(20L,40L,60L,80L)

    c <- c("Srinivas","DataHills","DataScience","MachineLearning")

    d <- c(T,FALSE,TRUE,F,T,F)

    e <- c(2+3i,4+4i,5+6i)

    # Heterogeneous data type values are converted into homogeneous data type values:

    a <- c(10,20,30,40,"DataHills")


    "10" "20" "30" "40" "DataHills"

    # The double and character values are converted into characters.

    Observer with some examples:-

    a <- c(10L,20)

    a <- c(T,5)

    a <- c(2+3i,"DataHills")

    a <- c(9L,30,4+5i)

    Here data types having some priority, based on that they are converting.

    i.e, Lower data types to higher data types


    2. COMPLEX

    3. DOUBLE

    4. INTEGER

    5. LOGICAL

    a <- c(TRUE,30,20L,2+3i,"DataHills")

    a <- c(TRUE,30,20L,2+3i)

    a <- c(TRUE,30,20L)

    a <- c(TRUE,20L)

    To generate a sequence of numeric values






    # by using seq() function

    Syntax: seq(from=VALUE,to=VALUE,by=VALUE)

    Ex:   seq(from=1,to=10,by=1)








    seq(10,1,1) # Error






  • Vector Manipulation & Sub-Setting  

    # length.out --> desired length of the sequence,

    'length.out' must be a non-negative number.

    seq_len is much faster.






    # along.with --> take the length from the length of this argument,

    it generates the integer sequence 1,2,....

    seq_along is much faster.





    a <- seq(along.with=c("Data",T,2,3,,4,5,6,7,8,9,04))



    Vector Manipulation:


    a <- c(4,7,9,12,8,3)

    b <- c(2,3,5,7,8,5)



    add <- a+b

    sub <- a-b

    mul <- a*b

    div <- a/b

    # if we apply arithmetic operators to two vectors of unequal length, then the elements of the shorter vector are recycled to complete the operators.

    a <- c(4,7,9,12,8,3)

    b <- c(2,3)

    add <- a+b

    sub <- a-b

    mul <- a*b

    div <- a/b

    # Elements in a vector can be sorted using the sort() function.

    a <- c(9,3,5,8,1,6,5)

    sort <- sort(a)

    rev_sort <- sort(a,decreasing=T)

    a <- c("Srinivas","DataHills","Analysis","MachineLearning")

    sort <- sort(a)

    rev_sort <- sort(a,decreasing=TRUE)

    Sub-setting the Data in Vectors:


    --> Extracting the required fields, rows from the R object.

    vector[position/logical index/negative index/name]


    a <- c("DataScience","DataAnalysis","MachineLearning","R","Python","Weka")

    # Accessing vector elements using position

    # Here [ ] brackets are used for indexing.

    # Indexing starts with position 1.


    a[2,4] # Error



    course <- a[c(1,4,5)]

    # Accessing vector elements using negative indexing


    a[-3,-5] # Error




    course <- a[-c(4,5,6)]

    # Accessing vector elements using logical indexing







    # Accessing vector elements using name

    a <- c(a="DataScience",b="DataAnalysis",c="MachineLearning",d="R",e="Python",f="Weka")


    a[b] # Error


    a["d","e"] # Error


    a[c("-d","-e")] # Error

    a[c(-"d",-"e")] # Error

    a[-c("d","e")] # Error

  • Constants  



    R has a small number of built-in constants.

    The following constants are available:

    1. LETTERS: the 26 upper-case letters of the Roman alphabet;

    2. letters: the 26 lower-case letters of the Roman alphabet;

    3. the three-letter abbreviations for the English month names;

    4. the English names for the months of the year;

    5. pi: the ratio of the circumference of a circle to its diameter.


    [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"

    [20] "T" "U" "V" "W" "X" "Y" "Z"

    > letters

    [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"

    [20] "t" "u" "v" "w" "x" "y" "z"


    [1] "January"  "February" "March"   "April"   "May"    "June"   

    [7] "July"   "August"  "September" "October"  "November" "December"


    [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

    > pi

    [1] 3.141593

    But it is not good to rely on these, as they are implemented as variables whose values can be changed.

    > pi

    [1] 3.141593


    > pi <- 10

    > pi

    [1] 10



    LETTERS[2,3,4,5] # Error








    a <- c(10,20,30,40,50,60)


    names(a) <- c("A","B","C","D","E")

    b <- c(70,80,90,100,110,120)


    names(b) <- LETTERS[21:26]

    sales_1 <- c(100,200,300)

    names(sales_1) <- c("Jan","Feb","Mar")

    names(sales_1) <- # Error

    names(sales_1) <-[1:3]

    names(sales_1) <-[10:12]

    names(sales_1) <-[c(1,5,10)]

    names(sales_1) <-[seq(1,12,4)]

    sales_2 <- c(100,200,300,400,150,250,350,450,120,220,320,420)

    names(sales_2) <-

    names(sales_2) <-

  • RStudio Installation & Lists Part 1  



    --> RStudio is a free and open-source integrated development environment (IDE) for R.

    --> RStudio requires R 3.0.1+. If you don't already have R, download it.

    --> RStudio makes R easier to use.

    --> It includes a code editor, debugging & visualization tools.

    --> RStudio is a separate piece of software that works with R to make R much more user friendly and also adds some helpful features.

    --> RStudio was founded by JJ Allaire.

    --> RStudio is written in the C++ programming language.

    --> Initial release: 28 February 2011 - 7 years ago

    --> Stable release: 1.1.456 / 19 July 2018 - 52 days ago

    Downloading & Installation RStudio:


    --> Goto official website of RStudio i.e.,

    --> Click on RStudio Download

    --> Click on RStudio Desktop Open Source License (FREE) Download

    --> Click on RStudio 1.1.456 - Windows Vista/7/8/10 (85.8 MB Size)

    --> Automatically file will be downloaded in our system

    --> Installation is easy, it takes less than 2 min to install.



    --> Single dimensional object with hetrogeneous data types.

    --> To create a list use function list().

    # Create a list containing character, complex, double, integer and logical.

    a <- list("DataHills",2+3i,10,20L,TRUE)

    # to check the internal storage of a


    # to check the internal storage of each value in a




    lapply(a,typeof) # list of values

    sapply(a,typeof) # vector of values

  • Lists Part 2  

    --> Lists are the R objects which contain elements of different types like








    Function and

    another list inside it.

    # Create a list containing vectors

    a <- list(c(1,2,3),c("A","B","C"),c("R","Python","Weka"),c(10000,8000,6000))




    # Create a list containing characters, vector, double

    b <- list("DataHills","Srinivas",c(10,20,30),15.5)




    # Create a list containing a vector, matrix, fucntion and list.

    c <- list(c(10,20,30),matrix(c(1,2,3,4),nrow=2),search(),list("DataHills",9292005440))




    Naming List Elements:


    --> The list elements can be given names and they can be accessed using these names.

    b <- list(Name1="DataHills",Name2="Srinivas",vector_values=c(10,20,30),single_value=15.5)


    c <- list(c(10,20,30),matrix(c(1,2,3,4),nrow=2),search(),list("DataHills",9292005440))

    names(c) <- c("values","mat","fun","inner_list")

  • List Manipulation, Sub-Setting & Merging  
  • List to Vector & Matrix Part 1  
  • Matrix Part 2  

    matrix(c(1,2,3,4,5,6,7,8,9,10), nrow=5)

    matrix(1:10, nrow=5)

    # Elements are arranged by row

    matrix(1:10, nrow=5, byrow=TRUE)

    # Elements are arranged by column

    matrix(1:10, nrow=5, byrow=FALSE)

    matrix(1:10, ncol=5, byrow=T)

    # Create a matrix with row names and column names

    matrix(1:10, ncol=5, byrow=TRUE, dimnames=list(c("A","B"),c("C","D","E","F","G")))

    matrix(1:10, ncol=5, byrow=TRUE, dimnames=list(LETTERS[1:2],LETTERS[3:7]))

    # To check or define or update or delete the names of rows and columns,

    we have to use the functions



    a <- matrix(1:10, ncol=5, byrow=TRUE, dimnames=list(LETTERS[1:2],LETTERS[3:7]))



    rownames(a) <- c("row1","row2")

    colnames(a) <- c("col1","col2","col3","col4","col5")




    a <- matrix(1:10, ncol=5, byrow=TRUE)

    rownames(a) <- LETTERS[20:21]

    colnames(a) <- LETTERS[22:26]


    a <- matrix(11:20, ncol=5, byrow=TRUE)

    x <- c("r1","r2")

    y <- c("c1","c2","c3","c4","c5")

    rownames(a) <- x

    colnames(a) <- y


    a <- matrix(21:30, ncol=5, byrow=TRUE, dimnames=list(x,y))


    # Remove the row names and column names

    rownames(a) <- NULL


    colnames(a) <- NULL

    # Create a matrix without argument names








    The transpose (reversing rows and columns) is perhaps the simplest method of reshaping a dataset. Use the t() function to transpose a matrix.

    a <- matrix(1:9,nrow=3)



    a <- matrix(1:10, ncol=5)



    a <- t(a)

    Matrix Part 2 

  • Matrix Accessing  

    a <- matrix(1:10,2)

    a <- matrix(1:10,2,T) #Here it prints only first 2 elements

    a <- matrix(1:10,nrow=2,T) #Same result

    a <- matrix(1:10,2,5,T)

    # Create some matrices with hetrogenous datatype elements and observe

    a <- matrix(c(1,2,3,"A","B","C"),2)



    a <- matrix(c("Data",2+3i,TRUE,20,30L,FALSE),3)



    a <- matrix(c(TRUE,20,30L,FALSE),2)



    # Create a matrix which is not multiple of the no. of rows and columns

    a <- matrix(1:3,3,3)

    a <- matrix(1:3,3,3,T)

    a <- matrix(1:5,2,3) # Warning Message

    a <- matrix(1:5,2,5)

    a <- matrix(1:5,2,3) # Warning Message

    a <- matrix(1:10,2,5)

    a <- matrix(1:10,3,4) # Warning Message

    a <- matrix(1:10,5,4)

    Dimensions of a Matrix:


    --> Retrieve or set the dimension of an object.

    --> We have to use


    dim(x) <- value

    # to check the dimensions of a matrix


    x <- 1:12

    dim(x) <- c(3,4)


    Accessing Matrix Elements:


    --> Elements of a matrix can be accessed by using the column and row index[position] of the element.

    a <- matrix(1:12,3)


    # Access the element at 1st column and 1st row



    # Access the element at 2nd column and 3rd row


    # Access the element at 2nd column, 1st row,2nd row and 3rd row



    # Access the element at 2nd column, 1st row and 3rd row


    # Access the element at 1st row, 2nd & 3rd column



    # Access the element at 2nd & 3rd row, 2nd & 3rd column


    # Access the element at 1st & 3rd row, 1st & 3rd column


    # Access only the 1st row



    # Access only the 3rd column



    # Access the element at 2nd & 3rd column, all rows except 2nd row


    --> Elements of a matrix can be accessed by using the column and row names of the element.



    rownames(a) <- LETTERS[1:3]

    colnames(a) <- LETTERS[23:26]



    --> Elements of a matrix can be accessed by using the column and row logical index of the element.


    a[c(F,F,F),c(T,T,T,T)] # only colnames will access


  • Matrix Manipulation, rep function & Data Frame  
  • Data Frame Accessing  

    Convert the columns of a data.frame to characters:


    --> By default, data frames convert characters to factors.

    --> The default behavior can be changed with the stringsAsFactors parameter

    --> Here stringsAsFactors can be set to FALSE.

    --> If the data has already been created, factor columns can be converted to character columns as shown below.

    # Convert all columns to character

    students[] <- lapply(students, as.character)


    # Create a students data frame without stringsAsFactors

    students <- data.frame(

















    # creating a emp data frame

    emp <- data.frame(













    # creating a course data frame




    course <- data.frame(cid,cname,cfee)

    course <- data.frame(cid,cname,cfee,stringsAsFactors=F)

    Extract Data from Data Frame:


    --> Syntax for accessing rows and columns: [, [[, and $

    --> Like a matrix with single brackets data[rows, columns]

    Using row and column numbers

    Using column and row names

    --> Like a list:

    With single brackets data[columns] to get a data frame

    With double brackets data[[one_column]] to get a vector

    With $ for a single column data$column_name

    --> Here we can extract the specific column from a data frame using column name.

    # Extract Specific columns on emp data frame.

    emp1 <- data.frame(emp$ename,emp$salary)

    # Extract first three rows.


    # Extract 2nd and 4th row with 1st and 4th column.
















    Expand Data Frame:


    --> Data frame can be expanded by adding columns and rows.

    Add Column:


    --> Add the column using a new column name.

    # Add the "contact" coulmn.

    emp$contact <- c(9292005440,9898989898,9696969696,9595959595,9191919191)

  • Column Bind & Row Bind  

    column bind and row bind:


    --> Combine R Objects by Rows or Columns

    --> Take a sequence of vector, matrix or data-frame arguments and combine by columns or rows

    --> c to combine vectors as vectors

    --> data.frame to combine vectors and matrices as a data frame.

    --> The functions to bind rows and columns are cbind() and rbind()

    c <- cbind(1:5, 1:5)


    c <- cbind(1, 1:5) # Here 1 is recycled


    c <- cbind(c, 6:10) # insert a column at last


    c <- cbind(c,11:15)[,c(1,3,4,2)] # insert a column at required position


    r <- rbind(1:5,1:5)


    r <- rbind(1,1:5) # Here 1 is recycled


    r <- rbind(a = 1, b = 1:5)


    # deparse.level

    --> deparse.level = 0 by default it will not construct labels

    --> deparse.level = 1 or 2 constructs labels from the argument names

    a <- 40

    rbind(1:5, b = 20, Data = 30, a, deparse.level = 0) # middle 2 rownames

    rbind(1:5, b = 20, Data = 30, a, deparse.level = 1) # 3 rownames (default)

    rbind(1:5, b = 20, Data = 30, a)

    rbind(1:5, b = 20, Data = 30, a, deparse.level = 2) # 4 rownames

    Add Row:


    --> Adding a Single Observation (row)

    emp <- rbind(emp,c(106,"Data",45000,"Java","Hyd",9589695895))

    --> Adding a Many Observations (rows)

    emp <- rbind(emp,c(107,"Data",45000,"Java","Hyd",9589695895),


    --> Adding more rows to an existing data frame, we need to bring in the new rows in the same structure as the existing data frame and use the rbind() function.

    --> Here we create a data frame with new rows and merge it with the existing data frame to create the final data frame.

    # Create the second data frame

    emp_2 <- data.frame(








    # we can merge two data frames using rbind() function

    # Bind the two data frames.

    emp_final <- rbind(emp,emp_2)


  • Merging Data Frames Part 1  

    --> We can join multiple vectors by using the cbind() function.

    cid = c(10, 20, 30)

    cname = c("DataScience", "DataAnalytics", "MachineLearning")

    cfee = c(10000, 8000, 10000)

    # Combine above three vectors.

    course <- cbind(cid, cname, cfee)





    course <- data.frame(cid, cname, cfee)





    course <- data.frame(cid, cname, cfee, stringsAsFactors = FALSE)



    # Create another data frame with similar columns

    course_new <- data.frame(

    cid = c(40, 50, 60),

    cname = c("DataScience", "DataAnalytics", "MachineLearning"),

    cfee = c(10000, 8000, 10000),

    stringsAsFactors = F)





    # Combine rows from both the data frames.

    course_final <- rbind(course, course_new)





    Merging Data Frames:


    --> Merge two data frames by common columns or row names

    --> Merge is similar to join operations in database

    --> Joins are used to retrive the data from multiple tables.

    --> We can merge two data frames by using the merge() function.

    --> The column names should be same when merging the data frames.


    merge(x, y, by, by.x, by.y, all, all.x, all.y, sort)



    x, y x and y are data frames or objects

    by, by.x, by.y specifies the common columns.

    all, all.x, all.y determines the type of merge

    sort logical. by default it is TRUE

    student_details <- data.frame(

      s_name = c("Sreenu","Vasu","Nivas","Reddy","Sai"),

      address = c("Hyd","Bang","Chennai","Pune","Mumbai"),

      contact = c(9292005440,9898989898,9696969696,9595959595,9292929292),

      stringsAsFactors = FALSE)

    course_details <- data.frame(

      s_name = c("Sreenu","Vasu","Nivas","Reddy","Sai"),

      course = c("DataScience","MachineLearning","DataAnalytics","R","Python"),

      fee = c(20000,15000,10000,8000,10000),

      stringsAsFactors = FALSE)

    by, by.x, by.y:


    --> The names of the columns that are common to both x and y.

    --> The default is to use the columns with common names between the two data frames.

    merge(student_details, course_details, by="s_name")

    merge(course_details, student_details, by="s_name")

    merge(student_details, course_details) #Here by is optional

    --> When both data frames contains more than one common column names, then it retrive based on all common column names.

    course_details$address = c("Hyd","Bang","Chennai","Pune","Mumbai")

    merge(student_details, course_details)

    course_details$address[3:4] <- c("Delhi","Hyd")

    merge(student_details, course_details)

    course_details$address = NULL

    merge(student_details, course_details)

    colnames(student_details)[1] <- "student_name"


    names(student_details)[1] <- "student_name"



    merge(student_details, course_details, by.x="student_name", by.y="s_name")

    merge(course_details, student_details, by.x="s_name", by.y="student_name")

    merge(student_details, course_details) #Without by it give the cross merge

    student_details[c(3,5),1] <- c("Rama","Sita")

    merge(student_details, course_details, by.x="student_name", by.y="s_name")

    merge(course_details, student_details, by.x="s_name", by.y="student_name")

    # Here by default sort is TRUE, set as FALSE

    merge(course_details, student_details, by.x="s_name", by.y="student_name",


  • Merging Data Frames Part 2  
  • Melting & Casting  

    Melting and Casting:


    --> Melting and casting are used to change the shape of the data in multiple steps to get a desired shape.

    --> The functions are melt() and cast().

    --> First install the package "reshape".

    --> The "reshape" package is used for restructuring and aggregating datasets.



    mydata <- data.frame(ID=c(1,1,2,2),Time=c(1,2,1,2),



    Melt the Data:


    --> Melt an object into a form suitable for easy casting.

    --> When we melt a dataset, we restructure it into a format where each measured variable is in its own row, along with the ID variables needed to uniquely identify it.

    md <- melt(mydata, id=c("ID","Time"))

    --> Note: We must specify the variables needed to uniquely identify each measurement (ID and Time) and that the variable indicating the measurement variable names (X1 or X2) is created automatically.

    --> Now the data in a melted form, you can recast it into any shape, using the cast() function.

    Cast the Melted Data:


    --> cast() function starts with melted data and reshapes it using a formula that we provide and an (optional) function used to aggregate the data.

    --> The format is

    newdata <- cast(md, formula, FUN)

    --> Where md is the melted data,

    formula describes the desired end result, and

    FUN is the (optional) aggregating function.

    --> The formula takes the form

    rowvar1 + rowvar2 + … ~ colvar1 + colvar2 + …

    --> In this formula,

    rowvar1 + rowvar2 + … define the set of crossed variables that define the rows, and colvar1 + colvar2 + … define the set of crossed variables that define the columns.

    With Aggregation:


    cast(md, ID~variable, mean)

    cast(md, Time~variable, mean)

    cast(md, ID~Time, mean)

    Without Aggregation:


    cast(md, ID+Time~variable)

    cast(md, ID+variable~Time)

    cast(md, ID~variable+Time)

    We consider the dataset called ships present in the library called "MASS".



    ships - ships damage data

    # Data frame giving the number of damage incidents and aggregate months of service by ship type, year of construction, and period of operation.


    type: "A" to "E".


    year of construction: 1960–64, 65–69, 70–74, 75–79 (coded as "60", "65", "70", "75").


    period of operation : 1960–74, 75–79.


    aggregate months of service.


    number of damage incidents.

    Now we melt the data to organize it, converting all columns other than type and year into multiple rows.

    molten.ships <- melt(ships, id = c("type","year"))


    We can cast the molten data into a new form where the aggregate of each type of ship for each year is created.

    cast(molten.ships, type~variable,sum)

    cast(molten.ships, year~variable,sum)

    cast(molten.ships, type+year~variable,sum)

  • Arrays  



    --> Multidimensional object with homogeneous data types.

    --> An array can have one, two or more dimensions.

    --> It is simple like a vector

    --> One-dimensional array looks like vectors

    --> Two-dimensional array looks like matrix

    --> An array is created using the array() function


    array(data, dim, dimnames)




    --> a vector data to fill the array.


    --> dim attribute which creates the required number of dimensions


    --> either NULL or the names for the dimensions.

    # Create an array with two elements which are 3x3 matrices each.

    a <- array(c("Data", "Hills"), dim = c(3,3,2))

    # it creates 2 rectangular matrices each with 3 rows and 3 columns




    typeof(a) # character

    class(a) # array

    # Create two vectors of different lengths.

    a <- 1:3

    b <- 4:9

    # Take these vectors as input to the array.

    c <- array(c(a,b),dim = c(3,3,2))


    Column Names and Row Names:


    --> By using the dimnames parameter we can give names to the rows, columns and matrices in the array .

    # Create two vectors of different lengths.

    a <- 1:3

    b <- 4:9

    c <- array(c(a,b),dim = c(3,3,2),dimnames = list(c("R1","R2","R3"),

    c("C1","C2","C3), c("M1","M2")))

    x <- c("ROW1","ROW2","ROW3")

    y <- c("COL1","COL2","COL3")

    z <- c("Matrix1","Matrix2")

    # Take these vectors as input to the array.

    c <- array(c(a,b),dim = c(3,3,2),dimnames = list(x,y,z))




    Accessing Array Elements:


    # Print the first row of the second matrix of the array.


    # Print the second column of the first matrix of the array.


    # Print the element in the 1st row and 2nd column of the 1st matrix.


    # Print the element in the 3rd row and 2nd column of the 2nd matrix.


    # Print the 1st Matrix.


    # Print the 2nd Matrix.


    Manipulating Array Elements:


    --> Array is made up matrices in multiple dimensions, the operations on elements of array are carried out by accessing elements of the matrices.

    # create matrices from these arrays.

    m1 <- c[,,1]

    m2 <- c[,,2]

    # Add the matrices.

    result <- m1+m2


    d <- array(11:22,dim = c(3,3,2))

    # create matrices from these arrays.

    m1 <- c[,,2]

    m2 <- d[,,2]

    # Add the matrices.

    result <- m1+m2


    Calculations across Array Elements:


    --> By using the apply() function we can do calculations across the elements in an array.

    --> Syntax:

    apply(X, MARGIN, FUN)



    --> X is an array.

    --> MARGIN is the name of the dataset used.

       E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates       rows and columns

    --> FUN is the function to be applied across the elements of the array.

    # Use apply to calculate the sum of the rows across all the matrices.

    d <- apply(c, MARGIN=1, FUN=sum)

    d <- apply(c, 1, sum) #1 indicates rows


    e <- apply(c, 2, sum) #2 indicates columns


    f <- apply(c, c(1,2), sum) #c(1, 2) indicates rows and columns


    g <- apply(c, c(2,1), sum) #c(2,1) indicates columns and rows


  • Factors  



    --> Factors represent categorical values

    --> Factors are the data objects which are used to categorize the data and store it   as levels.

    --> By default, if the levels are not supplied by the user, then R will generate   the set of unique values in the vector, sort these values alphanumerically, and   use them as the levels.

    --> Factors can store both strings and integers.

    --> Factors are useful in the columns which have a limited number of unique values.   Like "Yes", "No" and "Good", "Bad" etc.

    --> Factors are useful in data analysis for statistical modeling.

    --> Factors are created using the factor() function

    Factors in Data Frame


    --> After creating a data frame with character data type elements, R treats the character column as categorical data and creates factors on it.

    # Create the vectors for data frame.

    h <- c(5.6,6.0,5.10,6.2,5.5,5.2,5.8)

    w <- c(50,55,58,65,70,65,60)

    g <- c("M","F","F","F","M","F","M")

    # Create the data frame.

    hwg <- data.frame(h,w,g)


    # Test if the gender column is a factor.


    # Print the gender column so see the levels.



    # Create a vector as input.

    gender <- c("M","F","M","M","F","F","F","M","F","M")



    typeof(gender) #character

    class(gender) #character

    # Apply the factor function.

    gender_f <- factor(gender)


    typeof(gender_f) #integer

    class(gender_f) #factor





    Changing Order Levels & Creating Labels:


    #If we want to change the ordering of the levels, then one option to specify the levels manually:

    gender_f <- factor(gender,levels=c("M","F"))




    gender_f <- factor(gender,levels=c("M","F"),ordered=TRUE)



    gender_f <- factor(gender,levels=c("M","F"),labels=c("Male","Female"))


    gender_f <- factor(gender,levels=c("M","F"),labels=c("Male","Female"),ordered=T)


    Generating Factor Levels


    --> We can generate factor levels by using the gl() function.

    --> It takes two integers as input which indicates how many levels and how many times each level.


    gl(n, k, labels)



    n is a integer giving the number of levels.

    k is a integer giving the number of replications(no. of times each level).

    labels is a vector of labels for the resulting factor levels.

    a <- gl(2, 4, labels = c("Male", "Female"))


    speed <- c("high","low","medium","low","high","low")

    typeof(speed) #character


    speed_f <- factor(speed)



    speed_f <- factor(speed,levels=c("low","medium","high"))

    speed_f <- factor(speed,levels=c("low","medium","high"),ordered=T)


    data <- c("E","W","E","N","N","E","W","W","W","E","N")



    data_f <- factor(data)



    # Apply the factor function with required order of the level.

    data_f2 <- factor(data_f,levels = c("E","W","N"))


    grades <- c(1,2,3,4,4,3,1,2,1,2,3)

    grades_f <- factor(grades)




    grades_f <- factor(grades,levels=c(3,1,4,2),ordered=TRUE)


    is.factors(grades) #False

    is.factors(grades_f) #True

    Weekdays <- factor(c("Sunday","Monday", "Tuesday", "Wednesday","Thursday","Friday", "Saturday"))


    Weekdays <- factor(Weekdays, levels=c("Sunday","Monday", "Tuesday", "Wednesday", "Thursday", "Friday","Saturday"), ordered=TRUE)


    Weekend <- subset(Weekdays, Weekdays == "Saturday" | Weekdays == "Sunday")


    # When a level of the factor is no longer used,

    we can drop it using the droplevels() function:

    Weekend <- droplevels(Weekend)


  • Functions & Control Flow Statements  


    --> Self contained block of one or more statements which is designed for a specific task is called "Function".

    --> A programmer builds a function to avoid repeating the same task, or reduce complexity.

    --> R function is created by using the keyword function

    --> Functions are classified into 2 types. i.e

    1. Built-in functions / Predefined functions

    2. User defined functions

    Built-in Functions:


    --> There are a lot of built-in function in R.

    Examples of built-in functions are search(),seq(),rep(),c(),sum()....etc






    --> It is possible to see the source code of a function by running the name of the function itself in the console.






    User-defined Function:


    --> we need to write our own function because we have to accomplish a particular task and no ready made function exists.

    --> User-defined functions are specific to what a user wants and once created they can be used like the built-in functions.

    --> Create a function name with user-defined function different from a built-in function. It avoids confusion.

    --> The syntax to create a new function is

    function_name <- function(arg1, arg2, ... ){




    Function Components:

    Function Name


    Function Body

    Return Value

    # Create a function without an argument.

    addnum <- function()




    result <- a+b



    # Call the function without an argument.


    # Create a function with single argument.

    pownum <- function(a)


    result <- a^2



    # Call the function pownum supplying 10 as an argument.



    pownum <- function(a)


    result <- a^2



    # Call the function pownum supplying a vector 'x' as an argument.


    # Create a function with multiple arguments

    addnum <- function(a,b)


    result <- a+b



    # Call the function by position of arguments.


    # Call the function by name of arguments.


    # Create a function with default arguments

    subnum <- function(a=10,b=20)


    result <- a-b



    # Call the function without giving any argument.


    # Call the function with giving new values of the argument.


    addnum <- function(a,b,c)





    # Evaluate the function without supplying one of the arguments.


    # This is called Lazy Evaluation of Function, which means so they are evaluated only when needed by the function body.

    Control Flow Statements:


    --> These type of statements will control the execution flow of the program.

    --> Types of control flow statements are

    1. Decision making statements / selection statements

    2. Looping statements / Iteration statements

    3. Loop control statements

    Decision Making:


    --> By using decision making statements we can create conditional oriented block i.e depending on the condition, interpreter will decides that block will be executed or not.

    --> In decision making statements if the condition is TRUE then block will be executed , if the condition is FALSE then block will not be executed.

    --> R provides the following types of decision making statements.

    if statement

    if..else statement

    if..else..if..else statement

    switch statement






     print("Welcome to")





     print("Welcome to")

     print("Data Science")




    --> Set of instructions give to the interpreter to execute set of statements untill condition become FALSE, it is called Loop.

    --> The basic purpose of loop is code repetation.

    --> R provides the following types of loop to handle looping requirements.

    repeat Loop

    while loop

    for loop

    --> Loop control statements are

    break statement

    next statement


    for(i in 1:5) {

    b <- i^2



    # Create a function to print squares of numbers in sequence.

    powseq <- function(a) {

    for(i in 1:a) {

    b <- i^2




    # Call the function powseq supplying 5 as an argument.


  • Strings & String Manipulation with Base Package  



    --> A string can be created using single quotes or double quotes.

    --> Internally R stores every string within double quotes, even when we create them with single quote.

    --> The class of an object that holds character strings is “character”.

    --> R has several built-in functions that can be used to print or display information, but print() and cat() functions are the most basic.

    print("Hello World") #"Hello World"

    cat("Hello World\n") #Hello World

    # Without the new-line character (\n) the output would be

    cat("Hello World") #Hello World>

    --> cat() function takes one or more character vectors as arguments.

    --> If the character vector has a length greater than 1, arguments are separated by a space (by default)

    cat(c("hello", "world", "\n")) #hello world

    Valid and Invalid strings:


    chr <- 'this is a string'

    chr <- "this is a string"

    chr <- "this 'is' valid"

    chr <- 'this "is" valid'

    chr <- "this is "not" valid"

    chr <- 'this is 'not' valid'

    --> We can create an empty string with empty_str = "" or an empty character vector with empty_chr = character(0).

    --> Both have class “character” but the empty string has length equal to 1 while the empty character vector has length equal to zero.

    empty_str <- ""

    empty_chr <- character(0)

    class(empty_str) #character

    class(empty_chr) #character

    length(empty_str) #1

    length(empty_chr) #0

    --> The function character() will create a character vector with as many empty strings as we want.

    --> We can add new components to the character vector just by assigning it to an index outside the current valid range.

    --> The index does not need to be consecutive, in which case R will auto-complete it with NA elements.

    chr_vector <- character(2) # create char vector

    chr_vector # "" ""

    chr_vector[3] <- "Three" # add new element

    chr_vector # ""   ""   "Three"

    chr_vector[5] <- "Five" # do not need to be consecutive

    chr_vector # ""   ""   "Three" NA   "Five"

    String Manipulation with "base" package:


    --> Some of the string manipulation functions which belongs to base package are









    --> paste() function is used to concatenating (combine) the strings.


    paste(..., sep = " ", collapse = NULL)



    represents any number of arguments to be combined.


    represents any separator between the arguments. It is optional.


    is used to eliminate the space in between two strings. But not the space within two words of one string.

    a <- "Heartly"

    b <- 'Welcome to'

    c <- "DataHills! "


    paste(a,b,c, sep = "$")

    paste(a,b,c, sep = "", collapse = "")



    --> format() function is used to formatting the numbers and strings to a specific style.


    format(x, digits, nsmall, scientific, width, justify = c("left", "right", "centre", "none"))



    is the vector input.


    is the total number of digits displayed.


    is the minimum number of digits to the right of the decimal point.


    is set to TRUE to display scientific notation.


    indicates the minimum width to be displayed by padding blanks in the beginning.


    is the display of the string to left, right or center.

    # Total number of digits displayed. Last digit rounded off.

    format(10.123456789, digits = 9)

    # Display numbers in scientific notation.

    format(c(7, 10.12345), scientific = TRUE)

    # The minimum number of digits to the right of the decimal point.

    format(10.12, nsmall = 5)

    # Format treats everything as a string.


    # Numbers are padded with blank in the beginning for width.

    format(10.5, width = 7)

    # Left justify strings.

    format("DataHills", width = 20, justify = "l")

    # Justfy string with center.

    format("DataHills", width = 20, justify = "c")



    --> toupper() function is used to convert the characters of a string into upper case.



    Here x is the input vector

    toupper("Welcome to DataHills") #"WELCOME TO DATAHILLS"



    --> tolower() function is used to convert the characters of a string into lower case.



    Here x is the input vector

    tolower("Welcome to DataHills") #"welcome to datahills"



    --> substring function is used to extracts the part of a string.


    substring(x, first, last)



    is the character vector input


    is the position of the first character to be extracted


    is the position of the last character to be extracted

    substring("DataHills", 5, 9) #"Hills"



    --> nchar() function is used to count the number of characters including spaces in a string.



    Here x is the input vector.

    nchar("Welcome to DataHills") #20

    nchar(9292005440) #10

    Strings & String Manipulation with Base 

  • String Manipulation with Stringi Package Part 1  

    String manipulation with "stringi" package:


    --> String functions which belongs to base package are good for only simple text processing.

    --> Stringi package contains advanced string Processing functions

    --> For better processing we need stringi package for dealing with more complex problems such as natural language processing.

    --> Features of stringi package are

    text sorting,

    text comparing,

    extracting words,

    sentences and characters,

    text transliteration,

    replacing strings, etc.

    #Install and load stringi



    data <- "Welcome to DataHills.

         DataHills provides online training."

    # To avoid \n write %s+%

    data <- "Welcome to DataHills. " %s+%

        "DataHills provides online training."

    data <- "Welcome to DataHills. DataHills provides online training on Data Science, Data Analytics, Machine Learning, R Programming, Python, Weka, Pega, SqlServer, MySql, SSIS, SSRS, SSAS and PowerBI. For details contact 9292005440 or"



    --> stri_split_boundaries() function is used to extract words.




    single string input


    logical; perform no action for "words" that do not fit into any other categories


    logical; perform no action for words that appear to be numbers


    logical; perform no action for words that contain letters, excluding hiragana, katakana, or ideographic characters


    stri_split_boundaries(data, type="line")

    stri_split_boundaries(data, type="word")

    stri_split_boundaries(data, type="word", skip_word_none=TRUE)

    stri_split_boundaries(data, type="word", skip_word_letter=TRUE)

    stri_split_boundaries(data, type="word", skip_word_none=TRUE, skip_word_letter=TRUE)

    stri_split_boundaries(data, type="word", skip_word_number=TRUE)

    stri_split_boundaries(data, type="word", skip_word_none=TRUE, skip_word_number=TRUE)

    stri_split_boundaries(data, type="sentence")

    stri_split_boundaries(data, type="character")



    --> Count the number of text boundaries(like character, word, line, or sentence boundaries) in a string.


    stri_count_boundaries(data, type="line")

    stri_count_boundaries(data, type="word")

    stri_count_boundaries(data, type="sentence")

    stri_count_boundaries(data, type="character")




    stri_startswith & stri_endswith:


    --> stri_startswith_* and stri_endswith_* determine whether a string starts or ends with a given pattern.

    stri_startswith_fixed(c("srinu", "data", "science", "statistics", "hills"), "s")

    stri_startswith_fixed(c("srinu", "data", "science", "statistics", "hills"), "d")

    stri_startswith_fixed(c("srinu", "data", "science", "Statistics", "hills"), "s")

    stri_startswith_coll(c("srinu", "data", "science", "Statistics", "hills"), "s", strength=1)

    stri_endswith_fixed(c("srinu", "data", "science", "statistics", "hills"), "s")

    stri_detect_regex(c("srinu", "data", "science", "statistics", "hills"), "^s")

    stri_detect_regex(c("srinu", "data", "science", "statistics", "hills"), "s")

    stri_startswith_fixed("datahills", "hill")

    stri_startswith_fixed("datahills", "hill", from=5)



    --> stri_replace_all() function replaces a word with another word based on conditions

    --> vectorize_all parameter, which defaults to TRUE

    stri_replace_all_fixed(data, " ", "#")

    stri_replace_all_fixed(data, "a", "A")

    stri_replace_all_fixed(data,c("DataHills","provides"), c("Information","offers"), vectorize_all=FALSE)

    stri_replace_all_fixed(data,c("DataHills","provides"), c("Information","offers"), vectorize_all=TRUE)

    stri_replace_all_fixed(data,c("DataHills","provides"), c("Information","offers"))

    stri_replace_all_fixed(data,c("Data","provides"), c("Information","offers"), vectorize_all=FALSE)

    stri_replace_all_fixed(data,c("Data","provides"), c("Information","offers"), vectorize_all=TRUE)

    stri_replace_all_fixed(data,c("Data","provides"), c("Information","offers"), vectorize_all=FALSE)

    stri_replace_all_regex(data,"\\b"%s+%c("Data","provides")%s+%"\\b", c("Information","offers"),vectorize_all=FALSE)

    String Manipulation with Stringi Package Part 

  • String Manipulation with Stringi Package Part 2 & Date and Time Part 1  



    --> stri_split() is used to split sentences based on ; , _ or any other metric

    stri_split_fixed(data, " ")

    stri_split_fixed("a_b_c_d", "_")

    stri_split_fixed("a_b_c__d", "_")

    stri_split_fixed("a_b_c__d", "_", omit_empty=FALSE)

    stri_split_fixed("a_b_c__d", "_", omit_empty=TRUE)

    stri_split_fixed("a_b_c__d", "_", n=2) # "a" & remainder

    stri_split_fixed("a_b_c__d", "_", n=2, tokens_only=FALSE)

    stri_split_fixed("a_b_c__d", "_", n=2, tokens_only=TRUE) # "a" & "b" only

    stri_split_fixed("a_b_c__d", "_", n=4, tokens_only=TRUE)

    stri_split_fixed("a_b_c__d", "_", n=4, omit_empty=TRUE, tokens_only=TRUE)

    stri_split_fixed("a_b_c__d", "_", omit_empty=NA)

    stri_split_fixed(c("ab_c", "d_ef_g", "h", ""),"_")

    stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n=1, tokens_only=TRUE, omit_empty=TRUE)

    stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n=2, tokens_only=TRUE, omit_empty=TRUE)

    stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n=3, tokens_only=TRUE, omit_empty=TRUE)



    --> stri_list2matrix() is used to convert lists of atomic vectors to character matrices

    stri_split_fixed(c("ab,c", "d,ef,g", ",h", ""), ",", omit_empty=TRUE)

    stri_list2matrix(stri_split_fixed(c("ab,c", "d,ef,g", ",h", ""), ",", omit_empty=TRUE))

    stri_split_fixed(c("ab,c", "d,ef,g", ",h", ""), ",", omit_empty=TRUE, simplify=FALSE)

    stri_split_fixed(c("ab,c", "d,ef,g", ",h", ""), ",", omit_empty=TRUE, simplify=TRUE)

    stri_split_fixed(c("ab,c", "d,ef,g", ",h", ""), ",", omit_empty=NA, simplify=NA)



    --> stri_trans() functions transform strings either to lower case, UPPER CASE, or to Title Case.

    stri_trans_toupper(data) #toupper(data)

    stri_trans_tolower(data) #tolower(data)





    Date and Time:


    --> R is able to access the current date, time and time zone

    --> Sys.time and Sys.Date functions are used to get current data and time.




    x <- as.Date("2018-10-12")


    typeof(x) # double

    class(x) # Date

    as.Date(c('2018-10-11', '2018-10-12'))

    a <- Sys.Date()

    b <- Sys.time()

    typeof(a) # "double"

    typeof(b) # "double"

    class(a) # "Date"

    class(b) # "POSIXct" "POSIXIt"

    String Manipulation with Stringi Package Part 

  • Date and Time Part 2  



    --> DateTimeClasses function description of the classes "POSIXlt" and "POSIXct" representing calendar dates and times.

    --> To format Dates we use the format(date, format="%Y-%m-%d") function with either the POSIXct (given from as.POSIXct()) or POSIXlt (given from as.POSIXlt())

    --> Codes for specifying the formats to the as.Date() function.

    Format  Code_Meaning

    ------  -----------

     %d day

     %m month

     %y year in 2-digits

     %Y year in 4-digits

     %b abbreviated month in 3 chars

     %B full name of the month

    # It tries to interprets the string as %Y-%m-%d

    as.Date("2018-10-15") # no problem

    as.Date("2018/10/15") # no problem

    as.Date("  2018-10-15 datahills") # leading whitespace and all trailing characters are ignored

    as.Date("15-10-2018") #interprets as "%Y-%m-%d"

    as.Date("15/10/2018") #again interprets as "%Y-%m-%d"

    as.Date("2018-10-15", format = "%Y-%m-%d")

    as.Date("2018-10-15") # in ISO format, so does not require formatting string

    as.Date("10/15/18", format = "%m/%d/%y")

    as.Date("October 15, 2018", "%B %d, %Y")

    as.Date("October 15th, 2018", "%B %dth, %Y") # add separators and literals to format


    as.Date("15-10-2018", "%d-%m-%Y")

    as.Date("15 Oct, 2018","%d %b, %Y")


    as.Date("15 October, 2018", "%d %B, %Y")

    Formatting and printing date-time objects:


    # test date-time object

    d = as.POSIXct("2018-10-15 06:30:10.10", tz = "UTC")

    format(d,"%S") # 00-61 Second as integer

    format(d,"%OS") # 00-60.99… Second as fractional

    format(d,"%M") # 00-59 Minute

    format(d,"%H") # 00-23 Hours

    format(d,"%I") # 01-12 Hours

    format(d,"%p") # AM/PM Indicator

    format(d,"%Z") # Time Zone Abbreviation

    # To add/subtract time, use POSIXct, since it stores times in seconds


    # adding/subtracting times - 60 seconds

    as.POSIXct("2018-10-15") + 60

    # adding 5 hours, 30 minutes, 10 seconds

    as.POSIXct("2018-10-14") + ( (5 * 60 * 60) + (30 * 60) + 10)

    # as.difftime can be used to add time periods to a date.

    as.POSIXct("2018-10-14") +

    as.difftime(5, units="hours") +

    as.difftime(30, units="mins") +

    as.difftime(10, units="secs")

    # To find the difference between dates/times use difftime() for differences in seconds, minutes, hours, days or weeks.

    # using POSIXct objects


    as.POSIXct("2018-10-14 12:00:00"),

    as.POSIXct("2018-10-14 11:59:50"),

    unit = "secs")

    as.POSIXct("07:30", format = "%H:%M") # time, formatting string

    strptime("07:30", format = "%H:%M") # identical, but makes a POSIXlt object

    as.POSIXct("07 AM", format = "%I %p")


    format = "%H:%M:%S",

    tz = "Asia/Calcutta") # time string without timezone & set time zone

    as.POSIXct("2018-10-15 07:30:10",

    format = "%F %T") # shortcut tokens for "%Y-%m-%d" and "%H:%M:%S"

  • Data Extraction from CSV File  

    Loading Data into R Objects:


    --> R can read and write into various file formats like

    Data Extraction from CSV

    Data Extraction from URL

    Data Extraction from CLIPBOARD

    Data Extraction from EXCEL

    Data Extraction from DATABASES

    Working Directory:


    --> getwd() function is used to get current working directory.

    --> setwd() function is used to set a new working directory.

    getwd() # C:/Users/Sreenu/Documents


    getwd() # C:/Users/Sreenu

    Data Extraction from CSV:


    --> read.csv() function is used to import the Comma separated value files (CSVs)

    --> Use sep = "," to set the delimiter to a comma.

    Parameter Details

    --------- --------

    file name of the CSV file to read

    header logical: does the .csv file contain a header row with column names?

    sep character: symbol that separates the cells on each row

    quote character: symbol used to quote character strings

    dec character: symbol used as decimal separator

    fill logical: when TRUE, rows with unequal length are filled blanks.

    comment.char character: character used as comment in the csv file.

    Reading a CSV File Separated by ",":


    --> Read a CSV file available in current working directory

    read.csv("emp10.csv", sep=",", stringsAsFactors=TRUE)

    emp <- read.csv("emp10.csv")

    --> Read a CSV file available in another directory

    emp <- read.csv("c:\Users\Sreenu\Desktop\MLDataSets\emp10.csv") # Error

    emp <- read.csv("c:\\Users\\Sreenu\\Desktop\\MLDataSets\\emp10.csv")

    emp <- read.csv("c:/Users/Sreenu/Desktop/MLDataSets/emp10.csv")

    Analyzing the CSV File:

    ----------------------- # TRUE

    typeof(emp) # list

    mode(emp) # list

    class(emp) # data.frame



    dim(emp) # dimensions of the emp file

    names(emp) # names of the attributes

    str(emp) # structure of the attributes

    emp[1:6,] # first 6 rows and all columns

    # head() and tail() functions are used to returns the first or last records

    head(emp) # return first 6 records by default

    head(emp, n=3) # return n no. of records

    head(emp, 3)

    head(emp, -3)

    tail(emp) # return last 6 records by default

    tail(emp, 3)

    tail(emp, -3)

    # max salary from emp.


    # emp details having max salary.

    subset(emp, sal == max(sal))

    # all the employees working as data analyst

    subset(emp, desig == "data analyst")

    # Data Analyst whose sal is greater than 65000

    subset(emp, sal > 65000 & desig == "data analyst")

    # select only emp desig whose sal is greater than 60000

    subset(emp, sal > 60000, select = desig)

    select all columns except desig whose sal is greater than 60000

    subset(emp, sal > 60000, select = -desig)

    # employees who joined on or after 2013

    subset(emp, as.Date(doj) > as.Date("2013-01-01"))

    recent_join <- subset(emp, as.Date(doj) > as.Date("2013-01-01"))


  • Data Extraction from EXCEL File  

    Writing into a CSV File:


    --> write.csv() function is used to create the csv file.

    --> This file gets created in the current working directory.

    recent_join <- subset(emp, as.Date(doj) > as.Date("2013-01-01"))


    # Write filtered data into a new file.

    write.csv(recent_join, "emp6.csv")

    newemp <- read.csv("emp6.csv")


    write.csv(newemp, "emp6_1.csv")

    new <- read.csv("emp6_1.csv")


    # By default column X comes from the data set.

    write.csv(recent_join,"emp6.csv", row.names = FALSE)

    newemp <- read.csv("emp6.csv")


    Reading a CSV File Separated by ";":


    wines <- read.csv("c:/Users/Sreenu/Desktop/MLDataSets/winequality-red.csv")



    wines <- read.csv("c:/Users/Sreenu/Desktop/MLDataSets/winequality-red.csv",sep=";")




    wines <- read.csv("c:/Users/Sreenu/Desktop/MLDataSets/winequality-red.csv")


    wines <- read.csv2("c:/Users/Sreenu/Desktop/MLDataSets/winequality-red.csv")


    Data Extraction from EXCEL:


    --> R can read and write the excel files using xlsx package.

    --> Excel is the most widely used spreadsheet program which stores data in the .xls or .xlsx format.

    --> Note that, xlsx packages depends on rJava and xlsxjars R packages.

    # First install java software in our system, otherwise xlsx will not load into R



    Reading the Excel File:


    --> read.xlsx() and read.xlsx2 functions are used to import the excel files.

    --> read.xlsx2 is faster on big files compared to read.xlsx function.


    read.xlsx(file, sheetIndex, header=TRUE)




    the path to the file to read


    a number indicating the index of the sheet to read;

    Ex:- use sheetIndex=1 to read the first sheet


    a logical value. If TRUE, the first row is used as the names of the variables

    # Read the first worksheet in the file emp10.xlsx.

    emp <- read.xlsx("c:/Users/Sreenu/Desktop/MLDataSets/emp10.xlsx") #Error

    emp <- read.xlsx("c:/Users/Sreenu/Desktop/MLDataSets/emp10.xlsx", sheetIndex = 1)







    emp <- read.xlsx("c:/Users/Sreenu/Desktop/MLDataSets/emp10.xlsx", 2)


    Writing the Excel file:


    --> write.xlsx() and write.xlsx2() functions are used to export data from R to an Excel file.

    --> write.xlsx2 achieves better performance compared to write.xlsx for very large data.frame (with more than one lakh records).




    a data.frame to be written into the workbook


    the path to the output file


    a character string to use for the sheet name.

    col.names, row.names:

    a logical value specifying whether the column names/row names of x are to be written to the file


    a logical value indicating if x should be appended to an existing file.

    emp <- read.xlsx("c:/Users/Sreenu/Desktop/MLDataSets/emp10.xlsx", 1)

    a <- head(emp)

    b <- tail(emp)

    # Write the first data set in a new workbook

    write.xlsx(a, file="write_emp.xlsx", sheetName="first6", append=FALSE)

    # Add a second data set in a new worksheet

    write.xlsx(b, file="write_emp.xlsx", sheetName="last6", append=TRUE)






    # practice on emp100, emp1000 files

    emp100 <- read.csv("c:/Users/Sreenu/Desktop/MLDataSets/emp100.csv")


    emp1000 <- read.csv("c:/Users/Sreenu/Desktop/MLDataSets/emp1000.csv")


  • Data Extraction from CLIPBOARD, URL, XML & JSON Files  

    Data Extraction from CLIPBOARD:


    --> read.delim("clipboard") function is used to import the copied data.

    emp <- read.delim("clipboard")







    Data Extraction from URL:


    read.csv(url("url address"))

    wine_red <- read.csv(url(""))

    dim(wine_data) # 1599 1

    wine_red <- read.csv(url(""),sep=";")

    dim(wine_red) # 1599 12

    wine_white <- read.csv(url(""),sep=";")

    dim(wine_white) # 4898 12

    Data Extraction from XML:


    --> Extensible Markup Language (XML) is a file format which shares the data on the world wide web.

    --> XML is similar to HTML it contains markup tags.

    --> We can extract XML files using the "XML" package.



    Reading XML File:


    --> Read XML file by using xmlParse() function.

    emp_xml <- xmlParse("C:/Users/Sreenu/Desktop/MLDataSets/emp10.xml")


    # Exract the root node form the xml file.

    emp_root <- xmlRoot(emp_xml)

    # Extract the details of the first node


    # Get the first element of the first node.


    # Get the fifth element of the first node.


    # Get the second element of the third node.


    # Find number of nodes in the root.

    emp_size <- xmlSize(emp_root)


    XML to Data Frame:


    --> For data analysis it is better to convert the xml file into a data frame.

    --> We have to use xmlToDataFrame() function to convert into data frame.

    emp_df <- xmlToDataFrame("C:/Users/Sreenu/Desktop/MLDataSets/emp10.xml")



    Data Extraction from JSON:


    --> JavaScript Object Notation files can read by using the rjson package.



    Read the JSON File:


    --> Read the JSON file by using fromJSON() function.

    a <- fromJSON(file = "file_name.json")

    JSON to Data Frame:


    --> For data analysis it is better to convert the JSON file to a data frame.

    --> We have to use function to convert into data frame.

    b <-

    Data Extraction from CLIPBOARD, URL, XML 

  • Introduction to DBMS  



    --> Data is a raw fact (collection of characters, numeric values, special characters etc)

    --> Whatever we input from the keyboard is known as data.

    --> Data will not provide any meaningful statements to the user.

    --> Ex:- ec@mw2lo1e3



    --> Processing the data is called as information.

    --> Information always provides meaningful statements to the user.

    --> Ex:- welcome@123



    --> Database is a collection of information that can be written in predetermined manner and saved at a particular location is called database.

    Database Management System (DBMS):


    --> DBMS is a tool which can be used to maintain & manage the data with in the database.

    --> DBMS is used for stroing the information, accessing the information, sharing the information and providing security to the information.

    Models of DBMS:


    --> DBMS contains 6 models:

    1. File management system (FMS)

    2. Hierarchy management system (HMS)

    3. Network database management system (NDBMS)

    4. Relational database management system (RDBMS)

    5. Object relational database management system (ORDBMS)

    6. Object oriented relational database management system (OORDBMS)

    File Management System:


    --> FMS is a first model of DBMS which was designed & developed in 1950's.

    --> In this model, the data will be stored into sequential manner or continuous stream of a character manner.



    --> Costly in maintanence

    --> Required more man power

    --> Accessing the data in time taken process or time consume method

    --> It is difficult to maintain large amount of data

    --> There is no security

    --> It is not possible to share the information to the multiple programmers

    --> Programmer will get delay response.

    Hierarchy Management System:


    --> HMS is a second model of DBMS which was designed & developed by IBM company when they are developing a project called as IMS (Information Management System) in 1960's.

    --> In this model the data will be stored in the form of tree structure or level manner.

    --> In tree structure the user has to maintain the following levels those are

    Root level will represent Database Name,

    Parent level will represent Table's Name,

    Child level will represent Column names of a table,

    Leaf level will represent Additional columns.

    --> The main advantage of HMS model is to access the data from the location without taking much time.



    --> In this model only one programmer can interact with the data symultaniously.

    --> There is no security for database information

    --> It is not possible to share the database to multiple programmers or location.

    Network Database Management System:


    --> NDBMS is the third model of DBMS which was designed & developed by IBM company when they are enhancing the features in IBM project in 1969

    --> In this model the data will be stored in the form of tree structure and located with in networks environment.

    --> The main advantage of NDBMS is to shared the required database to the multiple programmers at a time and communicate with same database.



    --> There is no proper security for centralized database system

    --> Database redundency will be increased (duplicate values)

    --> It occupies more memory

    --> Application performance will be reduced

    --> User will get delay responses

    NOTE: The above 3 models are outdated.

    Relational Database Management System:


    --> RDBMS is a 4th model of DBMS which was designed & developed by German scientist E.F Codd in 1970.

    --> E.F Codd defined 12 Codd Rules

    Rule 0: Foundation rule

    Rule 1: Information rule

    Rule 2: Guaranteed access rule

    Rule 3: Systematic treatment of null values

    Rule 4: Active online catalog

    Rule 5: Comprehensive data sub-language rule

    Rule 6: View updating rule

    Rule 7: High-level insert, update and delete

    Rule 8: Physical data independence

    Rule 9: Logical data independence

    Rule 10: Integrity independence

    Rule 11: Distribution independence

    Rule 12: Non-subversion rule

    --> If any database satisfy atleast 6 codd rules then the DBMS is called RDBMS product.

    --> Here relation can be defined as a commonness between the objects

    --> Relations again calssified into 3 types

    1. one to one relation

    2. one to many relation

    3. many to many relation

    --> A object can have a relationship with an other object is known as one to one relation.

    Ex:- studnet <-> sid

    --> A object can have a relationship with an other many object is known as one to many relation

    Ex:- student <-> C, C++, Java

    --> Many objects can have a relationship with an other many objects is known as many to many relation.

    Ex:- vendor1, vendor2, vender3 <-> product1, product2, product3

    --> E.F Codd was designed the above relations based on mathematical concept is called as "Relational Algebra".

    --> The above 3 relations are called as Degree of Relationship.

    Features of RDBMS:


    --> In RDBMS model the data should be stored in the format of table.

    --> Table is a collection of rows and columns.

    --> The vertical lines is called as column or field or attribute and the horizontal lines is called as row or record or tuple.

    --> The intersection between row & column is called CELL / ATOMIC

    --> RDBMS will provide pure security to the database information

    --> Accessing the data is very easy and user friendly

    --> RDBMS will provide sharing the database from one location to another location without loosing the data faciltiy.

    --> It is easy to perform manipulations on tables data.

    --> RDBMS will improved application performance and avoid database redundancy problems.

  • Structured Query Language, MySQL Installation & Normalization  
  • Data Definition Language Commands  
  • Data Manipulation Language Commands  
  • Sub Queries & Constraints  
  • Aggregate Functions, Clauses & Views  
  • Data Extraction from Databases Part 1  

    # For data analysis mostly we have to read data from various data bases like MySQL, Microsoft SQL server, Oracle Database server etc.

    # So, as data scientist we should be able to write query to get data on various criteria from the database.

    Data Extraction from Databases:


    --> R connect with many relational databases like MySql, Oracle, Sql Server, PostgressSQL, SQLite etc.

    --> R can fetch records from databases as a data frame.

    --> Once the data is available in the R environment, then it becomes easy to manipulate or analyze using packages and functions.

    Different R Packages:


    MySQL - RMySQL

    PostgressSQL - RPostgressSQL

    Oracle - ROracle

    --> Here I am using MySql for connecting to R.

    --> To work with MySql we have to install & load RMySQL Package


    # Automatically DBI Package is also installed




    # library(RMySQL) not required

    Connecting R to MySql:


    --> For connecting R to MySql it takes the username, password, database name and host name as input.

    --> dbConnect() function is used to create a connection to a DBMS


    dbConnect(DB, ...)



    user:- for the user name (default: current user)

    password:- for the password

    dbname:- for the name of the database

    host:- for the host name (default: local connection)

    # Here, I am connnecting with "roadway_travels" database

    travels = dbConnect(MySQL(), user = 'root', password = 'datahills', dbname = 'roadway_travels', host = 'localhost')

    # dbListTables() function is used to display the list of tables available in this database.


    Querying the Tables:


    --> dbSendQuery() function is used to execute a query on a given database connection.

    --> fetch() function is used to store the result set as a data frame in R.

    # Query the "bus" tables to get all the rows.

    busdb = dbSendQuery(travels, "select * from bus")

    # Store the result in a R data frame object. n = 5 is used to fetch first 5 rows.

    bus5 = fetch(busdb, n = 5)



    # Query with Filter Clause:

    bushyd = dbSendQuery(travels, "select * from bus where SOURCE='HYDERABAD' ")

    # Fetch all the records(with n = -1) and store it as a data frame.

    busall = fetch(bushyd, n = -1)



  • Data Extraction from Databases Part 2 & DPlyr Package Part 1  

    # Updating Rows in the Tables

    dbSendQuery(travels, "update bus set SOURCE = 'DELHI' where BNO = 50")

    # After executing the above code we can see the table updated in the MySql.

    # Inserting Data into the Tables

    dbSendQuery(travels,"insert into bus values (70,'ORANGE','PUNE','DELHI','20:00:00','06:30:00')")

    # After executing the above code we can see the row inserted into the table in the MySql.

    # Dropping Tables in MySql

    dbSendQuery(travels, 'drop table reservation')

    # After executing the above code we can see the table is dropped in the MySql.

    Creating Tables in MySql:


    --> dbWriteTable() function is used to create tables in the MySql.

    --> It takes a data frame as input.

    --> It overwrites the table if it already exists.

    # Use the R data frame "mtcars" to create the table in MySql.

    dbWriteTable(travels, "mtcars", mtcars[,], overwrite = TRUE)

    # After executing the above code we can see the table created in the MySql.

    Closing connections:


    --> dbDisconnect() function is used to disconnect the created connections with MySQL.




    --> The dplyr is a powerful R-package to manipulate, clean and summarize unstructured data.

    --> The package "dplyr" contain many functions that perform mostly used data manipulation operations such as

    applying filter,

    selecting specific columns,

    sorting data,

    adding or deleting columns and

    aggregating data.

    --> This package was written by Hadley Wickham who has written many useful R packages such as ggplot2, tidyr etc.

    -->dplyr functions are similar to SQL commands such as

    select() for selecting columns,

    group_by() - group data by grouping column,

    join() - joining two data sets.

    Also includes inner_join() and left_join().

    It also supports sub queries for which SQL was popular for.

    --> But SQL was never designed to perform data analysis, it was designed for querying and managing data.

    --> There are many data analysis operations where SQL fails or makes simple things difficult.

    --> For example, calculating median for multiple columns, converting wide format data to long format etc.

    --> Whereas, dplyr package was designed to do data analysis.

    # install and load dplyr package



    # dplyr Functions:-

    dplyr Function Description Equivalent SQL

    ============== ===========           ==============

    select() Selecting columns (variables) SELECT

    filter() Filter (subset) rows. WHERE

    group_by() Group the data GROUP BY

    summarise() Summarise (or aggregate) data -

    arrange() Sort the data ORDER BY

    join() Joining data frames (tables) JOIN

    mutate() Creating New columns COLUMN ALIAS

    # I am using the sampledata.csv file which contains income generated by states from year 2002 to 2015.

    mydata = read.csv("C:/Users/Sreenu/Desktop/MLDataSets/sampledata.csv")

    dim(mydata) # 51 observations (rows) and 16 variables (columns)

    # Selecting Random N Rows

    --> sample_n() function selects random rows from a data frame (or table).

    --> The second parameter of the function tells R the number of rows to select.


    # Selecting Random Fraction of Rows

    --> sample_frac() function returns randomly N% of rows.

    sample_frac(mydata,0.1) # it returns randomly 10% of rows

    Data Extraction from Databases Part 2 & 

  • DPlyr Package Part 2  

    # Remove Duplicate Rows based on all the columns (Complete Row)

    --> distinct() function is used to eliminate duplicates.

    x1 = distinct(mydata)


    # In this dataset, there is not a single duplicate row so it returned same number of rows as in mydata.

    # Remove Duplicate Rows based on a column

    --> .keep_all argument is used to retain all other columns in the output data frame.

    x2 = distinct(mydata, Index, .keep_all= TRUE)


    # Remove Duplicates Rows based on multiple columns

    --> we are using two columns - Index, Y2010 to determine uniqueness.

    x2 = distinct(mydata, Index, Y2010, .keep_all= TRUE)




    --> It is used to select only desired columns.


    select(data , ....)

    data : Data Frame

    .... : columns by name or by function

    # Selecting Columns

    --> Selects column "Index", columns from "Y2006" to "Y2008".

    mydata2 = select(mydata, State, Y2006:Y2008)

    # Dropping columns

    --> The minus sign before a column tells R to drop the variable.

    mydata2 = select(mydata, -Index, -State)

    --> The above code can also be written like :

    mydata2 = select(mydata, -c(Index,State))

    # Selecting or Dropping columns starts with 'Y'

    --> starts_with() function is used to select columns starts with an alphabet.

    mydata3 = select(mydata, starts_with("Y"))


    --> Adding a negative sign before starts_with() implies dropping the columns starts with 'Y'

    mydata33 = select(mydata, -starts_with("Y"))


    The following functions helps you to select columns based on their names:

    Helpers Description

    ======= ===========

    starts_with() Starts with a prefix

    ends_with() Ends with a prefix

    contains() Contains a literal string

    matches() Matches a regular expression

    num_range() Numerical range like x01, x02, x03.

    one_of() Columns in character vector.

    everything() All columns.

    # Selecting columns contain 'I' in their names

    mydata4 = select(mydata, contains("I"))

    # Reorder columns

    --> The column 'State' in the front and the remaining columns follow that.

    mydata5 = select(mydata, State, everything())



    --> It is used to change column name.


    rename(data , new_name = old_name)

    data : Data Frame

    new_name : New column name you want to keep

    old_name : Existing column Name

    # Rename Columns

    --> The rename function can be used to rename columns.

    --> we are renaming 'Index' column to 'Index1'.

    mydata6 = rename(mydata, Index1=Index)



    --> It is used to subset data with matching logical conditions.

    syntax : filter(data , ....)

    data : Data Frame

    .... : Logical Condition

    # Filter Rows

    --> To filter rows and retain only those values in which Index is equal to A.

    mydata7 = filter(mydata, Index == "A")

    # Multiple Selection Criteria

    --> The %in% operator can be used to select multiple items.

    --> Select rows against 'A' and 'C' in column 'Index'.

    mydata7 = filter(mydata, Index %in% c("A", "C"))

    # 'AND' Condition in Selection Criteria

    --> Filtering data for 'A' and 'C' in the column 'Index' and income greater than 13 lakh in Year 2002.

    mydata8 = filter(mydata, Index %in% c("A", "C") & Y2002 >= 1300000)

    # 'OR' Condition in Selection Criteria

    --> | (OR) in the logical condition. It means any of the two conditions.

    mydata9 = filter(mydata, Index %in% c("A", "C") | Y2002 >= 1300000)

    # NOT Condition

    --> The "!" sign is used to reverse the logical condition.

    mydata10 = filter(mydata, !Index %in% c("A", "C"))

    # CONTAINS Condition

    --> The grepl() function is used to search for pattern matching.

    --> we are looking for records wherein column state contains 'Ar' in their name.

    mydata10 = filter(mydata, grepl("Ar", State))

  • DPlyr Functions on Air Quality Data Set  



    --> It is used to summarize data.

    syntax : summarise(data , ....)

    data : Data Frame

    ..... : Summary Functions such as mean, median etc

    # Summarize selected columns

    --> we are calculating mean and median for the column Y2015.

    summarise(mydata, Y2015_mean = mean(Y2015), Y2015_med=median(Y2015))

    # Summarize Multiple Columns

    --> we are calculating number of records, mean and median for columns Y2005 and Y2006.

    --> summarise_at() function allows us to select multiple columns by their names.

    summarise_at(mydata, vars(Y2005, Y2006), funs(n(), mean, median))

    Working on another dataset:


    # I am using the airquality dataset from the datasets package.

    # The airquality dataset contains information about air quality measurements in New York from May 1973 – September 1973.




    sample_n(airquality, size = 10)

    sample_frac(airquality, size = 0.1)

    # we can return all rows with Temp greater than 70 as follows:

    filter(airquality, Temp > 70)

    # return all rows with Temp larger than 80 and Month higher than 5.

    filter(airquality, Temp > 80 & Month > 5)

    # adds a new column that displays the temperature in Celsius.

    mutate(airquality, TempInC = (Temp - 32) * 5 / 9)

    summarise(airquality, mean(Temp, na.rm = TRUE))

    summarise(airquality, Temp_mean = mean(Temp, na.rm = TRUE))

    # Group By

    --> The group_by function is used to group data by one or more columns.

    --> we can group the data together based on the Month, and then use the summarise function to calculate and display the mean temperature for each month.

    summarise(group_by(airquality, Month), mean(Temp, na.rm = TRUE))

    # Count

    --> The count function calculates the no. of observations based on a group.

    --> It is slightly similar to the table function in the base package.

    count(airquality, Month)

    --> This means that there are 31 rows with Month = 5, 30 rows with Month = 6, and so on.

    # Arrange

    --> The arrange function is used to arrange rows by columns.

    --> Currently, the airquality dataset is arranged based on Month, and then Day.

    --> We can use the arrange function to arrange the rows in the descending order of Month, and then in the ascending order of Day.

    arrange(airquality, desc(Month), Day)

    # Pipe

    --> The pipe operator in R, represented by %>% can be used to chain code together.

    --> It is very useful when you are performing several operations on data, and don’t want to save the output at each intermediate step.

    --> For example, let’s say we want to remove all the data corresponding to Month = 5, group the data by month, and then find the mean of the temperature each month.

    --> The conventional way to write the code for this would be:

    filteredData <- filter(airquality, Month != 5)

    groupedData <- group_by(filteredData, Month)

    summarise(groupedData, mean(Temp, na.rm = TRUE))

    --> With piping, the above code can be rewritten as:

    airquality %>%

      filter(Month != 5) %>%

      group_by(Month) %>%

      summarise(mean(Temp, na.rm = TRUE))

  • Plyr Package for Data Analysis  



    --> The plyr package is a tool for doing split-apply-combine (SAC) procedures.

    --> This is an extremely common pattern in data analysis:

    we solve a complex problem by breaking it down into small pieces, doing something to each piece and then combining the results back together again.

    Install and Load plyr:



    Plyr provides a set of functions for common data analysis problems:



    re-order the rows of a data frame by specifying the columns to order by


    add new columns or modifying existing columns, like transform, but new columns can refer to other columns that you just created.


    like mutate but create a new data frame, not preserving any columns in the old data frame.


    an adapation of merge which is more similar to SQL, and has a much faster implementation if you only want to find the first match.


    a version of join that instead of returning the two tables combined together, only returns the rows in the first table that match the second.


    make any function work colwise on a dataframe


    easily rename columns in a data frame


    round a number to any degree of precision


    quickly count unique combinations and return return as a data frame.

    plyr vs dplyr:


    --> dplyr is a new package which provides a set of tools for efficiently manipulating datasets in R.

    --> dplyr is the next iteration of plyr, focussing on only data frames.

    --> dplyr is faster, has a more consistent API and should be easier to use.

    --> Lets compare plyr and dplyr with a little example, using the Batting dataset from the fantastic Lahman package which makes the complete Lahman baseball database easily accessible from R.

    --> Pretend we want to find the five players who have batted in the most games in all of baseball history.

    Install and Load Lahman:



    --> The basic format is two letters followed by ply().

    --> The first letter refers to the format in and the second to the format out.

    --> The three main letters are:

    d = data frame

    a = array (includes matrices)

    l = list

    --> So, ddply means: take a data frame, split it up, do something to it, and return a data frame.

    --> ldply means: take a list, split it up, do something to it, and return a data frame.

    --> This extends to all combinations.

    --> In the following table, the columns are the input formats and the rows are the output format:

    ObjectType dframe list array

    ==========   ====== ====  =====

    data frame ddply ldply adply

    list dlply llply alply

    array daply laply aaply

  • Tidyr Package with Functions  

    In plyr, we might write code like this:

    games <- ddply(Batting, "playerID", summarise, total = sum(G))

    head(arrange(games, desc(total)), 5)

    --> We use ddply() to break up the Batting dataframe into pieces according to the playerID column, then apply summarise() to reduce the player data to a single row.

    --> Each row in Batting represents one year of data for one player, so we figure out the total number of games with sum(G) and save it in a new column called total.

    --> We sort the result so the most games come at the top and then use head() to pull off the first five.

    # If you need functions from both plyr and dplyr, please load plyr first, then dplyr.

    # If we load plyr after dplyr - it is likely to cause problems.

    In dplyr, the code is similar:

    players <- group_by(Batting, playerID)

    games <- summarise(players, total = sum(G))

    head(arrange(games, desc(total)), 5)

    --> Grouping is a top level operation performed by group_by(), and summarise() works directly on the grouped data, rather than being called from inside another function.

    --> The other big difference is speed. plyr took about 9 seconds on my computer, and dplyr took 0.2s, a 35x speed-up.

    --> This is common when switching from plyr to dplyr, and for many operations you’ll see a 20x-1000x speedup.

    --> dplyr provides another innovation over plyr: the ability to chain operations together from left to right with the %>% operator.

    This makes dplyr behave a little like a grammar of data manipulation:

    Batting %>%

     group_by(playerID) %>%

     summarise(total = sum(G)) %>%

     arrange(desc(total)) %>%




    --> tidyr package is an evolution of reshape2 (2010-2014) and reshape (2005-2010) packages.

    --> It's designed specifically for data tidying (not general reshaping or aggregating)

    --> tidyr is new package that makes it easy to “tidy” your data.

    --> it’s easy to munge (with dplyr), visualise (with ggplot2 or ggvis) and model (modelling packages).

    Install and Load tidyr:



    Translation b/w the terminology used in different places:

    tidyr gather spread

    ==== ====== ======

    reshape(2) melt cast

    spreadsheets unpivot pivot

    databases fold unfold

    # I will use the mtcars dataset from the datasets library.



    # Let us include the names of the cars in a column called car for easier manipulation.

    mtcars$car <- rownames(mtcars)



    mtcars <- mtcars[, c(12, 1:11)]




    --> gather() function is used to converts wide data to longer format.

    --> It is analogous to the melt function from reshape2.


    gather(data, key, value, ..., na.rm = FALSE, convert = FALSE)

    where ... is the specification of the columns to gather.

    # We can replicate what melt does as follows:

    mtcarsNew <- mtcars %>% gather(attribute, value, -car)




    --> As we can see, it gathers all the columns except car and places their name and value into the attritube and value column respectively.

    --> The great thing about tidyr is that you can gather only certain columns and leave the others alone.

    --> If we want to gather all the columns from mpg to gear and leave the carb and car columns as they are, we can do it as follows:

    mtcarsNew <- mtcars %>% gather(attribute, value, mpg:gear)





    --> spread() fucntion is used to converts long data to wider format.

    --> It is analogous to the cast function from reshape2.


    spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE)

    We can replicate what cast does as follows:

    mtcarsSpread <- mtcarsNew %>% spread(attribute, value)


  • Factor Analysis  



    --> unite() fucntion is used to combines two or more columns into a single column.


    unite(data, col, ..., sep = "_", remove = TRUE)

    where ... represents the columns to unite and col represents the column to add.

    # Let us create some dummy data:

    date <- as.Date('2016-01-01') + 0:14

    hour <- sample(1:24, 15)

    min <- sample(1:60, 15)

    second <- sample(1:60, 15)

    event <- sample(letters, 15)

    data <- data.frame(date, hour, min, second, event)


    # Now, let us combine the date, hour, min, and second columns into a new column called datetime. # Usually, datetime in R is of the form Year-Month-Day Hour:Min:Second.

    dataNew <- data %>%

     unite(time, hour, min, second, sep = ':')


    dataNew <- data %>%

     unite(time, hour, min, second, sep = ':') %>%

     unite(datetime, date, time, sep = ' ')




    --> separate() is used to splits one column into two or more columns.


    separate(data, col, into, sep, remove, convert, extra , fill , ...)

    # We can get back the original data we created using separate as follows:

    data1 <- dataNew %>%

     separate(datetime, c('date', 'time'), sep = ' ') %>%

     separate(time, c('hour', 'min', 'second'), sep = ':')


    --> It first splits the datetime column into date and time, and then splits time into hour, min, and second.

    Factor Analysis:




    --> returns the count for each categorical values.

    --> table uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels.

    cars <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/usedcars.csv",stringsAsFactors=TRUE)












    --> Proportionality

    --> Express Table Entries as Fraction of Marginal Table

    --> In mathematics, two variables are proportional if there is always a constant ratio between them.

    --> The constant is called the coefficient of proportionality or proportionality constant.

    prop.table(cars$model) #it is not possible


    prop.table(table(cars$model))*100 #result in percentage

  • Prob.Table & CrossTable  























    --> CrossTable() function belongs to "gmodels" package (for more analysis)



    --> Cross Tabulation With Tests For Factor Independence

    --> The CrossTable( ) function in the gmodels package produces crosstabulations modeled after PROC FREQ in SAS or CROSSTABS in SPSS. It has a wealth of options.

    --> We can control whether

    row percentages (prop.r),

    column percentages (prop.c),

    table percentages (prop.t),

    chisq percentages (prop.chisq) by making them TRUE.






  • Statistical Observations Part 1  

    Statistical Observations:


    --> Statistical analysis in R is performed by using many in-built functions.

    --> Most of these functions are part of the R base package & stats package.

    --> These functions take R vector as an input along with the arguments and give the result.






    min() & max():


    --> Returns the (regular or parallel) maxima and minima of the input values.


    max(..., na.rm = FALSE)

    min(..., na.rm = FALSE)



    --> ... numeric or character arguments

    --> na.rm is a logical indicating whether missing values should be removed.

    # Create a vector.

    x <- c(12,41,21,-32,23,24,65,-12,10,-8)



    car <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/usedcars.csv")





    --> It is calculated by taking the sum of the values and dividing with the number of values in a data series.

    --> mean() function is used to calculate this value.


    mean(x, trim = 0, na.rm = FALSE, ...)



    --> x is the input vector.

    --> trim is used to drop some observations from both end of the sorted vector.

    --> na.rm is used to remove the missing values from the input vector.

    # Find Mean.

    a <- mean(x)



    Applying Trim Option:


    --> When trim parameter is supplied, the values in the vector get sorted and then the required

    numbers of observations are dropped from calculating the mean.

    --> When trim = 0.3, 3 values from each end will be dropped from the calculations to find mean.

    --> In this case the sorted vector is (-32 -12 -8 10 12 21 23 24 41 65) and the values removed from the vector for calculating mean are (-32 -12 -8) from left and (24 41 65) from right.

    a <- mean(x,trim = 0.3)


    mean(car$price, trim = 0.3)

    Applying NA Option:


    --> If there are missing values, then the mean function returns NA.

    --> To drop the missing values from the calculation use na.rm = TRUE. which means remove the NA values.

    # Create a vector.

    x <- c(12,41,21,-32,23,24,65,-12,10,-8,NA)

    # Find mean.

    a <- mean(x)

    print(a) # NA

    # Find mean dropping NA values.

    a <- mean(x,na.rm = TRUE)


    mean(car$price, na.rm = TRUE)



    --> The middle most value in a data series is called the median.

    --> median() function is used to calculate this value.


    median(x, na.rm = FALSE)



    --> x is the input vector.

    --> na.rm is used to remove the missing values from the input vector.

    # Find the median.

    b <- median(x)





    --> quantile produces sample quantiles corresponding to the given probabilities.

    --> The smallest observation corresponds to a probability of 0 and the largest to a probability of 1.





    --> it gives all the above 5 statistical observations.

    --> summary is a generic function used to produce result summaries of the results of various model fitting functions.

    --> The function invokes particular methods which depend on the class of the first argument.





    --> The mode is the value that has highest number of occurrences in a set of data.

    --> Unike mean and median, mode can have both numeric and character data.

    --> R does not have a standard in-built function to calculate mode.

    --> So we create a user function to calculate mode of a data set in R.

    --> This function takes the vector as input and gives the mode value as output.

    # Create the function.

    # unique is used to Extract Unique Elements

    # which.mean determines the location, i.e., index of the (first) minimum or maximum of a numeric (or logical) vector.

    # tabulate takes the integer-valued vector bin and counts the number of times each integer occurs in it.

    # match returns a vector of the positions of (first) matches of its first argument in its second.

    mod <- function(v) {

    uniqv <- unique(v)

    uniqv[which.max(tabulate(match(v, uniqv)))]


    # Create the vector with numbers.

    v <- c(2,7,5,3,7,6,1,7,2,5,7,9,7,6,0,7,5)

    # Calculate the mode using the user function.

    a <- mod(v)




    # Create the vector with characters.

    charv <- c("Analysis","DataHills","DataScience","DataHills")

    # Calculate the mode using the user function.

    a <- mod(charv)




  • Statistical Observations Part 2  

    mean = median


    --> No skewed

    --> Normal distribution

    --> Data is equal distribution

    mean > median


    --> Right skewed

    --> Data is extended more on the right hand side

    --> Positive skewness

    mean < median


    --> Left skewed

    --> Data is extended more on the left hand side

    --> Negative skewness

    cars <-read.csv("C:/Users/Sreenu/Desktop/MLDataSets/usedcars.csv")








    mean(cars$price) #12961.93

    median(cars$price) #13591.5



    #1 to 1000 median is 500



    mean(a) #500

    median(a) #500


    median(cars$price) #13591

    mean(cars$price) #12961

    13591-3800 #9791

    21992-13591 #8401

    # to check the left skewed data in the graph

    boxplot(cars$price, horizontal=T)

    mean(cars$mileage) #44260

    median(cars$mileage) #36385

    range(cars$mileage) #4867 151479

    36385-4867 #31518

    151479-36385 #115094

    boxplot(cars$mileage, horizontal=T)

  • Statistical Analysis on Credit Data set  

    credit <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/credit.csv")




    range(credit$age) #19 75

    mean(credit$age) #35.546

    median(credit$age) #33


    # quantile

    0%(min) 25%(Q1)   50%(median)(Q2) 75%(Q3)  100%(max)

    cars <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/usedcars.csv")

    quantile(cars$price) #it gives the values of 0%,25%,50%,75%,100%



    quantile(cars$price,seq(0.1,1,0.1)) #it gives 10%,20%,.......100%

    quantile(cars$price,seq(0.25,0.75,0.25)) #it gives 25%,50%,75%


    # it gives Min. 1st Qu.  Median  Mean  3rd Qu. Max


    hist(cars$price) #in this histogram we can observe left skewed data

    hist(cars$mileage) #in this histogram we can observe right skewed data





    --> IQR(x, na.rm = FALSE)

    --> Q3-Q1 i.e., middle 50% data

    --> Computes interquartile range of the x values.



    --> variance (symbolized by S2)

    --> It returns that how much variance is from the mean value.

    --> It is calculated as the average squared deviation of each number from the mean of a data set.

    --> For example,

    a <- c(1,2,3)

    mean(a) # 2

    var(a) # 1

    for the numbers 1, 2, and 3 the mean is 2 and the variance is 0.667.

    [(1 - 2)2 + (2 - 2)2 + (3 - 2)2] ÷ 3 = 0.667

    [squaring deviation from the mean] ÷ number of observations-1 = variance


    var(x, na.rm = FALSE)

    var(1:10) # 9.166667

    var(1:5, 1:5) # 2.5



    --> standard deviation (the square root of the variance, symbolized by S)

    --> This function computes the standard deviation of the values in x.

    --> If na.rm is TRUE then missing values are removed before computation proceeds.


    sd(x, na.rm = FALSE)


    sd(1:2) ^ 2

    # IQR (Interquartile Range):

    IQR(cars$price) # it gives middle 50%

    14904-10995 # manually Q3-Q1 value


    marks <- c(76,80,72,78)

    mean(marks) # 76.5


    #3.415 it means that mean value +sd value is max value(approx) & mean value -sd value is min value(approx)

    var(marks) # 11.666

    marks <- c(76.5,76.5,76.5,76.5)

    mean(marks) # 76.5

    sd(marks) # 0

    var(marks) # 0

    sd(cars$price) # 3122.482

    mean(cars#price) # 12961.93

    12961-3122 # 9839

    12961+3122 # 16083


    # check the values from 16083 to 9839. 38 values excluding outof 150 values. i.e., total 112 values are considering for sd

    mean(cars$mileage) # 44260.65

    sd(cars$mileage) # 26982.1

    44260-26982 # 17278

    44260+26982 # 71242


    cars_new <- subset(cars,mileage>=17278 & mileage<=71242)



    mean(cars_new$mileage) # 38356.13

    median(cars_new$mileage) # 36124

    mean(cars$mileage) # 44260.65

    median(cars$mileage) # 36385

    sd(cars_new$mileage) # 12265.27

    sd(cars$mileage) # 26982

    hist(cars$mileage) # right skewed deviation

    hist(cars_new$mileage) # nearly to normal deviation

  • Data Visualization, Pie Charts, 3D Pie Charts & Bar Charts  

    Data Visualization / Plotting:


    1. High level plotting

    --> Generates a new plot

    2. Low level plotting

    --> Editing the existing plot

    --> R Programming language has many libraries to create charts and graphs.

    --> Data can be visualized in the form of

    Pie Charts

    Bar Charts



    Line Graphs


    Pie Charts:


    --> A pie-chart is a representation of values as slices of a circle with different colors.

    --> The slices are labeled and the numbers corresponding to each slice is also represented in the chart.

    --> In R the pie chart is created using the pie() function which takes positive numbers as a vector input.

    --> The additional parameters are used to control labels, color, title etc.


    pie(x, labels, radius, main, col, clockwise)



    --> x is a vector containing the numeric values used in the pie chart.

    --> labels is used to give description to the slices.

    --> radius indicates the radius of the circle of the pie chart.(value between -1 and +1).

    --> main indicates the title of the chart.

    --> col indicates the color palette.

    --> clockwise is a logical value indicating if the slices are drawn clockwise or anti clockwise.

    --> Simple pie-chart using the input vector and labels.

    --> It will create and save the pie chart in the current R working directory.

    # Create data for the graph.

    x <- c(21, 62, 10, 53)

    labels <- c("London", "New York", "Singapore", "Mumbai")

    # Give the chart file a name.

    # png - Graphics devices for BMP, JPEG, PNG and TIFF format bitmap files.

    png(file = "city.jpg")

    # Plot the chart.


    # Save the file.

    # function provide control over multiple graphics devices.

    Pie Chart Title and Colors:


    --> We can expand the features of the chart by adding more parameters to the function.

    --> We will use parameter main to add a title to the chart and another parameter is col which will make use of rainbow colour pallet while drawing the chart.

    --> The length of the pallet should be same as the number of values we have for the chart.

    --> Hence we use length(x).

    # Give the chart file a name.

    png(file = "city_title_colours.jpg")

    # Plot the chart with title and rainbow color pallet.

    pie(x, labels, main = "City pie chart", col = rainbow(length(x)))

    # Save the file.

    Slice Percentages and Chart Legend:


    --> We can add slice percentage and a chart legend by creating additional chart variables.

    piepercent<- round(100*x/sum(x), 1)

    # Give the chart file a name.

    png(file = "city_percentage_legends.jpg")

    # Plot the chart.

    pie(x, labels = piepercent, main = "City pie chart",col = rainbow(length(x)))

    # legend --> used to add legends to plots

    legend("topright", c("London","New York","Singapore","Mumbai"), cex = 1.0,

    fill = rainbow(length(x)))

    # Save the file.

    3D Pie Chart:


    --> A pie chart with 3 dimensions can be drawn using additional packages. The package plotrix has a function called pie3D() that is used for this.

    # Install & Load plotrix package.



    # Give the chart file a name.

    png(file = "3d_pie_chart.jpg")

    # Plot the chart.

    pie3D(x,labels = lbl,explode = 0.1, main = "Pie Chart of Countries ")

    # Save the file.

    Bar Charts:


    --> A bar chart represents data in rectangular bars with length of the bar proportional to the value of the variable.

    --> R uses the function barplot() to create bar charts.

    --> R can draw both vertical and Horizontal bars in the bar chart. In bar chart each of the bars can be given different colors.


    barplot(H,xlab,ylab,main, names.arg,col)



    --> H is a vector or matrix containing numeric values used in bar chart.

    --> xlab is the label for x axis.

    --> ylab is the label for y axis.

    --> main is the title of the bar chart.

    --> names.arg is a vector of names appearing under each bar.

    --> col is used to give colors to the bars in the graph.

    Data Visualization, Pie Charts, 3D Pie Charts 

  • Box Plots  

    --> Creating a bar chart using the input vector and the name of each bar.

    # Create the data for the chart

    H <- c(5,10,30,3,40)

    # Give the chart file a name

    png(file = "barchart.png")

    # Plot the bar chart


    # Save the file

    Bar Chart Labels, Title and Colors:


    --> The features of the bar chart can be expanded by adding more parameters.

    --> The main parameter is used to add title.

    --> The col parameter is used to add colors to the bars.

    --> The is a vector having same number of values as the input vector to describe the meaning of each bar.

    # Create the data for the chart

    H <- c(5,10,30,3,40)

    M <- c("Jan","Feb","Mar","Apr","May")

    # Give the chart file a name

    png(file = "barchart_months_revenue.png")

    # Plot the bar chart

    barplot(H,names.arg=M,xlab="Month",ylab="Revenue",col="blue",main="Revenuechart", border="red")

    # Save the file

    Group Bar Chart and Stacked Bar Chart:


    --> We can create bar chart with groups of bars and stacks in each bar by using a matrix as input values.

    --> More than two variables are represented as a matrix which is used to create the group bar chart and stacked bar chart.

    # Create the input vectors.

    colors = c("red","blue","green")

    months <- c("Jan","Feb","Mar","Apr","May")

    regions <- c("East","West","North")

    # Create the matrix of the values.

    Values <- matrix(c(3,9,4,13,8,4,9,7,3,15,8,2,7,11,12), nrow = 3, ncol = 5, byrow = TRUE)

    # Give the chart file a name

    png(file = "barchart_stacked.png")

    # Create the bar chart

    barplot(Values, main = "total revenue", names.arg = months, xlab = "month", ylab = "revenue", col = colors)

    # Add the legend to the chart

    legend("topleft", regions, cex = 1.3, fill = colors)

    # Save the file



    --> Boxplots are a measure of how well distributed is the data in a data set.

    --> It divides the data set into three quartiles.

    --> This graph represents the minimum, maximum, median, first quartile and third quartile in the data set.

    --> It is also useful in comparing the distribution of data across data sets by drawing boxplots for each of them.

    --> Boxplots are created in R by using the boxplot() function.


    boxplot(x, data, notch, varwidth, names, main)



    --> x is a vector or a formula.

    --> data is the data frame.

    --> notch is a logical value. Set as TRUE to draw a notch.

    --> varwidth is a logical value. Set as true to draw width of the box proportionate to the sample size.

    --> names are the group labels which will be printed under each boxplot.

    --> main is used to give a title to the graph.

    --> Use the data set "mtcars" available in the R environment to create a basic boxplot.

    # Let's look at the columns "mpg" and "cyl" in mtcars.


    # Creating the Boxplot:

    --> The below script will create a boxplot graph for the relation between mpg (miles per gallon) and cyl (number of cylinders).

    # Give the chart file a name.

    png(file = "boxplot.png")

    # Plot the chart.

    boxplot(mpg ~ cyl, data = mtcars, xlab = "Number of Cylinders",

      ylab = "Miles Per Gallon", main = "Mileage Data")

    # Save the file.

    Boxplot with Notch:


    --> We can draw boxplot with notch to find out how the medians of different data groups match with each other.

    -->The below script will create a boxplot graph with notch for each of the data group.

    # Give the chart file a name.

    png(file = "boxplot_with_notch.png")

    # Plot the chart.

    boxplot(mpg ~ cyl, data = mtcars,

      xlab = "Number of Cylinders",

      ylab = "Miles Per Gallon",

      main = "Mileage Data",

      notch = TRUE,

      varwidth = TRUE,

      col = c("green","yellow","purple"),

      names = c("High","Medium","Low")


    # Save the file.

  • Histograms & Line Graphs  



    --> A histogram represents the frequencies of values of a variable bucketed into ranges.

    --> Histogram is similar to bar chat but the difference is it groups the values into continuous ranges.

    --> Each bar in histogram represents the height of the number of values present in that range.

    --> R creates histogram using hist() function.





    --> v is a vector containing numeric values used in histogram.

    --> main indicates title of the chart.

    --> col is used to set color of the bars.

    --> border is used to set border color of each bar.

    --> xlab is used to give description of x-axis.

    --> xlim is used to specify the range of values on the x-axis.

    --> ylim is used to specify the range of values on the y-axis.

    --> breaks is used to mention the width of each bar.

    --> Creating a histogram using input vector, label, col and border parameters.

    # Create data for the graph.

    v <- c(8,14,23,9,38,21,14,44,34,31,17)

    # Give the chart file a name.

    png(file = "histogram.png")

    # Create the histogram.

    hist(v,xlab = "Weight",col = "yellow",border = "blue")

    # Save the file.

    Range of X and Y values:


    --> To specify the range of values allowed in X axis and Y axis, we can use the xlim and ylim parameters.

    --> The width of each of the bar can be decided by using breaks.

    # Give the chart file a name.

    png(file = "histogram_lim_breaks.png")

    # Create the histogram.

    hist(v,xlab = "Weight",col = "green",border = "red", xlim = c(0,40), ylim = c(0,5),

      breaks = 5)

    # Save the file.

    Line Graphs:


    --> A line chart is a graph that connects a series of points by drawing line segments between them.

    --> These points are ordered in one of their coordinate (usually the x-coordinate) value.

    --> Line charts are usually used in identifying the trends in data.

    --> The plot() function in R is used to create the line graph.





    --> v is a vector containing the numeric values.

    --> type takes the value "p" to draw only the points, "l" to draw only the lines and "o" to draw both points and lines.

    --> xlab is the label for x axis.

    --> ylab is the label for y axis.

    --> main is the Title of the chart.

    --> col is used to give colors to both the points and lines.

    --> Creating a line chart using the input vector and the type parameter as "O".

    # Create the data for the chart.

    v <- c(5,10,30,3,40)

    # Give the chart file a name.

    png(file = "line_chart.jpg")

    # Plot the bar chart.

    plot(v,type = "o")

    # Save the file.

    Line Chart Title, Color and Labels:


    --> The features of the line chart can be expanded by using additional parameters.

    --> We add color to the points and lines, give a title to the chart and add labels to the axes.

    # Give the chart file a name.

    png(file = "line_chart_label_colored.jpg")

    # Plot the bar chart.

    plot(v,type = "o", col = "red", xlab = "Month", ylab = "Rain fall",

      main = "Rain fall chart")

    # Save the file.

    Multiple Lines in a Line Chart:


    --> More than one line can be drawn on the same chart by using the lines() function.

    --> After the first line is plotted, the lines() function can use an additional vector as input to draw the second line in the chart,

    # Create the data for the chart.

    v <- c(5,10,30,3,40)

    t <- c(15,9,8,25,5)

    # Give the chart file a name.

    png(file = "line_chart_2lines.jpg")

    # Plot the bar chart.

    plot(v,type = "o",col = "red", xlab = "Month", ylab = "Rain fall",

      main = "Rain fall chart")

    lines(t, type = "o", col = "blue")

    # Save the file.

  • Scatter Plots & Scatter plot Matrices  



    --> Scatterplots show many points plotted in the Cartesian plane.

    --> Each point represents the values of two variables.

    --> One variable is chosen in the horizontal axis and another in the vertical axis.

    --> Scatterplot is created using the plot() function.


    plot(x, y, main, xlab, ylab, xlim, ylim, axes)



    --> x is the data set whose values are the horizontal coordinates.

    --> y is the data set whose values are the vertical coordinates.

    --> main is the tile of the graph.

    --> xlab is the label in the horizontal axis.

    --> ylab is the label in the vertical axis.

    --> xlim is the limits of the values of x used for plotting.

    --> ylim is the limits of the values of y used for plotting.

    --> axes indicates whether both axes should be drawn on the plot.

    --> Using the data set "mtcars" available in the R environment to create a basic scatterplot.

    # Let's use the columns "wt" and "mpg" in mtcars.


    Creating the Scatterplot

    The below script will create a scatterplot graph for the relation between wt(weight) and mpg(miles per gallon).

    # Give the chart file a name.

    png(file = "scatterplot.png")

    # Plot the chart for cars with weight between 2.5 to 5 and mileage between 15 and 30.

    plot(x = mtcars$wt,y = mtcars$mpg,

      xlab = "Weight",

      ylab = "Milage",

      xlim = c(2.5,5),

      ylim = c(15,30),

      main = "Weight vs Milage"


    # Save the file.

    Scatterplot Matrices:


    --> When we have more than two variables and we want to find the correlation between one variable versus the remaining ones we use scatterplot matrix.

    --> We use pairs() function to create matrices of scatterplots.


    pairs(formula, data)



    --> formula represents the series of variables used in pairs.

    --> data represents the data set from which the variables will be taken.

    --> Each variable is paired up with each of the remaining variable.

    --> A scatterplot is plotted for each pair.

    # Give the chart file a name.

    png(file = "scatterplot_matrices.png")

    # Plot the matrices between 4 variables giving 12 plots.

    # One variable with 3 others and total 4 variables.

    pairs(~wt+mpg+disp+cyl, data = mtcars, main = "Scatterplot Matrix")

    # Save the file.

    Box Plots: view quantiles


    cars <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/usedcars.csv")



    boxplot(cars$price, outline=FALSE)

    boxplot(cars$price, outline=FALSE, col="blue")

    boxplot(cars$price, outline=FALSE, col="blue", border="red")

    boxplot(cars$price, col="blue", border = c("red","yellow","pink"))

    # it contains only one plot, so it takes only one border.

    # IQR --> Q3-Q1 1.5*IQR < Q1--------------Q3 < 1.5*IQR  this is whiskers.

    # Once read 68 95 99.7 rule and understand this rule.

    IQR(cars$price) #3909.5

    1.5*IQR(cars$price) # 5864.25

    10995-5864 # 5131 Q1-1.5*IQR

    14904+5864 # 20768 Q3+1.5*IQR


    # check the outlier values having 2 in lower and 2 in upper.

  • Low Level Plotting  

    cars <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/usedcars.csv")



    # Low level plotting:


    title(main="Cars Price")

    # First generate the box plot and work on low level plotting otherwise it gives an error


    boxplot(price ~ model, data=cars) #it generates 3 boxplots for 3 models

    boxplot(price ~ model, data=cars, border=c("red","yellow","blue"))

    boxplot(price ~ color, data=cars, border=c("red","yellow","blue"))

    boxplot(price ~ color, data=cars, border=c("black","blue","gold","gray","green","black","yellow","black","yellow"))

    boxplot(price ~ transmission, data=cars)

    credit <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/credit.csv")


    boxplot(amount ~ purpose, data=credit)

    boxplot(amount ~ default, data=credit)

    hist --> historgrams are used to display the frequency










    plot --> scatter plots






    #check the pch starts from 0 to 25, default is 21. practice all the pch sizes



    a <- c(2,10,5,20,15,6,30)











    x <- c(1,2,3,4)

    y <- c(10,20,30,40)








    pie(cars$model) # ERROR:'x' values must be positive numeric, otherwise




  • Bar Plot & Density Plot  



    --> Consider the following data preparation:


    Marks <- sample(grades,40,replace=T,prob=c(.2,.3,.25,.15,.1))




    --> sample takes a sample of the specified size from the elements of x using either with or without replacement.


    sample(x, size, replace = FALSE, prob = NULL)



    --> x: either a vector of one or more elements from which to choose, or a positive integer.

    --> size: a non-negative integer giving the number of items to choose.

    --> replace: should sampling be with replacement?

    --> prob: a vector of probability weights for obtaining the elements of the vector being sampled.

    # A bar chart of the Marks vector is obtained from

    barplot(table(Marks), main="Mid-Marks")

    --> Notice that, the barplot() function places the factor levels on the x-axis in the lexicographical order of the levels.

    --> Using the parameter names.arg, the bars in plot can be placed in the order as stated in the vector, grades.

    # plot to the desired horizontal axis labels

    barplot(table(Marks), names.arg=grades, main="Mid-Marks")

    # Colored bars can be drawn using the col= parameter.

    barplot(table(Marks),names.arg=grades,col = c("lightblue", "lightcyan", "lavender", "mistyrose", "cornsilk"), main="Mid-Marks")

    # A bar chart with horizontal bars can be obtained as follows:

    barplot(table(Marks),names.arg=grades,horiz=TRUE,col = c("lightblue","lightcyan", "lavender", "mistyrose", "cornsilk"), main="Mid-Marks")

    # A bar chart with proportions on the y-axis can be obtained as follows:

    barplot(prop.table(table(Marks)),names.arg=grades,col = c("lightblue","lightcyan", "lavender", "mistyrose", "cornsilk"), main="Mid-Marks")

    # The sizes of the factor-level names on the x-axis can be increased using cex.names parameter.

    barplot(prop.table(table(Marks)),names.arg=grades,col = c("lightblue", "lightcyan", "lavender", "mistyrose", "cornsilk"),main="Mid-Marks",cex.names=2)

    --> The heights parameter of the barplot() could be a matrix.

    --> For example it could be matrix, where the columns are the various subjects taken in a course, the rows could be the labels of the grades.

    # Consider the following matrix:

    gradTab <- matrix(c(13,10,4,8,5,10,7,2,19,2,7,2,14,12,5), ncol = 3, byrow = TRUE)

    rownames(gradTab) <- c("A","A+","B","B+","C")

    colnames(gradTab) <- c("DataScience", "DataAnalytics","MachineLearning")


    # To draw a stacked bar, simply use the command:

    barplot(gradTab,col = c("lightblue","lightcyan", "lavender", "mistyrose", "cornsilk"),legend.text = grades, main="Mid-Marks")

    # To draw a juxtaposed bars, use the besides parameter, as given under:

    barplot(gradTab,beside = T,col = c("lightblue","lightcyan", "lavender", "mistyrose", "cornsilk"),legend.text = grades, main="Mid-Marks")

    # A horizontal bar chart can be obtained using horiz=T parameter:

    barplot(gradTab,beside = T,horiz=T,col = c("lightblue","lightcyan", "lavender", "mistyrose", "cornsilk"),legend.text = grades, cex.names=.75, main="Mid-Marks")

    Density plot:


    --> A very useful and logical follow-up to histograms would be to plot the smoothed density function of a random variable.

    # A basic plot produced by the command

    plot(density(rnorm(100)),main="Normal density",xlab="x")

    # We can overlay a histogram and a density curve with


    hist(x,prob=TRUE,main="Normal density + histogram")


  • Combining Plots  

    Combining Plots:


    --> It's often useful to combine multiple plot types in one graph (for example a Barplot next to a Scatterplot.)

    --> R makes this easy with the help of the functions par() and layout().



    --> par uses the arguments mfrow or mfcol to create a matrix of nrows and ncols c(nrows, ncols) which will serve as a grid for your plots.

    # The following example shows how to combine four plots in one graph:


    plot(cars, main="Speed vs. Distance")

    hist(cars$speed, main="Histogram of Speed")

    boxplot(cars$dist, main="Boxplot of Distance")

    boxplot(cars$speed, main="Boxplot of Speed")



    --> The layout() is more flexible and allows you to specify the location and the extent of each plot within the final combined graph.

    # This function expects a matrix object as an input:

    layout(matrix(c(1,1,2,3), 2,2, byrow=T))

    hist(cars$speed, main="Histogram of Speed")

    boxplot(cars$dist, main="Boxplot of Distance")

    boxplot(cars$speed, main="Boxplot of Speed")

  • Analysis with Scatter Plot, Box Plot, Histograms, Pie Charts & Basic Plot  

    Getting Started with R_Plots:




    # We have two vectors and we want to plot them.

    x_values <- rnorm(n = 20 , mean = 5 , sd = 8) #20 values generated from Normal(5,8)

    y_values <- rbeta(n = 20 , shape1 = 500 , shape2 = 10) #20 values generated from Beta(500,10)

    # If we want to make a plot which has the y_values in vertical axis and the x_values in horizontal axis, we can use the following commands:

    plot(x = x_values, y = y_values, type = "p") # standard scatter-plot

    plot(x = x_values, y = y_values, type = "l") # plot with lines

    plot(x = x_values, y = y_values, type = "n") # empty plot

    # We can type ?plot in the console to read about more options.



    # We have some variables and we want to examine their Distributions

    # boxplot is an easy way to see if we have some outliers in the data.


    y_values[c(19 , 20)] <- c(0.95 , 1.05) # replace the two last values with outliers


    boxplot(y_values) # Points are the outliers of variable z.



    # Easy way to draw histograms

    hist(x = x_values) # Histogram for x vector

    hist(x = x_values, breaks = 3) #use breaks to set the numbers of bars you want



    # If we want to visualize the frequencies of a variable just draw pie

    # First we have to generate data with frequencies, for example :

    P <- c(rep('A' , 3) , rep('B' , 10) , rep('C' , 7) )


    t <- table(P) # this is a frequency matrix of variable P


    pie(t) # And this is a visual version of the matrix above

    Basic Plot:


    --> A basic plot is created by calling plot().

    --> Here we use the built-in cars data frame that contains the speed of cars

    and the distances taken to stop in the 1920s.

    --> To find out more about the dataset, use help(cars).

    plot(x = cars$speed, y = cars$dist, pch = 1, col = 1,

    main = "Distance vs Speed of Cars",

    xlab = "Speed", ylab = "Distance")

    --> We can use many other variations in the code to get the same result.

    --> We can also change the parameters to obtain different results.



    --> Evaluate an R expression in an environment constructed from data, possibly modifying (a copy of) the original data.

    --> Syntax: with(data, expr, ...)

    with(cars, plot(dist~speed, pch = 2, col = 3,

    main = "Distance to stop vs Speed of Cars",

    xlab = "Speed", ylab = "Distance"))

    --> Additional features can be added to this plot by calling points(), text(), mtext(), lines(), grid(), etc.

    plot(dist~speed, pch = "*", col = "magenta", data=cars,

    main = "Distance to stop vs Speed of Cars",

    xlab = "Speed", ylab = "Distance")

    mtext("In the 1920s.")




    --> Histograms allow for a pseudo-plot of the underlying distribution of the data.


    hist(ldeaths, breaks = 20, freq = F, col = 3)

    # ldeaths belongs to UKLungDeaths Package

    # Monthly Deaths from Lung Diseases in the UK

    # Three time series giving the monthly deaths from bronchitis, emphysema and asthma in the UK, 1974–1979, both sexes (ldeaths), males (mdeaths) and females (fdeaths).

    Analysis with Scatter Plot, Box Plot, 

  • Mat Plot, ECDF & Box Plot with IRIS Data set  



    --> matplot is useful for quickly plotting multiple sets of observations from the same object, particularly from a matrix, on the same graph.

    --> Here is an example of a matrix containing four sets of random draws, each with a different mean.

    xmat <- cbind(rnorm(100, -3), rnorm(100, -1), rnorm(100, 1), rnorm(100, 3))


    --> One way to plot all of these observations on the same graph is to do one plot call followed by three more points or lines calls.

    plot(xmat[,1], type = 'l')

    lines(xmat[,2], col = 'red')

    lines(xmat[,3], col = 'green')

    lines(xmat[,4], col = 'blue')

    --> However, this is both tedious, and causes problems because, among other things, by default the axis limits are fixed by plot to fit only the first column.

    --> Much more convenient in this situation is to use the matplot function, which only requires one call and automatically takes care of axis limits and changing the aesthetics for each column to make them distinguishable.

    matplot(xmat, type = 'l')

    --> Note that, by default, matplot varies both color (col) and linetype (lty) because this increases the number of possible combinations before they get repeated.

    --> However, any (or both) of these aesthetics can be fixed to a single value...

    matplot(xmat, type = 'l', col = 'black')

    --> ...or a custom vector (which will recycle to the number of columns, following standard R vector recycling rules).

    matplot(xmat, type = 'l', col = c('red', 'green', 'blue', 'orange'))

    --> Standard graphical parameters, including main, xlab, xmin, work exactly the same way as for plot.

    --> For more on those, see ?par.

    --> Like plot, if given only one object, matplot assumes it's the y variable and uses the indices for x.

    --> However, x and y can be specified explicitly.

    matplot(x = seq(0, 10, length.out = 100), y = xmat, type='l')

    # In fact, both x and y can be matrices.

    xes <- cbind(seq(0, 10, length.out = 100),

    seq(2.5, 12.5, length.out = 100),

    seq(5, 15, length.out = 100),

    seq(7.5, 17.5, length.out = 100))

    matplot(x = xes, y = xmat, type = 'l')

    Empirical Cumulative Distribution Function:


    --> A very useful and logical follow-up to histograms and density plots would be the Empirical Cumulative Distribution Function.

    --> We can use the function ecdf() for this purpose.

    # A basic plot produced by the command

    plot(ecdf(rnorm(100)),main="Cumulative distribution",xlab="x")

    Create a box-and-whisker plot with boxplot()


    # This example use the default boxplot() function and the iris data frame.


    --> The iris dataset has been used for classification in many research publications.

    --> It consists of 50 samples from each of three classes of iris


    --> One class is linearly separable from the other two, while the latter are not linearly separable from each other.

    --> There are five attributes in the dataset:

    sepal length in cm,

    sepal width in cm,

    petal length in cm,

    petal width in cm, and

    class: Iris Setosa, Iris Versicolour, and Iris Virginica.

    --> Detailed desription of the dataset and research publications citing it can be found at the UCI Machine Learning Repository.

    --> Below we have a look at the structure of the dataset with str().

    --> Note that all variable names, package names and function names in R are case sensitive.


    --> From the output, we can see that there are 150 observations (records, or rows) and 5 variables (or columns) in the dataset.

    --> The first four variables are numeric.

    --> The last one, Species, is categoric (called as "factor" in R) and has three levels of values.

    # Simple boxplot (Sepal.Length)

    # Create a box-and-whisker graph of a numerical variable

    boxplot(iris[,1],xlab="Sepal.Length",ylab="Length(in centemeters)",

    main="Summary Charateristics of Sepal.Length(Iris Data)")

    # Boxplot of sepal length grouped by species

    # Create a boxplot of a numerical variable grouped by a categorical variable

    boxplot(Sepal.Length~Species,data = iris)

    Mat Plot, ECDF & Box Plot with IRIS Data set 

  • Additional Box Plot Style Parameters  

    # Bring order

    # To change order of the box in the plot you have to change the order of the categorical variable's levels.

    # For example if we want to have the order virginica - versicolor - setosa

    newSpeciesOrder <- factor(iris$Species, levels=c("virginica","versicolor","setosa"))

    boxplot(Sepal.Length~newSpeciesOrder, data = iris)

    # Change groups names

    # If you want to specify a better name to your groups you can use the Names parameter.

    # It take a vector of the size of the levels of categorical variable

    boxplot(Sepal.Length~newSpeciesOrder,data = iris,names=c("name1","name2","name3"))

    Small improvements:


    # Color

    # col : add a vector of the size of the levels of categorical variable

    boxplot(Sepal.Length~Species,data = iris,col=c("green","yellow","orange"))

    # Proximity of the box

    # boxwex: set the margin between boxes.

    boxplot(Sepal.Length~Species,data = iris,boxwex = 0.1)

    boxplot(Sepal.Length~Species,data = iris,boxwex = 1)

    See the summaries which the boxplots are based plot=FALSE:


    --> To see a summary you have to put the paramater plot to FALSE.

    boxplot(Sepal.Length~newSpeciesOrder,data = iris,plot=FALSE)

    $stats #summary of the numerical variable for the 3 groups

    $n #number of observations in each groups

    $conf #extreme value of the notchs

    $out #extreme value

    $group #group in which are the extreme value

    $names #groups names

    Additional boxplot style parameters:




    boxlty - box line type

    boxlwd - box line width

    boxcol - box line color

    boxfill - box fill colors



    medlty - median line type ("blank" for no line)

    medlwd - median line widht

    medcol - median line color

    medpch - median point (NA for no symbol)

    medcex - median point size

    medbg - median point background color



    whisklty - whisker line type

    whisklwd - whisker line width

    whiskcol - whisker line color



    staplelty - staple line type

    staplelwd - staple line width

    staplecol - staple line color



    outlty - outlier line type ("blank" for no line)

    outlwd - outlier line width

    outcol - outlier line color

    outpch - outlier point type (NA for no symbol)

    outcex - outlier point size

    outbg - outlier point background color



    --> Default and heavily modified plots side by side


    # Default

    boxplot(Sepal.Length ~ Species, data=iris)

    # Modified

    boxplot(Sepal.Length ~ Species, data=iris,

    boxlty=2, boxlwd=3, boxfill="cornflowerblue", boxcol="darkblue",

    medlty=2, medlwd=2, medcol="red", medpch=21, medcex=1, medbg="white",

    whisklty=2, whisklwd=3, whiskcol="darkblue",

    staplelty=2, staplelwd=2, staplecol="red",

    outlty=3, outlwd=3, outcol="grey", outpch=NA


    Displaying multiple plots:


    --> Display multiple plots in one image with the different facet functions.

    --> An advantage of this method is that all axes share the same scale across charts, making it easy to compare them at a glance.

    --> We'll use the mpg dataset included in ggplot2.

    # Wrap charts line by line (attempts to create a square layout):

    ggplot(mpg, aes(x = displ, y = hwy)) +

    geom_point() +


    # Display multiple charts on one row, multiple columns:

    ggplot(mpg, aes(x = displ, y = hwy)) +

    geom_point() +


    # Display multiple charts on one column, multiple rows:

    ggplot(mpg, aes(x = displ, y = hwy)) +

    geom_point() +


    # Display multiple charts in a grid by 2 variables:

    ggplot(mpg, aes(x = displ, y = hwy)) +

    geom_point() +

    facet_grid(trans~class) #"row" parameter, then "column" parameter

  • Set.Seed Function & Preparing Data for Plotting  



    --> Random Number Generation

    --> runif will not generate either of the extreme values unless max = min or max-min is small compared to min, and in particular not for the default arguments.

    # Now make the data into random order, just observe the example and do it

    a = c(1,2,3,4,5)

    print(a) # 1,2,3,4,5

    runif(5) # it gives random values

    runif(5) # again it gives random values

    runif(5) # again it gives random values

    sort(runif(5)) # it puts the data in sorting

    sort(runif(5)) # again it puts the data in sorting

    order(runif(5)) # it gives the position of the values # 1 3 4 5 2

    order(runif(5)) # it gives the position of the values # 4 1 2 3 5

    a = c(10,20,30,40,50)

    a[order(runif(5))] # 20 10 40 50 30

    a[order(runif(5))] # 50 10 30 20 40

    #if we want to put the data in same order, we have seeding technique (set.seed(1))




    runif(5) # it generates the same values


    runif(5) # it generates another values


    runif(5) # it gives already generated values of seed 1





    # now use this techique in our credit.csv data

    credit <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/credit.csv")


    credit_n <- credit[order(runif(1000)),]


    head(credit_n) # once check the data before and after randamized



    summary(credit_n$amount) # data is same,just we randamized the data

    Prepare your data for plotting:


    --> ggplot2 works best with a long data frame.

    --> The following sample data which represents the prices for sweets on 20

    different days, in a format described as wide, because each category has a column.


    sweetsWide <- data.frame(date = 1:20,

    chocolate = runif(20, min = 2, max = 4),

    iceCream = runif(20, min = 0.5, max = 1),

    candy = runif(20, min = 1, max = 3))


    --> To convert sweetsWide to long format for use with ggplot2, several useful functions from base R, and the packages reshape2, data.table and tidyr (in chronological order) can be used:

    # reshape from base R

    sweetsLong <- reshape(sweetsWide, idvar = 'date', direction = 'long',

    varying = list(2:4), new.row.names = NULL, times = names(sweetsWide)[-1])



    # melt from 'reshape2'


    sweetsLong <- melt(sweetsWide, id.vars = 'date')



    # melt from 'data.table'

    # which is an optimized & extended version of 'melt' from 'reshape2'



    sweetsLong <- melt(setDT(sweetsWide), id.vars = 'date')



    # gather from 'tidyr'


    sweetsLong <- gather(sweetsWide, sweet, price, chocolate:candy)



    --> See also Reshaping data between long and wide forms for details on converting data between long and wide format.

    --> The resulting sweetsLong has one column of prices and one column describing the type of sweet.

    --> Now plotting is much simpler:


    ggplot(sweetsLong, aes(x = date, y = price, colour = sweet)) + geom_line()

    --> Add horizontal and vertical lines to plot

    --> Add one common horizontal line for all categorical variables

    # sample data

    df <- data.frame(x = c('A', 'B'), y = c(3, 4))

    p1 <- ggplot(df, aes(x=x, y=y)) +

    geom_bar(position = "dodge", stat = 'identity') + theme_bw()

    p1 + geom_hline(aes(yintercept=5), colour="#990000", linetype="dashed")

    --> Add one horizontal line for each categorical variable

    # sample data

    df <- data.frame(x = c('A', 'B'), y = c(3, 4))

    # add horizontal levels for drawing lines

    df$hval <- df$y + 2

    p1 <- ggplot(df, aes(x=x, y=y)) +

    geom_bar(position = "dodge", stat = 'identity') + theme_bw()

    p1 + geom_errorbar(aes(y=hval, ymax=hval, ymin=hval), colour="#990000", width=0.75)

    --> Add horizontal line over grouped bars

    # sample data

    df <- data.frame(x = rep(c('A', 'B'), times=2),

    group = rep(c('G1', 'G2'), each=2),

    y = c(3, 4, 5, 6),

    hval = c(5, 6, 7, 8))

    p1 <- ggplot(df, aes(x=x, y=y, fill=group)) +

    geom_bar(position="dodge", stat="identity")

    p1 + geom_errorbar(aes(y=hval, ymax=hval, ymin=hval),


    position = "dodge",

    linetype = "dashed")

    --> Add vertical line

    # sample data

    df <- data.frame(group=rep(c('A', 'B'), each=20),

    x = rnorm(40, 5, 2),

    y = rnorm(40, 10, 2))

    p1 <- ggplot(df, aes(x=x, y=y, colour=group)) + geom_point()

    p1 + geom_vline(aes(xintercept=5), color="#990000", linetype="dashed")

    Set.Seed Function & Preparing Data for 

  • QPlot, ViolinPlot, Statistical Methods & Correlation Analysis  

    Scatter Plots:


    --> We plot a simple scatter plot using the built-in iris data set as follows:

    ggplot(iris, aes(x = Petal.Width, y = Petal.Length, color = Species)) +


    Produce basic plots with qplot:


    --> qplot is intended to be similar to base r plot() function, trying to always plot out your data without requiring too much specifications.

    # basic qplot

    qplot(x = disp, y = mpg, data = mtcars)

    # adding colors

    qplot(x = disp, y = mpg, colour = cyl,data = mtcars)

    # adding a smoother

    qplot(x = disp, y = mpg, geom = c("point", "smooth"), data = mtcars)

    Vertical and Horizontal Bar Chart:


    ?diamonds # Prices of 50,000 round cut diamonds

    ggplot(data = diamonds, aes(x = cut, fill =color)) +

    geom_bar(stat = "count", position = "dodge")

    --> it is possible to obtain an horizontal bar chart simply adding coord_flip() aesthetic to the ggplot object:

    ggplot(data = diamonds, aes(x = cut, fill =color)) +

    geom_bar(stat = "count", position = "dodge") +


    Violin plot:


    --> Violin plots are kernel density estimates mirrored in the vertical plane.

    --> They can be used to visualize several distributions side-by-side, with the mirroring helping to highlight any differences.

    ggplot(diamonds, aes(cut, price)) +


    --> Violin plots are named for their resemblance to the musical instrument, this is particularly visible when they are coupled with an overlaid boxplot.

    --> This visualisation then describes the underlying distributions both in terms of

    Tukey's 5 number summary (as boxplots) and full continuous density estimates (violins).

    ggplot(diamonds, aes(cut, price)) +

    geom_violin() +

    geom_boxplot(width = .1, fill = "black", outlier.shape = NA) +

    stat_summary(fun.y = "median", geom = "point", col = "white")

    Statistical Methods:


    --> When analyzing data, it is possible to have a statistical approach.

    --> The basic tools that are needed to perform basic analysis are:

    Correlation analysis

    Chi-squared Test


    Analysis of Variance

    Analysis of Covariance

    Hypothesis Testing

    Time Series Analysis

    Survival Analysis

    --> When working with large datasets, it doesn’t involve a problem as these methods are not computationally intensive with the exception of Correlation Analysis.

    --> In this case, it is always possible to take a sample and the results should be robust.

    Correlation Analysis:


    --> Correlation Analysis seeks to find linear relationships between numeric variables.

    --> This can be of use in different circumstances.

    --> First of all, the correlation metric used in the mentioned example is based on the Pearson coefficient.

    --> There is however, another interesting metric of correlation that is not affected by outliers.

    --> This metric is called the spearman correlation.

    --> The spearman correlation metric is more robust to the presence of outliers than the Pearson method and gives better estimates of linear relations between numeric variable when the data is not normally distributed.

    --> Correlation is a statistical tool which studies the relationship between two variables.

    --> Co-efficient of correlation gives the degree(amount) of correlation between two variables.

    Formulae: summation(dxdy)/sqrt of dx2*dy2


    1. Denote one series by X and the other series by Y

    2. Calculate x' and y'

    3. Calculate dx and dy [i.e deviation]

    dx=x-x' dy=y-y'

    4. Square these deviations i.e dx2 and dy2

    5. Multiply the respective dx and dy

    6. Apply the formula to calculate r.

    Calculate the coefficient of correlation between X and Y for the following data:

    X: 10,6,9,10,12,13,11,9

    Y: 9,4,6,9,11,13,8,4

    Solution: X, dx(x-x'), dx2, y, dy(y-y'), dy2, dxdy


    # Select variables that are interesting to compare pearson and spearman correlation methods.

    x = diamonds[, c('x', 'y', 'z', 'price')]

    # From the histograms we can expect differences in the correlations of both metrics.

    # In this case as the variables are clearly not normally distributed, the spearman correlation

    # is a better estimate of the linear relation among numeric variables.

    par(mfrow = c(2,2))

    colnm = names(x)

    for(i in 1:4) {

    hist(x[[i]], col = 'deepskyblue3', main = sprintf('Histogram of %s', colnm[i]))


    --> From the histograms in the following figure, we can expect differences in the correlations of both metrics.

    --> In this case, as the variables are clearly not normally distributed, the spearman correlation is a better estimate of the linear relation among numeric variables.

    # Correlation Matrix - Pearson and spearman

    cor_pearson <- cor(x, method = 'pearson')

    cor_spearman <- cor(x, method = 'spearman')

    # Pearson Correlation


    # Spearman Correlation


    QPlot, ViolinPlot, Statistical Methods & 

  • ChiSquared Test, T Test, ANOVA, ANCOVA, Time Series Analysis & Survival Anal  
  • Data Exploration and Visualization  

    Data Exploration and Visualization:


    Have a Look at iris Data:






    iris[1:5, ]



    # draw a sample of 5 rows

    a <- sample(1:nrow(iris), 5)


    iris[a, ]

    iris[1:10, "Sepal.Length"]

    iris[1:10, 1]


    Explore Individual Variables:




    quantile(iris$Sepal.Length, c(0.1, 0.3, 0.65))








    Explore Multiple Variables:


    # Calculate covariance and correlation between variables with cov() and cor().

    cov(iris$Sepal.Length, iris$Petal.Length)


    cor(iris$Sepal.Length, iris$Petal.Length)


    # Compute the stats of Sepal.Length of every Species with aggregate()

    aggregate(Sepal.Length ~ Species, summary, data=iris)

    boxplot(Sepal.Length ~ Species, data=iris, xlab="Species", ylab="Sepal.Length")

    with(iris, plot(Sepal.Length, Sepal.Width, col=Species, pch=as.numeric(Species)))

    ## same function as above

    # plot(iris$Sepal.Length, iris$Sepal.Width, col=iris$Species, pch=as.numeric(iris$Species))

    # When there are many points, some of them may overlap. We can use jitter() to add a small amount of noise to the data before plotting.

    plot(jitter(iris$Sepal.Length), jitter(iris$Sepal.Width))

    # A smooth scatter plot can be plotted with function smoothScatter(), which a smoothed color density representation of the scatterplot, obtained through a kernel density estimate.

    smoothScatter(iris$Sepal.Length, iris$Sepal.Width)

    # A Matrix of Scatter Plots


    More Explorations:


    --> A 3D scatter plot can be produced with package scatterplot3d


    scatterplot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)

    --> Package rgl supports interactive 3D scatter plot with plot3d().


    plot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)

    --> A heat map presents a 2D display of a data matrix, which can be generated with heatmap() in R.

    --> With the code below, we calculate the similarity between different flowers in the iris data with dist() and then plot it with a heat map.

    distMatrix <- as.matrix(dist(iris[,1:4]))


    --> A level plot can be produced with function levelplot() in package lattice --> Function grey.colors() creates a vector of gamma-corrected gray colors.

    --> A similar function is rainbow(), which creates a vector of contiguous colors.


    levelplot(Petal.Width~Sepal.Length*Sepal.Width, iris, cuts=9, col.regions=grey.colors(10)[10:1])

    --> Contour plots can be plotted with contour() and filled.contour() in package graphics, and with contourplot() in package lattice.

    ?volcano # Understand the volcano dataset.

    filled.contour(volcano, color=terrain.colors, asp=1, plot.axes=contour(volcano, add=T))

    --> Another way to illustrate a numeric matrix is a 3D surface plot shown as below, which is generated with function persp().

    persp(volcano, theta=25, phi=30, expand=0.5, col="lightblue")

    --> Parallel coordinates provide nice visualization of multiple dimensional data.

    --> A parallel coordinates plot can be produced with parcoord() in package MASS, and with parallelplot() in package lattice.


    parcoord(iris[1:4], col=iris$Species)


    parallelplot(~iris[1:4] | Species, data=iris)


    qplot(Sepal.Length, Sepal.Width, data=iris, facets=Species ~.)

    Save Charts into Files:


    --> Save charts into PDF and PS files respectively with functions pdf() and postscript().

    --> Picture files of BMP, JPEG, PNG and TIFF formats can be generated respectively with bmp(), jpeg(), png() and tiff().

    --> Note that the files (or graphics devices) need be closed with or after plotting.

    # save as a PDF file


    x <- 1:50

    plot(x, log(x))

    # Save as a postscript file


    x <- -20:20

    plot(x, x^2)

  • Machine Learning, Types of ML with Algorithms  

    Machine Learning:


    --> It is similar like Human Learning

    --> Machine learning is the subfield of computer science that, according to Arthur Samuel, gives "computers the ability to learn without being explicitly programmed."

    --> Samuel, an American pioneer in the field of computer gaming and artificial intelligence, coined the term "Machine Learning" in 1959 while at IBM.

    --> Machine learning is a field of computer science that uses statistical techniques to give computer systems the ability to "learn" (e.g., progressively improve performance on a specific task) with data, without being explicitly programmed.

    Traditional Programming vs Machine Learning:


    --> In traditional programming, if we give inputs + programs to the computer, then computer gives the output.

    --> In machine learning, if we give inputs + outputs to the computer, then computer gives the program (Predictive Model).

    Example 1: Here "a" and "b" are inputs and "c" is output


    a b c

    -- -- --

    1 2 3

    2 3 5

    3 4 7

    4 5 9

    9 10 ?

    What is the output of c?

    Example 2: Here "x" is input and "y" is output


    x y

    -- --

    1 10

    2 20

    3 30

    4 40

    5 ?

    500 ?

    y ~ x :   y=10x

    Example 3: Here "x" is input and "y" is output


    x y

    -- --

    1 14

    2 18

    3 22

    4 26

    5 ?

    750 ?

    here we can observe linear regression

    y ~ x :   y=mx+c  here m is slope and c is contant


    Machine Learning Engineer:


    1. Convert the business data into statistical model

    2. Make the machine to develop (train) the model

    3. Evaluate the performance of the model

    Actual vs Predicted (% accurancy, % error)

    4. Techniques to improve the performance.

    (Classification, Regression, Clustering)

    Types of Machine Learning:


    --> There are 3 types of Machine Learning Algorithms.

    1. Supervised Learning

    2. Unsupervised Learning

    3. Reinforcement Learning

    --> Supervised and unsupervised are mostly used by a lot machine learning engineers

    --> Reinforcement learning is really powerful and complex to apply for problems.

    Supervised Learning:


    --> As we know machine learning takes data as input and output (Training data)

    --> The training data includes both Inputs and Labels(Targets or Outputs)

    --> For example addition of two numbers a=5,b=6 result =11, Inputs are 5,6 and Target is 11.

    --> We first train the model with the lots of training data(inputs & targets) then with new data and the logic we got before we predict the output

    --> This process is called Supervised Learning which is really fast and accurate.

    --> This algorithm consist of a target / outcome variable (or dependent variable) which is to be predicted from a given set of predictors (independent variables). --> Using these set of variables, we generate a function that map inputs to desired outputs.

    --> The training process continues until the model achieves a desired level of accuracy on the training data.

    Types of Supervised Learning:


    1. Classification

    2. Regression



    --> This is a type of problem where we predict the categorical response value where the data can be separated into specific “classes” (ex: we predict one of the values in a set of values).

    Some examples are :

    --> this mail is spam or not?

    --> will it rain today or not?

    --> is this picture a cat or not?

    Basically ‘Yes/No’ type questions called binary classification.

    Other examples are :

    --> mail is spam or important or promotion?

    --> is this picture a cat or a dog or a tiger?

    This type is called multi-class classification.



    --> This is a type of problem where we need to predict the continuous-response value (ex : above we predict number which can vary from -infinity to +infinity)

    Some examples are:

    --> what is the price of house in a specific city?

    --> what is the value of the stock?

    --> how many total runs can be on board in a cricket game?

    Supervised Learning Algorithms:


    Decision Trees


    Naive Bayes Classification

    Support vector machines for classification problems

    Random forest for classification and regression problems

    Linear regression for regression problems

    Ordinary Least Squares Regression

    Logistic Regression

    Unsupervised Learning:


    --> The training data does not include Targets here so we don’t tell the system where to go , the system has to understand itself from the data we give.

    --> Here training data is not structured (contains noisy data,unknown data and etc..)

    --> In this algorithm, we do not have any target or outcome variable to predict / estimate. 

    --> It is used for clustering population in different groups, which is widely used for segmenting customers in different groups for specific intervention.

    --> Unsupervised learning is bit difficult to implement and its not used as widely as supervised.

    Types of Unsupervised Learning:


    1. Clustering

    2. Pattern Detection (Association Rule)



    --> This is a type of problem where we group similar things together.

    --> Bit similar to multi class classification but here we don’t provide the labels, the system understands from data itself and cluster the data.

    Some examples are :

    --> given news articles, cluster into different types of news

    --> given a set of tweets, cluster based on content of tweet

    --> given a set of images, cluster them into different objects

    Association Rule:


    --> Association rules are if-then statements that help to show the probability of relationships between data items within large data sets in various types of databases.

    --> An association rule has two parts: an antecedent (if) and a consequent (then). --> An antecedent is an item found within the data.

    --> A consequent is an item found in combination with the antecedent.

    Unsupervised Learning Algorithms:


    K-means for clustering problems

    Apriori algorithm for association rule learning problems

    Principal Component Analysis

    Singular Value Decomposition

    Independent Component Analysis

    Reinforcement Learning:


    --> Using this algorithm, the machine is trained to make specific decisions.

    --> It works this way: the machine is exposed to an environment where it trains itself continually using trial and error.

    --> This machine learns from past experience and tries to capture the best possible knowledge to make accurate business decisions.

    --> Example of Reinforcement Learning: Markov Decision Process

    Machine Learning, Types of ML with 

  • How Machine Solve Real Time Problems  

    Machine Learning:


    --> Solve the real time problems

    Understand the data

    Insights from the data(75%)

    Test the performance(25%)

    Insights are applied on the new data

    to get the prediction

    Ex 1:

    Jio --> 3 months free of cost





    10% people moved from airtel --> Jio

    40% are waiting to move after 3 months jio service

    1 Crore--10% means 10 Lak * RS 200 = 20 Crore Loss

    for 1 year 20*12=240 Crores loss

    Airtel will done the analysis on 10% based on their data like IMEI no., Location, Internet data, Recharge plan..... and they changed thier recharge plans like dynamic pricing(in airlines,bus tickets)

    Ex 2:

    Marketing Team--> 1000 Custemers information







    age>30,Salary<50K,credit<4, loans<2,responds=yes

    yes = called, then no response

    no = don't whether they are response or not

    Data sets divide into 2 parts --> Train + Test

    Data sets --> Train(75%) + Test(25%) #thumb rule

    Split the data into 4 parts like 25% + 25% + 25% + 25% and check the 3 parts to 4th part in 4 ways.

    First test on Known data later test on new data.

    Target Attribute --> Categorical (Classification)

    Target Attribute --> Numerical (Regression)

    (Identifying the relation between the attributes)

  • Nearest Neighbor(KNN) Classification  

    K-Nearest Neighbour(KNN) Classsificaton:


    --> K-Nearest Neighbors is one of the most basic classification algorithms in Machine Learning.

    --> It belongs to the supervised learning domain and finds intense application in pattern recognition, data mining and intrusion detection.

    --> Mostly used for Life Sciences

    Ex: BP,Col,BSugar,...HeartAttack

    It calculates "Distance between the Data Points"

    K is a number, intially its value is 1(for even values may have chance for 50 50 distance points, it is better to take odd values like 11 in the place of 10)



    sqrt of (x1-y1)2 + (x2-y2)2 +...(xn-yn)2

    sum of i=1 to n sqrt of (xi-yi)2

    Mathematically this distance is called as "EUCLIDEAN DISTANCE".

    KDD depends on --> Distance

    NB depends on --> Probability

    DT depends on --> Information gain & entropy

    # Steps to follow in machine learning

    1. Collecting the data

    2. Data Preparation

    3. Train the model

    4. Evaluate the performance (Actual vs Predicted)

    5. Improve the performance


    Train Data, Train Labels

    Test Data, Test Labels

    Machine --> Train Data, Train Labels --> Data Model --> Test Data --> Labels for the test data

    Distance between test data & train data

    predicted vs actual

    # Simple Example for KNN Classification












    # copy this data and load in r by using clipboard

    marks <- read.delim("clipboard", sep=",", stringAsFactors=F)

    print(marks) #check the data

    str(marks) #change the result datatype as factor

    marsk$result <- factor(marks$result)

    train_data <- marks[1:7,-4]

    test_data <- marks[8:10,-4]



    train_labels <- marks[1:7,4]

    test_labels <- marks[8:10,4]






    # install & attach the "class" package to work on KNN algorithm




    predicted_labels <- class::knn(train_data,test_data,train_labels,k=1)

    predicted_labels #fail fail fail

    test_labels #fail pass fail

    predicted_labels <- class::knn(train_data,test_data,train_labels,k=3)

    predicted_labels #fail pass fail

    test_labels #fail pass fail

    predicted_labels <- class::knn(train_data,test_data,train_labels,k=5)

    predicted_labels #fail pass fail

    test_labels #fail pass fail

    predicted_labels <- class::knn(train_data,test_data,train_labels,k=7)

    predicted_labels #pass pass pass

    test_labels #fail pass fail

    # for k=3 & k=5 it is predicting correctly

  • KNN Classification with Cancer Data set Part 1  

    # Having a data set which is related to cancer contains Malignant(harmful spread across the body) and Benign(not harmful). -- First understand the data in data set.

    # Collecting the data

    cancer <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/cancer.csv",stringsAsFactors=FALSE)

    dim(cancer) #569 32

    # Data Preparation

    names(cancer) #out of 32 column "id" is not requried for analysis

    cancer <- cancer[-1] #except 1st column 31 contains

    str(cancer) #only 1 string column is available, convert it into factorial data type

    dim(cancer) #569 31


    cancer$diagnosis <- factor(cancer$diagnosis,levels=c("B","M"),labels=c("Benign","Malignant"))

    str(caner) # now 1st colum changed to factor data type

    table(caner$diagnosis) # Benign is 357 Malignant is 212


    prop.table(table(cancer$diagnosis))*100 # 62.8 37.2

    summary(cancer[-1]) # here data range is different, make the data range as common, use normalize i.e., 0 to 1.

    Normalization is a kind of making any data to drop between the same range of data.

    NORMALIZE --> (x-min(x)) / (max(x)-min(x))


    10-5/31-5,20-5/31-5,5-5/31-5,....31-5/31-5 (Here min value is 0 and max value is 1)

    0 to 1 --> Normalization

    normalize <- function(x)


    return((x-min(x)) / (max(x)-min(x)))


    cancer_n <-lapply(cancer[-1],normalize)

    class(cancer_n) # it is list of values but we want a data frame

    cancer_n <- data.frame(lapply(cancer[-1],normalize))

    class(cancer_n) # now it is data frame values

    summary(cancer_n) # now min to max all the values are in common range i.e., 0 to 1 only

    # Now split the data set(100%) i.e., 569 observations into Train(75%) i.e., first 427 observations and Test(25%) i.e., last 142 observations.

    train_data <- cancer_n[1:427,]

    test_data <- cancer_n[428:569,]

    train_labels <- cancer[1:427,1]

    test_labels <- cancer[428:569,1]

    dim(train_data) #427 30

    dim(test_data) #142 30

    length(train_labels) #427

    length(test_labels) #142

    # Train the Model

    predict_labels <- knn(train_data,test_data,train_labels,k=1)

    predict_labels[1:10] #predicted vs actual is matched


    predict_labels[11:20] #predicted vs actual is matched


    predict_labels[21:30] #predicted vs actual is not matched


    KNN Classification with Cancer Data set Part 

  • KNN Classification with Cancer Data set Part 2  

    # Evaluate the performance of the model



    # for comparing all labels we use CrossTable contains in "gmodels" package





    # Improve the performance of the model

    # Some of the labels are not matching, let us change the k value from 1 to 3

    predict_labels <- knn(train_data,test_data,train_labels,k=3)


    # Change the K values from 3 to 5 and evaluate

    predict_labels <- knn(train_data,test_data,train_labels,k=5)


    predict_labels <- knn(train_data,test_data,train_labels,k=7)


    # here some benign data is not correct - not a problem

    predict_labels <- knn(train_data,test_data,train_labels,k=9)


    # here some malignant data is not correct - problem

    # now do the prediction with first 25% test data and last 75% train data

    train_data <- cancer_n[143:569,]

    test_data <- cancer_n[1:142,]

    train_labels <- cancer[143:569,1]

    test_labels <- cancer[1:142,1]

    predict_labels <- knn(train_data,test_data,train_labels,k=7) #here the data is not proper propertionality, so gives more error

    prop.table(table(cancer$diagnosis)) #here B is 63 & M is 37

    prop.table(table(train_labels)) #here B is 70 & M is 30

    prop.table(test_labels)) #here B is 42 & M is 58

    KNN Classification with Cancer Data set Part 

  • Navie Bayes Classification  

    Naive Bayes: is a Bayesian theorem


    --> Naive Bayes classifier is a simple classifier that has its foundation on the well known Bayes’s theorem.

    --> Naive Bayes algorithm, in particular is a logic based technique which is simple yet so powerful that it is often known to outperform complex algorithms for very large datasets.

    Probability - Chances for occuring

    Probability = no. of times event occured / total no. of chances

    10 days --> 6 days rained, 4 days not rained (in independent events)

    p(rain=yes) = 6/10 = 60%

    p(rain=no) = 4/10 = 40%

    10 days --> 6 days rained, 4 days not rained (in dependent events)

    p(rain=yes) = 6/10 = 60% - depends like temp,cloud,...

    p(rain=no) = 4/10 = 40% - depends like temp,cloud,...

    Joint Probability:


    it contains in 2 types

    1. Independent Events

    P(A/B) = P(A) * P(B)

    2. Dependent Events

          P(B/A) * P(A)

    P(A/B) = ------------------


              liklihood * prior prob

    posterior prob = -------------------------

             evidence (or) marginal prob

    # once goto simple sms_spam1.csv file and understand the data

    # Calculate this by using independent events

    p(spam) -- 0.4

    p(ham) -- 0.6

    p(spam/sms) = p(spam) * p(sms)

        4/10 * 3/10 = 0.12

    p(ham/sms) = p(ham) * p(sms)

            6/10 * 3/10 = 0.18  #this is not the currect way for calculating the independent events, so we have to follow bayesian theorem (dependent events)

    # Calculate this by using dependent events(bayesian theorem)

    p(spam) -- 0.4

    p(ham) -- 0.6

         p(sms/spam) * p(spam)

    p(spam/sms) = --------------------------


           1/4 * 4/10

    = -------------- = 1/3 = 33.33%


         p(sms/ham) * p(ham)

    p(ham/sms) = ---------------------------


           2/6 * 6/10

    = -------------- = 2/3 = 66.66%


  • Navie Bayes Classification with SMS Spam Data set & Text Mining  

    # Once goto sms_spam.csv file and understand data

    # Collecting the data

    sms_data <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/sms_spam.csv", stringsAsFactors=FALSE)


    # Data Preparation

    sms_data$type = factor(sms_data$type)


    # install tm package for text mining on our data



    library(help="tm") # once gothrough the fn's in tm

    # 1. Create the Corpus for collection of document

    # 2. Clean the data by common case letter format, removing numbers, stopwords, punctuation, whitespaces

    sms_corpus <- Corpus(VectorSource(sms_data$text))

    inspect(sms_corpus) #it inspect all 5574 documents

    inspect(sms_corpus[1:3]) #it inspect first 3 documents

    # Here the data in different cases like upper & lower, let convert into single case in lower by using tm_map()

    sms_clean <- tm_map(sms_corpus,tolower)



    # Remove numbers from the data

    sms_clean <- tm_map(sms_clean,removeNumbers)



    # Remove stopwords from data, stopwords() contains 174 words


    sms_clean <- tm_map(sms_clean,removeWords,stopwords())



    # Remove Punctuation from data

    sms_clean <- tm_map(sms_clean,removePunctuation)



    # Remove Whitespace from data

    sms_clean <- tm_map(sms_clean,stripWhitespace)



    # Collect the spam messages and clean the data

    sms_spam <- subset(sms_data,type=="spam")

    spam_corpus <- Corpus(VectorSource(sms_spam$text))

    spam_clean <- tm_map(spam_corpus, tolower)

    spam_clean <- tm_map(spam_clean, removeNumbers)

    spam_clean <- tm_map(spam_clean, removeWords, stopwords())

    spam_clean <- tm_map(spam_clean, removePunctuation)

    spam_clean <- tm_map(spam_clean, stripWhitespace)



    # Collect the ham messages and clean the data

    sms_ham <- subset(sms_data,type=="ham")

    ham_corpus <- Corpus(VectorSource(sms_ham$text))

    ham_clean <- tm_map(ham_corpus, tolower)

    ham_clean <- tm_map(ham_clean, removeNumbers)

    ham_clean <- tm_map(ham_clean, removeWords, stopwords())

    ham_clean <- tm_map(ham_clean, removePunctuation)

    ham_clean <- tm_map(ham_clean, stripWhitespace)



    Navie Bayes Classification with SMS Spam 

  • WordCloud & Document Term Matrix  
  • Train & Evaluate a Model using Navie Bayes  

    # Collecting the data

    sms_data <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/sms_spam.csv", stringsAsFactors=FALSE)

    # The type element is currently a character vector.

    # Convert it into a factor.

    sms_data$type <- factor(sms_data$type)

    # Displays description of each variable





    # Data preparation - cleaning and standrdizing text data

    # The tm package can be installed via the install.packages("tm") and

    # loaded with the library(tm) command.


    # corpus can use to collect text documents

    # In order to create a corpus, VCorpus() is used which is in the tm package

    # The VectorSource is reader function to create a source object from the existing sms_data$text

    sms_corpus <- VCorpus(VectorSource(sms_data$text))


    # View a summary of the first and second SMS messages in the corpus


    # The as.character() is used to view actual message text


    # The lapply() function is used to apply procedure to each element of an R data structure.

    lapply(sms_corpus[1:2], as.character)

    # Text transformation 

    # The tm_map() function provides a method to apply a transformation

    # to a tm corpus.

    # New transformation save the result in a new object called sms_cleaned_corpus

    # Convert text into lowercase. Here used following functions:

    # content_transformer(); tm wrapper function

    # tolower(); lowercase transformation function

    sms_cleaned_corpus <- tm_map(sms_corpus, content_transformer(tolower))

    # Check the difference between sms_corpus and sms_cleaned_corpus



    # Remove numbers from SMS messages

    sms_cleaned_corpus <- tm_map(sms_cleaned_corpus, removeNumbers)

    # Remove filler words using stopwords() and removeWords() functions

    sms_cleaned_corpus <- tm_map(sms_cleaned_corpus, removeWords, stopwords())

    # Remove punctuation characters

    sms_cleaned_corpus <- tm_map(sms_cleaned_corpus, removePunctuation)

    # Reducing words to their root form using stemming. The tm package provides

    # stemming functionality via integration with the SnowballC packge.

    # The SnowballC package can be installed via the install.packages("SnowballC") and

    # loaded with the library(SnowballC) command.


    # Apply stemming

    sms_cleaned_corpus <- tm_map(sms_cleaned_corpus, stemDocument)

    # Remove additional whitespace

    sms_cleaned_corpus <- tm_map(sms_cleaned_corpus, stripWhitespace)

    # Data preparation - splitting text documents into words(Tokenization)

    # Create a data structure called a Document Term Matrix(DTM)

    sms_dtm <- DocumentTermMatrix(sms_cleaned_corpus)


    # Divide the data into a training set and a test set with ratio 75:25

    # The SMS messages are sorted in a random order.

    sms_dtm_train <- sms_dtm[1:4181, ]

    sms_dtm_test <- sms_dtm[4182:5574, ]

    # Create labels that are not stored in the DTM

    sms_train_lables <- sms_data[1:4181, ]$type

    sms_test_lables <- sms_data[4182:5574, ]$type

    # Compare the proportion of spam in the training and test data



    # Visualizing text data using word clouds

    # The wordcloud package can be installed via the install.packages("wordcloud") and

    # loaded with the library(wordcloud) command.


    # Create wordcloud from a tm corpus object

    pal <-brewer.pal(8,"Dark2")

    wordcloud(sms_cleaned_corpus, min.freq=40, random.order = FALSE, colors=pal)

    # Create wordcloud for spam and ham data subsets

    spam <- subset(sms_data, type == "spam")

    wordcloud(spam$text, max.word = 40, scale = c(4, 0.8), colors=pal)

    ham <- subset(sms_data, type == "ham")

    wordcloud(ham$text, max.word = 40, scale = c(4, 0.8), colors=pal)

    # Data preparation - Creating indicator features for frequent words

    sms_frequent_words <- findFreqTerms(sms_dtm_train, 5)


    sms_dtm_freq_train<- sms_dtm_train[ , sms_frequent_words]

    sms_dtm_freq_test <- sms_dtm_test[ , sms_frequent_words]

    # print the most frequent words in each class.

    sms_corpus_ham <- VCorpus(VectorSource(ham$text))

    sms_corpus_spam <- VCorpus(VectorSource(spam$text))

    sms_dtm_ham <- DocumentTermMatrix(sms_corpus_ham, control = list(tolower = TRUE,removeNumbers = TRUE,stopwords = TRUE,removePunctuation = TRUE,stemming = TRUE))

    sms_dtm_spam <- DocumentTermMatrix(sms_corpus_spam, control = list(tolower = TRUE,removeNumbers = TRUE,stopwords = TRUE,removePunctuation = TRUE,stemming = TRUE))

    sms_dtm_ham_frequent_words <- findFreqTerms(sms_dtm_ham, lowfreq= 0, highfreq = Inf)



    sms_dtm_spam_frequent_words <- findFreqTerms(sms_dtm_spam, lowfreq= 0, highfreq = Inf)



    # The following defines a convert_counts() function to convert counts to

    # Yes / No strings:

    convert_counts <- function(x) {

    x <- ifelse(x > 0, "Yes", "No")


    # Apply above function to train and test data sets.

    sms_train <- apply(sms_dtm_freq_train, MARGIN = 2,convert_counts)

    sms_test <- apply(sms_dtm_freq_test, MARGIN = 2,convert_counts)

    # Training a model using Naive Bayes


    sms_classifier <- naiveBayes(sms_train, sms_train_lables)

    # Evaluating model

    sms_test_pred <- predict(sms_classifier, sms_test)


    CrossTable(sms_test_pred, sms_test_lables,prop.chisq = FALSE, prop.t = FALSE,dnn = c('predicted', 'actual'))

    # Accuracy : Measures of performance


    confusionMatrix(sms_test_pred, sms_test_lables, positive = "spam")

    # Improving model performance

    # Adding Laplace estimator

    new_sms_classifier <- naiveBayes(sms_train, sms_train_lables, laplace = 1)

    new_sms_classifier_pred <- predict(new_sms_classifier, sms_test)

    # Compare the predicted classes to the actual classifications using cross table

    CrossTable(new_sms_classifier_pred, sms_test_lables, prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE, dnn = c('Predicted', 'Actual'))

    Train & Evaluate a Model using Navie Bayes 

  • MarkDown using Knitr Package  
  • Decision Trees  



    --> Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems.

    --> It works for both categorical and continuous input and output variables.

    --> In this technique, we split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiator in input variables.

    Types of Decision Trees:


    Types of decision tree is based on the type of target variable we have. It can be of two types:

    --> Categorical Variable Decision Tree: Decision Tree which has categorical target variable then it called as categorical variable decision tree.

    --> Continuous Variable Decision Tree: Decision Tree has continuous target variable then it is called as Continuous Variable Decision Tree.

    Ex: Decision taking to join in job or not

    salary: high, medium, low

    working hrs: high, medium, low

    distance: long, medium, short

    if salary=high --> join=yes

    if salary=medium or low

    working=medium or low

    distance=medium or low --> join=yes

    if salary=low


    distance=low --> join=yes

    --> "ENTROPY" is used in decision trees

    --> Generally, entropy refers to disorder or uncertainty and the definition of entropy used in information theory is directly analogous to the definition used in statistical thermodynamics.....if these values are equally probable, the entropy (in bits) is equal to this number.

    --> For entropy formula once goto

    --> In 1975 Ross Quinlan developed an algorithm ID3 (Iterative Dichotomiser 3) --> C4.4 ---> C5.0

    # Collecting the data

    credit <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/credit.csv", stringsAsFactors=TRUE)



    # Once goto UCI Machine Learning Repository and search credit.csv file and understood the data by each column and 17th column default (1,2)

    # In default column 1 means not defaulter (paying regularly) and 2 means defaulter (not paying regularly)






    # Preparing the Data

    default <- subset(credit,default==2)

    nondefault <- subset(credit,default==1)











    # Just understood the data about how much percentage of defaulters and nondefaulters in checking_balance, purpose, employment_length......

    # Age column is the numerical data, so it is not depends on default data.



    # Some of the columns are not possible to identify the defaulters, identify those columns

  • Decision Trees with Credit Data set Part 1  

    # Here we have to install the package "C50"



    # once again i am collecting the credit.csv data set.

    credit <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/credit.csv", stringsAsFactors=TRUE)


    credit$default <- factor(credit$default, levels=c(1,2), labels=c("NO","YES"))


    # Now make the data into random order by using seeding technique on credit data


    credit_n <- credit[order(runif(1000)),]


    head(credit) # once check the data before and after randamized


    summary(credit_n$amount) #data is same, just we randamized the data


    train_data <- credit_n[1:750,-17]

    test_data <- credit_n[751:1000,-17] #75,25% is actually not required, we can give 100% data for training - no problem, later we observe with 100% training data

    train_lables <- credit_n[1:750,17]

    test_labels <- credit_n[751,1000,17]



    # here it is giving exactly 70 and 30% data, just change the set.seed(2) and observe it


    credit_n <- credit[order(runif(1000)),]



    # here also it is giving approx 70 and 30% data.

    # Let it make it as a 90% training data and observe

    train_data <- credit_n[1:900,-17]

    test_data <- credit_n[901:1000,-17]

    train_lables <- credit[1:900,17]

    test_labels <- credit[901,1000,17]



    # here also it is giving approx 70 and 30% data, same proposnality, if we train 100% no problem

    # Train the model


    credit_classifier <- C50::C5.0(credit_n[,-17],credit_n[,17])

    # Here i gave all the columns for training, later we will observe one by one column


    # Evaluate the performance of the model

    # Now observe the tree


    # System has developed the decision tree classifier like this, just like if and else statement

    # Observe the attribute usage and check the error percentage

  • Decision Trees with Credit Data set Part 2  

    # Improve the performance of the model

    # Now i am taking only 3 columns

    credit_classifier <- C50::C5.0(credit_n[,c(1,2,3)],credit_n[,17])


    # Now size of the tree decreases and check the error %, and add the required columns and check the error % (it will decrease).

    credit_classifier <- C50::C5.0(credit_n[,c(1,2,3,4,5)],credit_n[,17])


    # Once goto search in R and type C5.0 and check the parameter - trials

    # trails - the number of boosting iterations.

    credit_classifier <- C50::C5.0(credit_n[,c(1,2,3,4,5)],credit_n[,17],trials=5)


    credit_classifier <- C50::C5.0(credit_n[,c(1,2,3,4,5)],credit_n[,17],trials=10)


    credit_classifier <- C50::C5.0(credit_n[,-17],credit_n[,17],trials=10)


    # Check the examples of C5.0 in help like plotting.......

    plot(credit_classifier) #it is very bit tree, try with some columns

    credit_classifier <- C50::C5.0(credit_n[,c(1,2,3)],credit_n[,17],trails=10)


    test_data <- credit_n[1:250,-17]

    predict_labels <- predict(credit_classifier,test_data)




    # Check the no.of defaulters and proposnality

    test_data <- credit_n[251:500,-17]


    test_data <- credit_n[501:750,-17]


  • Support Vector Machine, Neural Networks & Random Forest  
  • Regression & Linear Regression  



    --> This is a type of problem where we need to predict the continuous-response value (ex : predict number which can vary from -infinity to +infinity)

    Some examples are:

    --> what is the price of house in a specific city?

    --> what is the value of the stock?

    --> how many total runs can be on board in a cricket game?

    Algorithms in Regression:


    1. Linear Regression

    2. Logistic Regression

    Linear Regression:


    --> Regression analysis is a very widely used statistical tool to establish a relationship model between two variables.

    --> One of these variable is called predictor variable whose value is gathered through experiments.

    --> The other variable is called response variable whose value is derived from the predictor variable.

    --> In Linear Regression these two variables are related through an equation, where exponent(power) of both these variables is 1. Mathematically a linear relationship represents a straight line when plotted as a graph.

    --> A non-linear relationship where the exponent of any variable is not equal to 1 creates a curve.

    --> The general mathematical equation for a linear regression is:

    y = ax + b

    y is the response variable.

    x is the predictor variable.

    a and b are constants which are called the coefficients.



    --> lm() function creates the relationship model between the predictor and the response variable.




    --> formula is a symbol presenting the relation between x and y.

    --> data is the vector on which the formula will be applied.

    Example 1:


    x <- c(1,2,3,4,5)

    y <- c(14,18,22,26,30)

    # Apply the lm() function.

    relation <- lm(y~x)



    # Predict the y value for x=500

    a <- data.frame(x=500)

    result <- predict(relation,a)


    # Predict the y values for x values

    x <- data.frame(x)

    result <- predict(relation,x)


    # Evaluate the performance of the model


    # Visualize the Regression Graphically



  • Multiple Regression  

    Example 2:


    x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

    y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

    # Apply the lm() function.

    relation <- lm(y~x)



    # Predict the weight of a new person with height 170.

    a <- data.frame(x = 170)

    result <- predict(relation,a)


    # Predict the weights of existing persons

    x <- data.frame(x)

    result <- predict(relation,x)


    # Evaluate the performance of the model


    # Visualize the Regression Graphically

    # Give the chart file a name.

    png(file = "linearregression.png")

    # Plot the chart.

    plot(y,x,col = "blue",main = "Height & Weight Regression",

    abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab = "Height in cm")

    # Save the file.

    Multiple Regression:


    --> Multiple regression is an extension of linear regression into relationship between more than two variables.

    --> In simple linear relation we have one predictor and one response variable, but in multiple regression we have more than one predictor variable and one response


    --> The general mathematical equation for multiple regression is:

    y = a + b1x1 + b2x2 +...bnxn

    y is the response variable.

    a, b1, are the coefficients.

    x1, x2, ...xn are the predictor variables.

    --> We create the regression model using the lm() function in R.

    --> The model determines the value of the coefficients using the input data.

    --> Next we can predict the value of the response variable for a given set of predictor variables using these coefficients.


    lm(y ~ x1+x2+x3...,data)

    Example 1:


    input <- mtcars[,c("mpg","disp","hp","wt")]


    # Create the relationship model.

    model <- lm(mpg~disp+hp+wt, data = input)

    # Show the model.


    # Get the Intercept and coefficients as vector elements.

    cat("# # # # The Coefficient Values # # # ","\n")

    a <- coef(model)[1]


    Xdisp <- coef(model)[2]

    Xhp <- coef(model)[3]

    Xwt <- coef(model)[4]




    Example 2:


    --> Australian CPI (Consumer Price Index) data, which are quarterly CPIs from 2008 to 2010.

    --> In this example, an x-axis is added manually with function axis(), where las=3 makes text vertical.

    year <- rep(2008:2010, each=4)

    quarter <- rep(1:4, 3)

    cpi <- c(162.2, 164.6, 166.5, 166.0, 166.2, 167.0, 168.6, 169.5, 171.0, 172.1, 173.3, 174.0)

    plot(cpi, xaxt="n", ylab="CPI", xlab="")

    # draw x-axis

    axis(1, labels=paste(year,quarter,sep="Q"), at=1:12, las=3)

    # Check the correlation between CPI and the other variables, year and quarter.



    # Built a linear regression model with lm(), using year and quarter as predictors and CPI as response.

    fit <- lm(cpi ~ year + quarter)


    # With the above linear model, CPI is calculated as

    cpi = c0 + c1*year + c2*quarter;

    # where c0, c1 and c2 are coefficients from model fit. Therefore, the CPIs in 2011 can be get as follows.

    # An easier way for this is using function predict(), which will be demonstrated at the end of this subsection.

    cpi2011 <- fit$coefficients[[1]] + fit$coefficients[[2]]*2011 + fit$coefficients[[3]]*(1:4)


    # differences between observed values and fitted values



    # Plot the fitted model


    # We can also plot the model in a 3D plot as below, where function scatterplot3d() creates a 3D scatter plot and plane3d() draws the fitted plane.

    # Parameter lab specifies the number of tickmarks on the x- and y-axes.


    s3d <- scatterplot3d(year, quarter, cpi, highlight.3d=T, type="h", lab=c(2,3))


    # With the model, the CPIs in year 2011 can be predicted as follows, and the predicted values are shown as red triangles

    data2011 <- data.frame(year=2011, quarter=1:4)

    cpi2011 <- predict(fit, newdata=data2011)

    style <- c(rep(1,12), rep(2,4))

    plot(c(cpi, cpi2011), xaxt="n", ylab="CPI", xlab="", pch=style, col=style)

    axis(1, at=1:16, las=3, labels=c(paste(year,quarter,sep="Q"), "2011Q1", "2011Q2", "2011Q3", "2011Q4"))

  • Generalized Linear Regression, Non Linear Regression & Logistic Regression  

    Generalized Linear Regression:


    --> The generalized linear model (GLM) generalizes linear regression by allowing the linear model to be related to the response variable via a link function and allowing the magnitude of the variance of each measurement to be a function of its predicted value.

    --> It unifies various other statistical models, including linear regression, logistic regression and Poisson regression.

    --> Function glm() is used to fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution.



    # A generalized linear model is built below with glm() on the bodyfat data

    data("bodyfat", package="")

    myFormula <- DEXfat ~ age + waistcirc + hipcirc + elbowbreadth + kneebreadth

    bodyfat.glm <- glm(myFormula, family = gaussian("log"), data = bodyfat)


    pred <- predict(bodyfat.glm, type="response")

    # type indicates the type of prediction required. The default is on the scale of

    the linear predictors, and the alternative "response" is on the scale of the response variable.

    plot(bodyfat$DEXfat, pred, xlab="Observed Values", ylab="Predicted Values")

    abline(a=0, b=1)

    # if family=gaussian("identity") is used, the built model would be similar

    to linear regression. One can also make it a logistic regression by setting family to binomial("logit").

    Non-linear Regression:


    --> While linear regression is to find the line that comes closest to data, non-linear regression is to fit a curve through data.

    --> Function nls() provides nonlinear regression. Examples of nls() can be found by running \?nls" under R.

    Logistic Regression:


    --> Logistic regression is a classification model in which the response variable is categorical.

    --> It is an algorithm that comes from statistics and is used for supervised classification problems.

    --> The Logistic Regression is a regression model in which the response variable (dependent variable) has categorical values such as True/False or 0/1.

    --> It actually measures the probability of a binary response as the value of response variable based on the mathematical equation relating it with the predictor variables.

    --> The general mathematical equation for logistic regression is:

    y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...))

    y is the response variable.

    x is the predictor variable.

    a and b are the coefficients which are numeric constants.



    --> glm() function is used to create the regression model and get its summary for





    --> formula is the symbol presenting the relationship between the variables.

    --> data is the data set giving the values of these variables.

    --> family is R object to specify the details of the model. It's value is binomial for logistic regression.

    Example 1:




    # Split dataset in training and testing

    inx = sample(nrow(spam), round(nrow(spam) * 0.8))

    train = spam[inx,]

    test = spam[-inx,]

    # Fit regression model

    fit = glm(spam ~ ., data = train, family = binomial())


    # Make predictions

    preds = predict(fit, test, type = "response")

    preds = ifelse(preds > 0.5, 1, 0)

    tbl = table(target = test$spam, preds)


    sum(diag(tbl)) / sum(tbl)

    Example 2:


    --> The in-built data set "mtcars" describes different models of a car with their various engine specifications.

    --> In "mtcars" data set, the transmission mode (automatic or manual) is described by the column am which is a binary value (0 or 1).

    --> We can create a logistic regression model between the columns "am" and 3 other columns - hp, wt and cyl.

    # Select some columns form mtcars.

    input <- mtcars[,c("am","cyl","hp","wt")]


    # Create Regression Model = glm(formula = am ~ cyl + hp + wt, data = input, family = binomial)




    --> In the summary as the p-value in the last column is more than 0.05 for the variables "cyl" and "hp", we consider them to be insignificant in contributing to the value of the variable "am".

    --> Only weight (wt) impacts the "am" value in this regression model.

    Generalized Linear Regression, Non Linear 

  • Clustering  



    --> This is a type of problem where we group similar things together.

    --> Bit similar to multi class classification but here we don’t provide the labels, the system understands from data itself and cluster the data.

    Some examples are :

    --> given news articles, cluster into different types of news

    --> given a set of tweets, cluster based on content of tweet

    --> given a set of images, cluster them into different objects

    K Means Clustering:


    --> K Means Clustering is an unsupervised learning algorithm that tries to cluster data based on their similarity.

    --> Unsupervised learning means that there is no outcome to be predicted, and the algorithm just tries to find patterns in the data.

    --> In k means clustering, we have the specify the number of clusters we want the data to be grouped into.

    --> The algorithm randomly assigns each observation to a cluster, and finds the centroid of each cluster.

    --> Then, the algorithm iterates through two steps:

    1. Reassign data points to the cluster whose centroid is closest.

    2. Calculate new centroid of each cluster.

    --> These two steps are repeated till the within cluster variation cannot be reduced any further.

    --> The within cluster variation is calculated as the sum of the euclidean distance between the data points and their respective cluster centroids.

    Example 1:


    # k-means clustering of iris data.

    # At first, we remove species from the data to cluster.

    # After that, we apply function kmeans() to iris2, and store the clustering result in kmeans.result.

    # The cluster number is set to 3 in the code below.

    iris2 <- iris

    iris2$Species <- NULL

    kmeans.result <- kmeans(iris2, 3)

    # The clustering result is then compared with the class label (Species) to check whether similar objects are grouped together.

    table(iris$Species, kmeans.result$cluster)

    # The above result shows that cluster "setosa" can be easily separated from the other clusters, and that clusters "versicolor" and "virginica" are to a small degree overlapped with each other.

    # Next, the clusters and their centers are plotted. Note that there are four dimensions in the data and that only the first two dimensions are used to draw the plot below.

    # Some black points close to the green center (asterisk) are actually closer to the black center in the four dimensional space. We also need to be aware that the results of k-means clustering may vary from run to run, due to random selection of initial cluster centers.

    plot(iris2[c("Sepal.Length", "Sepal.Width")], col = kmeans.result$cluster)

    # plot cluster centers

    points(kmeans.result$centers[,c("Sepal.Length", "Sepal.Width")], col = 1:3, pch = 8, cex=2)

    Example 2:


    # Exploring the data:

    # The iris dataset contains data about sepal length, sepal width, petal length, and petal width of flowers of different species. Let us see what it looks like:



    # After a little bit of exploration, I found that Petal.Length and Petal.Width were similar among the same species but varied considerably between different species, as demonstrated below:


    ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()

    # Clustering:

    # Okay, now that we have seen the data, let us try to cluster it. Since the initial cluster assignments are random, let us set the seed to ensure reproducibility.


    irisCluster <- kmeans(iris[, 3:4], 3, nstart = 20)


    # Let us compare the clusters with the species.

    table(irisCluster$cluster, iris$Species)


    # As we can see, the data belonging to the setosa species got grouped into cluster 3, versicolor into cluster 2, and virginica into cluster 1. The algorithm wrongly classified two data points belonging to versicolor and six data points belonging to virginica.

    # We can also plot the data to see the clusters:

    irisCluster$cluster <- as.factor(irisCluster$cluster)

    ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$cluster)) + geom_point()

  • K-Means Clustering with SNS Data Analysis  
  • Association Rules (Market Basket Analysis)  

    Association Rules (Market Basket Analysis):


    --> Association rules are rules presenting association or correlation between itemsets.

    --> An association rule is in the form of A => B, where A and B are two disjoint itemsets, referred to respectively as the LHS (left-hand side) and RHS (right-hand side) of the rule.

    --> The three most widely-used measures for selecting interesting rules are support, confidence and lift.

    --> Support is the percentage of cases in the data that contains both A and B.

    --> Confidence is the percentage of cases containing A that also contain B.

    --> Lift is the ratio of confidence to the percentage of cases containing B.

    Lets consider the rule A => B in order to compute these metrics.

           Number of transactions with both A and B

    Support =   ---------------------------------------- = P(AnB)

               Total number of transactions

            Number of transactions with both A and B

    Confidence = ---------------------------------------- = P(AnB) / P(A)

             Total number of transactions with A

                Number of transactions with B

    ExpectedConfidence =  ------------------------------ = P(B)

                Total number of transactions


    Lift = ------------------- = P(AnB) / P(A).P(B)

        Expected Confidence

    --> A classic algorithm for association rule mining is APRIORI.

    --> It is a level-wise, breadth-first algorithm which counts transactions to find frequent itemsets and then derive association rules from them.

    --> An implementation of it is function apriori() in package arules.

    --> Another algorithm for association rule mining is the ECLAT algorithm, which finds frequent itemsets with equivalence classes, depth-first search and set intersection instead of counting.

    --> It is implemented as function eclat() in the same package.

    --> With the apriori() function, the default settings are:

    1) supp=0.1, which is the minimum support of rules;

    2) conf=0.8, which is the minimum confidence of rules; and

    3) maxlen=10, which is the maximum length of rules.

    Example 1 on Titanic dataset:


    --> The Titanic dataset in the datasets package is a 4-dimensional table with summarized information on the fate of passengers on the Titanic according to social class, sex, age and survival.

    --> To make it suitable for association rule mining, we reconstruct the raw data as titanic.raw, where each row represents a person.


    df <-


    titanic.raw <- NULL

    for(i in 1:4) {

    titanic.raw <- cbind(titanic.raw, rep(as.character(df[,i]), df$Freq))


    titanic.raw <-

    names(titanic.raw) <- names(df)[1:4]






    # find association rules with default settings

    rules.all <- apriori(titanic.raw)

    quality(rules.all) <- round(quality(rules.all), digits=3)



    # use code below if above code does not work


    # rules with rhs containing "Survived" only

    rules <- apriori(titanic.raw, control = list(verbose=F),

    parameter = list(minlen=2, supp=0.005, conf=0.8),

    appearance = list(rhs=c("Survived=No", "Survived=Yes"),


    quality(rules) <- round(quality(rules), digits=3)

    rules.sorted <- sort(rules, by="lift")


    Removing Redundancy:


    # find redundant rules

    subset.matrix <- is.subset(rules.sorted, rules.sorted)

    subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA

    redundant <- colSums(subset.matrix, na.rm=T) >= 1


    # remove redundant rules

    rules.pruned <- rules.sorted[!redundant]


    Interpreting Rules:


    rules <- apriori(titanic.raw,

    parameter = list(minlen=3, supp=0.002, conf=0.2),

    appearance = list(rhs=c("Survived=Yes"),

    lhs=c("Class=1st", "Class=2nd", "Class=3rd", "Age=Child", "Age=Adult"),


    control = list(verbose=F))

    rules.sorted <- sort(rules, by="confidence")


    Visualizing Association Rules:


    --> Next we show some ways to visualize association rules, including scatter plot, balloon plot, graph and parallel coordinates plot.

    --> More examples on visualizing association rules can be found in the vignettes of package "arulesViz".




    plot(rules.all, method="grouped")

    plot(rules.all, method="graph")

    plot(rules.all, method="graph", control=list(type="items"))

    plot(rules.all, method="paracoord", control=list(reorder=TRUE))

    Association Rules (Market Basket Analysis) 

  • Market Basket Analysis using Association Rules with Groceries Data set  
  • Python Libraries for Data Science  
Reviews (0)