ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • R Programming (5) - Sophisticated data structures
    R Programming 2020. 3. 28. 04:35
    728x90

     

    Sophisticated data structures

    • R provides sophisticated structure for the storage and manipulation of data.
      • simplify data representation, manipulation and analysis.
    • dataframe
      • is like a matrix,
      • but extended to allow for different object models in different columns
    • list
      • is a general data structure that can house many kinds of R object.
    • factor
      • special variable that represent categorical objects.

    Factor

    A factor is a vector that can contain only predefined values, and is used to store categorical data

    Factors are built on top of integer vectors using two attributes: the class, “factor”, which makes them behave differently from regular integer vectors, and the levels, which defines the set of allowed values.

    • Data type for ordinal and categorical vectors.
    • The possible values of a factor are referred to as its level.
    • To create a factor, apply function factor to some vector.
    hair <- c("blonde", "black", "brown", "brown", "black", "gray", "none")
    is.character(hair)
    ## [1] TRUE
    is.factor(hair)
    ## [1] FALSE
    hair <- factor(hair)
    is.factor(hair)
    ## [1] TRUE
    class(hair)
    ## [1] "factor"
    levels(hair)
    ## [1] "black"  "blonde" "brown"  "gray"   "none"

    table() calculates the number of times each level of the factor appears.

    table(hair)
    ## hair
    ##  black blonde  brown   gray   none 
    ##      2      1      2      1      1
    • Specify level using levels argument.
    hair <- factor(hair, levels=c("black", "blonde", "brown", "gray", "white", "none"))
    table(hair)
    ## hair
    ##  black blonde  brown   gray  white   none 
    ##      2      1      2      1      0      1

    You can’t use values that are not in the levels.

    hair[2] <- "green"
    ## Warning in `[<-.factor`(`*tmp*`, 2, value = "green"): invalid factor level,
    ## NA generated
    # reset
    hair[2] <- "black"

    Indeed, the type of a factor is integer.

    typeof(hair)
    ## [1] "integer"
    as.numeric(hair)
    ## [1] 2 1 3 3 1 4 6

    Dataframe

    A data frame is the most common way of storing data in R.

    • Dataframe is a list of vectors (with equal length).
    • Each vector (column) is a variable in experiment.
    • Each row is a single observation.

    creating data.frame

    You create a data.frame using data.frame(), which takes named vectors or existing data.frame as input:

    data.frame(col1=x1, col2=x2, …, df1, df2, …)
    • col1 and col2 are column names

    • x1 and x2 are vectors of equal lengths

    • df1 and df2 are dataframes whose columns must be the same length with x1, x2

    df <- data.frame(x = 2:4, y = c("a", "b", "c"))
    str(df)
    ## 'data.frame':    3 obs. of  2 variables:
    ##  $ x: int  2 3 4
    ##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3

    data.frame() turns strings into factors. Use stringsAsFactors = FALSE to suppress this behaviour:

    df <- data.frame(
      x = 2:4,
      y = c("a", "b", "c"),
      stringsAsFactors = FALSE)
    str(df)
    ## 'data.frame':    3 obs. of  2 variables:
    ##  $ x: int  2 3 4
    ##  $ y: chr  "a" "b" "c"
    typeof(df)
    ## [1] "list"
    class(df)
    ## [1] "data.frame"
    is.data.frame(df)
    ## [1] TRUE

    Combinding dataframes:

    cbind(df, data.frame(z = 3:1))
    ##   x y z
    ## 1 2 a 3
    ## 2 3 b 2
    ## 3 4 c 1
    rbind(df, data.frame(x = 10, y = "z"))
    ##    x y
    ## 1  2 a
    ## 2  3 b
    ## 3  4 c
    ## 4 10 z

    read.table

    You can use read.table to create data.frame from a file.

    Plot Tree Species Diameter Height
    2 1 DF 39 20.5
    2 2 WL 48 33.0
    3 2 GF 52 30.0
    • Large dataframe are usually read into R from a file, using read.table.
    read.table(file, header=FALSE, sep="")
    • file
      • filename to be read, can be relative or absolute.
      • the same number of values in each row.
      • the values may be different modes, but the pattern of modes must be the same in each row.
    • header
      • whether or not the first line of the file is the variable names
    • sep
      • gives the character used to separate values in each row.
      • default "" has the special interpretation that a variable amount of white space (spaces, tabs, returns) can separate values.

    more about read.table

    • Two commonly used variants of read.table
      • read.csv(file)
        • for comma-separated data
        • equivalent to read.table(file, header=TRUE, sep=",")
      • read.delim(file)
        • for tab-delimitated data
        • equivalent to read.table(file, header=TRUE, sep="\t")

    read csv file

    • ufc.csv is comma separated and there is a header line.
    ufc <- read.csv("the path of ufc.csv")
    • head and tail to examine the object
    head(ufc)
    ##   plot tree species dbh.cm height.m
    ## 1    2    1      DF     39     20.5
    ## 2    2    2      WL     48     33.0
    ## 3    3    2      GF     52     30.0
    ## 4    3    5      WC     36     20.7
    ## 5    3    8      WC     38     22.5
    ## 6    4    1      WC     46     18.0
    tail(ufc)
    ##     plot tree species dbh.cm height.m
    ## 331  143    1      GF   28.0     21.0
    ## 332  143    2      GF   33.0     20.5
    ## 333  143    7      WC   47.8     20.5
    ## 334  144    1      GF   10.2     16.0
    ## 335  144    2      DF   31.5     22.0
    ## 336  144    4      WL   26.5     25.0
    • Each column has a unique name and we can extract that variable by means of names using $.
    x <- ufc$height.m
    x[1:5]
    ## [1] 20.5 33.0 30.0 20.7 22.5

    accessing element in data.frame

    • We can use [[?]] to extract columns.
      • ufc$height.m, ufc[[5]] and ufc[["height.m"]] are all equivalent.
      • The result is a vector.
    x1 <- ufc[["height.m"]]
    x1[1:5]
    ## [1] 20.5 33.0 30.0 20.7 22.5
    x2 <- ufc[[5]]
    x2[1:5]
    ## [1] 20.5 33.0 30.0 20.7 22.5
    • You can extract the elements of a dataframe directly using matrix indexing.
    # the result is a vector
    ufc[1:5, 5]
    ## [1] 20.5 33.0 30.0 20.7 22.5
    • To select more than one of the variables in a dataframe, we use [?].
      • ufc[4:5] is equivalent to ufc[c("dbh.cm", "height.m")].
      • [] preserves the type of dataframe.
    #diam.height and z are data.frame
    diam.height <- ufc[4:5]    
    (z <- diam.height[1:5, ])
    ##   dbh.cm height.m
    ## 1     39     20.5
    ## 2     48     33.0
    ## 3     52     30.0
    ## 4     36     20.7
    ## 5     38     22.5
    is.data.frame(diam.height)
    ## [1] TRUE
    # diam.height1 is a data.frame
    diam.height1 <- ufc[5]
    is.data.frame(diam.height1)
    ## [1] TRUE
    # z1 is not a data.frame, but a vector, since [,] simplifies.
    z1 <- diam.height1[1:5, ]
    is.data.frame(z1)
    ## [1] FALSE
    • Selecting a column using [[?]] preserves the type of the object that is being extracted.

    • Whereas, [?] keeps the type of the object from which the extraction is being made.

    typeof(ufc)
    ## [1] "list"
    typeof(ufc[5])
    ## [1] "list"
    # numeric vector
    typeof(ufc[[5]])
    ## [1] "double"
    • In addition, [,] tends to simplify the data type, if possible.
    typeof(ufc[1:5, 5])
    ## [1] "double"
    # In this case, simplification is impossible, and original data.frame type is preserved.
    typeof(ufc[1:5, 4:5])
    ## [1] "list"
    class(ufc[1:5, 4:5])
    ## [1] "data.frame"

    http://r4ds.had.co.nz/vectors.html

    create new variable

    • create a new variable within a dataframe, by naming it and assigning it a value.
    ufc$volume.m3 <- pi * (ufc$dbh.cm / 200)^2 * ufc$height / 2
    mean(ufc$volume.m3)
    ## [1] 1.93294

    name of column

    • names(df) return the names of the dataframe df.

    • To change the names of df you pass a vector of strings to names(df).

    (ufc.names <- names(ufc))
    ## [1] "plot"      "tree"      "species"   "dbh.cm"    "height.m"  "volume.m3"
    names(ufc) <- c("P", "T", "S", "D", "H", "V")
    names(ufc)
    ## [1] "P" "T" "S" "D" "H" "V"
    names(ufc) <- ufc.names

    subset

    • subset is a convenient tool for selecting rows of dataframe.
      • x %in% y returns a logical vector whose i-th element is TRUE if x[i] is in y.
    fir.height <- subset(ufc, subset=species %in% c("DF", "GF"), select = c(plot, tree, height.m))
    head(fir.height)
    ##    plot tree height.m
    ## 1     2    1     20.5
    ## 3     3    2     30.0
    ## 7     4    2     17.0
    ## 8     5    2     29.3
    ## 9     5    4     29.0
    ## 10    6    1     26.0

    write a dataframe

    write.table(x, file="", append=FALSE, sep=" ", row.names=TRUE, col.names=TRUE)
    • x is a dataframe to be written.
    • file is the address of the file to write to.
    • append indicates whether or not to append.
    • sep is the character used to separate the values.
    • row.name indicates whether or not to include the row names as the first column.
    • col.names indicates whether or not to include the column names as the first row.

    Another method to read and write tabular data

    • read_csv() and read_tsv() are special cases of the general read_delim() in readr package.
    • They’re useful for reading the most common types of flat file data, comma separated values and tab separated values, respectively.
    • ufc_tibble is a data frame providing a nicer printing method, useful when working with large data sets.
    library(readr)
    ufc_tibble <- read_csv("./data/ufc.csv")
    ## Parsed with column specification:
    ## cols(
    ##   plot = col_integer(),
    ##   tree = col_integer(),
    ##   species = col_character(),
    ##   dbh.cm = col_double(),
    ##   height.m = col_double()
    ## )

    List

    • We have seen that a vector is an indexed set of objects.

    • All the elements of a vector to be the same type - numeric, character or logical - which is called the mode of the vector.

    • List is an indexed set of objects, but the element of a list can be of different type, including other list.
      • The mode of a list is list.
    • A list is created using the list(...) command instead of c(), with comma separated arguments.

    • Single square brackets [ ] are used to select a sublist.

    • Double square brackets [[ ]] are used to extract a single element.

    Example

    (my.list <- list("one", TRUE, 3, c("f", "o", "u", "r")))
    ## [[1]]
    ## [1] "one"
    ## 
    ## [[2]]
    ## [1] TRUE
    ## 
    ## [[3]]
    ## [1] 3
    ## 
    ## [[4]]
    ## [1] "f" "o" "u" "r"
    str(my.list)
    ## List of 4
    ##  $ : chr "one"
    ##  $ : logi TRUE
    ##  $ : num 3
    ##  $ : chr [1:4] "f" "o" "u" "r"
    my.list[[2]]
    ## [1] TRUE
    mode(my.list[[2]])
    ## [1] "logical"
    my.list[2]
    ## [[1]]
    ## [1] TRUE
    mode(my.list[2])
    ## [1] "list"
    my.list[[4]][1]
    ## [1] "f"
    my.list[4][1]
    ## [[1]]
    ## [1] "f" "o" "u" "r"
    • When displaying a list,

    • R uses double brackets [[1]], [[2]], etc., to indicate list elements,

    • then single brackets [1], [2], etc., to indicate vector elements.

    # list can contain other list
    my.list2 <- list("a", c(5,6,7), my.list)
    my.list2[3]
    ## [[1]]
    ## [[1]][[1]]
    ## [1] "one"
    ## 
    ## [[1]][[2]]
    ## [1] TRUE
    ## 
    ## [[1]][[3]]
    ## [1] 3
    ## 
    ## [[1]][[4]]
    ## [1] "f" "o" "u" "r"
    #error
    my.list2[3][[4]]
    my.list2[[3]]
    ## [[1]]
    ## [1] "one"
    ## 
    ## [[2]]
    ## [1] TRUE
    ## 
    ## [[3]]
    ## [1] 3
    ## 
    ## [[4]]
    ## [1] "f" "o" "u" "r"
    my.list2[[3]][4]
    ## [[1]]
    ## [1] "f" "o" "u" "r"
    my.list2[[3]][[4]]
    ## [1] "f" "o" "u" "r"

    name of elements of list

    The element of a list can be named when the list is created.

    my.list <- list(first = "one", second=TRUE, third=3, fourth = c("f", "o", "u", "r"))
    names(my.list)
    ## [1] "first"  "second" "third"  "fourth"
    my.list$second
    ## [1] TRUE

    The element of a list can be named by names attribute.

    names(my.list) <- c("Fi", "Se", "Th", "Fo")
    my.list$Se     
    ## [1] TRUE
    my.list$"Se"
    ## [1] TRUE
    my.list[["Se"]]
    ## [1] TRUE

    Simplify vs. Preserving

    type Simplifying Preserving
    vector x[[1]] x[1]
    array(matrix) x[1, ], x[, 1] x[1, , drop = F], x[, 1, drop = F]
    list x[[1]], x$a, x[["a"]] x[1], x["a"]
    data.frame x[, 1] x[1, ], x[, 1, drop = F]
    • Note that data.frame is a list, and hence simlifying and preserving rules for list are also applied to data.frame.

    http://adv-r.had.co.nz/Subsetting.html

    list as output of function

    • Many functions produce list object as their ouput.

    • For example, when fit a least squares regression, the regression object itself is a list,

    • and can manipulated using list operations.

    lm.xy <- lm(y~x, data=data.frame(x=1:5, y=1:5))
    typeof(lm.xy)
    ## [1] "list"
    names(lm.xy)
    ##  [1] "coefficients"  "residuals"     "effects"       "rank"         
    ##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
    ##  [9] "xlevels"       "call"          "terms"         "model"
    • str can be used to summarize a list or dataframe : str(lm.xy)

    apply family

    • R has several functions that allow you to easily apply a function to all or selected elements of a list or dataframe.

    • tapply allows us to vectorise the application of a function to subsets of data.

    tapply(X, INDEX, FUN, …)
    • X : target vector to be applied.

    • INDEX : a factor, the same length as X, used to group the elements. (If INDEX is not a factor, then it will be automatically converted to a factor.)

    • FUN : a function to be applied. It is applied to subvectors of X corresponding to a single level of INDEX.

    • tapply works similar with groupby + summary function in SQL.

    • INDEX denotes groupby and FUN represents a summary function.

    https://www.rdocumentation.org/packages/base/versions/3.4.0/topics/tapply

    tapply example

    Using tapply we obtain average height group by species as follows:

    tapply(ufc$height.m, ufc$species, mean)
    ##       DF       GF       WC       WL 
    ## 25.30000 24.34322 23.48777 25.47273

    To round the results:

    round(tapply(ufc$height.m, ufc$species, mean), digits=1)
    ##   DF   GF   WC   WL 
    ## 25.3 24.3 23.5 25.5

    To find out how many samples we have of each species:

    tapply(ufc$species, ufc$species, length)
    ##  DF  GF  WC  WL 
    ##  57 118 139  22
    # the same result
    tapply(ufc$height.m, ufc$species, length)
    ##  DF  GF  WC  WL 
    ##  57 118 139  22

    Example : tapply

    tapply can be used to count the number of elements in a vector:

    x <- c(1,2,3,4,5,4,6,2,5,6,5,3,4,1,4,5,6,7,2,2,6,7,9,3,5)
    tapply(X = x, INDEX = x, FUN = length)
    ## 1 2 3 4 5 6 7 9 
    ## 2 4 3 4 5 4 2 1

    the above is almost the same as

    table(x)
    ## x
    ## 1 2 3 4 5 6 7 9 
    ## 2 4 3 4 5 4 2 1
    y <- x %% 2
    tapply(X = x, INDEX = y, FUN = length)
    ##  0  1 
    ## 12 13
    y <- x %% 2
    tapply(X = x, INDEX = y, FUN = max)
    ## 0 1 
    ## 6 9

    lapply and sapply

    • lapply(X, FUN, …) applies the function FUN to each element of the list X and returns a list.

    • sapply(X, FUN, …) applies the function FUN to each element of X, which can be a list or a vector, and by default will try to return the results in a vector or a matrix, if this make sense, otherwise in a list.

    To obtain the mean diameter, height, and volume of trees:

    lapply(ufc[4:6], mean)   # return a list
    ## $dbh.cm
    ## [1] 37.41369
    ## 
    ## $height.m
    ## [1] 24.2256
    ## 
    ## $volume.m3
    ## [1] 1.93294
    sapply(ufc[4:6], mean)   # return a vector
    ##    dbh.cm  height.m volume.m3 
    ##  37.41369  24.22560   1.93294

    Simple example:

    sapply(1:3, function(x) x^2)
    ## [1] 1 4 9
    lapply(1:3, function(x) x^2)
    ## [[1]]
    ## [1] 1
    ## 
    ## [[2]]
    ## [1] 4
    ## 
    ## [[3]]
    ## [1] 9

    lapply example

    df <- data.frame(replicate(6, sample(c(1:10, -99), 6, rep = TRUE)))
    names(df) <- letters[1:6]
    df
    ##     a   b   c  d   e  f
    ## 1 -99   7   1 10   5  3
    ## 2  10 -99   5  6   8  7
    ## 3  10 -99   5  1   9  7
    ## 4   7  10 -99 10   2 10
    ## 5   1   8   7  7 -99  3
    ## 6   5   8   4  3 -99  7

    You want to replace all the -99s with NAs.

    fix_missing <- function(x) {
      x[x == -99] <- NA
      x
    }
    df[] <- lapply(df, fix_missing)  #list to data.frame, use df[]
    df
    ##    a  b  c  d  e  f
    ## 1 NA  7  1 10  5  3
    ## 2 10 NA  5  6  8  7
    ## 3 10 NA  5  1  9  7
    ## 4  7 10 NA 10  2 10
    ## 5  1  8  7  7 NA  3
    ## 6  5  8  4  3 NA  7

    More about apply : http://adv-r.had.co.nz/Functional-programming.html

    three dots ellipsis … in function argument

    The three dots ellipsis ... in function argument is used to get a variable-length argument list.

    • One example is paste, where you can put a variable numbers of arguments. Check ?paste.

    To handle ..., first convert it to a list inside the function.

    addemup <- function(x, ...){
      args <- list(...)
      for (a in args) x <- x + a
      x
    }
    addemup(1, 1)
    ## [1] 2
    addemup(1, 2, 3, 4, 5)
    ## [1] 15
    reverse_paste0 <- function(...){
      result <- ""
      args <- list(...)
      
      for (i in length(args):1){
        result <- paste0(result, args[i])
      }
      result
    }
    reverse_paste0("a", "b", "c")
    ## [1] "cba"
    reverse_paste0("a", "b", "c", "d", "e")
    ## [1] "edcba"

    728x90

    'R Programming' 카테고리의 다른 글

    R Programming (7) - Discrete random variable  (0) 2020.03.28
    R Programming (6) - Numerical integration  (0) 2020.03.28
    R Programming (4) - Function  (0) 2020.03.28
    R Programming (3) - IO  (0) 2020.03.28
    R Programming (2) - R basic  (0) 2020.03.28

    댓글

Designed by Tistory.