R Programming (8) - Continuous random variable

R Programming 2020. 3. 28. 04:57

728x90

Continuous random variable

Some of continuous random variables in R

Some of the discrete distributions provided by R, together with the names of their parameter inputs.

Distribution	R name (dist)	Paramter names
Uniform	`unif`	`min=0, max=1`
Exponential	`exp`	`rate=1`
chi-square	`chisq`	`df`
Gamma	`gamma`	`shape, rate=1`
Normal	`norm`	`mean=0, sd=1`
t	`t`	`df`
Weibull	`weibull`	`shape, scale=1`

Uniform distribution

If the probability that \(X\) lies in a given subinterval of \([a,b]\) depends only on the length of the subinterval and not on its location,
then \(X\) is said to have a uniform distribution on \([a,b]\).
Write \(X\) ~ \(U[a,b]\)
The pdf, mean and variace are
- \(f(x) = 1/(a-b)\), for \(a \leq x \leq b\)
- \(E[X] = (a+b)/2\)
- \(Var(x) = (b-a)^2/12\)

Uniform distribution in R

density function of the uniform on the interval from min to max.

dunif(x, min=0, max=1, log = FALSE)

the uniform distribution function

punif(q, min=0, max=1, lower.tail = TRUE, log.p = FALSE)

quantile function

qunif(p, min=0, max=1, lower.tail = TRUE, log.p = FALSE)

generate the uniform random variable

runif(n, min=0, max=1)

Example :

var(runif(100000))  # approximately 1/12

## [1] 0.08382216

From uniform random variable to Bernoulli random variable

We can generate a Bernoulli random variable from the uniform distribution.

rBernoulli <- function(p = 0.5){
  if(runif(1) < p ) 1
  else 0
}
rBernoulli()

## [1] 0

Or, more conveniantly,

rBernoulli <- function(p = 0.5) as.integer(runif(1) < p) # TRUE -> 1, FALSE -> 0
rBernoulli()

## [1] 0

Also, we can generate multiple samples.

# generate muliple samples
rBernoulli <- function(n, p = 0.5) as.integer(runif(n) < p)
rBernoulli(n = 100, p = 0.9)

##   [1] 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1
##  [71] 0 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1

We can further contstruct a random number generator for the Binomial distribution.

my_rbinom <- function(n, size, prob){
  sapply(1:n, function(x) sum(rBernoulli(size, prob)))  # x is a dummy variable
}

my_rbinom(10, 5, 0.6)

##  [1] 4 4 4 2 2 2 3 4 1 3

N <- 1000; n <- 20; p <- 0.4

frequency <- table(factor(my_rbinom(N, n, p), levels = 0:n))
relative_frequency <- as.numeric(frequency/N)

x <- 0:n; pmf <- dbinom(x, n, p)

par(mfrow=c(1,2))
plot(x, relative_frequency, 'h')
plot(x, pmf, 'h')

Two dimensional uniform random variable

Independent two uniform random variables.

n <- 10000
X <- runif(n, -1, 1)
Y <- runif(n, -1, 1)
plot(X,Y, cex=0.1) # check scatter plot

Uncorrelated but dependent variables

Indepnent variables are uncorrelated but uncorrelated variables can be dependent.

n <- 50000
X <- runif(n, -1, 1)
Z <- runif(n, -1, 1)
Y <- sign(Z)*abs(X)
cor(X,Y)   # correlation coefficient is almost 0

## [1] -0.004253034

plot(X,Y, cex=0.1) # check scatter plot

Normal distribution

Normal distribution in R

the normal density function

dnorm(x, mean = 0, sd = 1, log = FALSE)

the normal distribution function

pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)

Example :

pnorm(1.96, lower.tail=TRUE)

## [1] 0.9750021

pnorm(1.96, lower.tail=FALSE)

## [1] 0.0249979

quantile function

qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)

Example:

qnorm(0.975, 0, 1)

## [1] 1.959964

normal random variables

rnorm(n, mean = 0, sd = 1)

Normal density with different mean

x <- seq(-3, 5, 0.01)
plot(x, dnorm(x, 0, 1), type="l", xlab="x", ylab="y",  xlim=c(-3,5))
lines(x, dnorm(x, 1, 1), col="red")
lines(x, dnorm(x, 2, 1), col="blue")
title("Normal distribution: sigma=1, mu=0,1,2")

Normal density with different sigma

x <- seq(-10, 10, 0.01)
plot(x, dnorm(x, 0, 1), type="l", xlab="x", ylab="y",  xlim=c(-10,10))
lines(x, dnorm(x, 0, 2), col="red")
lines(x, dnorm(x, 0, 3), col="blue")
title("Normal distribution: mu=0, sigma=1,2,3")

Comparison between histogram and pdf

z<-rnorm(10000)
par(las=1)
hist(z, breaks = seq(-5,5,0.2), freq=FALSE)
phi <- function(x) exp(-x^2/2)/sqrt(2*pi)
x <- seq(-5,5,0.1)
lines(x, phi(x))  # or lines(x, dnorm(x))

Sum of independent normal

\(X\) ~ \(N(\mu_1, \sigma_1^2)\), \(Y\) ~ \(N(\mu_2, \sigma_2^2)\)

Then \(X+Y\) ~ \(N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)\)

mu1 <-1; mu2 <-1
sigma1 <- 1; sigma2 <- 2
X <- rnorm(50000, mu1, sigma1)
Y <- rnorm(50000, mu2, sigma2)

hist(X + Y, breaks=seq(-10,14,.2), freq=FALSE)

x <- seq(-10, 14, 0.1)
lines(x, dnorm(x, mu1 + mu2, sqrt(sigma1^2 + sigma2^2)))

Correlated normal r.v.s

Z1 <- rnorm(1000)
Z2 <- rnorm(1000)

par(mfrow=c(1,3))
rhos <- c(-0.7,0,0.7)
for (rho in rhos){
    X <- Z1
    Y <- rho * Z1 + sqrt(1 - rho^2) * Z2
    plot(X,Y, cex=0.2)
    title(paste("rho = ", rho))
}

for (rho in rhos){
  X <- Z1
    Y <- rho * Z1 + sqrt(1 - rho^2) * Z2
  cat("When rho is ", rho, " the variance of X+Y is :",  var(X+Y), "\n")
}

## When rho is  -0.7  the variance of X+Y is : 0.6174315 
## When rho is  0  the variance of X+Y is : 2.115124 
## When rho is  0.7  the variance of X+Y is : 3.617621

Sample mean and sample variance

The sample mean and sample variance from a normal distribution are independent.

In general, it is hard to figure out the independence between two random variables.

In this example, instead we compute the correlation between two random variables.

sample_size <- 100  # sample size for each statistic
N <- 10000   # number of simulation

mu <- 3; sigma <- 2

X_bar <- S2 <- numeric(N)  #preallocation

for(i in 1:N){
  rns <- rnorm(sample_size, mu, sigma)    # random normal sample
  X_bar[i] <- mean(rns)
  S2[i] <- var(rns)
}

cat("The correlation between X_bar and S^2 is", cor(X_bar, S2), "\n")

## The correlation between X_bar and S^2 is 0.001101206

t distribution

t density

curve(dt(x, df=1), from=-6, to=6, ylab="density", ylim=c(0,0.4))
curve(dt(x, df=2), add = TRUE)
curve(dt(x, df=4), add = TRUE)
curve(dt(x, df=10), add = TRUE)
curve(dnorm(x), lwd=2, add= TRUE)
title("t densities with 1,2,4 and 10 d.f., and normal limit in bold")

t statistics

In the code below, I intentionally used sapply for practical purpose of vectorized programming.

You can rewrite this code with a program that uses for loop.

mu <- 5
sigma <- 3
n <- 7     # sample size for each simulation
N <- 10000   # simulation number
t_statistic <- numeric(N)


# t_statistic is a vector of length N
t_statistic <- sapply(1:N, function(x) {
  rsample <- rnorm(n, mu, sigma)
  (mean(rsample) - mu)/(sd(rsample)/sqrt(n)) 
})
# x is dummy variable

hist(t_statistic, breaks = seq(min(t_statistic),max(t_statistic)+0.2,0.2), 
     probability=TRUE)

x <- seq(-6,6,0.01)
lines(x, dt(x, n-1))

QQ plot

qqnorm(t_statistic)

chi-square distribution

chi-square density

curve(dchisq(x, df=1), from=0, to=8, ylab="density", ylim=c(0,0.5))
curve(dchisq(x, df=2), col="red", add = TRUE)
curve(dchisq(x, df=3), col="blue", add = TRUE)
curve(dchisq(x, df=4), col="darkgreen", add = TRUE)
legend("topright", c("df=1", "df=2", "df=3", "df=4"),  pch="----", col=c("black","red","blue","darkgreen"))

chi-square simulation

dfs <- c(1,2,3,4)
N <- 5000

par(mfrow=c(2,2))
for (df in dfs){
    chi_rv <- numeric(N)
    for (i in 1:df)
        chi_rv <- chi_rv + rnorm(N)^2
    hist(chi_rv, breaks=seq(0, 30, 0.5), freq=FALSE, main=paste("d.f. = ", df))
    curve(dchisq(x, df), add=TRUE)
}

### Sample mean and sample variance

As demonstraed below, if the population distribution is chi-square,

then the sample mean and sample variance are not independent.

sample_size <- 100  # sample size for each statistic
N <- 10000   # number of simulation

dof <- 10

X_bar <- S2 <- numeric(N)  #preallocation

for(i in 1:N){
  chi_rv <- rchisq(sample_size, dof)    # random normal sample
  X_bar[i] <- mean(chi_rv)
  S2[i] <- var(chi_rv)
}

cat("The correlation between X_bar and S^2 is ", cor(X_bar, S2), "\n")

## The correlation between X_bar and S^2 is  0.5030476

Life time model : exponential and Weibull

Let \(X \geq 0\) be the time until some event occurs,
- such as the breakdown of some mechanical component
- lifetime of the component.
Let \(f\) and \(F\) be the pdf and cdf of \(X\), then we define the survivor function as
- \(G(x) = P(X > x) = 1 - F(x)\)
- \(G(x)\) : the probability that the component will survive until \(x\).
The failure rate is called the hazard function \(\lambda(x)\).
- \(\lambda(x) dx = P(\)component fails between \(x\) and \(x+dx\) | still working at \(x)\) = \(\frac{f(x)dx}{G(x)}\)
Removing \(dt\), we have
- \(\lambda(x) = \frac{f(x)}{G(x)}\)
We can find the density \(f\) from \(\lambda\) as follows:
- \(f(x) = \frac{d F(x)}{dx} = \frac{d}{dx} (1 - G(x)) = - \frac{dG(x)}{dx}\)
- \(-\lambda(x) = \frac{f(x)}{G(x)} = -\frac{G'(x)}{G(x)} = - \frac{d}{dx} \log G(x)\)
- \(G(x) = \exp(- \int_0^t \lambda(u) du)\)
- \(f(x) = \lambda(x) \exp ( - \int_0^x \lambda(u) d u )\)

Exponential distribution

If \(\lambda(x) = \lambda\), constant rate of failure, then we say \(X\) has an exponential distribution and write \(X\) ~ exp\((\lambda)\).
- \(f(x) = \lambda\) exp\((-\lambda\)x)
- \(F(x) = 1 - \exp(-\lambda x)\)
- \(\mu = 1/\lambda\)
- \(\sigma = 1/\lambda\)
memoryless property
- aging has no effect, that is, the component fails at random.
- \(P(X > s + t | X > s) = P(X > s + t)/P(X > s) = P(X > t)\)
- Given that you have survived until age \(s\), the probability of surviving an additional time \(t\) is the same as if you has just been born.

Exponential distribution in R

density function with \(\lambda =\) rate

dexp(x, rate = 1, log = FALSE)

distribution function with \(\lambda =\) rate

pexp(q, rate = 1, lower.tail = TRUE, log.p = FALSE)

quantile function with \(\lambda =\) rate

qexp(p, rate = 1, lower.tail = TRUE, log.p = FALSE)

random variables with \(\lambda =\) rate

rexp(n, rate = 1)

Exponential density

x = seq(0,8,0.01)
plot(x, dexp(x, rate=2), type="l", ylab="f(x)", col=1)
lines(x, dexp(x, rate=1) , col=2)
lines(x, dexp(x, rate=0.5), col=3)
legend(5.5, 1.9, c("lambda = 2", "lambda = 1", "lambda = 0.5"), col=c(1,2,3), lty = c(1,1,1))

Exponential distribution

x = seq(0,8,0.01)
plot(x, pexp(x, rate=2), type="l", ylab="f(x)", col=1)
lines(x, pexp(x, rate=1) , col=2)
lines(x, pexp(x, rate=0.5), col=3)
legend(5.5, 0.4, c("lambda = 2", "lambda = 1", "lambda = 0.5"), col=c(1,2,3), lty = c(1,1,1))

Example : radioactive decay

Uranium-238 decays into thorium-234 at rate \(\lambda\) per year.
The half-life of uranium-238 is \(4.47*10^9\) years.
- the time it takes for half of some lump of uranium-238 to decay into thorium-234.
Let \(X\) be the time of decay of a single atom, then \(X \sim\) exp\((\lambda)\) and
- \(P(X > 4.47*10^9) = 0.5\)
Since \(P(X > x) = \exp(-\lambda * x)\), we have

(lambda <- log(2)/(4.47*10^9))

## [1] 1.550665e-10

Memoryless property

The exponential distribution has a memoryless property.

Let X follows an exponential distribution.

The following simulates P(X>4).

n <- 1000000
y <- rexp(n, rate = 0.1)
larger_4 <- y[y > 4]
(ratio1 <- length(larger_4)/n)

## [1] 0.670187

The following simulates P( X>12 | X>8 ) which is similar to the previous result.

larger_8 <- y[y > 8]
larger_12 <- y[y > 12]

(ratio2 <- length(larger_12)/length(larger_8))

## [1] 0.6706099

Weibull distribution

\(X\) has a Weibull distribution with parameters \(\lambda\) and \(m\), if the hazard function is
- \(\lambda(x) = m \lambda x^{(m-1)}\)
\(X\) ~ Weibull\((\lambda, m)\)
\(G(x) = \exp(-\lambda x^m)\)
\(f(x)\) = m\(\lambda\) \(x^{(m-1)}\) exp(-\(\lambda\) m x^{(m-1)})
R parameterization of the Weibull distribution differs from that presented above.
- m for shape argument and \(\lambda^{(-1/m)}\) for scale argument.

x=seq(0,4,0.001)

curve(2*x, from=0, to=4, ylab="hazard function")
lines(x, 2.5*x^(1.5))
lines(x, 0.5*x^(-0.5))
lines(x, 1.5*x^0.5)
lines(x, rep(1,length(x)))
text(2.3, 6.99, "m>2")
text(3, 5.2, "m=2")
text(2.5, 2.65, "1<m<2")
text(3.5, 1.3, "m=1")
text(3, 0.52, "m<1")

par(mfrow=c(3,1))
curve(dweibull(x, shape=0.5, scale=2^(-1/0.5)), from=0, to=4, ylab="f(x)", lwd=3)
text(2,2,"Weibull(2,0.5) density")
curve(dweibull(x, shape=1.5, scale=2^(-1/1.5)), from=0, to=4, ylab="f(x)", lwd=3)
text(2,0.6,"Weibull(2,1.5) density")
curve(dweibull(x, shape=3, scale=2^(-1/3)), from=0, to=4, ylab="f(x)", lwd=3)
text(2,0.8,"Weibull(2,3) density")

Example : time to the next disaster

Suppose that the chance of a nuclear power station having a major accident in any given year is proportional to its age.
Also suppose we keep building nuclear power stations at a rate of one per year, until we have a major accident.
Let \(T\) be the time until the first major accident.
Let \(\alpha\) \(t\) be the chance that a single power station age \(t\) has an accident in the next year.
After \(t\) years there are \(t\) power stations operating, so the total rate of accident is \(\alpha\) \(t^2\).
- m\(\lambda\) \(t^{(m-1)}\) = \(\alpha\) \(t^2\) implies that m=3, \(\lambda=\alpha/3\)
Thus, \(T\) ~ Weibull\((\alpha/3, 3)\).
For example, let \(\alpha\) be 1/1,000,000.
The probability that the first major accident is within the next 50 years

alpha <- 1/1000000
m <- 3
lambda <- alpha/3
pweibull(50, shape = m, scale = lambda^(-1/m))

## [1] 0.04081054

Poisson process

A Poisson process is the continuous-time analogue of a sequence of independent trials.
Suppose that we have a sequence of events, occurring at some rate \(\lambda\) per unit time.
- The expected number of events in \((s, t)\) is \(\lambda(t-s)\).
- The infinitesimal probability that an event occurs in \((t, t+dt)\) is \(\lambda dt\).
Let \(T_k\) be the time between \(k-1\) and \(k\)-th events.
\(N(s, t)\) be the number of events that have occured during \((s, t)\).
- \(T_k\) are i.i.d. \(\exp(\lambda)\) random variables.
- \(N(s, t)\) follows pois\((\lambda(t-s))\).
- If \((a, b)\) and \((s, t)\) are disjoint, then \(N(a, b)\) and \(N(s, t)\) are independent.
A Poisson process looks like:

Merging and Thinning

Merging
- If we merge a Poisson process rate \(\lambda_1\) with an independent Poisson process \(\lambda_2\), then the results is a Poisson process rate \(\lambda_1 + \lambda_2\).

Thining
- We thin a process by tossing a (biased) coin for each event:
  - heads we keep it
  - tails it is discarded
- If we start with a Poisson process rate \(\lambda\), and the probability of keeping an event is \(p\),
- then the result is a Poisson process rate p\(\lambda\).

Queuing simulation

Customer arrival time and the time taken to serve are assumed to be exponential distribution.
The probability of new customer arrival during (t, t + dt) is \(\lambda\)dt.
The probability that service finished during (t, t + dt) is \(\mu\)dt.
We do not allow an arrival and a departure to occur in the same interval (t, t + dt).
- because the probability is small.

lambda <- 1     # arrival rate
mu <- 1.1       # service rate
t.end <- 100    # duration of simulation
t.step <- 0.05  # time step
queue <- rep(0, t.end/t.step + 1)
for (i in 2:length(queue)) {
  if (runif(1) < lambda*t.step) { # arrival
    queue[i] <- queue[i-1] + 1
  } else if (runif(1) < mu*t.step) { # departure
    queue[i] <- max(0, queue[i-1] - 1)
  } else { # nothing happens
    queue[i] <- queue[i-1]
  }
}
plot(seq(0, t.end, t.step), queue, type='l', xlab='time', ylab='queue size')
title(paste('Queuing Simulation. Arrival rate:', lambda,'Service rate:', mu))

728x90

'R Programming' 카테고리의 다른 글

R Programming (9) - More on simulation (0)	2020.03.28
R Programming (7) - Discrete random variable (0)	2020.03.28
R Programming (6) - Numerical integration (0)	2020.03.28
R Programming (5) - Sophisticated data structures (0)	2020.03.28
R Programming (4) - Function (0)	2020.03.28

ABOUT ME

oziguyo oziguyo

Continuous random variable

Some of continuous random variables in R

Uniform distribution

Uniform distribution in R

From uniform random variable to Bernoulli random variable

Two dimensional uniform random variable

Uncorrelated but dependent variables

Normal distribution

Normal distribution in R

Normal density with different mean

Normal density with different sigma

Comparison between histogram and pdf

Sum of independent normal

Correlated normal r.v.s

Sample mean and sample variance

t distribution

t density

t statistics

chi-square distribution

chi-square density

chi-square simulation

Life time model : exponential and Weibull

Exponential distribution

Exponential distribution in R

Exponential density

Exponential distribution

Example : radioactive decay

Memoryless property

Weibull distribution

Example : time to the next disaster

Poisson process

Merging and Thinning

Queuing simulation

'R Programming' 카테고리의 다른 글

티스토리툴바

ABOUT ME

Continuous random variable

Some of continuous random variables in R

Uniform distribution

Uniform distribution in R

From uniform random variable to Bernoulli random variable

Two dimensional uniform random variable

Uncorrelated but dependent variables

Normal distribution

Normal distribution in R

Normal density with different mean

Normal density with different sigma

Comparison between histogram and pdf

Sum of independent normal

Correlated normal r.v.s

Sample mean and sample variance

t distribution

t density

t statistics

chi-square distribution

chi-square density

chi-square simulation

Life time model : exponential and Weibull

Exponential distribution

Exponential distribution in R

Exponential density

Exponential distribution

Example : radioactive decay

Memoryless property

Weibull distribution

Example : time to the next disaster

Poisson process

Merging and Thinning

Queuing simulation

'R Programming' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바