-
R Programming (8) - Continuous random variableR Programming 2020. 3. 28. 04:57728x90
Continuous random variable Continuous random variable
Some of continuous random variables in R
Some of the discrete distributions provided by R, together with the names of their parameter inputs.
Distribution R name (dist) Paramter names Uniform unif
min=0, max=1
Exponential exp
rate=1
chi-square chisq
df
Gamma gamma
shape, rate=1
Normal norm
mean=0, sd=1
t t
df
Weibull weibull
shape, scale=1
Uniform distribution
If the probability that \(X\) lies in a given subinterval of \([a,b]\) depends only on the length of the subinterval and not on its location,
then \(X\) is said to have a uniform distribution on \([a,b]\).
Write \(X\) ~ \(U[a,b]\)
The pdf, mean and variace are
- \(f(x) = 1/(a-b)\), for \(a \leq x \leq b\)
- \(E[X] = (a+b)/2\)
- \(Var(x) = (b-a)^2/12\)
Uniform distribution in R
- density function of the uniform on the interval from
min
tomax
.
dunif(x, min=0, max=1, log = FALSE)
- the uniform distribution function
punif(q, min=0, max=1, lower.tail = TRUE, log.p = FALSE)
- quantile function
qunif(p, min=0, max=1, lower.tail = TRUE, log.p = FALSE)
- generate the uniform random variable
runif(n, min=0, max=1)
- Example :
var(runif(100000)) # approximately 1/12
## [1] 0.08382216
From uniform random variable to Bernoulli random variable
We can generate a Bernoulli random variable from the uniform distribution.
rBernoulli <- function(p = 0.5){ if(runif(1) < p ) 1 else 0 } rBernoulli()
## [1] 0
Or, more conveniantly,
rBernoulli <- function(p = 0.5) as.integer(runif(1) < p) # TRUE -> 1, FALSE -> 0 rBernoulli()
## [1] 0
Also, we can generate multiple samples.
# generate muliple samples rBernoulli <- function(n, p = 0.5) as.integer(runif(n) < p) rBernoulli(n = 100, p = 0.9)
## [1] 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 1 ## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 ## [71] 0 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
We can further contstruct a random number generator for the Binomial distribution.
my_rbinom <- function(n, size, prob){ sapply(1:n, function(x) sum(rBernoulli(size, prob))) # x is a dummy variable }
my_rbinom(10, 5, 0.6)
## [1] 4 4 4 2 2 2 3 4 1 3
N <- 1000; n <- 20; p <- 0.4 frequency <- table(factor(my_rbinom(N, n, p), levels = 0:n)) relative_frequency <- as.numeric(frequency/N) x <- 0:n; pmf <- dbinom(x, n, p) par(mfrow=c(1,2)) plot(x, relative_frequency, 'h') plot(x, pmf, 'h')
Two dimensional uniform random variable
Independent two uniform random variables.
n <- 10000 X <- runif(n, -1, 1) Y <- runif(n, -1, 1) plot(X,Y, cex=0.1) # check scatter plot
Normal distribution
Normal distribution in R
- the normal density function
dnorm(x, mean = 0, sd = 1, log = FALSE)
- the normal distribution function
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
- Example :
pnorm(1.96, lower.tail=TRUE)
## [1] 0.9750021
pnorm(1.96, lower.tail=FALSE)
## [1] 0.0249979
- quantile function
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
- Example:
qnorm(0.975, 0, 1)
## [1] 1.959964
- normal random variables
rnorm(n, mean = 0, sd = 1)
Normal density with different mean
x <- seq(-3, 5, 0.01) plot(x, dnorm(x, 0, 1), type="l", xlab="x", ylab="y", xlim=c(-3,5)) lines(x, dnorm(x, 1, 1), col="red") lines(x, dnorm(x, 2, 1), col="blue") title("Normal distribution: sigma=1, mu=0,1,2")
Normal density with different sigma
x <- seq(-10, 10, 0.01) plot(x, dnorm(x, 0, 1), type="l", xlab="x", ylab="y", xlim=c(-10,10)) lines(x, dnorm(x, 0, 2), col="red") lines(x, dnorm(x, 0, 3), col="blue") title("Normal distribution: mu=0, sigma=1,2,3")
Comparison between histogram and pdf
z<-rnorm(10000) par(las=1) hist(z, breaks = seq(-5,5,0.2), freq=FALSE) phi <- function(x) exp(-x^2/2)/sqrt(2*pi) x <- seq(-5,5,0.1) lines(x, phi(x)) # or lines(x, dnorm(x))
Sum of independent normal
\(X\) ~ \(N(\mu_1, \sigma_1^2)\), \(Y\) ~ \(N(\mu_2, \sigma_2^2)\)
Then \(X+Y\) ~ \(N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)\)
mu1 <-1; mu2 <-1 sigma1 <- 1; sigma2 <- 2 X <- rnorm(50000, mu1, sigma1) Y <- rnorm(50000, mu2, sigma2) hist(X + Y, breaks=seq(-10,14,.2), freq=FALSE) x <- seq(-10, 14, 0.1) lines(x, dnorm(x, mu1 + mu2, sqrt(sigma1^2 + sigma2^2)))
Sample mean and sample variance
The sample mean and sample variance from a normal distribution are independent.
In general, it is hard to figure out the independence between two random variables.
In this example, instead we compute the correlation between two random variables.
sample_size <- 100 # sample size for each statistic N <- 10000 # number of simulation mu <- 3; sigma <- 2 X_bar <- S2 <- numeric(N) #preallocation for(i in 1:N){ rns <- rnorm(sample_size, mu, sigma) # random normal sample X_bar[i] <- mean(rns) S2[i] <- var(rns) } cat("The correlation between X_bar and S^2 is", cor(X_bar, S2), "\n")
## The correlation between X_bar and S^2 is 0.001101206
t distribution
t density
curve(dt(x, df=1), from=-6, to=6, ylab="density", ylim=c(0,0.4)) curve(dt(x, df=2), add = TRUE) curve(dt(x, df=4), add = TRUE) curve(dt(x, df=10), add = TRUE) curve(dnorm(x), lwd=2, add= TRUE) title("t densities with 1,2,4 and 10 d.f., and normal limit in bold")
t statistics
In the code below, I intentionally used
sapply
for practical purpose of vectorized programming.You can rewrite this code with a program that uses
for
loop.mu <- 5 sigma <- 3 n <- 7 # sample size for each simulation N <- 10000 # simulation number t_statistic <- numeric(N) # t_statistic is a vector of length N t_statistic <- sapply(1:N, function(x) { rsample <- rnorm(n, mu, sigma) (mean(rsample) - mu)/(sd(rsample)/sqrt(n)) }) # x is dummy variable hist(t_statistic, breaks = seq(min(t_statistic),max(t_statistic)+0.2,0.2), probability=TRUE) x <- seq(-6,6,0.01) lines(x, dt(x, n-1))
QQ plot
qqnorm(t_statistic)
chi-square distribution
chi-square density
curve(dchisq(x, df=1), from=0, to=8, ylab="density", ylim=c(0,0.5)) curve(dchisq(x, df=2), col="red", add = TRUE) curve(dchisq(x, df=3), col="blue", add = TRUE) curve(dchisq(x, df=4), col="darkgreen", add = TRUE) legend("topright", c("df=1", "df=2", "df=3", "df=4"), pch="----", col=c("black","red","blue","darkgreen"))
chi-square simulation
dfs <- c(1,2,3,4) N <- 5000 par(mfrow=c(2,2)) for (df in dfs){ chi_rv <- numeric(N) for (i in 1:df) chi_rv <- chi_rv + rnorm(N)^2 hist(chi_rv, breaks=seq(0, 30, 0.5), freq=FALSE, main=paste("d.f. = ", df)) curve(dchisq(x, df), add=TRUE) }
### Sample mean and sample variance
As demonstraed below, if the population distribution is chi-square,
then the sample mean and sample variance are not independent.
sample_size <- 100 # sample size for each statistic N <- 10000 # number of simulation dof <- 10 X_bar <- S2 <- numeric(N) #preallocation for(i in 1:N){ chi_rv <- rchisq(sample_size, dof) # random normal sample X_bar[i] <- mean(chi_rv) S2[i] <- var(chi_rv) } cat("The correlation between X_bar and S^2 is ", cor(X_bar, S2), "\n")
## The correlation between X_bar and S^2 is 0.5030476
Life time model : exponential and Weibull
- Let \(X \geq 0\) be the time until some event occurs,
- such as the breakdown of some mechanical component
- lifetime of the component.
- such as the breakdown of some mechanical component
- Let \(f\) and \(F\) be the pdf and cdf of \(X\), then we define the survivor function as
- \(G(x) = P(X > x) = 1 - F(x)\)
- \(G(x)\) : the probability that the component will survive until \(x\).
- The failure rate is called the hazard function \(\lambda(x)\).
- \(\lambda(x) dx = P(\)component fails between \(x\) and \(x+dx\) | still working at \(x)\) = \(\frac{f(x)dx}{G(x)}\)
- Removing \(dt\), we have
- \(\lambda(x) = \frac{f(x)}{G(x)}\)
- We can find the density \(f\) from \(\lambda\) as follows:
- \(f(x) = \frac{d F(x)}{dx} = \frac{d}{dx} (1 - G(x)) = - \frac{dG(x)}{dx}\)
- \(-\lambda(x) = \frac{f(x)}{G(x)} = -\frac{G'(x)}{G(x)} = - \frac{d}{dx} \log G(x)\)
- \(G(x) = \exp(- \int_0^t \lambda(u) du)\)
- \(f(x) = \lambda(x) \exp ( - \int_0^x \lambda(u) d u )\)
- \(f(x) = \frac{d F(x)}{dx} = \frac{d}{dx} (1 - G(x)) = - \frac{dG(x)}{dx}\)
Exponential distribution
If \(\lambda(x) = \lambda\), constant rate of failure, then we say \(X\) has an exponential distribution and write \(X\) ~ exp\((\lambda)\).
- \(f(x) = \lambda\) exp\((-\lambda\)x)
- \(F(x) = 1 - \exp(-\lambda x)\)
- \(\mu = 1/\lambda\)
- \(\sigma = 1/\lambda\)
memoryless property
- aging has no effect, that is, the component fails at random.
- \(P(X > s + t | X > s) = P(X > s + t)/P(X > s) = P(X > t)\)
- Given that you have survived until age \(s\), the probability of surviving an additional time \(t\) is the same as if you has just been born.
Exponential distribution in R
- density function with \(\lambda =\)
rate
dexp(x, rate = 1, log = FALSE)
- distribution function with \(\lambda =\)
rate
pexp(q, rate = 1, lower.tail = TRUE, log.p = FALSE)
- quantile function with \(\lambda =\)
rate
qexp(p, rate = 1, lower.tail = TRUE, log.p = FALSE)
- random variables with \(\lambda =\)
rate
rexp(n, rate = 1)
Exponential density
x = seq(0,8,0.01) plot(x, dexp(x, rate=2), type="l", ylab="f(x)", col=1) lines(x, dexp(x, rate=1) , col=2) lines(x, dexp(x, rate=0.5), col=3) legend(5.5, 1.9, c("lambda = 2", "lambda = 1", "lambda = 0.5"), col=c(1,2,3), lty = c(1,1,1))
Exponential distribution
x = seq(0,8,0.01) plot(x, pexp(x, rate=2), type="l", ylab="f(x)", col=1) lines(x, pexp(x, rate=1) , col=2) lines(x, pexp(x, rate=0.5), col=3) legend(5.5, 0.4, c("lambda = 2", "lambda = 1", "lambda = 0.5"), col=c(1,2,3), lty = c(1,1,1))
Example : radioactive decay
Uranium-238 decays into thorium-234 at rate \(\lambda\) per year.
The half-life of uranium-238 is \(4.47*10^9\) years.
- the time it takes for half of some lump of uranium-238 to decay into thorium-234.
Let \(X\) be the time of decay of a single atom, then \(X \sim\) exp\((\lambda)\) and
- \(P(X > 4.47*10^9) = 0.5\)
Since \(P(X > x) = \exp(-\lambda * x)\), we have
(lambda <- log(2)/(4.47*10^9))
## [1] 1.550665e-10
Memoryless property
The exponential distribution has a memoryless property.
Let X follows an exponential distribution.
The following simulates P(X>4).
n <- 1000000 y <- rexp(n, rate = 0.1) larger_4 <- y[y > 4] (ratio1 <- length(larger_4)/n)
## [1] 0.670187
The following simulates P( X>12 | X>8 ) which is similar to the previous result.
larger_8 <- y[y > 8] larger_12 <- y[y > 12] (ratio2 <- length(larger_12)/length(larger_8))
## [1] 0.6706099
Weibull distribution
\(X\) has a Weibull distribution with parameters \(\lambda\) and \(m\), if the hazard function is
- \(\lambda(x) = m \lambda x^{(m-1)}\)
\(X\) ~ Weibull\((\lambda, m)\)
\(G(x) = \exp(-\lambda x^m)\)
\(f(x)\) = m\(\lambda\) \(x^{(m-1)}\) exp(-\(\lambda\) m x^{(m-1)})
R parameterization of the Weibull distribution differs from that presented above.
- m for
shape
argument and \(\lambda^{(-1/m)}\) forscale
argument.
- m for
x=seq(0,4,0.001) curve(2*x, from=0, to=4, ylab="hazard function") lines(x, 2.5*x^(1.5)) lines(x, 0.5*x^(-0.5)) lines(x, 1.5*x^0.5) lines(x, rep(1,length(x))) text(2.3, 6.99, "m>2") text(3, 5.2, "m=2") text(2.5, 2.65, "1<m<2") text(3.5, 1.3, "m=1") text(3, 0.52, "m<1")
par(mfrow=c(3,1)) curve(dweibull(x, shape=0.5, scale=2^(-1/0.5)), from=0, to=4, ylab="f(x)", lwd=3) text(2,2,"Weibull(2,0.5) density") curve(dweibull(x, shape=1.5, scale=2^(-1/1.5)), from=0, to=4, ylab="f(x)", lwd=3) text(2,0.6,"Weibull(2,1.5) density") curve(dweibull(x, shape=3, scale=2^(-1/3)), from=0, to=4, ylab="f(x)", lwd=3) text(2,0.8,"Weibull(2,3) density")
Example : time to the next disaster
Suppose that the chance of a nuclear power station having a major accident in any given year is proportional to its age.
Also suppose we keep building nuclear power stations at a rate of one per year, until we have a major accident.
Let \(T\) be the time until the first major accident.
Let \(\alpha\) \(t\) be the chance that a single power station age \(t\) has an accident in the next year.
- After \(t\) years there are \(t\) power stations operating, so the total rate of accident is \(\alpha\) \(t^2\).
- m\(\lambda\) \(t^{(m-1)}\) = \(\alpha\) \(t^2\) implies that m=3, \(\lambda=\alpha/3\)
Thus, \(T\) ~ Weibull\((\alpha/3, 3)\).
For example, let \(\alpha\) be 1/1,000,000.
The probability that the first major accident is within the next 50 years
alpha <- 1/1000000 m <- 3 lambda <- alpha/3 pweibull(50, shape = m, scale = lambda^(-1/m))
## [1] 0.04081054
Poisson process
A Poisson process is the continuous-time analogue of a sequence of independent trials.
- Suppose that we have a sequence of events, occurring at some rate \(\lambda\) per unit time.
- The expected number of events in \((s, t)\) is \(\lambda(t-s)\).
- The infinitesimal probability that an event occurs in \((t, t+dt)\) is \(\lambda dt\).
Let \(T_k\) be the time between \(k-1\) and \(k\)-th events.
- \(N(s, t)\) be the number of events that have occured during \((s, t)\).
- \(T_k\) are i.i.d. \(\exp(\lambda)\) random variables.
- \(N(s, t)\) follows pois\((\lambda(t-s))\).
- If \((a, b)\) and \((s, t)\) are disjoint, then \(N(a, b)\) and \(N(s, t)\) are independent.
- \(T_k\) are i.i.d. \(\exp(\lambda)\) random variables.
A Poisson process looks like:
Merging and Thinning
- Merging
- If we merge a Poisson process rate \(\lambda_1\) with an independent Poisson process \(\lambda_2\), then the results is a Poisson process rate \(\lambda_1 + \lambda_2\).
- Thining
- We thin a process by tossing a (biased) coin for each event:
- heads we keep it
- tails it is discarded
If we start with a Poisson process rate \(\lambda\), and the probability of keeping an event is \(p\),
then the result is a Poisson process rate p\(\lambda\).
- We thin a process by tossing a (biased) coin for each event:
Queuing simulation
Customer arrival time and the time taken to serve are assumed to be exponential distribution.
The probability of new customer arrival during (t, t + dt) is \(\lambda\)dt.
The probability that service finished during (t, t + dt) is \(\mu\)dt.
- We do not allow an arrival and a departure to occur in the same interval (t, t + dt).
- because the probability is small.
lambda <- 1 # arrival rate mu <- 1.1 # service rate t.end <- 100 # duration of simulation t.step <- 0.05 # time step queue <- rep(0, t.end/t.step + 1) for (i in 2:length(queue)) { if (runif(1) < lambda*t.step) { # arrival queue[i] <- queue[i-1] + 1 } else if (runif(1) < mu*t.step) { # departure queue[i] <- max(0, queue[i-1] - 1) } else { # nothing happens queue[i] <- queue[i-1] } } plot(seq(0, t.end, t.step), queue, type='l', xlab='time', ylab='queue size') title(paste('Queuing Simulation. Arrival rate:', lambda,'Service rate:', mu))
728x90'R Programming' 카테고리의 다른 글
R Programming (9) - More on simulation (0) 2020.03.28 R Programming (7) - Discrete random variable (0) 2020.03.28 R Programming (6) - Numerical integration (0) 2020.03.28 R Programming (5) - Sophisticated data structures (0) 2020.03.28 R Programming (4) - Function (0) 2020.03.28