Saturday, February 6, 2016

I played a bit with population sample sizes to estimate population mean. Even for population=1M and sample size=100 people, mean estimate (SampleMean) is already within 1% from TrueMean. Thus, a run of the following script outputs:
"Population 1000000, Sample size 0.01%, Estimation error: 0.3218%"
TrueMean = 500
TrueSD = 50
PopulationSize <- 1000000
Population <- rnorm(PopulationSize, mean=TrueMean, sd=TrueSD)
 
# SamplePercent <- 0.01
# SampleSize <- PopulationSize * SamplePercent / 100.0
SampleSize <- 100
MySample <- sample(Population, size=SampleSize)
SampleMean <- mean(MySample)
SampleSD <- sd(MySample)
 
 
hist(MySample, breaks=10, prob=TRUE, col="#DDDDDD", ylim=c(0,0.009)) #, xlab="Sample", col="grey")
lines(density(Population), col="blue", lwd=1)
 
SDBounds <- c(SampleMean - SampleSD, SampleMean + SampleSD)
vlines <- c(TrueMean, SampleMean, SDBounds)
style <- c("dashed", "dotted", "solid", "solid")
legend_text <- c("TrueMean", "SampleMean", "SampleMean-SD", "SampleMean+SD")
my_colors = c("red", "blue", "pink")
my_colors[4] = my_colors[3]
 
for(i in 1:length(vlines)){
  abline(v=vlines[i], col=my_colors[i], lty=style[i], lwd=2)  
}
legend(x="topleft", legend=legend_text, col=my_colors, lty=c(1,1))
 
SampleMeanError <- TrueMean - SampleMean
 
printf <- function(...) invisible(print(sprintf(...)))
printf("Population %d, Sample size %.2f%%, Estimation error: %4.4f%%", 
       PopulationSize, SampleSize/PopulationSize * 100, 100.0 * SampleMeanError/TrueMean)