Category Archives: Data Analysis

The Underground Economy

Published by:

Individuals and firms sometimes conceal the buying and selling of goods and services. In which case their production won’t be counted in GDP. Individuals and firms conceal what they buy and sell for three basic reasons:

  • They are dealing with illegal goods and services, such as drugs or prostitution;
  • They want to avoid paying taxes on the income they earn;
  • They want to avoid government regulations.

Estimates of the size of the underground economy in Australia vary widely, but a study by ABS estimated it to be 1.3% of GDP, or over $17 billion. The underground economy in some poor countries such as Peru and Zimbabwe may be more than half of measured GDP. I wonder how big it is the underground economy in Thailand.


In the link bellow, you can find the entire piece of work done by ABS:



Published by:







If  is true then:



This is our test statistic.

We reject H0 if the calculated value of our test statistic is less than -zα/2 or greater than +zα/2 (i.e., if it takes a value sufficiently far out in the tails of the standard normal distribution for us to think  is unlikely to be true).



The weights of fish in an aquaculture pond are considered to be normally distributed with a mean of 3.1Kg and a standard deviation of 1.1Kg. A random sample of size 30 is selected from the pond and the sample mean is found to be 2.37Kg. Is there sufficient evidence to indicate that the mean weight of the fish differs from 3.1Kg? Use a 10 level of significance.

hypoteses z test example





 Conclusion: The mean weight of the fish differs from 3.1Kg (at the 10% level of significance).






Hypothesis Testing

Published by:


  • A hypothesis is a statement (assumption) about a population parameter
    • population mean (Example: The mean monthly cell phone bill of this city is  μ = $42)
    • population proportion (Example: The proportion of adults in this city with cell phones is  π = 0.68)
  • Null Hypothesis
    • The hypothesis that assumes the status quo – that the old theory, method or standard is still true; the complement of the alternative hypothesis
    • Always contains ‘=‘ , ‘≤’ or ‘³’ sign
    • May or may not be rejected
    • Is always about a population parameter, ,not about a sample statistic
  • Alternative Hypothesis
    • The hypothesis that complements the null hypothesis.
    • Usually it is the hypothesis that the researcher is interested in proving
  • The Null and Alternative Hypotheses are mutually exclusive
    • e. only one of them can be true
  • The Null Hypothesis is assumed to be true
  • The burden of proof falls on the Alternative Hypothesis
  • Example: investigate if the mean monthly cell phone bill is $42
    • H0: μ = 42
    • H1: μ ≠ 42

Level of Significance and rejection region












Steps for the hypothesis test…

  1. State the null hypothesis, H0 and the alternative hypothesis, H1
  2. Choose the level of significance, a, and the sample size, n
  3. Determine the appropriate test statistic and sampling distribution
  4. Determine the critical values that divide the rejection and non-rejection regions
  1. Collect data and compute the value of the test statistic
  2. Make the statistical decision and state the managerial conclusion
  • If the test statistic falls into the non-rejection region, do not reject the null hypothesis H0.
  • If the test statistic falls into the rejection region, reject the null hypothesis
  • Express the managerial conclusion in the context of the real-world problem


  • p-value: Probability of obtaining a test statistic more extreme ( ≤ or ³ ) than the observed sample value given H0 is true
    • Also called observed level of significance
    • Smallest value of a  for which H0 can be rejected
    • Obtain the p-value from a table or computer
  • If p-value  <  a ,  reject H0
  • If p-value  ³  a ,  do not reject H0

Two populatio means tail test












Rules to follow:


Decision Rule:

Test Statistic:



Normal Distributions

Published by:

Normal Distributions

  • Also known as the Z distribution
    • Mean is 0
    • Standard deviation is 1
  • Characteristics of a Normal Distribution
    • Continuous Random Variable
    • - ∞ < x < + ∞
    • Curve is symmetrical around the mean (m).
    • Area under curve = 1
    • Mean & standard deviation uniquely determine a normal distribution.

Normal Distribution




value of Z




To find  P(a < X < b)  when  X  is distributed normally:

  1. Draw the normal curve for the problem in terms of X
  2. Translate X-values to Z-values and put Z values on your diagram
  3. Use the Standardised Normal Table

Example: Suppose X is normally distributed with mean 8 and std dev 5. Find P(X < 8.6)

Z value example






Z value table









Finding the X for a Known Probability:


  1. Draw a normal curve placing all known values on it such as mean of X and Z
  2. Shade in area of interest and find cumulative probability
  3. Find the Z value for the known probability
  4. Convert to X units using the formula:


How Large is Large Enough?

  • For most population distributions, n ≥ 30 will give a sampling distribution that is nearly normal
  • For fairly symmetric population distributions, n ≥ 5 is sufficient
  • For normal population distributions, the sampling distribution of the mean is always normally distributed



  • A point estimate is the value of a single sample statistic
  • A confidence interval provides a range of values constructed around the point estimate

Point Estimates







Confidence Level  (1-a)

  • Common confidence levels = 90%, 95% or 99%
    • Also written (1 – a) = 0.90, 0.95 or 0.99
  • A relative frequency interpretation
    • In the long run, 90%, 95% or 99% of all the confidence intervals that can be constructed (in repeated samples) will contain the unknown true parameter
  • For example, if we were to randomly select 100 samples and use the results of each sample to construct 95% confidence intervals, approximately 95 out of 100 would contain the population mean

 Confident Interval for Population when standard deviation is known












Finding the critical value Z











So, what happens if we don’t know the standard deviation of the population????

  • If the population standard deviation σ  is unknown, we can substitute the sample standard deviation, S
  • This introduces extra uncertainty, since S is variable from sample to sample
  • So we use the t distribution instead of the normal distribution

Confidence Interval Estimate:

Interval estimate



where t is the critical value of the t distribution with n -1 degrees of freedom and an area of α/2 in each tail

t distribution increasing










A random sample of n = 25 has X = 50 & S = 8.

Form a 95% confidence interval for μ:

d.f. = n – 1 = 24,  so

example of distribution





The confidence interval is  46.698 ≤ μ ≤ 53.302


Required Sample Size Example

If s = 45, what sample size is needed to estimate the mean within ± 5 with 90% confidence?

example 2



So the required sample size is n = 220






Basic Math for Economists – Logs

Published by:


We have learnt indices or exponents in the algebra material. If you haven’t check then, we recommend you to do so. We will need those concepts for progressing with logs. The idea of logarithms (or simply logs) is based on indices. In fact, as you will find out very soon, the rules for logarithms are very similar to the rules for indices. Therefore, a recap of the concept of indices will be useful for us to understand how logarithms works.

Download the entire file : LOGARITHMS

Data Analysis for Economists – Part I

Published by:

Describing Data

Every economist needs to have the ability to collect, analyse, manipulate, understand and report data. In a daily research environment, we need to deal with randomness, variation and in order to apply our knowledge. Therefore we are going to summarize the most important and useful tools for every economist.

Key Definitions:

  • A population consists of all the members of a group about which you want to draw a conclusion. The size of the population depends on what you are interested in. (μ, σ, Ν)
  • A sample is the portion of the population selected for analysis. Collecting information on the population can be difficult and costly, therefore we sample. (x, s, n)
  • A parameter is a numerical measure that describes a characteristic of a population
  • A statistic is a numerical measure that describes a characteristic of a sample

A note on Notation

  • Greek letters (μ, σ, Ν) are used for population data
  • Roman letter (x, s, n) are used for sample data


Scatter diagrams are very common in econometrics and are used to examine possible relationships between two numerical variables.

  • In a scatter diagram one variable is measured on the vertical axis (Y) and the other variable is measured on the horizontal axis (X)
    • X = independent variable
    • Y = dependent variable

Scatter Plot










Figure 1: Plot A: Scatter Plot Relationship between Share of Food (WFOOD) and Total Expenditure (TOTEXP).

So, how do we actually describe our data?

Describing Data


We will need to know the data mean, median and mode, however, we will pretty much talk about the data Variation, Shape, Skewness, Range, Interquartile Range, Variance, Standard Deviation and Coefficient of Variation.

So, let’s start through the Central Tendency. What is the mean, median and mode?


  • Commonly called as the average
  • Calculated as the sum of values divided by the number of values
  • Affected by extreme values (outliers)
Population Mean Sample Mean
               μ           X



  • In an ordered array, the median is the ‘middle’ number, not the average, but actually the physical position.
  • The location of the median: (n + 1) /2   is not the value of the median, only the position of the median in the ranked data

Rule 1: If the number of values in the data set is odd, the median is the middle ranked  value

Rule 2: If the number of values in the data set is even, the median is the mean (average) of the two middle ranked values



  • Value that occurs most often (the most frequent). It can be more than one value.



  • Quartiles split the ranked data into 4 segments with an equal number of values per segment
  • The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger

Q1 = (n+1)/4

  • The second quartile, Q2, is the same as the median (50% are smaller, 50% are larger)

Q2 = (n+1)/2

  • Only 25% of the observations are greater than the third quartile Q3

Q3 = 3(n+1)/4



Measures of variation give information on the spread or variability of the data values.

  • RANGE:

Difference between the largest and the smallest values in a set of data

Range = Xlargest - Xsmallest



Like the median and Q1 and Q2, the IQR is a resistant summary measure. It eliminates outlier problems by using the interquartile range as high- and low-valued observations are removed from calculations:

IQR = 3rd quartile – 1st quartile


  • VARIANCE: The mean squared deviation and it shows variation about the mean.


  • Each value in the data set is used in the calculation
  • Values far from the mean are given extra weight as deviations from the mean are squared


  • Sensitive to extreme values (outliers)
  • Measures of absolute variation not relative variation


The denominator (n-1) is to adjust for the biasness of the sample statistics.




Measures relative variation i.e. shows variation relative to mean. It can be used to compare two or more sets of data measured in different units and it is always expressed as percentage (%).


Shape and Skweness

  • Describes how data are distributed
  • Measures of shape – Symmetric or skewed

Shape of a Data



Sample Covariance

  • The sample covariance measures the direction of the linear relationship between two numerical variables (direction of the association)


Sample Coefficient of the of Correlation r

  • Measures the relative strength of the linear relationship between two variables:

Correlation r

Where Sx and Sy are their Sample Variance.


Value of r Interpretation
r = -1 PERFECT negative linear relationship
-1 < r ≤ -0.7 STRONG negative linear relationship
-0.7 < r ≤ -0.3 MODERATE negative linear relationship
-0.3 < r < 0 WEAK negative linear relationship
r = 0 No relationship
0 < r < 0.3 WEAK positive linear relationship
0.3 ≤ r < 0.7 MODERATE positive linear relationship
0.7 ≤ r < 1 STRONG positive linear relationship
1 PERFECT positive linear relationship