Every economist needs to have the ability to collect, analyse, manipulate, understand and report data. In a daily research environment, we need to deal with randomness, variation and in order to apply our knowledge. Therefore we are going to summarize the most important and useful tools for every economist.
- A population consists of all the members of a group about which you want to draw a conclusion. The size of the population depends on what you are interested in. (μ, σ, Ν)
- A sample is the portion of the population selected for analysis. Collecting information on the population can be difficult and costly, therefore we sample. (x, s, n)
- A parameter is a numerical measure that describes a characteristic of a population
- A statistic is a numerical measure that describes a characteristic of a sample
A note on Notation
- Greek letters (μ, σ, Ν) are used for population data
- Roman letter (x, s, n) are used for sample data
Scatter diagrams are very common in econometrics and are used to examine possible relationships between two numerical variables.
- In a scatter diagram one variable is measured on the vertical axis (Y) and the other variable is measured on the horizontal axis (X)
- X = independent variable
- Y = dependent variable
Figure 1: Plot A: Scatter Plot Relationship between Share of Food (WFOOD) and Total Expenditure (TOTEXP).
So, how do we actually describe our data?
We will need to know the data mean, median and mode, however, we will pretty much talk about the data Variation, Shape, Skewness, Range, Interquartile Range, Variance, Standard Deviation and Coefficient of Variation.
So, let’s start through the Central Tendency. What is the mean, median and mode?
- Commonly called as the average
- Calculated as the sum of values divided by the number of values
- Affected by extreme values (outliers)
- In an ordered array, the median is the ‘middle’ number, not the average, but actually the physical position.
- The location of the median: (n + 1) /2 is not the value of the median, only the position of the median in the ranked data
Rule 1: If the number of values in the data set is odd, the median is the middle ranked value
Rule 2: If the number of values in the data set is even, the median is the mean (average) of the two middle ranked values
- Value that occurs most often (the most frequent). It can be more than one value.
- Quartiles split the ranked data into 4 segments with an equal number of values per segment
- The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger
Q1 = (n+1)/4
- The second quartile, Q2, is the same as the median (50% are smaller, 50% are larger)
Q2 = (n+1)/2
- Only 25% of the observations are greater than the third quartile Q3
Q3 = 3(n+1)/4
Measures of variation give information on the spread or variability of the data values.
Difference between the largest and the smallest values in a set of data
Range = Xlargest - Xsmallest
Like the median and Q1 and Q2, the IQR is a resistant summary measure. It eliminates outlier problems by using the interquartile range as high- and low-valued observations are removed from calculations:
IQR = 3rd quartile – 1st quartile
- VARIANCE: The mean squared deviation and it shows variation about the mean.
- Each value in the data set is used in the calculation
- Values far from the mean are given extra weight as deviations from the mean are squared
- Sensitive to extreme values (outliers)
- Measures of absolute variation not relative variation
The denominator (n-1) is to adjust for the biasness of the sample statistics.
Measures relative variation i.e. shows variation relative to mean. It can be used to compare two or more sets of data measured in different units and it is always expressed as percentage (%).
Shape and Skweness
- Describes how data are distributed
- Measures of shape – Symmetric or skewed
- The sample covariance measures the direction of the linear relationship between two numerical variables (direction of the association)
Sample Coefficient of the of Correlation r
- Measures the relative strength of the linear relationship between two variables:
Where Sx and Sy are their Sample Variance.
|Value of r
|r = -1
||PERFECT negative linear relationship
|-1 < r ≤ -0.7
||STRONG negative linear relationship
|-0.7 < r ≤ -0.3
||MODERATE negative linear relationship
|-0.3 < r < 0
||WEAK negative linear relationship
|r = 0
|0 < r < 0.3
||WEAK positive linear relationship
|0.3 ≤ r < 0.7
||MODERATE positive linear relationship
|0.7 ≤ r < 1
||STRONG positive linear relationship
||PERFECT positive linear relationship