**Describing Data**

Every economist needs to have the ability to collect, analyse, manipulate, understand and report data. In a daily research environment, we need to deal with randomness, variation and in order to apply our knowledge. Therefore we are going to summarize the most important and useful tools for every economist.

**Key Definitions:**

- A
**population** consists of all the members of a group about which you want to draw a conclusion. The size of the population depends on what you are interested in. (μ, σ, Ν)
- A
**sample** is the portion of the population selected for analysis. Collecting information on the population can be difficult and costly, therefore we sample. (x, s, n)
- A
**parameter** is a numerical measure that describes a characteristic of a population
- A statistic is a numerical measure that describes a characteristic of a sample

**A note on Notation**

- Greek letters (μ, σ, Ν) are used for population data
- Roman letter (x, s, n) are used for sample data

** **

**Scatter diagrams** are very common in econometrics and are used to examine possible relationships between two numerical variables.

- In a scatter diagram one variable is measured on the vertical axis (Y) and the other variable is measured on the horizontal axis (X)
- X = independent variable
- Y = dependent variable

Figure 1: Plot A: Scatter Plot Relationship between Share of Food (WFOOD) and Total Expenditure (TOTEXP).

So, how do we actually describe our data?

We will need to know the data mean, median and mode, however, we will pretty much talk about the data Variation, Shape, Skewness, Range, Interquartile Range, Variance, Standard Deviation and Coefficient of Variation.

So, let’s start through the Central Tendency. What is the mean, median and mode?

**Mean**

- Commonly called as the average
- Calculated as the sum of values divided by the number of values
- Affected by extreme values (outliers)

Population Mean |
Sample Mean |

μ |
X |

**Median**

- In an ordered array, the median is the ‘middle’ number, not the average, but actually the physical position.
- The location of the median: (n + 1) /2 is not the
*value* of the median, only the *position* of the median in the ranked data

Rule 1: If the number of values in the data set is odd, the median is the middle ranked value

Rule 2: If the number of values in the data set is even, the median is the mean (average) of the two middle ranked values

**Mode**

- Value that occurs most often (the most frequent). It can be more than one value.

**Quartiles**

- Quartiles split the ranked data into 4 segments with an equal number of values per segment
- The first quartile, Q
_{1}, is the value for which 25% of the observations are smaller and 75% are larger

**Q**_{1} = (n+1)/4

- The second quartile, Q
_{2}, is the same as the median (50% are smaller, 50% are larger)

**Q**_{2} = (n+1)/2

- Only 25% of the observations are greater than the third quartile Q
_{3}

**Q**_{3} = 3(n+1)/4

**Variation**

Measures of variation give information on the **spread **or **variability** of the data values.

Difference between the largest and the smallest values in a set of data

Range = X_{largest }- X_{smallest}

_{ }

Like the median and Q_{1} and Q_{2}, the IQR is a resistant summary measure. It eliminates outlier problems by using the **interquartile range **as high- and low-valued observations are removed from calculations:

IQR = 3^{rd} quartile – 1^{st} quartile

- VARIANCE: The mean squared deviation and it shows variation about the mean.

Advantages:

- Each value in the data set is used in the calculation
- Values far from the mean are given extra weight as deviations from the mean are squared

Disadvantage:

- Sensitive to extreme values (outliers)
- Measures of absolute variation not relative variation

The denominator (*n-1*) is to adjust for the biasness of the sample statistics.

Measures relative variation i.e. shows variation relative to mean. It can be used to compare two or more sets of data measured in different units and it is always expressed as percentage (%).

**Shape and Skweness**

- Describes how data are distributed
- Measures of shape – Symmetric or skewed

**Sample Covariance**

- The sample covariance measures the direction of the linear relationship between
**two numerical variables **(direction of the association)

**Sample Coefficient of the of Correlation ***r*

- Measures the relative
*strength* of the linear relationship between two variables:

Where S_{x }and S_{y} are their Sample Variance.

**Value of ***r* |
**Interpretation** |

r = -1 |
PERFECT negative linear relationship |

-1 < r ≤ -0.7 |
STRONG negative linear relationship |

-0.7 < r ≤ -0.3 |
MODERATE negative linear relationship |

-0.3 < r < 0 |
WEAK negative linear relationship |

r = 0 |
No relationship |

0 < r < 0.3 |
WEAK positive linear relationship |

0.3 ≤ r < 0.7 |
MODERATE positive linear relationship |

0.7 ≤ r < 1 |
STRONG positive linear relationship |

1 |
PERFECT positive linear relationship |