Statistics — Edexcel GCSE Mathematics Revision Notes

What you'll learn

Statistics forms a substantial component of both Foundation and Higher tier Edexcel GCSE Mathematics papers, typically accounting for 15-20% of the total marks. This topic examines your ability to collect, organise, analyse and interpret data using measures of central tendency, measures of spread, and various graphical representations. Questions appear across all three papers and often combine multiple statistical techniques in real-world contexts.

Key terms and definitions

Mean — the sum of all values divided by the number of values; sensitive to extreme values and can be calculated from frequency tables and grouped data.

Median — the middle value when data is arranged in ascending order; for even datasets, the mean of the two middle values.

Mode — the most frequently occurring value in a dataset; a dataset can be bimodal (two modes) or have no mode.

Range — the difference between the highest and lowest values; provides a simple measure of spread but is affected by outliers.

Interquartile range (IQR) — the difference between the upper quartile (Q₃) and lower quartile (Q₁); measures the spread of the middle 50% of data.

Outlier — a value that lies outside the expected range, typically defined as below Q₁ - 1.5 × IQR or above Q₃ + 1.5 × IQR.

Cumulative frequency — the running total of frequencies up to a particular value; used to find medians and quartiles for grouped data.

Correlation — the relationship between two variables, described as positive, negative, or zero; does not imply causation.

Core concepts

Measures of central tendency

The three main averages used in Edexcel GCSE Mathematics are mean, median and mode. Each has specific applications and limitations.

Calculating the mean from raw data:

Add all values together
Divide by the total number of values
Formula: x̄ = Σx ÷ n

Calculating the mean from a frequency table:

Multiply each value by its frequency
Sum these products to find Σfx
Divide by the total frequency (Σf)
Formula: x̄ = Σfx ÷ Σf

Estimating the mean from grouped data:

Find the midpoint of each class interval
Multiply each midpoint by its frequency
Calculate Σfx ÷ Σf
This gives an estimate because exact values are unknown

Finding the median:

Arrange data in ascending order
For n values, the median position is (n + 1) ÷ 2
If n is odd, the median is the middle value
If n is even, calculate the mean of the two middle values

Finding the mode:

Identify the value with the highest frequency
For grouped data, identify the modal class (the class interval with the highest frequency)

Measures of spread

Measures of spread quantify how dispersed data values are around the average.

Range calculation:

Range = highest value - lowest value
Quick to calculate but heavily influenced by extreme values
Often paired with mean or median in exam questions

Quartiles and interquartile range:

Lower quartile (Q₁): the value 25% of the way through the ordered dataset
Upper quartile (Q₃): the value 75% of the way through the ordered dataset
Q₁ position = (n + 1) ÷ 4
Q₃ position = 3(n + 1) ÷ 4
IQR = Q₃ - Q₁
IQR is more resistant to outliers than range

Identifying outliers:

Lower boundary: Q₁ - 1.5 × IQR
Upper boundary: Q₃ + 1.5 × IQR
Any value outside these boundaries is classified as an outlier
Outliers may be removed when calculating mean to give a more representative average

Statistical diagrams and graphs

Frequency polygons:

Plot frequency against the midpoint of each class interval
Join points with straight lines
Used to compare distributions when multiple datasets are shown on the same axes

Cumulative frequency diagrams:

Plot cumulative frequency against the upper class boundary
Join points with a smooth curve
Used to estimate median, quartiles and percentiles
Median estimate: read across from ½n on the y-axis
Q₁ estimate: read across from ¼n on the y-axis
Q₃ estimate: read across from ¾n on the y-axis

Box plots:

Visual representation showing five key values: minimum, Q₁, median, Q₃, maximum
The box spans from Q₁ to Q₃ (containing the middle 50% of data)
Whiskers extend to minimum and maximum (excluding outliers)
Outliers are shown as individual points beyond the whiskers
Useful for comparing distributions between datasets

Histograms:

Bars represent class intervals of potentially different widths
The frequency density determines the height of each bar
Formula: frequency density = frequency ÷ class width
Area of each bar represents the frequency
Essential to label y-axis as "Frequency density" in exams

Scatter graphs:

Display bivariate data (two variables) as coordinate points
Used to identify correlation between variables
Positive correlation: as one variable increases, the other increases
Negative correlation: as one variable increases, the other decreases
Zero/no correlation: no apparent relationship
Line of best fit: drawn to pass through the mean point with roughly equal numbers of points on each side

Time series graphs

Time series graphs display how data changes over time, with time always on the horizontal axis.

Identifying trends:

Overall upward trend: values generally increasing
Overall downward trend: values generally decreasing
Seasonal or cyclical patterns: regular fluctuations

Moving averages:

Used to smooth out fluctuations and identify underlying trends
For a 4-point moving average, calculate the mean of consecutive groups of 4 values
Plot each moving average at the midpoint of the time period it covers
Common in Higher tier questions involving seasonal data

Sampling and data collection

Types of data:

Qualitative data: descriptive (categories, opinions)
Quantitative data: numerical values
Discrete data: can only take specific values (usually whole numbers)
Continuous data: can take any value within a range (measurements)

Sampling methods:

Simple random sample: every member has equal chance of selection
Systematic sample: select every nth member from a list
Stratified sample: population divided into groups (strata); sample size from each group proportional to its size in the population
Formula for stratified sampling: (stratum size ÷ population size) × sample size

Bias in sampling:

Samples must be representative of the population
Convenience sampling (selecting easily accessible members) introduces bias
Time and location of sampling can affect representativeness

Worked examples

Example 1: Calculating mean from grouped frequency table

The table shows the time taken by 50 students to complete a puzzle.

Time (t seconds)	Frequency
20 < t ≤ 30	8
30 < t ≤ 40	15
40 < t ≤ 50	18
50 < t ≤ 60	9

Estimate the mean time taken.

Solution:

Find midpoints: 25, 35, 45, 55

Midpoint (x)	Frequency (f)	fx
25	8	200
35	15	525
45	18	810
55	9	495
Totals	Σf = 50	Σfx = 2030

Mean = Σfx ÷ Σf = 2030 ÷ 50 = 40.6 seconds

Example 2: Drawing and interpreting a box plot

For the dataset: 12, 15, 18, 20, 22, 23, 25, 28, 30, 32, 35, 38, 42

(a) Find the median and quartiles (b) Calculate the interquartile range (c) Identify any outliers

Solution:

(a) n = 13 values (already ordered)

Median position = (13 + 1) ÷ 2 = 7th value = 25

Q₁ position = (13 + 1) ÷ 4 = 3.5, so Q₁ = mean of 3rd and 4th values = (18 + 20) ÷ 2 = 19

Q₃ position = 3(13 + 1) ÷ 4 = 10.5, so Q₃ = mean of 10th and 11th values = (32 + 35) ÷ 2 = 33.5

(b) IQR = Q₃ - Q₁ = 33.5 - 19 = 14.5

Upper boundary = 33.5 + 1.5 × 14.5 = 33.5 + 21.75 = 55.25

All values fall within these boundaries, so no outliers.

Example 3: Stratified sampling

A school has 240 Year 10 students and 180 Year 11 students. A stratified sample of 70 students is required. How many students should be selected from each year group?

Solution:

Total population = 240 + 180 = 420

Year 10: (240 ÷ 420) × 70 = 0.5714... × 70 = 40 students

Year 11: (180 ÷ 420) × 70 = 0.4286... × 70 = 30 students

Check: 40 + 30 = 70 ✓

Common mistakes and how to avoid them

Confusing mean and median — Students often calculate mean when the question asks for median. Read the question carefully and understand that median requires ordered data and finding the middle position, while mean requires summing all values and dividing.

Using values instead of midpoints for grouped data — When estimating the mean from grouped data, always use the midpoint of each class interval, not the class boundaries. For 20 < t ≤ 30, the midpoint is 25, not 20 or 30.

Calculating range from frequency tables — Range is highest value minus lowest value. For grouped data, use the highest possible value (top of highest class) minus lowest possible value (bottom of lowest class), not the midpoints.

Forgetting frequency density for histograms — A major error is plotting frequency instead of frequency density on the y-axis. Always divide frequency by class width, and remember that unequal class widths make this essential.

Misinterpreting correlation as causation — Just because two variables are correlated does not mean one causes the other. Ice cream sales and drowning incidents may correlate (both increase in summer), but ice cream doesn't cause drowning.

Incorrect box plot construction — The box must extend from Q₁ to Q₃, with a line at the median inside the box. Whiskers extend to minimum and maximum (or to the boundary values if outliers exist). Outliers are plotted separately as crosses or dots beyond the whiskers.

Exam technique for Statistics

Command word recognition — "Calculate" or "work out" requires a numerical answer with working shown. "Estimate" typically indicates grouped data where the exact mean cannot be found. "Describe" for correlation requires both type (positive/negative/none) and strength (strong/weak) plus context from the question.

Show working clearly — Statistics questions often carry method marks. Write out formulas, show substitution, then calculate. For finding quartiles, explicitly state positions before identifying values. Examiners award marks for correct method even if the final answer is wrong.

Label axes and diagrams fully — Histograms must have "Frequency density" on the y-axis (1 mark often allocated). Box plots need a scale. Cumulative frequency curves need both axes labelled with units. Missing labels cost marks even when the diagram is otherwise correct.

Justify statistical choices — Higher tier questions may ask which average is most appropriate. Link your choice to the data characteristics: mean is affected by outliers so median is better for skewed data; mode is suitable for qualitative data; mean uses all data values so is representative when no extreme values exist.

Quick revision summary

Statistics in Edexcel GCSE Mathematics requires mastery of averages (mean, median, mode), measures of spread (range, IQR), and graphical representations (histograms, box plots, cumulative frequency curves, scatter graphs). Calculate mean from frequency tables using Σfx ÷ Σf and estimate from grouped data using midpoints. Find median and quartiles from ordered data or cumulative frequency graphs. Draw histograms using frequency density (frequency ÷ class width). Identify correlation type and strength from scatter graphs. Apply stratified sampling using proportional selection. Show all working, label diagrams correctly, and match statistical measures to data context.

What you'll learn

Key terms and definitions

Core concepts

Measures of central tendency

Measures of spread

Statistical diagrams and graphs

Time series graphs

Sampling and data collection

Worked examples

Common mistakes and how to avoid them

Exam technique for Statistics

Quick revision summary

Lock in Statistics with real exam questions.