A time series is a sequence of data points collected over time (e.g., daily stock prices, monthly sales, hourly temperature readings). To build effective forecasting models, we first need to understand the underlying structure of our data.
1. Time Series Decomposition
Decomposition is a statistical task that deconstructs a time series into several components, each representing one of the underlying categories of patterns. It helps us see the "ingredients" of our data. The three main components are:
- Trend: The long-term direction of the data. Is it increasing, decreasing, or staying flat over time?
- Seasonality: A repeating, fixed-period pattern in the data (e.g., sales are always higher in December, web traffic is higher on weekdays).
- Residuals (or Noise): The random, irregular component left over after the trend and seasonality have been removed.
We can perform this decomposition easily using Python's statsmodels library.
Python
import pandas as pd
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.pyplot as plt
# Assume 'df' is a pandas DataFrame with a datetime index and a 'value' column
# For this example, let's create some sample data
data = {'value': [i + (i//12)*5 + 10* (i%12<6) + 5* (i%12>8) + i/10* (i%3-1) for i in range(120)]}
df = pd.DataFrame(data, index=pd.date_range(start='2015-01-01', periods=120, freq='M'))
# Decompose the time series
# The model can be 'additive' or 'multiplicative'
# Additive: Y(t) = Trend(t) + Seasonality(t) + Residual(t)
# Multiplicative: Y(t) = Trend(t) * Seasonality(t) * Residual(t)
result = seasonal_decompose(df['value'], model='additive', period=12) # period=12 for monthly data
# Plot the decomposed components
fig = result.plot()
plt.suptitle('Time Series Decomposition', y=1.02)
plt.show()
2. Stationarity
A stationary time series is one whose statistical properties such as mean, variance, and autocorrelation are constant over time. It's a flat-looking series without a trend, and its fluctuations are consistent.
Why is stationarity important? Many popular time series forecasting models, like ARIMA, are based on the assumption that the underlying data is stationary. If the data is not stationary, the model may produce unreliable forecasts.
3. Testing for Stationarity: The Augmented Dickey-Fuller (ADF) Test
While we can often spot non-stationarity visually (e.g., an obvious upward trend), a statistical test provides a more rigorous confirmation. The most common test is the Augmented Dickey-Fuller (ADF) Test.
The ADF test is a type of statistical hypothesis test. Its null hypothesis is that the time series is non-stationary (it has a unit root).
- Null Hypothesis (H_0): The series is non-stationary.
- Alternative Hypothesis (H_a): The series is stationary.
We interpret the result using the p-value from the test.
- If p-value > 0.05: We fail to reject the null hypothesis. The data is likely non-stationary.
- If p-value <= 0.05: We reject the null hypothesis. The data is likely stationary.
Here's how to run the ADF test in Python:
Python
from statsmodels.tsa.stattools import adfuller
# Using the same df from the decomposition example
series = df['value'].values
# Perform the ADF test
result = adfuller(series)
# Print the results
print('ADF Statistic:', result[0])
print('p-value:', result[1])
print('Critical Values:')
for key, value in result[4].items():
print(f'\t{key}: {value}')
# Interpret the p-value
if result[1] <= 0.05:
print("\nResult: Reject the null hypothesis. The data is likely stationary.")
else:
print("\nResult: Fail to reject the null hypothesis. The data is likely non-stationary.")
# If data is non-stationary, we can make it stationary using differencing
df['differenced_value'] = df['value'].diff().dropna()
If your data is non-stationary, a common technique to make it stationary is differencing, which involves subtracting the previous observation from the current observation.