SECRET OF CSS

Fuel Price Prediction Using Python and “scikit-learn” | by Leo van der Meulen | Aug, 2022


Fuel prices are going up and down (mostly up). Can we predict what will happen?

0*6tvQNkUl6L8pPMc
Photo by Dawn McDonald on Unsplash

The price of gasoline (E10) is going up and down. The energy crisis is influencing these prices highly. It is claimed to be highly dependent on the oil price and that following an upward trend in the oil price goes faster than following the downtrend. We will see if this claim is true and if we can predict the fuel price for the coming days.

So, first, we will gather our data (oil and fuel prices). The second step is the analysis if there is a (visual) correlation between these two. Finally, we will try to build a predictor for the E10 price.

It is easy to find the current prices and graphs for recent weeks/months. It is a bit harder to find the underlying tabular data. For the oil price, I have found it on investing.com. It is possible to download historic data (daily, weekly, monthly) when you have a free account on the site.

When you follow the link above, you see a page with the information on the Brent oil futures. Brent is one of the two major oil trading placings in the world (WTI is the other) A future is a contract specifying value, price, and date of an oil exchange. It is a good indicator of the current oil price.

0*3pDrAFm44goQwc2r
Screenshot from investing.com

Make sure to click the tab ‘Historical Data’. On the right side of the Download Data button, you can specify the period you want to download and on the left side you can specify if you want the daily, weekly, or monthly data. It is possible to go back in time for more than 30 years.

When you specified the download period and data frequency, you can download the data and receive the information in a CSV file:

0*DeOEMUMHTRDUZ9jS
Example oil price data (screenshot by author)

For the current purpose this way to obtain the data is fine. When the data needs to be retrieved frequently, it should be automated, e.g. with BeautifoulSoup4.

Since I am interested in the E10 prices in the Netherlands, the fuel prices will be in Euros. The Brent oil futures are in dollars. The exchange rate of the Euro to the Dollar will have changed over the period of this analysis, so we need one currency. In this case, I have chosen the Euro and need to convert the Dollar price to Euros.

The exchange rate can also be found on investing.com. Downloading this data is equal to the oil price.

So now we have two CSV files, one with the oil future prices and one with the exchange rate. We will import these with Pandas, apply the change rate and perform some data cleaning:

In lines 4 and 5 both CSVs are imported to a panda dataframe whereby the columns of the Euro-Dollar exchange dataframe are suffixed with ‘_USD’. Both CSV has some identical column names and we need to be able to distinguish them. These data frames are merged on the date columns.

In line 7 the future price in Euros is calculated en in line 9 the string representation of the date is translated to a datetime object. In line 8 the correct locale is set to make sure the month abbreviations are correctly interpreted. The date column contains the three-letter abbreviation of the month and these differ per locale. Finally, line 10 keeps the required columns and gives the columns applicable names. Oildata looks as follows:

0*zkcX39t9DWJhTYSQ
Oil price per date (screenshot by author)

Now that we have the historical oil prices, the next step is obtaining the E10 prices. In the Netherlands, the national bureau for statistics (CBS) keeps track of gasoline prices on a daily basis.

0*lLRQbupZIRNqOiTX
Dutch gasoline prices (screenshot website CBS)

There is a python library for downloading CBS data (cbsodata) maintained by CBS. Downloading large tables can take some time. Since most tables do not change frequently, I have developed a wrapper around cbsodata that introduces caching.

The method get_cbs_odata() returns two dataframes, the first with the data and the second with the description of the table and columns. The parameter cache determines if a local cache is used. Lines 13 to 15 perform the string strip() function on all columns with strings. This is added since a majority of string columns are aligned with spaces and removing these spaces makes the data easier to use.

With the help of this method we retrieve the gasoline prices and then clean this data:

To convert the dates to datetime objects we need the Duch locale since the date column contains Dutch day and month names (“2022 woensdag 13 juli”). We only keep E10 and diesel prices:

0*eidrD6BkeBVmZ06y
Dutch gasoline price per date (screenshot by author)

The next step is combining the oil and gasoline dataframes. In the previous steps both tables got a column with the date as datetime object. This makes it easy to merge the tables:

Line 1 merges the dataframes on the column ‘Date’. The column to use for merging is specified by the on parameter. The specified name is used in both tables (if the merge key is has different column names in both datasets you can yous left_on en right_on). The merge function returns by default only keys (in this case dates) that are present in both tables. By specifying how=’outer’ all keys are part of the result, also if it is only present in one of the two tables. This assures that in case of a missing line in one of the datasets, we get a final dataframe with a row for each data.

Line 2 resamples the table so that every day between the first and last date. It only works on the index of a dataframe. Missing dates are added and dates that have multiple occurrences are combined. The max() selects the maximum column values in case a date occurs multiple times. In this dataframe this will most likely not happen, but when we would resample it to weeks, seven days will be grouped to one row and a value selector is required. There are also value selectors for e.g. the minimum value and mean value.

Now the interpolate() function fills missing column values by performing a linear regression between the available values around the missing value ( a lot of functions require all columns to be filled). Finally the rolling(7).mean() smooths the data by calculating the average value over a range of 7 values. This removes the noise from the data. The last line removes all dates before January 1st, 2020.

This article uses the fuel prices in the Netherlands but it is also possible to perform these calculations for other countries. For example, for Germany the historical prices are available on the finanzen.net website.

Time to look at the data:

The oil price is between 20 and 130 euro while the E10 price is between 1.50 and 2.50 euro. Plotting these on the same axis will render the graph useless since the E10 price will be shown as a near flat line since the y-range is determined by the oil price. Introducing a secondary y-axis solves this problem.

Line 1 plots the oil price and uses the left y-axis which is named in line 2. Line 3 plots the E10 price, reusing the matplotlib axis from the first plot and adds a secondary y-axis on the right (parameter secondary_y=True). The second axis is named in line 4 while line 5 adds a title to the plot.

0*nRvnYQyzkJLPepPz
Oil and E10 prices compared (image by author)

Looking at the graph we see that the E10 price roughly follows the oil price but with a small delay. It is also a little bit more smooth compared to the oil price. Zooming in on 2022 reveals this even better:

0*Dj Ld 8C8H0e4Ob6
Oil and E10 prices compared [2022] (image by author)

The E10 price follows the price surge in march with a slight delay. The period of July/Àugust shows that it also follows when going down. Looking at these two periods it seems that a rise in oil results in higher E10 prices quicker than a fall in oil price. Adding two time-delayed versions of the oil price (7 days and 14 days delay) confirms this:

0*C0oe8lREPGTd9yff
Comparing time shifted series (image by author)

The red lines seem to align with the orange dotted line (7 days delayed oil price) when it is ascending and with the green dotted line (14 days delay) when it is descending. This confirms the hypothesis that gasoline prices follow ascending oil prices faster than descending oil prices.

But now for the cherry on the cake. Is it possible to predict the E10 price for the coming period? The data can be interpreted as a time series that can be extended with several features (oil prices, delayed prices). But it might also be possible with a simpler model like regression. There seems to be a clear correlation between the value to calculate (E10 price) and the mentioned features.

But first, it is time to look at the correlations between the different features using the corr() function of Pandas dataframe:

0*4Qfsf6oIolnwyn0t
Correlation analysis (image by author)

We are most interested in the top row that shows the correlation between the E10 price and the other columns. The _7d and _14d columns are the time-shifted columns (7 and 14 days) The _delta columns contain the price delta between the price 7 days ago and 14 days ago.

The highest correlations are with the oil price, the delayed oil and the delayed E10 prices. The correlation with the _delta values is small. Using a delayed version of the value we want to predict is possible, but reduces the time period we can look forward. In this case, we need the E10 price of 7 days ago so we can only predict 7 days in the future (or we need to use a delayed version of the predicted value).

Before we can apply machine learning, we need to prepare our test data.

A copy of the gas_data dataframe is used. In line 3 we define the column names that will be used as features. Line 5 identifies the target column, the value we will try to predict. The last 7 values of the dataframe are not added to the X (features) and y (target values) dataset. This prevents that we use the values we want to predict as input for the model.

The statsmodel package contains, among others, the implementation of the Ordinary Least Squares (OLS) regression technique. OLS fits a linear function on the features by minimizing the sum of the square errors. Wikipedia has a good description on OLS.

The OLS model is created, trained and used to predict the next 7 values in the following 4 lines of code:

Line 2 creates an OLS model and the fit() method trains the model and finds the best fit. In line 4 the model is used to predict the target value, also for the 7 entries, we did not use for training. The predicted value is stored in a new column named ‘EstLR’.

0*HYjIsdGJ0YKtWvA6
Actual value and predicted values (screenshot by author)

We can calculate the Root Mean Squared Error (RMSE) to check how good this prediction is. The RMSE is 0.025, which translates to a quite good prediction. The real and expected values in graphical form:

Comparing real value with predicted value (created by author)

The blue line represents the actual value and the dotted part the last seven days that are not used in the training phase. The green line is the predicted value. The green line does follow the blue line quite well. The delayed drop in April is due to changes in the excise duty. These were lowered by the government and cannot be predicted by the model since it only looks at historical prices of oil and E10. Overall I am pleasantly surprised by the quality of the prediction. Based on this result I feel confident to delay filling up my car when the E10 price prediction is dropping for the coming seven days.

A Support Vector Machine (SVM) is a supervised machine learning algorithm for both classification and regression purposes. A SVM works by finding a hyperplane that divides a dataset into two or more classes. Therefore it is mostly used for classification, but it works for regression as well.

Adding the estimations from the SVM to the comparison graphs shows that it performs less and underestimates the E10 price.

0*WFJYgQgXHNcATl0U
Comparing all three predictions (image by author)

The last algorithm tested is the multilayer perceptron (MLP). An MLP is a neural network consisting of layers of perceptrons. It has at least an input layer, an output layer, and one hidden layer. It uses supervised training and backpropagation to determine all the weights in the model.

Implementing an MLP is a bit more complex compared to the previous two algorithms:

Finding the right number of layers and perceptrons is trial and error. In this case, three hidden layers of 64 perceptrons seem to give the best result:

0*LDJO6 u9z1laAFfs
Comparing prediction results (image by author)

Overall the MLP predictions seem to be as good as the regression. Looking at the error metrics we see that the LR is slightly better:

0*OcBm5qhi1cABmJwg
Metric scores for the models (image by authot)

So looking at these model results it is possible to predict the E10 price relatively well. Using a regression or neural network results in a prediction that is sufficient enough to use in decision-making.

A notebook with all the code for this article can be found on my GitHub page. It also includes a set of CSV files as used for the calculations and predictions.

I hope you enjoyed this article. For more inspiration, check some of my other articles:

Disclaimer: The views and opinions included in this article belong only to the author.



News Credit

%d bloggers like this: