Walmart Sales Prediction

12 min readJul 19, 2020

This is my first blog ever. Hope you like it.

Have you guys ever been disappointed by the fact that something you wanted to buy was out of stock? Well, what if you get a chance to solve the problem of products being out of stock or overstock. This blog discusses the same problem and ways to overcome the problem using machine learning. A simple definition of machine learning is using data to answer right questions. The more data we have, the more accurate our predictions can be. This is a real world problem from kaggle. I would like to thank walmart for providing the data.

Problem Statement

The objective of the problem is to predict the sales of weather-sensitive products. We have been provided with sales data for 111 products that are sold in 45 different walmart locations (stores). These 45 locations are covered by 20 weather stations. Some of the stores share a weather station. We need to only forecast sales data and not the weather data since the weather data covering the entire data set is already given. The task is to predict the sales for the products present in the test.csv file. It’s a regression problem but also a rare regression problem because in this problem we have imbalanced targets (lots of them are 0’s).

Data Overview

source of data: https://www.kaggle.com/c/walmart-recruiting-sales-in-stormy-weather/data

we have been provided with three files. They are train.csv,key.csv,weather.csv

train.csv - sales data for all stores and dates in the training set. It has four columns

item_nbr = an id representing one of the 111 products
store_nbr = an id representing one of the 45 stores
date = the day of sales
units = the quantity sold of an item on a given day

key.csv - relational mapping between stores and the weather stations that cover them. It has two columns

station_nbr = an id representing one of 20 weather stations
store_nbr = an id representing one of the 45 stores

weather.csv - a file containing the weather information for each station and day. It has 20 columns

station_nbr = an id representing one of 20 weather stations
date = the day of weather
tmax = maximum temperature on a particular day measured in Fahrenheit
tmin = minimum temperature on a particular day measured in Fahrenheit
tavg = average temperature on a particular day measured in Fahrenheit
depart = departure from normal temperature on a particular day measured in Fahrenheit
dewpoint = average dewpoint temperature on a particular day measured in Fahrenheit
wetbulb = average wetbulb temperature on a particular day measured in Fahrenheit
heat = heating degree day on a particular day with base as 65 degrees Fahrenheit. A heating degree day (HDD) is a measurement designed to quantify the demand for energy needed to heat a building
cool = cooling degree day on a particular day with base as 65 degrees Fahrenheit. A cooling degree day (CDD) is a measurement designed to quantify the demand for energy needed to cool buildings
sunrise = The time of sunrise
sunset = The time of sunset
codesum = codes for different weather types
snowfall = the quantity of snow falling on a particular day measured in inches
preciptotal = the quantity of precipitation on a particular day measured in inches
stnpressure = average station pressure measured in hg
sealevel = average sealevel pressure measured in hg
resultspeed = resultant wind speed in miles per hour
resultdir = resultant wind direction
avgspeed = average speed of the wind

Mapping the real world problem to an Machine Learning problem

Type of ML probelm

For a given product, we need to predict sales. It is a regression problem since sales is real-valued

Performance Metric

The Metric is RMSLE (Root Mean Squared Logarithmic Error)

Business Constraints

No low-latency requirement
Errors can be costly (because if we prediction of the model is wrong then there is a problem of overstock or out of stock)
Minimise RMSLE value

Exploratory Data Analysis

Exploratory data analysis is an approach to analyse data sets often with visual methods. In EDA, we analyse features that we have at hand, we try to understand the relation between each feature and the target variable. While performing EDA, we can new engineer features which might be useful during model building. Firstly, import all the libraries that you need (for example numpy,pandas etc)

Load the data

load all three files train.csv,key.csv,weather.csv. Use pandas to load these csv files.

Imbalanced regression targets

97.4 percent of the data points have sales(units) equal to zero but only 2.5 percent of the data points have sales or units greater than zero. This is kind of a rare problem since it is a regression problem in which targets are mostly zero’s.

Analysing features

I have mentioned all the features that are available in the data overview section. Sales differ as the item_nbr feature differs which means different items have been sold in different quantities. Same holds true for store_nbr and station_nbr feature. For example, item with item_nbr 45 is the one with highest sales (1005111 units) followed by item 9 (916615 units). Item corresponding to least amount of sales is item 40(254 units). We can see the sales are not even, corresponding to different items.

tmax

tmax means maximum temperature. To plot tmax against sales, group by tmax,store_nbr,item_nbr and take the sum of sales for each combination. It is not possible to plot for all items since there are 111 items, so randomly select 5 or 6 items.

Above is a 3D plot for item 5. Red points indicate sales is zero. We can see that item 5 has been sold in large amounts across most of the stores. We can also see the sales of item 5 when the temperature is in between 0 and 100 degree Fahrenheit (approx).

Above is the plot for item 1. Very few units of item 1 has been sold. We can see the sales of item 1 when the temperature is in between 20 and 100 degree Fahrenheit (approx). Sales for item 1 is high when the temperature is around 60 degree Fahrenheit. Sales for item 75 is high when the temperature is around 20 degrees Fahrenheit. This shows that, there is a variability in the amounts of items that are sold corresponding to different temperatures.

heat

To plot heat against sales, group by heat,store_nbr,item_nbr and take the sum of sales for each combination.

From the above two plots, we can see that items are sold in decent amount of quantities when the heat value is less. Item 5 which is one of the items with large sales was sold when the heat value is around 0. Some other items like item 50 was sold when the heat value is greater than 0 also. This tells us there is a variability in the sales of items corresponding to the variability in heat value.

preciptotal

To plot preciptotal against sales, group by preciptotal,store_nbr,item_nbr and take the sum of sales for each combination.

We can see that items were sold when the precipitation value is very less (less than 1 approx.). Item 5 is sold across almost all the stores but items like 108 and 55 are sold only in some stores.

stnpressure

To plot stnpressure against sales, group by stnpressure,store_nbr,item_nbr and take the sum of sales for each combination.

From the above plots, we can see that there are no sales when the station pressure value is in between 24–28 hg (approx). There is a certain pattern in the amount of sales. For example, item 5 sales increase and then decrease as the stnpressure increases.

weekday

weekday feature is a number ranging from 0 to 6 where 0 means Monday and 6 means Sunday.

day  :  total_sales 
MON  :  669573
TUE  :  587327
WED  :  560519
THU  :  548025
FRI  :  617066
SAT  :  738786
SUN  :  826764

Highest amount of sales were seen in the weekends.There was a good amount of sales even on Monday compared to other weekdays. Sales was less on Thursday. There is a variability in sales corresponding to the feature weekday.

Day

Day feature is a number ranging from 1 to 31.

Day  1  :  189655
Day  2  :  164684
Day  3  :  168782
Day  4  :  153478
Day  5  :  185472
Day  6  :  163719
Day  7  :  160847
Day  8  :  156434
Day  9  :  155074
Day  10  :  182456
Day  11  :  159351
Day  12  :  155561
Day  13  :  150276
Day  14  :  144167
Day  15  :  149491
Day  16  :  140668
Day  17  :  139457
Day  18  :  140593
Day  19  :  141057
Day  20  :  139428
Day  21  :  141600
Day  22  :  141809
Day  23  :  140659
Day  24  :  140374
Day  25  :  131115
Day  26  :  137596
Day  27  :  132753
Day  28  :  132481
Day  29  :  121713
Day  30  :  115684
Day  31  :  71626

We can see that sales kept dropping as the days progress.Highest amount of sales were seen in the initial days which is expected.

I chose to drop depart,sunrise,sunset,snowfall,codesum because these features have lot of missing values(percentage of missing values are greater than 25 percent). I have mentioned EDA for few features over here but after analysing all the features apart from the one’s that I chose to drop, there is some variability in the sales with respect to all the features. So, all of these features might be useful in predicting sales.

Preprocessing and Data Preparation

Removing outliers in train.csv

I have used matplotlib to plot the above plot. We can clearly see from the above plot that there are two outliers because those two values are very extreme compared to the rest of the values. Hence remove those outliers. There are no any outliers in the features corresponding to weather data. Drop depart,sunrise,sunset,snowfall,codesum because these features have lot of missing values.

Impute Missing Values

There are missing values only in weather data. There is a date column in the weather data, we can split the data based on this column. Before that, there are lot of Missing values in weather data. We need to impute them. We cannot use all the data to impute them. So, I split the data into two parts which is before 2014-05-06 and from 2014-05-06. I have used only the data before 2014-05-06 to impute the missing values. 2014-05-06 has been choosen because till 2014-05-06, it corresponds to 85% of data and rest is 15% of data. The same 2014-05-06 is considered when we split the data into CV and train which can be witnessed in the Train-CV split section. Missing values have been calculated by using the mean of the values that are present. Since values differ for each weather station, mean values are being calculated corresponding to each station.

Additional Features based on date feature

Merge train data, key data, weather data to get the clearer picture of what is the weather on the day on which a particular item was sold. I have also engineered few other features like day,month,year,number of days,weekday using the date feature.

Above function is used to create features from date feature.

Train-CV split

As mentioned earlier, the data before ‘2014–05–06’ is chosen to be train data which corresponds to 85% of total data and the data from ‘2014–05–06’ is chosen to be CV data which corresponds to 15% of the total data.

Machine Learning Model Building

I built two new features based on avgspeed and weekday. Groupby avgspeed,store,item and find the sum of units for this each group. Now, the new feature avgspeed_mean is calculated by using this sum value corresponding to the avgspeed,store,item combination and substituting it in the final data frame that is used for model building. Same method is followed for other new feature which is weekday_mean. I have experimented with four regression models. Them being Linear regression, KNN regression, Random Forest regression, Gradient boosted decision tree regression(gbdt).

Base Line Model

predicting 0 for all the data points . I got 0.48 as the RMSLE value for the CV data. This means that any sensible model should give an RMSLE value less than 0.48.

Linear regression

The first model that I used was linear regression with all the features and I got an RMSLE value 0.16. Below is the code to implement linear regression through scikit-learn.

KNN regression

Hyper parameter tuned KNN regression with number of neighbours as the hyper parameter and got 11 as the best value for number of neighbours. RMSLE value obtained was 0.44

Random Forest regression

Hyper parameter tuned random forest regression with number of estimators and depth as the hyper parameters and got max_depth=10 and n_estimators=50 as the best values. RMSLE value obtained was 0.14

Gradient boosted decision tree regression

Hyper parameter tuned gbdt regression with number of estimators and depth as the hyper parameters and got max_depth=5 and n_estimators=50 as the best values. RMSLE value obtained was 0.14

Variation in KNN regression

I tried an other approach which also uses KNN regression but with a variation. The source of reference to this idea is kaggle’s discussion board. Almost 97% of the sales in the data is 0. Keeping this information in mind, I built a model which is as follows. Take all the rows in train data where sales is greater than 0. For these rows, group by store number and item number and apply count on this groupby. We get only few combinations of item and store number where sales is greater than 0 and save these combinations in a list(say in list L). Now go through each row in cv data. For those combinations of item and store number of cv data, which are not present in L, predict sales to be 0. For those combinations of item and store number of cv data, which is present in L, use KNN regression to predict the sales. For each row in cv data, if the store and item number is present in L, then apply a function. In this function take only those rows of train data where store number and item number matches that particular store and item number of that particular row in cv data. Now train KNN regressor to predict the target variable. For this method, I have only used features obtained from the date feature which seems to work the best. By using this method, I got an RMSLE value of 0.11 which is the best value till now.

Final Results in a table format

KNN with variation performed better than all other models.

Kaggle Submission Score

Future Work

In the future, I would like to engineer some other features relevant to this case study and experiment with it. I would also like to experiment with stacking of different models to improve the efficiency of models.

Conclusion

This was the first time that I solved a kaggle problem. I understood the importance of feature engineering in this case study, somehow weather features were not useful at all. Features obtained from date like weekday, date_numeric, month, day, year were very much helpful. Simple methods like KNN with variation can also achieve great results. We will be able to build such models only by spending time on analysing the data. My postion on leaderboard was decent and I think it is a good start as a beginner.

References

Kaggle’s discussion board has been my great source of reference. https://www.kaggle.com/c/walmart-recruiting-sales-in-stormy-weather/discussion/14358

2. https://www.slideshare.net/MinchaoLin/masters-project-report-minchao-lin

3. https://towardsdatascience.com/an-easy-introduction-to-3d-plotting-with-matplotlib-801561999725

you can find me on linkedin