Complete guide to Time Series Analysis using Neural Network

Welcome to the Fourth Episode of Fastdotai where we will deal with Structured and time series data. Before we start , I would like to thank Jeremy Howard and Rachel Thomas for their efforts to democratize AI.

To make best out of this blog post Series , feel free to explore the first Part of this Series in the following order:-

  1. Dog Vs Cat Image Classification
  2. Dog Breed Image Classification
  3. Multi-label Image Classification
  4. Time Series Analysis using Neural Network
  5. NLP- Sentiment Analysis on IMDB Movie Dataset
  6. Basic of Movie Recommendation System
  7. Collaborative Filtering from Scratch
  8. Collaborative Filtering using Neural Network
  9. Writing Philosophy like Nietzsche
  10. Performance of Different Neural Network on Cifar-10 dataset
  11. ML Model to detect the biggest object in an image Part-1
  12. ML Model to detect the biggest object in an image Part-2

The notebook that we will discuss is an implementation of the third place result in the Rossman Kaggle competition as detailed in Guo/Berkhahn’s [Entity Embeddings of Categorical Variables]. The code has been executed with the help of kaggle kernel . So there would be couple of extra stuffs which has been written so as the code would run smoothly on kaggle. These extra lines of code might not be necessary , if you run your code in any other cloud GPU platform . I will attach the kaggle kernel link at the end of this blog post.

As said in the above code snippets ,we will be having additional datasets which will help us to enrich the features and hence will help us to get better accuracy. These datasets which makes up the feature space are as mentioned below.

Feature Space Involves:

  • train: Contains store information on a daily basis, tracks things like sales, customers, whether that day was a holiday, etc.
  • store: List of stores. General information about the store including competition, etc.
  • store_states: Mapping of store to the German state they are in.
  • googletrend: Trend of certain google keywords over time, found by users to correlate well with given data.
  • weather: Weather conditions for each state.
  • test: Same as training table, without sales and customers.

Lets save these data file names in a specific list as shown below:-

We can use head() to get a quick look at the contents of each table:

Data Cleaning / Feature Engineering:-

In the following code snippets we will be doing Data Cleaning:-

Before moving forward lets have a look at googletrends data, as we will be doing Feature Engineering on this next.

Lets have a look at googletrend after Feature Engineering. Check out the new column Date ,State and changes made to State column.

The following step is really important for feature engineering:-

As we can see in the snapshot below, we have got so much more column added to googletrends which are formed out of Date column using the above command.

In the above code snippet , joined is our training data and joined_test is our test data. As we know a picture is worth 1000 words . Let me describe the flowchart of how we end up with train data — joined and test data — joined_test by merging multiple tables. The following snapshot is same for joined and joined_test .


As we can see in the code below, we will fill in missing values that are present in train data — joined and test data — joined_test .

The output looks like :-

In the code below we are replacing the outliers and doing some more feature engineering .

Same feature engineering for Promo dates.


You are almost half way through this article . Great job. Keep going . Its gonna be interesting from here

Now ,let’s focus on Categorical and Continuous variables.

Categorical vs Continuous


In the code above , we have separated the columns which we will be considering as continuous and categorical . But How ? This needs a detailed explanation .

So a column/variable is said to be continuous or categorical based on the CARDINALITY of that column/variable. CARDINALITY is defined as number of levels in a category. e.g CARDINALITY for the days of a week =7

  • Whichever variables are already present in data in categorical form will be categorical variables in the models.
  • Rest for Continuous variables , we have to check for their cardinality . If their cardinality(number of distinct levels) is too high, it will be continuous else it will be converted into categorical variables.
  • A continuous variable should have a continuous and smoothish function . For e.g year , although a continuous variable but don’t have many distinct level, hence it would be better to make it as a categorical variable. In this case , continuous ones are the ones that are of type floating point or having int as data type.
  • We can bin a continuous variable and then convert it into a categorical variable. Sometimes binning can be very helpful.

Which variables are categorical and which are continuous variable is one of the modelling decision we have to make.

At this point of time , our data looks like this. It has some continuous , some Boolean , some categorical and so.

In the code above, Fastai has a function named as process_dataframe(proc_df) .

  • It pulls out the dependent variable i.e ‘Sales’ from the dataframe ‘joined_samp’ and stores it in a separate variable y .
  • The rest of the dataframe excluding the dependent variable ‘Sales’ is saved in df.
  • nas: returns a dictionary of which nas it created, and the associated median.
  • mapper: A DataFrameMapper which stores the mean and standard deviation of the corresponding continuous variables which is then used for scaling of during test-time.
  • It also handles missing values , so missing values in categorical variables become 0 and other categories becomes 1,2 3,4 and so on.
  • For continuous variables , it replaces missing values with median and creates a new column as boolean which says if its missing or not.

The output df has all the variable as continuous now. The categorical columns are represented by equivalent continuous values.Check out how the year and Assortment column changed before and after.

In this way we have all the columns as continuous .

  • The continuous columns remains the same . They have been changed to float32 as that’s standard numerical data type that pytorch accepts.
  • The Categorical column gets converted into equivalent Continuous type.


The problem statement we have in Rossman data as per Kaggle is to predict next two weeks of Sales .Since its a Time -Series Data , our validation dataset isn’t random. Instead it is the most recent data as it would be in our real application.


  • As per the Kaggle rules of Rossman data Competition , we will be evaluated on the basis of RMSPE(Root Mean Square Percentage Error).
  • RMSPE will be the metrics instead of accuracy so we have formulated that in the code below:-
  • Lets create our model Data object :-
  • Earlier it was ImageClassifierData as we were dealing with Images back then . Now in this case, its ColumnarModelData as we are dealing with Columnar table data.
  • Instead of from_Paths we have from_data_frame .
  • PATH — where to store the model file.
  • val_idx — List of indexes of rows to be put in validation dataset.
  • df — data frame .
  • y1 — consist of the dependent variable.
  • cat_flds — Which all columns to be treated as Categorical variables , as by this time all the columns are converted into numerical.
  • bs — batch size.


What the above code does is, it goes through every categorical variables (cat_vars) and prints out the number of distinct levels or categories it has .The +1 with categories in the code above , is reserved for missing values. Basically it prints the cardinality of each variable along with the variable name.

We use the cardinality of each variable to decide how large its embedding matrix should be 

In the code above we go by a rule of thumb which denotes that the embedding size is cardinality size//2 but no bigger than 50.


  • When we create these matrices , it has random numbers . So we put them in a Neural Network and keep updating their values so as to reduce a Loss Function. These embedding matrices can be compared to a bunch of weights which updates themselves in such a way that reduces Loss. In this way we go from random values for these weights to an updated value that makes some sense.
  • To put things into perspective , an embedding matrix is something that can have value which is between 0 and the maximum number of levels in that category. We can then index that matrix to find a particular row and we append it to all of our continuous variables and everything after that is just the same as before .



The above diagram shows how Neural Network deals with continuous variables. Check out how the matrix multiplication works and results in the corresponding output. This is a 2 hidden layer Neural Network. Suppose there are 20 columns as inputs and we have designed the Neural Network so that the 1st hidden layer has 100 hidden units and the 2nd hidden layer has 50 hidden units. So our input [1,20] is multiplied with weight matrix of dimension [20,100] which will result to [1,100] . We then apply ReLU to this [1,100] dimension values, which results in activation of [1,100]. This [1,100] dimension activations are then multiplied with weights of dimension [100,50] , which results in activation with dimensions of [1,50]. This [1,50] activation is then multiplied with weight of dimension [50,1] which result in 1 output . We apply Softmax on the last layer to get the final output.

  • NOTE :- Never put ReLU in the last layer as Softmax needs negative values to create lower probabilities . ReLU function gets rid of the negative values . In this case of Rossman sales data , we are trying to predict Sales , hence we don’t need Softmax at the end .

Let’s create our learner . This is how we are setting up our Neural Network.

best lr = 1e-3

Parameters used in get_learner() .

  • emb_s — Use this embedding matrix for every categorical variable.
  • len(df.columns)-len(cat_vars) — Denotes the number of Continuous variables.
  • 0.04 — Dropout at the very start.
  • 1 — output of the last linear layer.
  • [1000,500] — Number of activation in the first and second linear layer.
  • [0.001,0.01] — Dropouts used in the first and second linear layer.
  • y_range=y_range — Has been described earlier.
  • tmp_name=f”{PATH_WRITE}tmp” — (optional) To be used only in kaggle kernels.
  • models_name=f”{PATH_WRITE}models” — (optional) To be used only in kaggle kernels.


!!! Question Time !!!

Qs 1 :- What does the four value in the week embedding denotes?

Initially we start with some random four values . It function the same way as the weights . So we update these values while minimizing the Loss . When the Loss has been minimized , we will end up with some updated four values. At this point of time , we will find that these particular parameters are human interpretable and are quite interesting.

Qs 2 :- Is there any way to initialize embedding matrices besides random initialization?

If there is an pre-trained embedding matrix for Cheese Sales at Rossman , we can use that embedding matrix for predicting liquor Sales. This technique is being used in instacart and pinterest .They have embedding matrix of products and stores that gets shared within the organization so that people don’t have to train new ones.

Qs 3 :- What’s the advantage of using Embedding matrices over One-Hot Encoding?

Take for e.g Sunday’s one hot encoded vector:-

[1 0 0 0 0 0 0]

The problem with the One-hot encoding representation of Sunday is that it represents a single floating point number. It represents a single thing. More of a linear behavior .

With embeddings , Sunday is a concept of 4-d space. What we get after updating these values are rich semantic concepts . For example , if it turns out that weekends have different behavior as compared to weekdays, then we see that certain value for Saturday and Sunday in this embedding matrix would be higher . By having these high dimensionality vector , we give a chance to the Neural Network to learn these underlying rich representations. It’s the rich representation that allows it to learn such interesting relationships. This idea of representing embedding is known as “distributed representation”. Neural Network has a high dimensional representation which is sometimes hard to interpret . These numbers in this matrix doesn’t have to have just one meaning . It could change its meaning with the context because its going through the rich non-linear functions.

Although one hot encoding is having more dimension in case of Days of Week , its not meaningfully high dimension . The use of Embedding matrix helps in reducing the size as well as represent the variable more appropriately.

Qs 3:- Are embeddings suitable for certain type of variables?

We use the concept of embedding when we come across Categorical Variables. But it won’t work when it comes to variables with high cardinality.

Qs 4:- How does the add_datepart() affects seasonality?

The fastai library has an add_datepart() function which takes a dataframe and its column which represents date . It optionally removes that date column and instead of that returns lots of columns which denotes useful information about that date. For e.g it replaces Date with dayOfWeek, MonthOfYear, Year, Month, Week, Day etc etc. So daysOFWeek can be represented now in a [8,4] embedding matrix. Conceptually , this allows the models to pick some interesting characteristics and patterns .

  • Suppose , there is something with a 7 day period cycle and that goes up in Monday and down on Thursday and only in Berlin , it can totally extract that pattern as it has all the information it needs. Its a nice way to deal with Time Series Models. We just need to make sure that the cycle indicator or periodicity in our time Series Model should exist as a column.

Develop the learner and get the best learning rate using the below commands.

As we can see the best learning rate is somewhere at 10^-3. So we will proceed with training our Neural Network with lr=10^-3., 3, metrics=[exp_rmspe])
[ 0. 0.02479 0.02205 0.19309]
[ 1. 0.02044 0.01751 0.18301]
[ 2. 0.01598 0.01571 0.17248], 5, metrics=[exp_rmspe], cycle_len=1)
[ 0. 0.01258 0.01278 0.16 ]
[ 1. 0.01147 0.01214 0.15758]
[ 2. 0.01157 0.01157 0.15585]
[ 3. 0.00984 0.01124 0.15251]
[ 4. 0.00946 0.01094 0.15197], 2, metrics=[exp_rmspe], cycle_len=4)
[ 0. 0.01179 0.01242 0.15512]
[ 1. 0.00921 0.01098 0.15003]
[ 2. 0.00771 0.01031 0.14431]
[ 3. 0.00632 0.01016 0.14358]
[ 4. 0.01003 0.01305 0.16574]
[ 5. 0.00827 0.01087 0.14937]
[ 6. 0.00628 0.01025 0.14506]
[ 7. 0.0053 0.01 0.14449]

Note:- Switching from gradient boosting to deep learning is good as it requires less feature engineering and it is a simpler model which requires less maintenence . This is one of the big benifits of using an approach to deep learning Using this we can get SoTA results but with lots of less work.


Step 1:- List categorical variables names and list continuous variable names and put them in data frame.

Step 2:- Create a list of row_indexes we want in validation set.

Step 3:- Creation of columnar model data object.

Step 4:- Create a list of how big we want the embedding matrix to be.

Step 5:- Call get_learner and use the exact parameters to get started with.

Step 6:- Call

I’ve used the kernels mentioned in this blogpost by William Horton .

!!! Congratulations on completing another Lesson on . Well done . !!!

If you like it , then ABC (Always be clapping . 👏 👏👏👏👏😃😃😃😃😃😃😃😃😃👏 👏👏👏👏👏)

If you have any questions, feel free to reach out on the forums or on Twitter:@ashiskumarpanda

P.S. -This blog post will be updated and improved as I further continue with other lessons. For more interesting stuff , Feel free to checkout my Github account.

To make best out of this blog post Series , feel free to explore the first Part of this Series in the following order:-

  1. Dog Vs Cat Image Classification
  2. Dog Breed Image Classification
  3. Multi-label Image Classification
  4. Time Series Analysis using Neural Network
  5. NLP- Sentiment Analysis on IMDB Movie Dataset
  6. Basic of Movie Recommendation System
  7. Collaborative Filtering from Scratch
  8. Collaborative Filtering using Neural Network
  9. Writing Philosophy like Nietzsche
  10. Performance of Different Neural Network on Cifar-10 dataset
  11. ML Model to detect the biggest object in an image Part-1
  12. ML Model to detect the biggest object in an image Part-2

Edit 1:- TFW Jeremy Howard approves of your post . 💖💖 🙌🙌🙌 💖💖 .

Leave a comment

Your email address will not be published. Required fields are marked *