Our task as data scientists at Transfix is to predict the cost to move a truckload of freight from any two points in the continental US at any time in the next 18 months. Cost prediction is not a unique business problem to solve. What makes it unique for us at Transfix is the data (or lack thereof) that we have to use to support a large number of predictions we have to make across geographies, seasons, business cycles, and equipment type. There are over a million routes (or lanes, as they’re called in the industry) between any two 3-digit zip code prefix that require a prediction from us. However, unlike the stock market, there is not one market price that defines the transaction of moving goods between two points. The unit economics, route preferences, and business goals of different sized trucking companies and owner-operators result in a wide range of viable rates that one will be able to book a truck from among the hundreds of thousands of carriers in this highly fragmented market.
Our internal data represents such a small percentage of the market that we have to supplement it with data from 3rd party aggregators. These providers collect and aggregate rates up to the metropolitan area, combining reported rates from fleets both large and small into a single time series of average monthly truck cost going back many years. For example, this leaves us with 60 observations per lane if we have 5 years worth of historical data. Unfortunately, receiving aggregates rather than raw data limits our ability to avoid any biases within the data resulting from sample size, composition of the companies reporting said rates, and the geographic uniformity of rates reported for each lane. With minimal industry access to more granular time series data at such a wide scale, how can we deploy models that generalizes well and are accurate enough to impact our business?
The spirit of this talk is less about predicting the cost of a truck than it is about overcoming data sparsity and an erratic data generating process with creative approaches to time series modeling. Come learn about how we use Facebook Prophet algorithm as a way to address the noise in our data, seasonal ARIMA models for forecasting our time series, and AWS Fargate to scale up and deliver models at scale.
Abstract & Bio
How to Approach Time Series Forecasting Given Noisy and Sparse Data: an Example from the Trucking Industry