Query-based Workload Forecasting for Self-Driving Database Management Systems
December 07, 2020
This is an outline of a research paper, it might contain errors, read with cautious.
Lin Ma & Andrew Pavlo et al, 18 SIGMOD
problem to solve
The first step towards an autonomous database management system
(DBMS) is the ability to model the target application’s workload.
Previous forecasting techques model the resource utilization of
the queries. Such metrics, however, change whenever the physical design of the database and the hardware resources change, thereby rendering previous forecasting models useless
related works
method
predict the expected arrival rate of queries
uses the logical composition of queries in the workload rather than the amount of physical resources used for query execution
clustering-based technique for reducing the total number of forecasting models to maintain
evaluation
compare on 3 real-world database traces
implemented as external controller for PG & MySQL
utilization
automatic index selection
three challenges
arrival rate forecasting model
reduce the complexity of the workload
handle the changes in the workload’s patterns as well as the query mixtures
Method
3 components
Pre-processor
2 step to form templates
extracts constants, replaces with value placeholders, convert into prepared statements
formatting as normalization, use AST to rebuild
count number of queries in a given time interval, store count
aggregate stale records to save space
parameter sample
aggregate templates with equivalent semantic features to further reduce the number of unique templates, use heuristics to approximate the equivalence of templates
clusterer
purpose: reduce the total number of templates
features:
physical: amount of resources and runtime metrics when execute the query
predict the arrival rate patterns of the clusters’ queries
horizon & interval
interval: 1mins
horizon: given by planning module, and QB5000 builds a forecasting model for each required prediction horizon
how to forecast
linear regression
for each cluster, based on the arrival rate of the query over a specified period of time in the past
RNN
LSTM
ensemble
ensemble LR & LSTM
boost
can't predict spikes
add kernel regression model
non-parametric
not as well as ensemble averagely, but can predict spikes
hybrid method
if predicted volume from KR is 150% larger than ENSEMBLE, then use KR
how to utilize the prediction
automatic index selection
use AutoAdmin to generate set of indexes to build given predicted workload of the three largest clusters generated by Clusterer, instead of sampled workload