Query-based Workload Forecasting for Self-Driving Database Management Systems
December 07, 2020

This is an outline of a research paper, it might contain errors, read with cautious.

Lin Ma & Andrew Pavlo et al, 18 SIGMOD

  • problem to solve
    • The first step towards an autonomous database management system (DBMS) is the ability to model the target application’s workload.
    • Previous forecasting techques model the resource utilization of the queries. Such metrics, however, change whenever the physical design of the database and the hardware resources change, thereby rendering previous forecasting models useless
  • related works
  • method
    • predict the expected arrival rate of queries
    • uses the logical composition of queries in the workload rather than the amount of physical resources used for query execution
    • clustering-based technique for reducing the total number of forecasting models to maintain
  • evaluation
    • compare on 3 real-world database traces
    • implemented as external controller for PG & MySQL
  • utilization
    • automatic index selection

three challenges

  • arrival rate forecasting model
  • reduce the complexity of the workload
  • handle the changes in the workload’s patterns as well as the query mixtures

Method

  • 3 components
  • Pre-processor
    • 2 step to form templates
      • extracts constants, replaces with value placeholders, convert into prepared statements
      • formatting as normalization, use AST to rebuild
    • count number of queries in a given time interval, store count
      • aggregate stale records to save space
    • parameter sample
    • aggregate templates with equivalent semantic features to further reduce the number of unique templates, use heuristics to approximate the equivalence of templates
  • clusterer
    • purpose: reduce the total number of templates
    • features:
      • physical: amount of resources and runtime metrics when execute the query
        • pro: fine-grained
        • con: volatile
      • logical: table/column accessed, properties of AST
        • pro: persistent
        • con: limited information
      • arrival rate history
        • argued to be a better feature
        • sample timestamps to form vector
        • cosine similarity as clustering metric
        similarity=cos(θ)=ABAB=i=1nAiBii=1nAi2i=1nBi2,{\displaystyle {\text{similarity}}=\cos(\theta )={\mathbf {A} \cdot \mathbf {B} \over \|\mathbf {A} \|\|\mathbf {B} \|}={\frac {\sum \limits {i=1}^{n}{A{i}B_{i}}}{{\sqrt {\sum _{i=1}^{n}{A_{i}^{2}}}}{\sqrt {\sum_{i=1}^{n}{B_{i}^{2}}}}}},}
    • online-clustering
      • modified version of DBSCAN
      • density-based clustering scheme
      • further prune
  • Forecaster
    • what to forecast
      • predict the arrival rate patterns of the clusters’ queries
      • horizon & interval
        • interval: 1mins
        • horizon: given by planning module, and QB5000 builds a forecasting model for each required prediction horizon
    • how to forecast
      • linear regression
        • for each cluster, based on the arrival rate of the query over a specified period of time in the past
      • RNN
        • LSTM
      • ensemble
        • ensemble LR & LSTM
        • boost
      • can't predict spikes
      • add kernel regression model
        • non-parametric
        • not as well as ensemble averagely, but can predict spikes
      • hybrid method
        • if predicted volume from KR is 150% larger than ENSEMBLE, then use KR
    • how to utilize the prediction
      • automatic index selection
      • use AutoAdmin to generate set of indexes to build given predicted workload of the three largest clusters generated by Clusterer, instead of sampled workload
    • evaluation
      • number of cluster
      • prediction accuracy