My Blog

This is an outline of a research paper, it might contain errors, read with cautious.

Lin Ma & Andrew Pavlo et al, 18 SIGMOD

problem to solve
- The first step towards an autonomous database management system (DBMS) is the ability to model the target application’s workload.
- Previous forecasting techques model the resource utilization of the queries. Such metrics, however, change whenever the physical design of the database and the hardware resources change, thereby rendering previous forecasting models useless
related works
method
- predict the expected arrival rate of queries
- uses the logical composition of queries in the workload rather than the amount of physical resources used for query execution
- clustering-based technique for reducing the total number of forecasting models to maintain
evaluation
- compare on 3 real-world database traces
- implemented as external controller for PG & MySQL
utilization
- automatic index selection

three challenges

arrival rate forecasting model
reduce the complexity of the workload
handle the changes in the workload’s patterns as well as the query mixtures

Method

3 components

Pre-processor
- 2 step to form templates
  - extracts constants, replaces with value placeholders, convert into prepared statements
  - formatting as normalization, use AST to rebuild
- count number of queries in a given time interval, store count
  - aggregate stale records to save space
- parameter sample
- aggregate templates with equivalent semantic features to further reduce the number of unique templates, use heuristics to approximate the equivalence of templates
clusterer
- purpose: reduce the total number of templates
- features:
  - physical: amount of resources and runtime metrics when execute the query
    - pro: fine-grained
    - con: volatile
  - logical: table/column accessed, properties of AST
    - pro: persistent
    - con: limited information
  - arrival rate history
    - argued to be a better feature
    - sample timestamps to form vector
    - cosine similarity as clustering metric
    ${\text{similarity}}=\cos(\theta )={\mathbf {A} \cdot \mathbf {B} \over \|\mathbf {A} \|\|\mathbf {B} \|}={\frac {\sum \limits {i=1}^{n}{A{i}B_{i}}}{{\sqrt {\sum _{i=1}^{n}{A_{i}^{2}}}}{\sqrt {\sum_{i=1}^{n}{B_{i}^{2}}}}}},$
- online-clustering
  - modified version of DBSCAN
  - density-based clustering scheme
  - further prune
Forecaster
- what to forecast
  - predict the arrival rate patterns of the clusters’ queries
  - horizon & interval
    - interval: 1mins
    - horizon: given by planning module, and QB5000 builds a forecasting model for each required prediction horizon
- how to forecast
  - linear regression
    - for each cluster, based on the arrival rate of the query over a specified period of time in the past
  - RNN
    - LSTM
  - ensemble
    - ensemble LR & LSTM
    - boost
  - can't predict spikes
  - add kernel regression model
    - non-parametric
    - not as well as ensemble averagely, but can predict spikes
  - hybrid method
    - if predicted volume from KR is 150% larger than ENSEMBLE, then use KR
- how to utilize the prediction
  - automatic index selection
  - use AutoAdmin to generate set of indexes to build given predicted workload of the three largest clusters generated by Clusterer, instead of sampled workload
- evaluation
  - number of cluster
  - prediction accuracy