An Engine for Fast and Approximate Database Queries
Asiakas / Client:
Prof. Michael MathioudakisYhteyshenkilön email / Contact email:
michael.mathioudakis@helsinki.fiKuvaus / Description:
An Engine for Fast and Approximate Database Queries
In one sentence
In this project, we build an Approximate Query Processing (AQP) engine -- i.e., a software layer on top of a relational database, that allows us to obtain fast, approximate answers to aggregate queries, with the help of Machine Learning models.
The setting
Consider a relational database table T, with numerical attributes A, B, C, etc. We are interested to approximate SQL queries of the following form.
SELECT COUNT(*) as answer FROM T WHERE lb < B < rb AND lc < C < rc
SELECT AVG(A) as answer FROM T WHERE lb < B < rb AND lc < C < rc
To approximate the answer of each query, the engine will build and use probabilistic tree models (e.g., Decision Trees or Random Forests). Unlike traditional relational systems, such as Postgres or MySQL, the engine will not provide an exact answer -- but rather an approximate estimate of the answer, along with a measure of uncertainty for the estimate. Specifically, the AQP engine will return a pair of numerical values Estimate(answer), StandardDeviation(answer), derived based on the built tree models.
Note that we are equally interested in queries with disjunctive WHERE clauses (i.e., using OR instead of AND).
Why this project is important
As the volume of data keeps increasing, so does the need to develop efficient ways to process it.
While traditional database systems have focused on producing exact answers to user queries, they do not address settings where: * users are interested in fast and approximate answers, rather than slow and exact ones; * users are interested in predictive queries -- i.e., they want to inquire about data that might be encountered later, rather than data that already exist in the database; * the volume of data is simply too large to process in its entirety every time a user submits an aggregate query (e.g., to look for the average value of a quantity).
Machine learning find a natural application in such settings, as using models trained on the data can lead to immense reduction of computational costs.
Working on this project will give you hands-on experience with database systems and engineering aspects of machine learning.
Why Tree Models
Because Probabilistic Tree Models are fast to train over large datasets, typically have very good performance, and are easy to interpret and perform computations with.
Toteustusympäristö / Implementation environment:
The proposed implementation environment is Python over Postgres.
The development of the project involves at least the following modules components.
Core
- Machine Learning: Train probabilistic tree models (e.g., Random Forests) on top of the data.
- Query Parser/Interface: Specify query from user-provided input.
- Query Engine: Execute query over pre-trained models.
Evaluation
- Tests: Automatic tests for syntactic and semantic correctness.
- Benchmarking: Evaluate the performance of the engine over extensive benchmarks (e.g., TCP).
The product
The final software product should be properly structured and documented to allow for extension by others.
Erityisvaatimukset / Requirements for participants :
At least one of the following:
- experience with Database Systems as user, or,
- experience with Machine Learning algorithms, or
- experience with Statistics / Probabilistic models.
Lisätietoja / Additional information:
To get an idea about Approximate Query Processing, you might want to skim through the following papers (it is not required to understand all their content):
- Getoor, Lise, Benjamin Taskar, and Daphne Koller. "Selectivity estimation using probabilistic models." ACM SIGMOD Record. Vol. 30. No. 2. ACM, 2001.
- Cormode, Graham, et al. "Synopses for massive data: Samples, histograms, wavelets, sketches." Foundations and Trends in Databases 4.1–3 (2012): 1-294.
- Agarwal, Sameer, et al. "BlinkDB: queries with bounded errors and bounded response times on very large data." Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 2013.
- Park, Yongjoo, et al. "Database learning: Toward a database that becomes smarter every time." Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 2017.
To get a hands-on idea of tree models, play with the Decision-Tree and Random Forest modules of Scikit-Learn.
Further sprcifications
The customer will provide a short but detailed specification for the mathematical expressions to be implemented in the project.
sopivat ajankohdat / possible timeframes
- 3.9.-14.12.