Ohjelmistotuotantoprojekti / Software engineering project

An Engine for Fast and Approximate Database Queries

Asiakas / Client:

Prof. Michael Mathioudakis

Yhteyshenkilön email / Contact email:

michael.mathioudakis@helsinki.fi

Kuvaus / Description:

An Engine for Fast and Approximate Database Queries

In one sentence

In this project, we build an Approximate Query Processing (AQP) engine -- i.e., a software layer on top of a relational database, that allows us to obtain fast, approximate answers to aggregate queries, with the help of Machine Learning models.

The setting

Consider a relational database table T, with numerical attributes A, B, C, etc. We are interested to approximate SQL queries of the following form.

SELECT COUNT(*) as answer FROM T WHERE lb < B < rb AND lc < C < rc

SELECT AVG(A) as answer FROM T WHERE lb < B < rb AND lc < C < rc

To approximate the answer of each query, the engine will build and use probabilistic tree models (e.g., Decision Trees or Random Forests). Unlike traditional relational systems, such as Postgres or MySQL, the engine will not provide an exact answer -- but rather an approximate estimate of the answer, along with a measure of uncertainty for the estimate. Specifically, the AQP engine will return a pair of numerical values Estimate(answer), StandardDeviation(answer), derived based on the built tree models.

Note that we are equally interested in queries with disjunctive WHERE clauses (i.e., using OR instead of AND).

Why this project is important

As the volume of data keeps increasing, so does the need to develop efficient ways to process it.

While traditional database systems have focused on producing exact answers to user queries, they do not address settings where: * users are interested in fast and approximate answers, rather than slow and exact ones; * users are interested in predictive queries -- i.e., they want to inquire about data that might be encountered later, rather than data that already exist in the database; * the volume of data is simply too large to process in its entirety every time a user submits an aggregate query (e.g., to look for the average value of a quantity).

Machine learning find a natural application in such settings, as using models trained on the data can lead to immense reduction of computational costs.

Working on this project will give you hands-on experience with database systems and engineering aspects of machine learning.

Why Tree Models

Because Probabilistic Tree Models are fast to train over large datasets, typically have very good performance, and are easy to interpret and perform computations with.

Toteustusympäristö / Implementation environment:

The proposed implementation environment is Python over Postgres.

The development of the project involves at least the following modules components.

Core

Evaluation

The product

The final software product should be properly structured and documented to allow for extension by others.

Erityisvaatimukset / Requirements for participants :

At least one of the following:

Lisätietoja / Additional information:

To get an idea about Approximate Query Processing, you might want to skim through the following papers (it is not required to understand all their content):

To get a hands-on idea of tree models, play with the Decision-Tree and Random Forest modules of Scikit-Learn.

Further sprcifications

The customer will provide a short but detailed specification for the mathematical expressions to be implemented in the project.

sopivat ajankohdat / possible timeframes