Correlation Statistics in SingleStore

Optimizing query performance involving correlated columns in SingleStore

In the world of database optimization, the key to crafting efficient query plans lies in understanding the size and distribution of table data. The query optimizer, a pivotal component, relies on this information to estimate the number of rows generated by different facets of a query plan to choose the most optimal execution strategy.

The query optimizer needs good estimates of selectivity to pick good plans. In SingleStore, this process is seamlessly orchestrated by Autostats — the automatic statistics-gathering system — along with Sampling. These systems empower the optimizer with essential insights. However, correlated columns can be tricky and cause bad selectivity estimates because Selectivity(A AND B) <> Selectivity(A)*Selectivity(B).

In this blog, we'll explore the impact of correlated columns on query optimization and discuss how SingleStore’s correlation statistics help you get a better selectivity estimate when related columns are involved in filters.

Understanding correlated columns

Correlated columns are not independent of each other — the value of one column depends on the value of another. For instance, the values in a "zip code" column are inherently correlated with a "state" column. The challenge arises when executing queries involving filters on multiple correlated columns. Filtering data by one column yields a certain selectivity; by introducing another column that is related to the original column, the entire equation shifts as the correlation levels influence the selectivity of these filters.

Let's consider an example. Suppose you're running a large car service center, and you have this table schema:

1CREATE TABLE `cars_metadata` ( 2  `VIN` bigint(20) NOT NULL AUTO_INCREMENT, 3  `First_Name` text CHARACTER SET utf8 COLLATE utf8_general_ci,4  `Last_Name` text CHARACTER SET utf8 COLLATE utf8_general_ci, 5  `Country` text CHARACTER SET utf8 COLLATE utf8_general_ci, 6  `Car_Brand` text CHARACTER SET utf8 COLLATE utf8_general_ci, 7  `Car_Model` text CHARACTER SET utf8 COLLATE utf8_general_ci, 8  `Car_Color` text CHARACTER SET utf8 COLLATE utf8_general_ci, 9  `Year_of_Manufacture` bigint(20) DEFAULT NULL, 10  `Credit_Card_Type` text CHARACTER SET utf8 COLLATE utf8_general_ci, 11  KEY `VIN` (`VIN`) USING HASH12); 13
14CREATE TABLE `service_record` ( 15  `VIN` bigint(20) NOT NULL,16  `Service_Cost_USD` bigint(20) DEFAULT NULL, 17  `Service_Date` date DEFAULT NULL, 18  KEY `VIN` (`VIN`) USING HASH19);

For purposes of illustration, we filled the cars_metadata table with a set of 7.68 million rows and the service_record table with a set of 84.48 million records.

Now let’s say you want to understand the trend of Honda cars that had a service cost less than $1,000 by date.

1select Service_Date, 2       count(*) 3from service_record a, 4     cars_metadata b 5where a.VIN = b.VIN and 6            b.Car_Brand = 'Honda' and7            a.Service_Cost_USD < 10008group by Service_Date;

Observing the visual profile of the query given sampling, the optimizer has a pretty good estimate for filters without any correlated columns ( close enough to the actuals), resulting in the optimized plan for this scenario.

Now let’s see how this changes with an additional filter based on Car_Model, which we know is correlated with Car_Brand. Let’s say you want to understand the trend of any Honda Accord that had a service cost less than $1,000 by date.

1select Service_Date,2       count(*) 3from service_record a, 4     cars_metadata b 5where a.VIN = b.VIN and 6           b.Car_Brand = 'Honda' and7           b.Car_Model = 'Accord'8           a.Service_Cost_USD < 10009group by Service_Date;10

Introducing an additional filter based on the car_model has significantly impacted the selectivity, which can be observed by looking at the estimated filter selection of 4.3k records while the actual is 53.5k records. The optimizer gets these estimates through multiplication of selectivity between each individual filter, resulting in a significantly lower overall selectivity and less efficient query plans.

Correlation statistics to the rescue

To address this challenge, SingleStore employs correlation statistics. In queries where the filtered columns have a strong correlation, like Car_Brand and Car_Model in the preceding example, the Car_Brand filter becomes redundant. It is prudent to adopt the more selective filter's selectivity for a more accurate estimation and efficient execution plan, leading to better overall performance.

Optimizing the query plan can be achieved by using statistics that account for correlated column relationships and dependencies within a query predicate. Correlation coefficients can be added between columns to fine-tune how the optimizer combines the selectivity of single-column filters when using histogram estimation.

The correlation coefficient measures the relationship strength between two columns ranging from 0 to 1, where 0 means no correlation between two columns and 1 means strong correlation between two columns. The optimizer defaults the correlation coefficient to 0.5 between any two columns.

One way to measure the relationship between two columns is using Cramer’s V statistic. Cramer's V measures the strength of association between two categorical variables. Output ranges from 0 to 1, where 0 indicates no association and 1 indicates a perfect association.

Now let’s calculate Cramer's V statistic on Car_Model and Car_Brand, to understand their strength of relationship.

1SELECT *,2       sqrt(chi2 / (total * least(k-1,r-1)))3FROM(4SELECT sum(observed_frequency) as total,5       sum(power((observed_frequency - ((row_total * column_total)/total)),2)/((row_total * 6column_total)/total)) as chi2,7       count(distinct car_brand) as k,8       count(distinct car_model) as r9FROM(10SELECT *,11       sum(observed_frequency) OVER(PARTITION BY car_brand) AS row_total,12       sum(observed_frequency) OVER(PARTITION BY car_model) AS column_total,13       sum(observed_frequency) OVER() as total14from(15SELECT 16    car_brand,17    car_model,18    COUNT(*) AS observed_frequency19FROM 20    cars_metadata a21GROUP BY 22    1,23    2)));

Given the output of Cramer’s V from the query is 0.98, it indicates a very strong correlation between Car_Model and Car_Brand. Let’s round off the 0.98 (Cramer’s V Statistic) we have got from running the above query to 1.0 and set the correlation coefficient to 1.0 by running the following command.

1ANALYZE TABLE cars_metadata CORRELATE COLUMN Car_Model WITH COLUMN Car_Brand USING COEFFICIENT 1.0;

Now let’s re-run the same query to see how the optimizer selectivity and query plan has changed.

1select Service_Date, 2       count(*) 3from service_record a, 4     cars_metadata b 5where a.VIN = b.VIN and 6           b.Car_Brand = 'Honda' and7           b.Car_Model = 'Accord'8           a.Service_Cost_USD < 10009group by Service_Date;10

Now the selectivity for the filters have improved significantly and is close to the actuals, notice that repartition operations were added lower in the plan and a bloom filter is applied on the left in the new plan. This enabled faster overall execution for this query — almost 2x faster (94ms with correlation statistic set to 1.0 vs 187ms originally).

Conclusion

Optimizing query performance in the presence of correlated columns is a nuanced task. Recognizing the impact of correlation on selectivity and leveraging correlation statistics can significantly improve the efficiency of query plans. By fine-tuning correlation coefficients, users can tailor the optimizer's behavior to suit the specific relationships between columns, ultimately leading to better overall database performance.

Harnessing the power of its robust query optimizer, SingleStore can now seamlessly navigate the complexities of correlated columns and data distribution.