10/01/2025

Introduction to SQL Server Statistics - Tutorial

Wishing you all, my loving readers, Happy New Year 2025!

This is the first blog for the year 2025. As a new initiative, I'm going to (try to) write some tutorials on some selected features of the SQL server. This is my own take of these features, and written using knowledge I acquire over the years. Hope this will give something for the society. I'm also hoping to create a Youtube video associate to this. When I done, I will update the page to link to that.

Introduction to SQL Server Statistics

What are SQL Server Statistics

SQL Server statistics are mainly used by Query Optimizer, therefore before we talk about statistics, we need to know little about how SQL Server Query Optimizer works.

When running a query, there can many ways to execute it (different ways to get data, different ways to join them, etc.). Therefore SQL server need a plan before executing a query. We call it Query Plan.

Since there are can be many combinations of ways to execute a query, there can be lot of plans that optimizer can come up for one query.

Therefore, need to find the optimized plan, the best plan (or rather good enough plan …). Optimizer only have limited time to come up with a plan. It has to finish and present the plan quickly as possible. In this process optimizer, will try to find good enough plan quickly as possible. Hens, it need all the help it can get. 


In order to create the optimal plan, Query Optimizer analyse the query and then it need to know meta data about data that query is looking for.

For example query can be looking for total number of orders for one particular customer placed during a date range. To create plan for this query, optimizer need to look into orders table and customer table. 

It need to know roughly how many orders are placed during the given time period. Because depend on the number of rows we might need to do different things to aggregate the data. For example, if we find 100 rows, we can aggregate one way and if find 1M rows, we might not be able to use the same approach. 

It also need to allocate memory for query to run. In order to estimate amount of memory to allocate, it need to estimate how many rows it is dealing with.

It cannot count rows, because that will take lot of time. Optimizer need to report back to SQL server quick as possible.

So it need to find out how data is distributed on the order table. We call this cardinality estimate. This is where statistics come to play. Statistics hold how data is distributed in a table/column.

Creation of SQL Server Statistics

Statistics are create on indexes and columns. It can be on single column or combination of columns.
Two types of statistics
• Statistics create for index (can be multi columns)
• Statistics created for columns (only created on single column)

Statistics have 3 components
• Header -> meta information about statistics
• Density -> mathematical construct of the selectivity of column or column
• Histogram -> show how data is distributed

You can see statistics information by running following command

DBCC SHOW_STATISTICS(MyTable, MyIndex);


Let's create "Employee" table, which have Employee Id, Age, Salary and Gender as an example:
CREATE TABLE Employee (
    EmpId INT PRIMARY KEY, -- Primary Key column
    Age INT,    -- Integer column for age
    Salary NUMERIC(18,2),  -- Numeric column for salary with precision 18 and scale 2
    Gender VARCHAR(10),-- Gender of the employee
);

We will populate the table with some random data:
-- Insert 1000 random rows into the Employee table with salary rounded to the nearest 500
SET NOCOUNT ON;

DECLARE @i INT = 1;

WHILE @i <= 1000
BEGIN
    INSERT INTO Employee (EmpId, Age, Salary, Gender)
    VALUES (
        @i,  -- EmpId (using sequential numbers for simplicity)
        FLOOR(RAND(CHECKSUM(NEWID())) * (70 - 18 + 1)) + 18,  -- Age: random number between 18 and 70
        ROUND(CAST((RAND(CHECKSUM(NEWID())) * (100000 - 15000) + 15000) AS NUMERIC(18,2)), -2),  -- Salary: rounded to nearest 500
        CASE 
            WHEN RAND(CHECKSUM(NEWID())) < 0.4 THEN 'MALE'        -- 40% chance for MALE
            WHEN RAND(CHECKSUM(NEWID())) < 0.8 THEN 'FEMALE'     -- 40% chance for FEMALE
            WHEN RAND(CHECKSUM(NEWID())) < 0.9 THEN 'DUAL'       -- 10% chance for DUAL
            ELSE 'UNKNOWN'                                       -- 10% chance for UNKNOWN
        END -- Gender
    );

    SET @i = @i + 1;
END;

Create indexes:
CREATE INDEX IX_Employee_Age ON Employee(Age);

-- Create an index on the 'Salary' column
CREATE INDEX IX_Employee_Salary ON Employee(Salary);

-- Create an index on the 'Gender' column
CREATE INDEX IX_Employee_Gender ON Employee(Gender);

Let's check out stats now:
DBCC SHOW_STATISTICS(Employee, IX_Employee_Age);



Let's deep dive into each section in the statistic data structure:

Header
Header contains name of the statistics, when it was last updated, row count AT THE TIME of the statistic creation
• Name -> name of the index
• Updated -> when this was last updated
• Rows -> Number of rows at the time of the statistic creation/update
• Rows Sample

Density
How much variety (selectivity). Unique index has very high selectivity.
Density is a measure that provide insight into the selectivity of a column or combination of columns in an index

Density = 1 / Number of Distinct Values

Let's have a look at density data for Age index:



There are 53 distinct age values in Employee table (in the example data I have, this might be slightly vary on yours). So Density = 1/53 => 0.0188679245283019
But if you add primary key EmpId to that column, Density is => 1/1000 (because empid is unique in that table and we have 1000 rows)

If we look at density data for Gender index (where only 4 possible values)



So high density means less selectivity, optimizer will use this data to choose the correct indexes.
"Average Length" column in this vector shows average of total length of fields in consideration. E.g. in above Gender column in 5 characters in length.

Histogram

Let's take a look at histogram for the Salary Index: 

DBCC SHOW_STATISTICS(Employee, IX_Employee_Salary);

SELECT * FROM Employee WHERE Salary BETWEEN 80600.00 AND 81500.00
ORDER BY Salary

Let's take a look at data for 81500 bucket from the table



Histogram is the most used information in statistics.
Histogram in SQL server statistics is a data structure which describe the distribution of data in a column or index
• Histograms main structure contains up to 200 steps called, buckets
• If NULL values are allowed there is a special bucket for NULLs so 201 buckets
• If number of rows are below 200, steps will be lower than 200

• RANGE_HI_KEY -> Highest value in the range (81500)
• EQ_ROW -> Rows that equal to highest value (1 for above example for 81500 bucket, but 2 for 80600 bucket)
• RANGE_ROWS -> Number of other values between high value and previous high key (i.e. 80600) -> in above example 7 (80800, 80900, 81100, 81300, 81400)
• DISTINCT_RANGE_ROWS -> distinct values in range rows.
• AVG_RANGE_ROWS -> RANGE_ROWS/DISTINCT_RANGE_ROWS (i.e. 7/5 = 1.4)

Note that values are considered from high-key (bottom to top)

Creation of Statistics

Indexed Columns (Rowstore):
Creation of statistics in indexed column is automatic and cannot be turned off.

Non Indexed Columns
By default SQL server create stats for non-indexed columns if they are used as filtering columns in queries.
But you can turn this off

In order to see auto create stats is turned on or off

SELECT DATABASEPROPERTYEX('<<YOUR DB>>, 'IsAutoCreateStatistics')

OR

SELECT is_auto_create_stats_on 
FROM sys.databases 
WHERE name = 'YourDatabaseName'

In order to enable or disable auto create stats:
ALTER DATABASE <<your database>> SET AUTO_CREATE_STATISTICS ON/OFF

These auto-created statistics are single-column only and are named in the format: _WA_Sys_columnname_hexadecimal

To see created stats:

SELECT * FROM sys.stats 
WHERE object_id = OBJECT_ID('<<your table name>>')

SELECT * FROM sys.stats 
WHERE object_id = OBJECT_ID('dbo.Employee')



In this screenshot "PK__Employee_..." is the stats for primary key index, which will be created automatically and cannot be turned off. Same with all IX stats, those are stats for non-clustered indexes.

_WA_Sys one is the one created automatically.
Why they are prefix with _WA? Well there is no official statement from Microsoft, so some believe, it is stands for Washington, where Microsoft headquarters are in.

Currently there are no user created stats in there.

If you want to create statistics on a column (which is not normally recommended):
CREATE STATISTICS <<statistics name>>
ON <<table name>> (<<column name>>)
WITH [Options];

CREATE STATISTICS _mpa_Employee_Age
ON dbo.Employee (Age)
WITH FULLSCAN;

If we check stats for Employee tables now:



Note that, new manually created stats has "user_created" field set to true.

Update Statistics

SQL server update stats automatically. However, there might be occasions where updating statistics manually will give optimzer better advantage.

You can update stats using following T-SQL:
UPDATE STATISTICS <<table name>> <<stats name>>
[WITH [FULLSCAN | SAMPLE n PERCENT | SAMPLE n ROWS]];

Examples:
This update stats for all statistics in a table
UPDATE STATISTICS dbo.Employee

Alternatively, you can create management task to maintain statistics.



You can update statistics for all tables using following SQL:
EXEC sp_updatestats;

Remove Statistics

Though we don't really want to remove stats, in case it is required:

DROP STATISTICS <<table name>>;

DROP STATISTICS dbo.Employee

No comments:

Post a Comment

Introduction to SQL Server Statistics - Tutorial

Wishing you all, my loving readers, Happy New Year 2025! This is the first blog for the year 2025. As a new initiative, I'm going to (tr...