10/01/2025

Introduction to SQL Server Statistics - Tutorial

Wishing you all, my loving readers, Happy New Year 2025!

This is the first blog for the year 2025. As a new initiative, I'm going to (try to) write some tutorials on some selected features of the SQL server. This is my own take of these features, and written using knowledge I acquire over the years. Hope this will give something for the society. I'm also hoping to create a Youtube video associate to this. When I done, I will update the page to link to that.

Introduction to SQL Server Statistics

What are SQL Server Statistics

SQL Server statistics are mainly used by Query Optimizer, therefore before we talk about statistics, we need to know little about how SQL Server Query Optimizer works.

When running a query, there can many ways to execute it (different ways to get data, different ways to join them, etc.). Therefore SQL server need a plan before executing a query. We call it Query Plan.

Since there are can be many combinations of ways to execute a query, there can be lot of plans that optimizer can come up for one query.

Therefore, need to find the optimized plan, the best plan (or rather good enough plan …). Optimizer only have limited time to come up with a plan. It has to finish and present the plan quickly as possible. In this process optimizer, will try to find good enough plan quickly as possible. Hens, it need all the help it can get. 


In order to create the optimal plan, Query Optimizer analyse the query and then it need to know meta data about data that query is looking for.

For example query can be looking for total number of orders for one particular customer placed during a date range. To create plan for this query, optimizer need to look into orders table and customer table. 

It need to know roughly how many orders are placed during the given time period. Because depend on the number of rows we might need to do different things to aggregate the data. For example, if we find 100 rows, we can aggregate one way and if find 1M rows, we might not be able to use the same approach. 

It also need to allocate memory for query to run. In order to estimate amount of memory to allocate, it need to estimate how many rows it is dealing with.

It cannot count rows, because that will take lot of time. Optimizer need to report back to SQL server quick as possible.

So it need to find out how data is distributed on the order table. We call this cardinality estimate. This is where statistics come to play. Statistics hold how data is distributed in a table/column.

Creation of SQL Server Statistics

Statistics are create on indexes and columns. It can be on single column or combination of columns.
Two types of statistics
• Statistics create for index (can be multi columns)
• Statistics created for columns (only created on single column)

Statistics have 3 components
• Header -> meta information about statistics
• Density -> mathematical construct of the selectivity of column or column
• Histogram -> show how data is distributed

You can see statistics information by running following command

DBCC SHOW_STATISTICS(MyTable, MyIndex);


Let's create "Employee" table, which have Employee Id, Age, Salary and Gender as an example:
CREATE TABLE Employee (
    EmpId INT PRIMARY KEY, -- Primary Key column
    Age INT,    -- Integer column for age
    Salary NUMERIC(18,2),  -- Numeric column for salary with precision 18 and scale 2
    Gender VARCHAR(10),-- Gender of the employee
);

We will populate the table with some random data:
-- Insert 1000 random rows into the Employee table with salary rounded to the nearest 500
SET NOCOUNT ON;

DECLARE @i INT = 1;

WHILE @i <= 1000
BEGIN
    INSERT INTO Employee (EmpId, Age, Salary, Gender)
    VALUES (
        @i,  -- EmpId (using sequential numbers for simplicity)
        FLOOR(RAND(CHECKSUM(NEWID())) * (70 - 18 + 1)) + 18,  -- Age: random number between 18 and 70
        ROUND(CAST((RAND(CHECKSUM(NEWID())) * (100000 - 15000) + 15000) AS NUMERIC(18,2)), -2),  -- Salary: rounded to nearest 500
        CASE 
            WHEN RAND(CHECKSUM(NEWID())) < 0.4 THEN 'MALE'        -- 40% chance for MALE
            WHEN RAND(CHECKSUM(NEWID())) < 0.8 THEN 'FEMALE'     -- 40% chance for FEMALE
            WHEN RAND(CHECKSUM(NEWID())) < 0.9 THEN 'DUAL'       -- 10% chance for DUAL
            ELSE 'UNKNOWN'                                       -- 10% chance for UNKNOWN
        END -- Gender
    );

    SET @i = @i + 1;
END;

Create indexes:
CREATE INDEX IX_Employee_Age ON Employee(Age);

-- Create an index on the 'Salary' column
CREATE INDEX IX_Employee_Salary ON Employee(Salary);

-- Create an index on the 'Gender' column
CREATE INDEX IX_Employee_Gender ON Employee(Gender);

Let's check out stats now:
DBCC SHOW_STATISTICS(Employee, IX_Employee_Age);



Let's deep dive into each section in the statistic data structure:

Header
Header contains name of the statistics, when it was last updated, row count AT THE TIME of the statistic creation
• Name -> name of the index
• Updated -> when this was last updated
• Rows -> Number of rows at the time of the statistic creation/update
• Rows Sample

Density
How much variety (selectivity). Unique index has very high selectivity.
Density is a measure that provide insight into the selectivity of a column or combination of columns in an index

Density = 1 / Number of Distinct Values

Let's have a look at density data for Age index:



There are 53 distinct age values in Employee table (in the example data I have, this might be slightly vary on yours). So Density = 1/53 => 0.0188679245283019
But if you add primary key EmpId to that column, Density is => 1/1000 (because empid is unique in that table and we have 1000 rows)

If we look at density data for Gender index (where only 4 possible values)



So high density means less selectivity, optimizer will use this data to choose the correct indexes.
"Average Length" column in this vector shows average of total length of fields in consideration. E.g. in above Gender column in 5 characters in length.

Histogram

Let's take a look at histogram for the Salary Index: 

DBCC SHOW_STATISTICS(Employee, IX_Employee_Salary);

SELECT * FROM Employee WHERE Salary BETWEEN 80600.00 AND 81500.00
ORDER BY Salary

Let's take a look at data for 81500 bucket from the table



Histogram is the most used information in statistics.
Histogram in SQL server statistics is a data structure which describe the distribution of data in a column or index
• Histograms main structure contains up to 200 steps called, buckets
• If NULL values are allowed there is a special bucket for NULLs so 201 buckets
• If number of rows are below 200, steps will be lower than 200

• RANGE_HI_KEY -> Highest value in the range (81500)
• EQ_ROW -> Rows that equal to highest value (1 for above example for 81500 bucket, but 2 for 80600 bucket)
• RANGE_ROWS -> Number of other values between high value and previous high key (i.e. 80600) -> in above example 7 (80800, 80900, 81100, 81300, 81400)
• DISTINCT_RANGE_ROWS -> distinct values in range rows.
• AVG_RANGE_ROWS -> RANGE_ROWS/DISTINCT_RANGE_ROWS (i.e. 7/5 = 1.4)

Note that values are considered from high-key (bottom to top)

Creation of Statistics

Indexed Columns (Rowstore):
Creation of statistics in indexed column is automatic and cannot be turned off.

Non Indexed Columns
By default SQL server create stats for non-indexed columns if they are used as filtering columns in queries.
But you can turn this off

In order to see auto create stats is turned on or off

SELECT DATABASEPROPERTYEX('<<YOUR DB>>, 'IsAutoCreateStatistics')

OR

SELECT is_auto_create_stats_on 
FROM sys.databases 
WHERE name = 'YourDatabaseName'

In order to enable or disable auto create stats:
ALTER DATABASE <<your database>> SET AUTO_CREATE_STATISTICS ON/OFF

These auto-created statistics are single-column only and are named in the format: _WA_Sys_columnname_hexadecimal

To see created stats:

SELECT * FROM sys.stats 
WHERE object_id = OBJECT_ID('<<your table name>>')

SELECT * FROM sys.stats 
WHERE object_id = OBJECT_ID('dbo.Employee')



In this screenshot "PK__Employee_..." is the stats for primary key index, which will be created automatically and cannot be turned off. Same with all IX stats, those are stats for non-clustered indexes.

_WA_Sys one is the one created automatically.
Why they are prefix with _WA? Well there is no official statement from Microsoft, so some believe, it is stands for Washington, where Microsoft headquarters are in.

Currently there are no user created stats in there.

If you want to create statistics on a column (which is not normally recommended):
CREATE STATISTICS <<statistics name>>
ON <<table name>> (<<column name>>)
WITH [Options];

CREATE STATISTICS _mpa_Employee_Age
ON dbo.Employee (Age)
WITH FULLSCAN;

If we check stats for Employee tables now:



Note that, new manually created stats has "user_created" field set to true.

Update Statistics

SQL server update stats automatically. However, there might be occasions where updating statistics manually will give optimzer better advantage.

You can update stats using following T-SQL:
UPDATE STATISTICS <<table name>> <<stats name>>
[WITH [FULLSCAN | SAMPLE n PERCENT | SAMPLE n ROWS]];

Examples:
This update stats for all statistics in a table
UPDATE STATISTICS dbo.Employee

Alternatively, you can create management task to maintain statistics.



You can update statistics for all tables using following SQL:
EXEC sp_updatestats;

Remove Statistics

Though we don't really want to remove stats, in case it is required:

DROP STATISTICS <<table name>>;

DROP STATISTICS dbo.Employee

30/11/2024

Using Codestral in VS Code - Part 1

This is a part of the series of article where I explain how I learning AI code assistance in VS Code.

You can read the first article here. It explains how you connect Codestral to VS Code.

Ok, now we have connected Codestral to VS Code, lets see how we can use it to assist my coding. In this part I'm going to ask Codestral to write a whole program for me and probably small edit. To simplicity I'm going to use .Net Core console application and ask Codestral to create a snake game in C#.

Get project created

First let's get project created for our game in VS Code. Open a folder and create a new folder called "SnakeGageCodestral" (or any preferred name).

Then open the terminal in VS Code and create console project by running following command:

dotnet new console

That will create the basic structure with Program.cs file in it.


Open the Program.cs and clear any code in it.


Get AI Assistance

Open "Continue" extension. Select your AI modal (in this case we are using Codestral) and ask the question from AI.

You can use a prompt like this: Can you create a snake game in .Net Core console project in C sharp?



Codestral will be more helpful than you think and will also include steps to create project also, but we are intrested in code section. Choose that section (step 4 in below example) and apply it.


This will get updated in Program.cs file.
Check for any obvious compile time errors and if there are you can get assistance from AI by just putting the error message in chat box.

Mine was ok and next step is to run it. Use following command on VS Code terminal.

dotnet run

Mine worked ok.



Editing using AI

Though it worked first time. Snake was running too fast. I wanted to slow it down. So I asked AI assistance to make a edit. It has suggested the line that I should edit as per screen shot below.


That's it. We will have look at more features later.

17/11/2024

Installing PowerShell on Windows

PowerShell is powerful command line interface for windows PCs and servers, it does get installed by default when you install windows on a PC or server. However, if you want the most up to date version of PowerShell, it is up to you to install it manually. 

Following article is based on Windows 10 PC so some version numbers might be different on different version of Windows.

The PowerShell version you get when you install Windows is called "Desktop Edition". You can see this by running $PSVersionTable command on PowerShell:

As you can see version of this edition of PowerShell is 5.1.xxx.

But there is another Edition written on .Net Core and it is currently on 7.4.xx.


This version is much more sophisticated and have ability to integrate with more modern development environments. There fore if you are a software developer or person who manage modern tools/software, I recommend using this edition of PowerShell.

So let's see how we can install this edition.

Microsoft has comprehensive article on how to install PowerShell in here. Following is simplified version of it.

There are few ways to install PowerShell.

  1. Using Winget
  2. Using MSI package
  3. Using .NET Global Tool
  4. Using Microsoft Store

Using Winget

If you not already aware, winget is the Windows Package Manager. You can run following command on command line (or even default edition of PowerShell) to install via winget:

winget install --id Microsoft.PowerShell --source winget


Using MSI package

This is more traditional way of installing. You will have to manually download it and install it.

Following is the link to the current version of the MSI package -> MSI

This will download the MSI package from GitHub and you will need to save it and run it (note above is x64 version, if you need different version please search for it).

MSI package does give more control over the installation. You can specify various options while installing it. For more information please refer to the original Microsoft article mentioned above.


Using .NET Global Tool

If you have .NET Core SDK already installed on your computer, you can install PowerShell as .NET Global Tool (can invoke from any directory).

Just run following command on command line:

dotnet tool install --global PowerShell


Using Microsoft Store

Like any other application on Microsoft store, PowerShell can be installed from Microsoft Store.









17/10/2024

IIS Max Allowed Content Length

Last week one of our third party integrator who is using our web api has asked what is the maximum number of ids they can query.

In this api they use, they send (Post) array of ids and in return, API provide status of those records. So they wanted to know what is the maximum limit they can send in the ID array.

I didn't had a answer straight away. That got me thinking.

Ids are in body, because it is a post request. So "maxQueryString" limit will not comes into play as far as I can see. 

I have done bit of Googling and found out there is a setting in IIS, which filter requests. It was called "maxAllowedContentLength". As I found out this is a security setting which limits some one sending huge requests. This setting can be found in system.webServer/security/requestFiltering section. If you looking in the web.config file you can see it in following section:


This setting is there to prevent DoS attacks.

Value in this setting is in bytes. Default value seems to be 30,000,000 (30 million) bytes, which is approximately 28.6 MB.  Most of time in production web servers, we tends to go much lower limit.

When you consider 28.6 MB, it can accommodate large number of IDs for our supplier, but on the other hand in order to full fill that request, back end (data stores) must work very hard. There fore, I have advice not to send too many Ids in there request and split their request to several sub request (paging) in order to prevent data store servers overwhelming. 

If you exceed "maxAllowedContentLength" limit, you will get HTTP 404.13 - Request Entity Too Large.



01/10/2024

Connecting Codestral with VS Code

These days AI hype is so high that I though I also should test some tools I can use to increase my productivity. That's when I came across Codestral by Mistral AI. 

Codestral is a coding assistant tool which can be integrated to your development environment and best of all is, it is free.

So I have tested whether I can integrate Codestral with VS Code. Here is what I experienced.

Step 1 - Create a Mistral AI account. Don't worry it is free (as of now).

Step 2 - Enable Codestral. Codestral is like part of Mistral AI (like a module), but it need activate separately.

Step 3 - Open VS Code and install "Continue" extension. Continue extension is use to integrate VS code and Codestral. This extension can be use to integrate VS Code with other AI Code assistance as well. But we here looking into how we can integrate with Codestral.


Once installed it will appear on left side bar in VS Code.


You can move it to right side, so it will not cover your file explorer and other important sections. But it totally up to you.

Step 4 - Generate API key. Go to Mistral AI website again and go to Codestral section and click on the Generate API button to generate a API key.


Step 5 - Add Codestral to continue extension. Click on the models drop down and click on add chat model.


Then select "Mistral" as the provider and "Codestral" as the model. Enter the API key you have generated before in the box provided. Finally click Connect.

That's it, now you have connected with your Codstral account in VS code.

In next article we see basic usage of the Codestral in VS code.





20/09/2024

SSMS Tips and Tricks - Part 02 - Query Elapsed Time

When we running a query on SSMS, one of the most important thing we first look at is elapsed time. I.e. how much time this query took to complete.

We can easily see this in the status bar of the Query Result window.


However this is only accurate to seconds. What if you want to see more accurate result, specially with very quick queries but very critical to performances.

That's when you can use properties window for query. In order to see this right click on the query window (see below) and select "Properties Windows".

In the properties window, you can see a property called "Connection elapsed time" and many other properties.



This shows elapsed time up to closest milli-second.

RDP to AzureAD Laptop

My work laptop is AzureAD joined and I'm using AzureAD account (from work place) to login to it. For easy access I regularly RDP to it from my home laptop (when working from home in unusual times). My home laptop just use my personal Microsoft account and not joined to any domain.

Suddenly on a day (very recently), my home laptop refused to log me into the work laptop. I was pretty sure I was typing the correct password and accessing the correct IP, because I didn't change a thing. I was puzzled.

I was sure this is something Microsoft has suddenly changed via a update. I Googled around and for my luck I was able to find following article:

Remote Desktop to Azure AD Joined Computer

This was a life saving article written by a person called Bradley Schacht around August 2019.it showed exactly the issue I was having and very clear step by step instructions to resolve it. 

Basic summary of the article as follows:

  • First you need to add two new entries if they are not already there in your RDP file. Those entries are:
    • enablecredsspsupport:i:0
    • authentication level:i:2
  • Then you need to access your account in following manner:
    • .\AzureAD\yourazureaduser@company.com
I'm not going to spell out the instructions here again, if you are interested, please visit the above link and you will be guided very clearly through the whole process.

Thanks again for Bradley Schacht for amazing article.


12/09/2024

SQL Server management Views - Part 1

SQL Server Management views (DMVs) are system views which provide insight into internal working of the SQL Server. When you are going to trouble shoot SQL Server or performance of it DMVs plays major part giving us very useful information.

There are hundreds of DMVs in SQL Server today. However this series of blogs going to look into interesting ones.

One of the DMVs I came across recently is called -> dm_exec_query_optimizer_info.

This DMV provide us with statistical data for SQL Server's Query Optimizer. It returns a table with following columns:

Counter Occurrence Value
Name of the counter/event
which is measured
No of times event has occurredAverage value of the property
(per occurrence)

Data Collection

Note that these values and occurrences are get reset, when SQL Server restarts and when they are collected values are occurrences are cumulative (add up for each event). If query was executed using a cached plan, stats in here doesn't get change (i.e. only update when optimizer starts working).

Permission

SQL Server 2019 and earlier -> VIEW SERVER STATE

SQL Server 2022 and later -> VIEW SERVER PERFORMANCE STATE

Execution

SELECT * FROM sys.dm_exec_query_optimizer_info

Data

This view collects data for about 40 events (might vary from version to version). Some of the most interested are listed below:

CounterOccurrenceValue
optimizationsTotal Number of optimizationsN/A        
elapsed timeNumber of times time elapsedAverage time taken
final costNumber of times final cost calculatedAverage final cost
trivial planNumber of times trivial plan usedN/A
no planNumber of times optimizer didn't find a planN/A
timeoutNumber of times optimizer time outN/A

Total List (in 2019)

optimizations

elapsed time

final cost

trivial plan

tasks

no plan

search 0

search 0 time

search 0 tasks

search 1

search 1 time

search 1 tasks

search 2

search 2 time

search 2 tasks

gain stage 0 to stage 1

gain stage 1 to stage 2

timeout

memory limit exceeded

insert stmt

delete stmt

update stmt

merge stmt

contains subquery

unnest failed

tables

hints

order hint

join hint

view reference

remote query

maximum DOP

maximum recursion level

indexed views loaded

indexed views matched

indexed views used

indexed views updated

dynamic cursor request

fast forward cursor request

Note that optimizer counts are not updated if optimizer choose to use cache plan.


03/09/2024

Two Digit Year Cut-off Setting in SQL Server

I have recently come across that SQL Server has setting which controls how it interpret year number when specified as two digits.

Since year number contains 4 digits (e.g. 2024), if user specify year as two digit (e.g. 24) SQL server need to know a way to identify how to interpret 2 digit year as a full 4 digit year. For example 24 can be year 1924 or year 2024 or even year 2124.

Default setting for this is 2049, which means two digit years are interpret as years from 1950 to 2049.

00 => Interpret as => 2000

20 => Interpret as => 2020

49 => Interpret as => 2049

50 => Interpret as => 1950

78 => Interpret as => 1978


This setting can be change in Server Properties.

Right click on the SQL server node, and select properties. Then go to "Advanced" section.



Or Using TSQL.

USE AdventureWorks2022;  

GO  

EXEC sp_configure 'show advanced options', 1;  

GO  

RECONFIGURE ;  

GO  

EXEC sp_configure 'two digit year cutoff', 2030 ;  

GO  

RECONFIGURE;  

GO


Note setting is in advanced settings section.

Above code will set the two year cut off setting to 2030, which means from year 1931 to 2030.

It is always recommended to use 4 digit in your code to avoid ambiguity.

Reference -> Microsoft Books Online


25/08/2024

SSMS Tips and Tricks - Part 01

I have been using SSMS for so many years now, but here is a trick I learned recently.

Let say you are working on a query window (not query designer), probably with a existing query and you need to add a table with very large number of columns and select all columns by name (rather than using SELECT *).

What is the best way to do this? You can type the query by hand and write all those column names one by one. But that would be laborious/tedious.

Here is a better approach:

Type the query in SELECT * <<table>> format.

Now select the whole line in query window and press Control + Shift + Q key combination.

Viola: Query designer window appear with your query in designer window.

Now, if you just want to add all column to query window, just press ok on Query Designer window.

Query window's query will update to have all column names:




Or you can further edit the query in Query designer window (e.g. add another table) then press ok button when you finish. What ever the query you designed in designer window will write into your query window.

That was very convenient and saved lot of time for me last week.

Also, did you know you can right click on a table in Query Designer and select all column in that table.





28/07/2024

Connecting to PostgreSQL using pgAdmin

If you are using SSMS to connect SQL Server, you probably looking for similar tool for PostgreSQL. Well, you are in luck, for that we have pgAdmin GUI tool.


If you launching pgAdmin tool on the server where you installed PostgreSQL, it most probably launch and auto connect to the local server by default.

If it doesn't or connecting from another computer, you will have to register the server in pgAdmin tool. To do that, right click on the "Servers" node you see under object explorer window, and choose Register -> Server...


This will bring you a dialog box with connection settings. First tab you see is "General" tab.

You can give a name to the connection and specify a background color too.
Next tab is the  most important tab where you specify host address, port number and credentials.



In above screenshot I have shown how I connected from remote machine. I have entered the host IP address and port number, which is the default port number. Then I left others as defaults, such as database name (postgres) and default user name (postgres). You can choose to save your password or enter every time you connect.

Other tabs control advanced connection settings, such as SSL parameters etc. which I will not touch in this article.

So once you done click on the "Save" button and your server is registered and connected.


If you want to run a query on pgAdmin tool, you can use the query tool in there. To open it, expand the "Databases" node and select your database, right click on it and choose "Query tool".

This will open up new window/tab on the admin tool. Note that there are few tabs already opened by default.



Like in SSMS, you can run queries on here and see results in the bottom pane.

There are few other components in this admin tool, but I think I will cover them later, in a specific article to them.












Introduction to SQL Server Statistics - Tutorial

Wishing you all, my loving readers, Happy New Year 2025! This is the first blog for the year 2025. As a new initiative, I'm going to (tr...