What are SQL Server Indexes
Indexes in SQL Server servers similar purpose as they do in a book we read.
In a large book, let say a book with 300 pages, which is on technology, we want to read pages written about "Databases". If we don't have an index, we would have to read the entire book to find the pages on "Databases". But if that book contains an index, we can simply read the index and find the page numbers on "Databases" and directly turn to them and read them.
Similar way, Indexes in SQL server provide a method to retrieve data much faster way and serves your query more quickly. For example consider a scenario where SQL server is looking for all the Employees starting with Letter "D" from the Employee table. If Employee table doesn't have an index like our book had, SQL server will have to read the entire table to find employees with name starting with letter "D". But if it had an index on Employee name, it could just read the index and jump straight to employees it require directly and present the result set quickly.
Index is a data structure which is linked with a table. In a way, you can call it small copy of the table, which has data arranged in a different way than the main table. You can have multiple indexes on a single table. Each of these indexes will arrange copy of the table data in a different way, so they can satisfy different different queries.
Type of Indexes
There are several types of indexes.
You can have tables without indexes. When a table doesn't have an index, it is called a "Heap". No order to the data and they are stored as they come.
Clustered Index
You can create a "Clustered" index to a table. When you create a clustered index for a table, table (or the heap) is re-arranged in the order of the index. So basically table become the clustered index, because data in table physically re-arranged to match the index.
For example think we have an employee data stored in a table without an index (i.e. a heap). Then we create an clustered index on that table, let say on Employee Id column (EmpId).
CREATE CLUSTERED INDEX IX_EMPLOYEE_ID ON Employee(EmpId)
Then we get a table which is physically sorted on employee Id. This is a clustered index. You can only have one clustered index per table, because table data is physically structured on clustered index.
When you specify a Primary Key for a table, SQL server automatically creates an clustered index on that column(s).
Non-Clustered Index
Let say we have lot of queries that based on employee age. So we need to access age very frequently. For example we need employees who are younger than 25 years. But as it stands now, we need to read the entire table to find young employees.
In order to ease the pain SQL server having, we can create an index on Age column, but leave the physical structure of the table as it is. To do this we can use Non-clustered index.
CREATE NONCLUSTERED INDEX IX_EMPLOYEE_AGE ON Employee(Age)
This will create a another structure link to the table. It is kind of small copy of the table, but arrange in a different manner. Though we only specify "Age" column in the index, SQL server always put the primary key of the table on the index, so that it can relate to the row back in the original table.
You can have many non-clustered indexes as you like on a table. However, more indexes you have, more data SQL server will have to add/update/delete when you add/update/delete data from a table. Because SQL server have to update non-clustered indexes as well as the table data (if they are effected by update).
Unique Index
Unique index is a special type of index, it just tell SQL server, column specified cannot have duplicate values. You can have clustered and non-clustered unique indexes. By default primary key constraint create a unique clustered index and unique constraint create unique non-clustered index.
CREATE UNIQUE NONCLUSTERED INDEX IX_EMPLOYEE_EMAIL ON Employee(Email)
This tell query optimizer that data in this email column is unique (i.e no duplicate). This information is used when query optimizer come up with execution plans.
Filtered Index
Filtered index is another special type of non-clustered index, where we only store data frequently required from a column.
For example, if we have "Location" column on Employee table and most queries are based on Location = "UK" employees, we can create a non-clustered index filtered only for employees who are in UK.
CREATE NONCLUSTERED INDEX IX_EMPLOYEE_LOC_FILTERED ON Employee(Location) WHERE Location = 'UK'
Having filtered index is efficient because you don't need to store the entire value collection for that column in the index. This makes index smaller and also easier for optimizer to use.
Other type of Index
There are many other types of indexes in SQL Server. For example:
• Full Text Index
• XML Index
• Spatial Index
• Column Store Index
• Hash index (on in memory OLTP)
We are not going to talk about above in this module.
Index Storage and How they work
As mentioned before, we are mainly focused on row store indexes. Rowstore index are organized into data structure called B+ tree.
As we learned in other modules, data in SQL server are organized into 8k pages.
Let's first look at the organisation of clustered index.
Each page in B+ tree structure is called Node.
Top node of the index is called "Root Node".
Very bottom nodes of the index is called "Leaf Nodes".
There can be multiple levels of nodes between "Root Nodes" and "Leaf Nodes" and they are called intermediate levels hence nodes in these levels are called "Intermediate Nodes".
Each of these nodes are doubly linked, which means we can go back and forward from one node to other in same level.
Clustered Index Example:
Let's consider an example index on following Employee table. Let's also assume EmpId is the primary key of the table. When you define a primary key on a table, SQL server automatically, create a CLUSTERED index on the table.
Or you can create a one like below:
CREATE CLUSTERED INDEX IX_EMPLOYEE_ID ON Employee(EmpId)
Let's assume in one 8k page we can only fit 5 records. So at the leaf level, nodes are arranged in following way.
Note because nodes are doubly linked (indicated by arrows), you can traverse (read) from one node to another both forward and backward.
Then we add intermediate level of nodes:
Note that each record in intermediate level is pointer to leaf node (data page) and also intermediate pages are doubly linked, making traversing more easy.
In a complex data table you will have multiple levels of intermediate nodes.
Finally you add the root level to complete the picture:
Note that root node has pointers to intermediate level.
Index Seek Operation:
Let say you want to seek into record 26 - Ethan Hunt.
SELECT EmpId, Name FROM Employee WHERE EmpId = 26
Following picture shows how the seek path:
As you can see we first come to root, look for 26 and found pointer to first intermediate node, go to first intermediate node, found pointer to second leaf page, go to second leaf page and go to record 26.
Range scan operation:
If we want to get range of records, for example, records from 20 to 40.
SELECT EmpId, Name FROM Employee WHERE EmpId BETWEEN 20 AND 40
We will traverse index in following way:
We start from the root, find 20s in first intermediate node pointer, go to first intermediate node, find 20s in second leaf node pointer, go to second leaf node find record 21 (because no 20), and start reading from there and read on until the end of the page, then using the link to next leaf page, move to next leaf node and read on until record 38.
Non-clustered Index
Non-clustered index use the same B+ tree structure, but leaf nodes are actually in heap or in index. Let's consider a scenario where we want to create a non-clustered index on same Employee table as above. We will create a non-clustered index on "Name" field.
CREATE NONCLUSTERED INDEX IX_EMPLOYEE_NAME ON Employee(Name)
Leaf level of the non-clustered index will be like below:
Note that, now index rows are sorted on name column and have a pointer to the primary key of the clustered index (i.e. EmpId). In addition to that, non-clustered index can have included columns. For example you can include "Age" column in the index. If you include value of the age will be stored in the non-clustered index. Since non-clustered index carry less columns than the table, one page will be able to hold more rows than the clustered index. This is illustrated in above picture by showing 6 rows in single page.
In above diagram it shows, that how first two rows of the non-clustered index is pointing to its related row in the clustered index (table). For clarity we have only showed first two rows, but all rows in the non-clustered index are pointing to related row in clustered index.
Non-clustered index also have intermediate and root nodes.
Index Seek Operation with Non-clustered Index
For example, let say we want to retrieve record for "Oscar White". If we use non-clustered index, we will have to scan the entire table/index. But if we use non-clustered index on "Name" column, we can directly go to Oscar White as shown on the picture below:
To seek "Oscar White", SQL server start from root node and from root node, it figure out records starting with O are in second intermediate node. Therefore it traverse to second intermediate node. From the second intermediate node, SQL server finds out Os are third leaf node, therefore it goes there and fetch the record. If query required only name and Id, non-clustered index itself will be able to satisfy the query and operation completes. However if query require additional fields which are not in non-clustered index, SQL server can use the pointer in the non-clustered index to go to clustered index to fetch additional columns (this is called key lookup operation).
Covering Index
Covering Index is an index which has all columns that required to satisfy a query without referring to base table. Because covering index satisfy a query by itself, IO requirement to serve the query is less. If all fields to satisfy the query is not in non-clustered index, SQL server has to do key lookups and go to clustered index to pick missing fields.
For example let's consider a query like below:
SELECT EmpId, Name, Age
FROM Employee
WHERE Name = 'Hanna Lee'
This query looks for employee name "Hanna Lee" and need to find her age as well.
If we have a non-clustered index like one shown in above section (see Non-Clustered Index section), we can find the record easily by seeking to "Hanna Lee" using the index, but since index doesn't have "Age" field, SQL server need to do a key lookup (using the pointer in non-clustered index to clustered index) and get "Age" field value from clustered index.
What if we include the Age column in the index like below:
CREATE NONCLUSTERED INDEX IX_EMPLOYEE_NAME ON Employee(Name)
INCLUDE (Age)
This is a non-clustered index on Name column, but additionally it has included "Age" column. Note that "Age" column is not part of the index key. Therefore, index will not be sorted on Age column. It is just part of the index leaf/data rows.
When query runs SQL server will use our IX_EMPLOYEE_NAME index. It will do a seek on to Hanna Lee's record as shown above. Query need Age value as well, luckily "Age" value is also in the non-clustered index, so SQL server can return the result, without even touching the clustered index.
This module only teach you very basics of the indexes. We will hope to look into move advanced index related topics in next modules.
No comments:
Post a Comment