What are partitions and why use
them?
The simple answer is:
To improve the scalability and manageability of large tables and
tables that have varying access patterns.
Typically, we create tables to store information about an entity,
such as customers or sales, and each table has attributes that describe only
that entity.
While a single table for each entity is the easiest to design and
understand, these tables are not necessarily optimized for performance,
scalability, and manageability, particularly as the table grows larger.
What constitutes a large table?
While the size of a very large database (VLDB) is measured in
hundreds of gigabytes, or even terabytes, the term does not necessarily
indicate the size of individual tables within the database.
A large database is one that does not perform as desired or one in
which the operational or maintenance costs have gone beyond predefined
maintenance or budgetary requirements.
These requirements also apply to tables; a table can be considered
large if the activities of other users or maintenance operations have a
limiting affect on availability.
For example, the sales table is considered large if performance is
severely degraded or if the table is inaccessible during maintenance for two
hours each day, each week, or even each month.
In some cases, periodic downtime is acceptable, yet it can often
be avoided or minimized by better design and partitioning implementations.
While the term VLDB applies only to a database, for partitioning, it is more
important to look at table size.
In addition to size, a table with varying access patterns might be
a concern for performance and availability when different sets of rows within
the table have different usage patterns.
Although usage patterns may not always vary (and this is not a
requirement for partitioning), when usage patterns do vary, partitioning can
result in additional gains in management, performance, and availability.
Again, to use the example of a sales table, the current month's
data might be read-write, while the previous month's data (and often the larger
part of the table) is read-only. In a case like this, where data usage varies,
or in cases where the maintenance overhead is overwhelming as data moves in and
out of the table, the table's ability to respond to user requests might be
impacted. This, in turn, limits both the availability and the scalability of
the server.
Additionally, when large sets of data are being used in different
ways, frequent maintenance operations are performed on static data.
This can have costly effects, such as performance problems,
blocking problems, backups (space, time, and operational costs) as well as a
negative impact on the overall scalability of the server.
How can partitioning help?
When tables and indexes become very large, partitioning can help
by partitioning the data into smaller, more manageable sections.
This document focuses on horizontal partitioning, in which large
groups of rows will be stored in multiple separate partitions.
The definition of the partitioned set is customized, defined, and
managed by our needs.
SQL Server allows us to partition our tables based on
specific data usage patterns using defined ranges or lists.
SQL Server also offers numerous options for the long-term
management of partitioned tables and indexes by the addition of features
designed around the new table and index structure.
Furthermore, if a large table exists on a system with multiple
CPUs, partitioning the table can lead to better performance through parallel
operations.
The performance of large-scale operations across extremely large
data sets (for instance many million rows) can benefit by performing multiple
operations against individual subsets in parallel.
An example of performance gains over partitions can be seen in
previous releases with aggregations.
For example, instead of aggregating a single large table, SQL
Server can work on partitions independently, and then aggregate the aggregates.
In SQL Server, queries joining large datasets can benefit
directly from partitioning;
SQL Server can join the data that resides on the same
partitions first and then combine the results.
This allows SQL Server to more effectively use multiple-CPU computers.
Partitioned Tables.
While the improvements in earlier versions of SQL Server significantly
enhanced performance when using partitioned views, they did not simplify the
administration, design, or development of a partitioned data set.
When using partitioned views, all of the base tables (on which the
view is defined) must be created and managed individually.
Application design is easier and users benefit by not needing to
know which base table to directly access, but administration is complex as
there are numerous tables to manage and data integrity constraints must be
managed for each table.
Because of the management issues, partitioned views were often
used to separate tables only when data needed to be archived or loaded.
When data was moved into the read-only table or when data was
deleted from the read-only table, the operations were expensive—taking time,
log space, and often creating blocking.
Additionally, because partitioning strategies in previous versions
required the developer to create individual tables and indexes and then UNION
them through views, the optimizer was required to validate and determine plans
for each partition (as indexes could have varied).
Therefore, query optimization time in SQL Server 2000 often
goes up linearly with the number of processed partitions.
In SQL Server 2005, each partition has the same indexes by
definition.
For example, consider a scenario in which the current month of On-line Transaction Processing (OLTP) data needs to be moved into an analysis
table at the end of each month.
The analysis table (to be used for read-only queries) is a single
table with one clustered index and two non clustered indexes; the bulk load of
1 gigabyte (GB) (into the already indexed and active single table) creates
blocking with current users as the table and/or indexes become fragmented
and/or locked.
Additionally, the loading process will take a significant amount
of time, as the table and indexes must be maintained as each row comes in.
There are ways to speed up the bulk load; however, these can
directly affect all other users, and sacrifices concurrency for speed.
If this data were isolated into a newly created (empty) and unindexed
[heap] table, the load could occur first and then the indexes could be created
after loading the data.
Often you will realize gains in performance of ten times or better
by using this scheme.
In fact, by loading into an unindexed table you can take advantage
of multiple CPUs by loading multiple data files in parallel or by loading
multiple chunks from the same file (defined by starting and ending row
position).
Since both operations can benefit from parallelism, this can yield
even further gains in performance.
In any release of SQL Server, partitioning allows you this more
granular control, and does not require that you have all data in one location; however,
there are many objects to create and manage.
A functional partitioning strategy could be achieved in previous
releases by dynamically creating and dropping tables and modifying the UNION
view.
However, in SQL Server 2005 the solution is more elegant: you
can simply switch in the newly filled partition(s) as an extra partition of the
existing partition scheme and switch out any old partition(s).
From end to end, the process takes only a short period of time and
can be made more efficient by using parallel bulk loading and parallel index
creation.
More importantly, the partition is manipulated outside of the
scope of the table so there is no effect on the queried table until the
partition is added.
The result is that, typically, adding a partition takes only a few
seconds.
The performance improvement is also significant when data needs to
be removed.
If one database needs a sliding-window set of data in which new
data is migrated in (for example, the current month), and the oldest data is
removed (maybe the parallel month from the previous year), the performance of
this data migration can be improved by orders of magnitude through the use of
partitioning.
While this may seem extreme, consider the difference without
partitioning; when all of the data is in a single table, deleting 1 GB of
old data requires row-by-row manipulation of the table, as well as its
associated indexes.
The process of deleting data creates a significant amount of log
activity, does not allow log truncation for the duration of the delete (note
that the delete is a single auto-commit transaction; however, you can control
the size of the transaction by performing multiple deletes where possible) and
therefore requires a potentially much larger log.
Using partitioning, however, removing that same data requires
removing the specific partition from a partitioned table (which is a metadata
operation) and then dropping or truncating the standalone table.
Moreover, without knowing how to best design partitions, one might
not be aware that the use of file groups in conjunction with partitions is
ideal for implementing partitioning.
File groups allow us to place individual tables on different
physical disks.
If a single table spans multiple files (using file groups) then
the physical location of data cannot be predicted.
For systems where parallelism is not expected, SQL Server improves
performance by using all disks more evenly through file groups, making specific
placement of data less critical.
SQL Server
continues to balance allocations among all of the objects within that file group.
While SQL
Server runs more effectively when using more disks for a given operation, using
more disks is not as optimal from a management or maintenance perspective,
particularly when usage patterns are very predictable (and isolated).
Since the data
does not have a specific location on disk, you don't have the ability to
isolate the data for maintenance such as backup operations.
With
partitioned tables in SQL Server, a table can be designed (using a
function and a scheme) such that all rows that have the same partitioning key
are placed directly on (and will always go to) a specific location.
The function
defines the partition boundaries, as well as the partition in which the first
value should be placed.
In the case of
a LEFT partition function, the first value will act as an upper boundary in the
first partition.
In the case of
a RIGHT partition function, the first value will act as a lower boundary in the
second partition (partition functions will be covered in more detail later in
this paper).
Once the
function is defined, a partition scheme can be created to define the physical
mapping of the partitions to their location within the database—based on a
partition function.
When multiple
tables use the same function (but not necessarily the same scheme), rows that
have the same partitioning key will be grouped similarly.
This concept is
called alignment.
By aligning
rows with the same partition key from multiple tables on the same or different
physical disks, SQL Server can, if the optimizer chooses, work with only the
necessary groups of data from each of the tables.
To achieve
alignment, two partitioned tables or indexes must have some correspondence
between their respective partitions.
They must use
equivalent partition functions with respect to the partitioning columns. Two
partition functions can be used to align data when:
- Both partition
functions use the same number of arguments and partitions.
- The partitioning key
used in each function is of equal type (includes length, precision and
scale if applicable, and collation if applicable).
- The boundary values
are equivalent (including the LEFT/RIGHT boundary criteria).
Note Even when two partition functions are designed to
align data, we could end up with an unaligned index if it is not partitioned on
the same column as the partitioned table.
Collocation is
a stronger form of alignment, where two aligned objects are joined with an
equi-join predicate where the equi-join is on the partitioning column.
This becomes
important in the context of a query, sub-query or other similar construct where
equi-join predicates may occur.
Collocation is
valuable because queries that join tables on the partition columns are
generally much faster.
Consider Orders
and OrderDetails tables.
Instead of
filling the files proportionally, you can create a partition scheme that maps
to three file groups.
When defining
the Orders and OrderDetails tables, you define them to use the
same scheme.
Related data
that has the same value for the partition key will be placed within the same
file, isolating the necessary data for the join.
When related
rows from multiple tables are partitioned in the same manner, SQL Server can
join the partitions without having to search through an entire table or
multiple partitions (if the table were using a different partitioning function)
for matching rows.
In this case,
the objects are not only aligned because they use the same key, but they are
storage-aligned because the same data resides within the same files.
Storage-aligned
tables.
SQL Server
allows partitioning based on ranges, and tables and indexes can be designed to
use the same scheme for better alignment.
Good design
significantly improves overall performance, but what if the usage of the data
changes over time?
What if an
additional partition is needed?
Administrative
simplicity in adding partitions, removing partitions, and managing partitions
outside of the partitioned table were major design goals for SQL
Server 2005.
SQL
Server 2005 introduced simplified partitioning with administration, development,
and usage in mind.
Some of the
performance and manageability benefits are:
- Simplify the design
and implementation of large tables that need to be partitioned for performance
or manageability purposes.
- Load data into a new
partition of an existing partitioned table with minimal disruption in data
access in the remaining partitions.
- Load data into a new
partition of an existing partitioned table with performance equal to
loading the same data into a new, empty table.
- Archive and/or remove
a portion of a partitioned table while minimally impacting access to the
remainder of the table.
- Allow partitions to be
maintained by switching partitions in and out of the partitioned table.
- Allow better scaling
and parallelism for extremely large operations over multiple related
tables.
- Improve performance
over all partitions.
- Improve query
optimization time because each partition does not need to be optimized
separately.
Definitions and Terminology.
To implement partitions in SQL Server 2005 and above, you must be
familiar with a few new concepts, terms, and syntax.
To understand these new concepts, let's first review a table's
structure with regard to creation and placement.
In previous releases, a table was always both a physical and a
logical concept, yet with SQL Server 2005 partitioned tables and indexes
you have multiple choices for how and where you store a table.
In SQL Server 2005, tables and indexes can be created with
the same syntax as previous releases—as a single tabular structure that is
placed on the DEFAULT file group or a user-defined file group.
Additionally, in SQL Server 2005, table and indexes can be created
on a partitioning scheme.
The partitioning scheme maps the object to one or more filegroups.
To determine which data goes to the appropriate physical
location(s), the partitioning scheme uses a partitioning function.
The partitioning function defines the algorithm to be used to
direct the rows and the scheme associates the partitions with their appropriate
physical location (i.e. , a file group).
In other words, the table is still a logical concept but its
physical placement on disk can be radically different from earlier releases;
the table can have a scheme.
Range Partitions.
Range partitions are table partitions that are defined by specific and
customizable ranges of data.
The range partition boundaries are chosen by the developer, and
can be changed as data usage patterns change.
Typically, these ranges are date-based or based on ordered
groupings of data.
The primary use of range partitions is for data archiving,
decision support (when often only specific ranges of data are necessary, such
as a given month or quarter), and for combined OLTP and Decision Support
Systems (DSS) where data usage varies over the life cycle of a row.
The biggest benefit to a SQL Server 2005 partitioned table and
index is the ability to manipulate very specific ranges of data, especially
related to archiving and maintenance.
With range partitions, old data can be archived and replaced very
quickly.
Range partitions are best suited when data access is typically for
decision support over large ranges of data.
In this case, it matters where the data is specifically located so
that only the appropriate partitions are accessed, when necessary.
Additionally, as transactional data becomes available you can add
the data easily and quickly.
Range partitions are initially more complex to define, as you will
need to define the boundary conditions for each of the partitions.
Additionally, you will create a scheme to map each partition to
one or more filegroups.
However, they often follow a consistent pattern so once defined,
they will likely be easy to maintain programmatically.
Defining the Partitioning Key.
The first step in partitioning tables and indexes is to define the
data on which the partition is keyed.
The partition key must exist as a single column in the table and
must meet certain criteria.
The partition function defines the data type on which the key
(also known as the logical separation of data) is based.
The function defines this key but not the physical placement of
the data on disk. The placement of data is determined by the partition scheme.
In other words, the scheme maps the data to one or more file groups
that map the data to specific file(s) and therefore disks.
The scheme always uses a function to do this: if the function
defines five partitions, then the scheme must use five file groups.
The file groups do not need to be different; however, you will get
better performance when you have multiple disks and, preferably, multiple CPUs.
When the scheme is used with a table, you will define the column
that is used as an argument for the partition function.
For range partitions, the dataset is divided by a logical and
data-driven boundary.
In fact, the data partitions may not be truly balanced.
Data usage dictates a range partition when the table is used in a
pattern that defines specific boundaries of analysis (also known as ranges).
The partition key for a range function can consist of only one
column, and the partitioning function will include the entire domain, even when
that data may not exist within the table (due to data integrity/constraints).
In other words, boundaries are defined for each partition but the
first partition and the last partition will, potentially, include rows for the
extreme left (values less than the lowest boundary condition) and for the
extreme right (values greater than the highest boundary condition).
Therefore, to restrict the domain of values to a specific dataset,
partitions must be combined with CHECK constraints.
Using check constraints to enforce your business rules and data
integrity constraints allows you to restrict the dataset to a finite range
rather than infinite range.
Range partitions are ideal when maintenance and administration
requires archiving large ranges of data on a regular basis and when queries
access a large amount of data that is within a subset of the ranges.
Index Partitioning.
In addition to partitioning a table's dataset, you can partition
indexes.
Partitioning both the table and its indexes using the same
function often optimizes performance.
When the indexes and the table use the same partitioning function
and columns in the same order, the table and index are aligned.
If an index is created on an already partitioned table, SQL Server
automatically aligns the new index with the table's partitioning scheme unless
the index is explicitly partitioned differently.
When a table and its indexes are aligned, then SQL Server can move
partitions in and out of partitioned tables more effectively, because all
related data and indexes are divided with the same algorithm.
When the tables and indexes are defined with not only the same
partition function but also the same partition scheme, they are considered to
be storage-aligned.
One benefit to storage alignment is that all data within the same
boundary is located on the same physical disk(s).
In this case, backups can be isolated to a certain time-frame and
your strategies can vary, in terms of frequency and backup type, based on the
volatility of the data.
Additional gains can be seen when tables and indexes on the same
file or file group are joined or aggregated.
SQL Server benefits from parallelizing an operation across
partitions.
In the case of storage alignment and multiple CPUs, each processor
can work directly on a specific file or file group with no conflicts in data
access because all required data is on the same disk.
This allows more processes to run in parallel without
interruption.
Special Conditions for Partitions:
Split, Merge, and Switch.
To aid in the usage of partitioned tables are several new features
and concepts related to partition management.
Because partitions are used for large tables that scale, the
number of partitions chosen when the partition function was created changes
over time.
You can use the ALTER TABLE statement with the new split option to
add another partition to the table.
When a partition is split, data can be moved to the new partition;
but to preserve performance, rows should not move.
This scenario is described in the case study later in this Document.
Conversely, to remove a partition, perform a switch-out for the
data and then merge the boundary point.
In the case of range partitions, a merge request is made by
stating which boundary point should be removed.
Where only a specific period of data is needed, and data archiving
is occurring on a regular basis (for example, monthly), you might want to
archive one partition of data (the earliest month) when the data for the
current month becomes available.
For example, you might choose to have one year of data available,
and at the end each month you switch in the current month and then switch out
the earliest month, differentiating between the current month's read/write OLTP
versus the previous month's read-only data.
There is a specific flow of actions that makes the process most
efficient, as illustrated by the following scenario.
You are keeping a year's worth of read-only data available.
Currently, the table holds data from September 2003 through
August 2004.
The current month of September 2004 is in another database,
optimized for OLTP performance.
In the read-only version of the table, there are 13 partitions:
twelve partitions which contain data (September 2003 through August 2004) and
one final partition that is empty.
This last partition is empty because a range partition always
includes the entire domain—both the extreme left as well as the extreme right.
And if you plan to manage data in a sliding-window scenario,
you'll always want to have an empty partition to split into which new data will
be placed.
In a partition function defined with LEFT boundary points, the
empty partition will logically exist to the far RIGHT.
Leaving a partition empty at the end will allow you to split the
empty partition (for the new data coming in) and not need to move rows from the
last partition (because none exist) to the new file group that's being added
(when the partition is split to include another chunk of data).
This is a fairly complex concept that will be discussed in greater
detail in the case study later in the paper but the idea is that all data
additions or deletions should be metadata-only operations.
To ensure that metadata-only operations occur, you will want to
strategically manage the changing part of the table.
To ensure that this partition is empty, you will use a check
constraint to restrict this data in the base table.
In this case, the OrderDate
should be on or after September 1, 2003 and before September 1, 2004.
If the last defined boundary point is August 31 at 11:59:59.997
(more on why 997 is coming up), then the combination of the partition function
and this constraint will keep the last partition empty.
While these are only concepts, it is important to know that split
and merge are handled through ALTER PARTITION FUNCTION and switch is handled
through ALTER TABLE.
Range partition boundaries before data
load/archive.
When October begins (in the OLTP database), September's data
should move to the partitioned table, which is used for analysis.
The process of switching the tables in and out is a very fast
process, and the preparation work can be performed outside of the partitioned
table.
This scenario is explained in depth in the case study coming up
but the idea is that you will be using "staging tables" that will
eventually become partitions within the partitioned table A detailed
description of this scenario is explained later in the case study later in this
document.
In this process, you will switch out a partition of a table to a
non partitioned table within the same file group.
Because the non partitioned table already exists within the same
file group (and this is critical for success), SQL Server can make this switch
as a metadata change.
As a metadata-only change, this can occur within seconds as
opposed to running a delete that might take hours and create blocking in large
tables.
Once this partition is switched out, you will still have
13 partitions; the first (oldest) partition is now empty and the last (most
recent, also empty) partition needs to be split.
Switching a partition out.
To remove the oldest partition (September 2003), use the new merge
option with ALTER TABLE.
Merging a boundary point effectively removes a boundary point, and
therefore a partition.
This reduces the number of partitions into which data loads to n-1 (in this case, 12).
Merging a partition should be a very fast operation when no rows
have to be moved (because the boundary point being merged has no data rows).
In this case, because the first partition is empty, none of the
rows need to move from the first to the second partition.
If the first partition is not empty and the boundary point is
merged, then rows have to move from the first to the second partition, which
can be a very expensive operation.
However, this is avoided in the most common sliding-window
scenario where an empty partition is merged with an active partition, and no
rows move.
Merging a partition.
Finally, the new table has to be switched in to the partitioned
table.
In order for this to be performed as a metadata change, loading
and building indexes must happen in a new table, outside of the bounds of the
partitioned table.
To switch in the partition, first split the last, most recent, and
empty range into two partitions.
Additionally, you need to update the table's constraint to allow
the new range.
Once again, the partitioned table will have 13 partitions.
In the sliding window scenario, the last partition with a LEFT
partition function will always remain empty.
Splitting a partition.
Finally, the newly loaded data is ready to be switched in to the
twelfth partition, September 2004.
Range partition boundaries after data load/archive.
Because only one partition can be added or removed at a time, tables
that need to have more than one partition added or removed should be recreated.
To change to this new partitioning structure, first create the new
partitioned table and then load the data into the newly created table.
This is a more optimal approach than rebalancing the whole table
for each split.
This process is accomplished by using a new partition function, a
new partition scheme, and then by moving the data to the newly partitioned
table.
To move the data, first copy the data using INSERT newtable SELECT columnlist FROM oldtable and then drop
the original tables.
Prevent user modifications while this process is running to help
ensure against data loss.
Steps for Creating Partitioned
Tables.
Now that you have an understanding of the value of partitioned tables,
the next section details the process of implementing a partitioned table, and
the features that contribute to this process.
Determine If Object Should Be
Partitioned.
While partitioning can offer great benefits, it adds
administrative overhead and complexity to the implementation of your objects,
which can be more of a burden than a gain.
Specifically, you might not want to partition a small table, or a
table that currently meets performance and maintenance requirements.
The sales scenario mentioned earlier uses partitioning to relieve the
burden of moving rows and data—you should consider whether your scenario has
this sort of burden when deciding whether to implement partitioning.
Determine Partitioning Key and
Number of Partitions.
If you are trying to improve performance and manageability for
large subsets of data and there are defined access patterns, range partitioning
can alleviate contention as well as reduce maintenance when read-only data does
not require it.
To determine the number of partitions, you should evaluate whether
or not logical groupings and patterns exist within your data.
If you often work with only a few of these defined subsets at a
time, then define your ranges so the queries are isolated to work with only the
appropriate data (i.e. only the specific partition).
Determine If Multiple Filegroups
Should Be Used.
To help optimize performance and maintenance, you should use file groups
to separate your data.
The number of file groups is partially dictated by hardware
resources: it is generally best to have the same number of file groups as partitions,
and these file groups often reside on different disks.
However, this primarily only pertains to systems where analysis
tends to be performed over the entire dataset.
When you have multiple CPUs, SQL Server can process multiple
partitions in parallel and therefore significantly reduce the overall
processing time of large complex reports and analysis.
In this case, you can have the benefit of parallel processing as
well as switching partitions in and out of the partitioned table.
Create Filegroups.
If you want to place a partitioned table on multiple files for
better I/O balancing, you will need to create at least one file group.
File groups can consist of one or more files, and each partition
must map to a file group.
A single file group can be used for multiple partitions but for
better data management, such as for more granular backup control, you should
design your partitioned tables so that only related or logically grouped data resides
on the same file group.
Using ALTER DATABASE, you can add a logical file group name, and
then add files.
To create a file group named 2003Q3 for the AdventureWorks database,
use ALTER DATABASE in the following way:
ALTER DATABASE AdventureWorks ADD FILEGROUP [2003Q3]
Once a filegroup exists, you use ALTER DATABASE to add files to
the filegroup.
ALTER DATABASE AdventureWorks
ADD FILE
(NAME = N'2003Q3',
FILENAME = N'C:\AdventureWorks\2003Q3.ndf',
SIZE = 5MB,
MAXSIZE = 100MB,
FILEGROWTH = 5MB)
TO FILEGROUP [2003Q3]
A table can be created on a file(s) by specifying a filegroup in
the ON clause of CREATE TABLE.
However, a table cannot be created on multiple filegroups unless
it is partitioned.
To create a table on a single filegroup, use the ON clause of
CREATE TABLE.
To create a partitioned table, you must first have a functional
mechanism for the partitioning.
The criteria on which you partition are logically separated from
the table in the form of a partition function.
This partition function will exist as a separate definition from
the table and this physical separation helps because multiple objects can use
the partition function.
Therefore,
the first step in partitioning a table is to create the partition function.
Create the Partition Function for
a Range Partition.
Range partitions must be defined with boundary conditions.
Moreover, no values, from either end of the range, can be
eliminated even if a table is restricted through a CHECK constraint.
To allow for periodic switching of data into the table, you'll
need a final and empty partition
In a range partition, first define the boundary points: for five
partitions, define four boundary point values and specify whether each value is
an upper boundary of the first (LEFT) or the lower boundary of the second
(RIGHT) partition.
Based on the left or right designation, you will always have one
partition that is empty, as the partition will not have an explicitly defined
boundary point.
Specifically, if the first value (or boundary condition) of a
partition function is '20001001' then the values within the bordering
partitions will be:
For LEFT
first partition
is all data <= '20001001'
second partition
is all data > '20001001'
For RIGHT
first partition
is all data < '20001001'
second partition
is all data => '20001001'
Since range partitions are likely to be defined on datetime data, you
must be aware of the implications.
Using datetime
has certain implications: you always have both a date and a time.
A date with no defined value for time implies a "0" time
of 12:00 A.M.
If LEFT is used with this type of data, then the data with a date
of Oct 1, 12:00 A.M. will end up in the first partition and the rest of
October's data in the second partition.
Logically, it is best to use beginning values with RIGHT and ending
values with LEFT.
These three clauses create logically identical partitioning
structures:
RANGE LEFT FOR VALUES ('20000930 23:59:59.997',
'20001231 23:59:59.997',
'20010331 23:59:59.997',
'20010630 23:59:59.997')
OR
RANGE RIGHT FOR VALUES ('20001001 00:00:00.000',
'20010101 00:00:00.000',
'20010401 00:00:00.000',
'20010701 00:00:00.000')
OR
RANGE RIGHT FOR VALUES ('20001001', '20010101', '20010401', '20010701')
Note Using the datetime data type does add a bit of
complexity here, but you need to make sure you set up the correct boundary
cases.
Notice the simplicity with RIGHT because the
default time is 12:00:00.000 A.M.
For LEFT, the added complexity is due to the
precision of the datetime data type.
The reason that 23:59:59.997 MUST be chosen is that
datetime data does not guarantee precision to the millisecond.
Instead, datetime data is precise within
3.33 milliseconds.
In the case of 23:59:59.999, this exact time tick
is not available and instead the value is rounded to the nearest time tick that
is 12:00:00.000 A.M. of the following day. With this rounding, the boundaries
will not be defined properly.
For datetime data, you must use caution with
specifically supplied millisecond values.
Note Partitioning functions also
allow functions as part of the partition function definition.
You may use DATEADD(ms,-3,'20010101') instead of explicitly defining the time using
'20001231 23:59:59.997'.
To store one-fourth of the Orders
data in the four active partitions, each representing one calendar quarter, and
create a fifth partition for later use (again, as a placeholder for sliding
data in and out of the partitioned table), use a LEFT partition function with
four boundary conditions:
CREATE PARTITION FUNCTION OrderDateRangePFN(datetime)
AS
RANGE LEFT FOR VALUES ('20000930 23:59:59.997',
'20001231 23:59:59.997',
'20010331 23:59:59.997',
'20010630 23:59:59.997')
Remember, four
defined boundary points creates five partitions. Review the data sets created
by this partition function by reviewing the sets as follows:
Boundary point '20000930 23:59:59.997' as LEFT
(sets the pattern):
The leftmost partition will include all values
<= '20000930 23:59:59.997'
Boundary point '20001231 23:59:59.997':
The second partition will include all values >
'20000930 23:59:59.997' but <= '20001231 23:59:59.997'
Boundary point '20010331 23:59:59.997':
The third partition will include all values >
'20001231 23:59:59.997' but <= '20010331 23:59:59.997'
Boundary point '20010630 23:59:59.997':
The fourth partition will include all values >
'20010331 23:59:59.997' but <= '20010630 23:59:59.997'
Finally, a fifth partition will include all values
> '20010630 23:59:59.997'.
Create the Partition Scheme.
Once you have created a partition function, you must associate it
with a partition scheme to direct the partitions to specific filegroups.
When you define a partition scheme, you must make sure to name a
filegroup for every partition, even if multiple partitions will reside on the
same filegroup.
For the range partition created previously (OrderDateRangePFN),
there are five partitions; the last, and empty, partition will be created in
the PRIMARY filegroup.
There is no need for a special location for this partition because
it will never contain data.
CREATE PARTITION SCHEME OrderDatePScheme
AS
PARTITION OrderDateRangePFN
TO ([2000Q3], [2000Q4], [2001Q1], [2001Q2], [PRIMARY])
Note If
all partitions will reside in the same filegroup, then a simpler syntax can be
used as follows:
CREATE PARTITION SCHEME OrderDatePScheme
AS
PARTITION OrderDateRangePFN
ALL TO ([PRIMARY])
Create the Partitioned Table.
With the partition function (the logical structure) and the
partition scheme (the physical structure) defined, the table can be created to
take advantage of them.
The table defines which scheme should be used, and the scheme
defines the function.
To tie all three together, you must specify the column to which
the partitioning function should apply.
Range partitions always map to exactly one column of the table
that should match the datatype of the boundary conditions defined within the
partition function.
Additionally, if the table should specifically limit the data set
(rather than from –infinity to positive infinity), then a check constraint
should be added as well.
CREATE TABLE [dbo].[OrdersRange]
(
[PurchaseOrderID] [int] NOT NULL,
[EmployeeID] [int] NULL,
[VendorID] [int] NULL,
[TaxAmt] [money] NULL,
[Freight] [money] NULL,
[SubTotal] [money] NULL,
[Status] [tinyint] NOT NULL ,
[RevisionNumber] [tinyint] NULL ,
[ModifiedDate] [datetime] NULL ,
[ShipMethodID] [tinyint] NULL,
[ShipDate] [datetime] NOT NULL,
[OrderDate] [datetime] NOT NULL
CONSTRAINT OrdersRangeYear
CHECK ([OrderDate] >= '20030701'
AND [OrderDate] <= '20040630 11:59:59.997'),
[TotalDue] [money] NULL
)
ON OrderDatePScheme (OrderDate)
GO
Create Indexes: Partitioned or
Not?
By default, indexes created on a partitioned table will also use
the same partitioning scheme and partitioning column.
When this is true, the index is aligned with the table.
Although not required, aligning a table and its indexes allows for
easier management and administration, particularly with the sliding-window
scenario.
For example, to create unique indexes, the partitioning column
must be one of the key columns; this will ensure verification of the
appropriate partition to guarantee uniqueness.
Therefore, if you need to partition a table on one column, and you
have to create unique index on a different column, then they cannot be aligned.
In this case, the index might be partitioned on the unique column
(if this is multi-column unique key, then it could be any of the key columns)
or it might not be partitioned at all.
Be aware that this index has to be dropped and created when
switching data in and out of the partitioned table.
Note If
you plan to load a table with existing data and add indexes to it immediately,
you can often get better performance by loading into a nonpartitioned,
unindexed table and then creating the indexes to partition the data after the
load.
By defining a clustered index on a
partition scheme, you will effectively partition the table after the load.
This is also a great way of partitioning an
existing table.
To create the same table as a
nonpartitioned table, and create the clustered index as a partitioned clustered
index, replace the ON clause in the create table with a single filegroup
destination.
Then, create the clustered index on the
partition scheme after the data is loaded.