September
2004 Issue: 33
Journal of Conceptual Modeling
www.inconcept.com/jcm
Calculating Deciles with Microsoft SQL Server
without Using Temporary Tables
According to the American Heritage® Dictionary a decile is, "Any one of the numbers or values in a series dividing the distribution of the individuals in the series into ten groups of equal frequency." (Houghton Mifflin Company, 1996). The idea behind deciles, quintiles, and quartiles is to statistically group data into a fixed number of logical categories. The first decile is the value that is greater than a tenth of the data points. (Hand, Mannila, and Smyth, 2001, p. 56).
The usual process of calculating a decile involves creating a temporary table. The first step in creating a decile is to generate a score for each data point, for example, the revenue per customer. The conventional approach is to create a temporary table of the data with an additional column holding the score. This can be achieved with Microsoft SQL server using an Identity column and the "into" clause on a select. Consider a simple table named Customers containing two fields: CustNumber and Dollars. The SQL below generates a temporary table with the rank of each record.
select
CustNumber,
Dollars,
Rank=IDENTITY(int, 1, 1)
into
#RankedCustomers
from Customers
order by Dollars
The use of temporary tables poses a problem with stateless connection methods like SOAP. The order by clause of the SQL does not fully define the score. For example, if two customers have the same Dollars amount the order is dependant on the structure of the database and the SQL implementation.
One approach to segmenting the data is to break the customers into ten groups based on the dollars each customer spends. That is, group the customers into ten approximately equal size categories based on the revenue associated with each customer. This simple idea is often difficult to implement in a SQL database. This approach makes two important assumptions about the values in the dollar field. The first assumption is that the dollar field has varies in value sufficiently to organize the customers into ten groups. The second assumption is that the number of distinct dollar values is significantly smaller than the number of customers. The
select
CustNumber,
Dollars,
Rank=IDENTITY(int, 1, 1)
into
#RankedCustomers
from
Customers
order by
Dollars
SQL query below tests the possible problem conditions.
select
case
when DistinctDollars < 10 then
'Too few distinct dollar values'
when NumCust < 10 then
'Too few customers'
when (DistinctDollars * 10) >= NumCust then
'Too many distinct dollar values'
else
'Deciles appear to be possible'
end as DecilesPossible
from
(select count(CustNumber) as NumCust,
count(distinct Dollars) as DistinctDollars
from
Customers) as q1
The query uses a nested select to count the number of customers and the number of distinct dollar values.
If the number of distinct dollar values is too few, then scoring the data by the dollar value will produce meaningless results. For example, consider the trivial case where all the dollar values are equal. Using a deciles method requires variation in the scoring attribute. The other extreme is where every record has a unique dollar value. This method will work but be inefficient in calculating the deciles. The development of this used a database with approximately 200,000 records that were non-null and non-zero with 11,000 unique dollar values.
The first step is to determine the number of customers for each distinct dollar value. The SQL statement calculates that.
select Dollars, count(CustNumber) as NumCust
from Customers
group by Dollars
It is common to exclude records from reports using deciles. A common example is excluding records with null or zero dollar values. This complicates the query a little:
select Dollars, count(CustNumber) as NumCust
from Customers
where Dollars is not null and Dollars > 0
group by Dollars
The table below shows a portion of the returned dataset from the query above.
Dollars NumCust $58.91
2
$58.94
5
$58.95
5
$58.98
1
$59.00
6,960
$59.01
1
$59.07
3
$59.08
1
$59.09
1
Notice that the distribution of dollars is not symmetric. Some values occur much more often than others do.
The next step is to calculate the cumulative number of people for each dollar value. That is, the dataset must contain the number of customers that spent equal to or less than the dollar value specified. Transact-SQL supports joining two sub-queries together. The cumulated people can be calculated by the following query:
select q1.Dollars, q1.NumCust, sum(q2.NumCust) as CumPeople
from
(select Dollars, count(CustNumber) as NumCust
from Customers
where Dollars is not null and Dollars > 0
group by Dollars ) as q1
join
(select Dollars, count(CustNumber) as NumCust
from Customers
where Dollars is not null and Dollars > 0
group by Dollars ) as q2 on (q2.Dollars <= q1.Dollars)
group by q1.Dollars, q1.NumCust
Although the query looks complicated, it is just two copies of the previous query joined, the aggregated to calculate the cumulated people. There are two key pieces. The first is the join. The second query is joined to the first where the dollar value is less than or equal to the dollar value in the first query. The join allows summing all the calculated customer count fields. This provides a cumulative number. The table below shows part of the returned result set.
Dollars NumCust CumPeople $58.91
2
45,582
$58.94
5
45,587
$58.95
5
45,592
$58.98
1
45,593
$59.00
6,960
52,553
$59.01
1
52,554
$59.07
3
52,557
$59.08
1
52,558
$59.09
1
52,559
The next step is to calculate the rank for each distinct dollar value. Dividing the cumulative people by the total number of people produces a number between 0 and 1 representing the rank. Multiplying the number by ten, and then rounding up generates a rank value from one to ten. The ceiling function rounds values up. The query to do this calculation uses the query above as a sub-query.
select Dollars,
NumCust, CumPeople,
Ceiling((10.0 * CumPeople) /
(select count(*) from Customers where
Dollars is not null and Dollars > 0)) as Rank
from
(select q1.Dollars, q1.NumCust, sum(q2.NumCust) as CumPeople
from
(select Dollars, count(CustNumber) as NumCust
from
Customers
where
Dollars is not null and Dollars > 0
group by Dollars ) as q1
join
(select Dollars, count(CustNumber) as NumCust
from Customers
where Dollars is not null and Dollars > 0
group by Dollars) as q2 on (q2.Dollars <= q1.Dollars)
group by q1.Dollars, q1.NumCust ) as q3
A sample of the result set from the above query is below:
Dollars NumCust CumPeople Rank $58.91
2
45,582
3
$58.94
5
45,587
3
$58.95
5
45,592
3
$58.98
1
45,593
3
$59.00
6,960
52,553
3
$59.01
1
52,554
3
$59.07
3
52,557
3
$59.08
1
52,558
3
$59.09
1
52,559
3
The last query calculated all the intermediate values needed to complete the deciles report. The final query joins back to the original customer table to and uses the rank to group the values into their prospective deciles. The query calculates the number of people in the grouping, the percentage of people in each group, the revenue for the group, the percentage of the total revenue, and the average revenue. The final query look complicated but is built from the previous queries.
select q4.rank, count(cs.CustNumber) as NumCust, ((1.0 *
count(cs.CustNumber)) /
(select count(*) from Customers where Dollars is not null
and Dollars > 0)) as PercentCustomers,
sum(cs.Dollars) as Revenue,
(sum(cs.Dollars) /
(select sum(Dollars) from Customers where Dollars is not null
and Dollars > 0)) as PercentRevenue,
avg(cs.Dollars) as AverageRevenue
from Customers cs
join
(select Dollars, NumCust, CumPeople, Ceiling((10.0 * CumPeople) /
(select count(*) from Customers where Dollars is not null and
Dollars > 0)) as Rank
from
(select q1.Dollars, q1.NumCust, sum(q2.NumCust) as CumPeople
from (select Dollars, count(CustNumber) as NumCust
from Customers
where Dollars is not null and Dollars
> 0
group by Dollars ) as q1
join
(select Dollars, count(CustNumber) as NumCust
from Customers
where Dollars is not null and Dollars
> 0
group by Dollars) as q2 on
(q2.Dollars <= q1.Dollars)
group by q1.Dollars, q1.NumCust ) as q3 ) as q4
on (q4.Dollars = cs.Dollars) group by q4.rank
Below is the result from this query.
| Rank |
Number of Customers |
Percent of Customers |
Revenue |
Percent of Revenue |
Average Revenue |
|
1 |
22,217 |
9.80% |
$543,179.14 |
1.71% |
$24.45 |
|
2 |
22,252 |
9.81% |
$1,111,423.73 |
3.50% |
$49.95 |
|
3 |
23,530 |
10.38% |
$1,558,090.81 |
4.91% |
$66.22 |
|
4 |
16,531 |
7.29% |
$1,303,592.34 |
4.11% |
$78.86 |
|
5 |
17,052 |
7.52% |
$1,507,861.60 |
4.75% |
$88.43 |
|
6 |
34,420 |
15.18% |
$3,853,860.91 |
12.15% |
$111.97 |
|
7 |
22,716 |
10.02% |
$3,387,401.26 |
10.68% |
$149.12 |
|
8 |
22,687 |
10.00% |
$4,190,138.33 |
13.21% |
$184.69 |
|
9 |
22,670 |
10.00% |
$5,561,674.85 |
17.53% |
$245.33 |
|
10 |
22,682 |
10.00% |
$8,699,866.44 |
27.42% |
$383.56 |
Notice that the difference in the number of customers in each category. One of the assumptions in the process is that the dollar field has varies in value sufficiently to organize the customers into ten groups. In the sample data used in the example, the assumption was false. Below is a chart showing the revenue distribution of the data:

There is a huge spike at $149.00 of 18,357 people. This spike makes it impossible to segregate the data into uniform groups using the dollar value alone.
Calculating cumulative percentages is possible in SQL, but the query is moderately complicated as is. It is much easier to calculate the cumulative percentages using Excel. Below is the result modified using Excel to generate the cumulative percentages.
| Rank |
Number of Customers |
Percent of Customers |
Cumulative Customer Percent |
Revenue |
Percent of Revenue |
Cumulative Revenue Percent |
Average Revenue |
|
1 |
22,217 |
9.80% |
9.80% |
$543,179.14 |
1.71% |
1.71% |
$24.45 |
|
2 |
22,252 |
9.81% |
19.61% |
$1,111,423.73 |
3.50% |
5.21% |
$49.95 |
|
3 |
23,530 |
10.38% |
29.99% |
$1,558,090.81 |
4.91% |
10.12% |
$66.22 |
|
4 |
16,531 |
7.29% |
37.28% |
$1,303,592.34 |
4.11% |
14.23% |
$78.86 |
|
5 |
17,052 |
7.52% |
44.80% |
$1,507,861.60 |
4.75% |
18.98% |
$88.43 |
|
6 |
34,420 |
15.18% |
59.98% |
$3,853,860.91 |
12.15% |
31.13% |
$111.97 |
|
7 |
22,716 |
10.02% |
69.99% |
$3,387,401.26 |
10.68% |
41.81% |
$149.12 |
|
8 |
22,687 |
10.00% |
80.00% |
$4,190,138.33 |
13.21% |
55.02% |
$184.69 |
|
9 |
22,670 |
10.00% |
90.00% |
$5,561,674.85 |
17.53% |
72.55% |
$245.33 |
|
10 |
22,682 |
10.00% |
100.00% |
$8,699,866.44 |
27.42% |
99.97% |
$383.56 |
Using this data we can plot the spending by decile.

The spike in the data makes the cumulative customer percent line crooked. One solution to this is to plot the trend line for the Cumulative Customer Percent. This creates the following diagram:

This approach works well for data where number of distinct values for the attribute used for scoring is much smaller than the number of records but the number of distinct values is larger than the number of intervals needed, which for deciles is ten. If the distribution contains spikes or gaps then resulting interval sizes may not be uniform. The performance of the query is primarily dependant on the number of distinct values and not on the size of the data set.
References
Hand, D., Mannila, H., & Smyth, P. (2001). Principles of Data Mining. Cambridge, MA: MIT Press.
Houghton Mifflin Company, (1996). The American
Heritage® Dictionary of the English Language (3rd ed.).
Retrieved September 6, 2004 from Microsoft Bookshelf 98
The
author, Jeffrey McArthur, is President of JSM Software, a company focused on
affordable business intelligence solution using Microsoft SQL Server. You can
reach Jeffrey at jeffmcarthur@jsm-software.com. For more information on JSM
Software, please see http://www.jsm-software.com or contact the sales department
at 1-410-290-6958.
![]()
© Copyright, 1998-2004 InConcept (Information Conceptual Modeling, Inc.) All Rights Reserved. Privacy Statement. ISSN: 1533-3825