We are in the final stages of preparing the book for printing, aiming for it to be available from Amazon in the next 4-6 weeks.
The E-book version will be available sometime later on iBooks as the Kindle option is not a viable proposition for us.
We will provide a free extract of the 1st few chapters, the free Microsoft SQL Server Holistic Data Warehouse Template, demonstration load using the Microsoft Adventure Works data and reporting models will be available at the time of publishing. This blog will have links to all of these.
Become a follower of the blog and recieve automatically the latest news on the book.
Tuesday, November 9, 2010
Friday, June 18, 2010
Using Visual Basic to process a large text file vs using SQL Server
Earlier this year I rose to the challenge on the Microsoft “Transact-SQL” forum where someone was trying to “parse 1 million records in one minute” using SQL server.
The problem was basically to take a large text file with “60 million rows resulting in 200-300 million rows”. Their best attempts using SQL server were taking 9 hours. The data appeared to be generated from a television viewer tracking system and these records needed to be processed every night before being analysed in SQL server.
Every solution that was being offered using SQL server had apparently been tried.
My alternate demonstration using Visual basic (within MS Excel) took 4 minutes per 1 million records and this was using a basic laptop. This was a time saving from 9 hours down to 1 hour and even better if a server was used.
It shows that it is best to use the right tool for the job - even in the world of I.T.
Google ‘social.msdn "parsing data" "transact-sql" ssdl gerry phillips ‘ to find the thread with the “answer”.
http://social.msdn.microsoft.com/Forums/en-US/transactsql/thread/d868cb65-90bd-4857-9fc5-521e35e73ef9
The problem was basically to take a large text file with “60 million rows resulting in 200-300 million rows”. Their best attempts using SQL server were taking 9 hours. The data appeared to be generated from a television viewer tracking system and these records needed to be processed every night before being analysed in SQL server.
Every solution that was being offered using SQL server had apparently been tried.
My alternate demonstration using Visual basic (within MS Excel) took 4 minutes per 1 million records and this was using a basic laptop. This was a time saving from 9 hours down to 1 hour and even better if a server was used.
It shows that it is best to use the right tool for the job - even in the world of I.T.
Google ‘social.msdn "parsing data" "transact-sql" ssdl gerry phillips ‘ to find the thread with the “answer”.
http://social.msdn.microsoft.com/Forums/en-US/transactsql/thread/d868cb65-90bd-4857-9fc5-521e35e73ef9
Sunday, June 13, 2010
Update on the Book
Despite Gerry being on a world trip for work we have been making good progress on the data warehousing book. Here are a few summary points about the book
The book will cover our Data warehousing strategy & philosophy and the Holistic Data warehousing method and will include a free download of the working template.
It will have full documentation of how to use the Holistic data warehouse template
The book is currently just over 300 pages with over 225 colour screenshots and diagrams.
It will be printed in black and white for hard copies and in colour for ebooks
We are optimising the graphics for presentation in the ebook format for display on the Apple iPad - using the (free) Kindle app.
We are expecting the book to be published within 3 months and available on Amazon and Kindle only.
Stay tuned
The book will cover our Data warehousing strategy & philosophy and the Holistic Data warehousing method and will include a free download of the working template.
It will have full documentation of how to use the Holistic data warehouse template
The book is currently just over 300 pages with over 225 colour screenshots and diagrams.
It will be printed in black and white for hard copies and in colour for ebooks
We are optimising the graphics for presentation in the ebook format for display on the Apple iPad - using the (free) Kindle app.
We are expecting the book to be published within 3 months and available on Amazon and Kindle only.
Stay tuned
Tuesday, March 16, 2010
Experimenting with Recursive Common Table Expressions (CTE)
Having read Johnathon Parker's excellent post on how Recursive CTEs work in SQL Server
http://jonathanparker.com.au/archive/2007/04/16/recursive-common-table-expressions-ctes-in-sql-server-2005.aspx
We decided to try it out on something more complicated
The challenge we gave oursleves was to take the "SalesQuota" table which is in quartlery buckets and split it into monthly Buckets.
This can be acheived with 3 queries in Union as follows:
SELECT SalesPersonID, QuotaDate, SalesQuota / 3 AS SalesQuota
FROM Sales.SalesPersonQuotaHistory
UNION ALL
SELECT SalesPersonID, DATEADD(m, 1, QuotaDate) AS QuotaDate, SalesQuota / 3 AS SalesQuota
FROM Sales.SalesPersonQuotaHistory AS SalesPersonQuotaHistory_1
UNION ALL
SELECT SalesPersonID, DATEADD(m, 2, QuotaDate) AS QuotaDate, SalesQuota / 3 AS SalesQuota
FROM Sales.SalesPersonQuotaHistory AS SalesPersonQuotaHistory_2
An alternative Solution using Recursive CTE is as follows:
WITH SPQ AS (SELECT SalesPersonID, QuotaDate, SalesQuota / 3 AS SalesQuota , 1 as c1
FROM Sales.SalesPersonQuotaHistory
UNION ALL
SELECT SalesPersonID, DATEADD(m, 1 , QuotaDate) AS QuotaDate, SalesQuota, c1+1 as c1
FROM SPQ
WHERE c1<3)
SELECT SalesPersonID, QuotaDate, SalesQuota
from SPQ
ORDER BY SalesPersonID, QuotaDate
This repeats the second part twice in a recursive manner - providing the same results as the 1st triple union query.
http://jonathanparker.com.au/archive/2007/04/16/recursive-common-table-expressions-ctes-in-sql-server-2005.aspx
We decided to try it out on something more complicated
The challenge we gave oursleves was to take the "SalesQuota" table which is in quartlery buckets and split it into monthly Buckets.
This can be acheived with 3 queries in Union as follows:
SELECT SalesPersonID, QuotaDate, SalesQuota / 3 AS SalesQuota
FROM Sales.SalesPersonQuotaHistory
UNION ALL
SELECT SalesPersonID, DATEADD(m, 1, QuotaDate) AS QuotaDate, SalesQuota / 3 AS SalesQuota
FROM Sales.SalesPersonQuotaHistory AS SalesPersonQuotaHistory_1
UNION ALL
SELECT SalesPersonID, DATEADD(m, 2, QuotaDate) AS QuotaDate, SalesQuota / 3 AS SalesQuota
FROM Sales.SalesPersonQuotaHistory AS SalesPersonQuotaHistory_2
An alternative Solution using Recursive CTE is as follows:
WITH SPQ AS (SELECT SalesPersonID, QuotaDate, SalesQuota / 3 AS SalesQuota , 1 as c1
FROM Sales.SalesPersonQuotaHistory
UNION ALL
SELECT SalesPersonID, DATEADD(m, 1 , QuotaDate) AS QuotaDate, SalesQuota, c1+1 as c1
FROM SPQ
WHERE c1<3)
SELECT SalesPersonID, QuotaDate, SalesQuota
from SPQ
ORDER BY SalesPersonID, QuotaDate
This repeats the second part twice in a recursive manner - providing the same results as the 1st triple union query.
Monday, November 16, 2009
Microsoft SQL - Update a Table with results from a query.
i.e. update with join to a derived table.
Scenario - You have a table (TableA) where you would like to keep track of, among other statistics and information, the number of records for each Entity from another table (DataTable). This requires a method to Update one table while joined to the results of query on another table (a derived table).
The join is by EntityID.
This is how you could do it in Microsoft SQL using a T-SQL statement.
UPDATE TableA
SET TableA.RecordCount = RecCount
FROM TableA,
(SELECT EntityID, Count(EntityID) as RecCount
From DataTable Group by EntityID) as DT
where TableA.EntityID = DT.EntityID
Scenario - You have a table (TableA) where you would like to keep track of, among other statistics and information, the number of records for each Entity from another table (DataTable). This requires a method to Update one table while joined to the results of query on another table (a derived table).
The join is by EntityID.
This is how you could do it in Microsoft SQL using a T-SQL statement.
UPDATE TableA
SET TableA.RecordCount = RecCount
FROM TableA,
(SELECT EntityID, Count(EntityID) as RecCount
From DataTable Group by EntityID) as DT
where TableA.EntityID = DT.EntityID
Sunday, November 2, 2008
Adventure works 2005- faulty report samples - fixes/solution
Solution for faulty Product line sales:
The Product line sales report has currently for "top employees" the following incorrect query in the dataset TopEmployees:
SELECT TOP 5 C.LastName, C.FirstName, E.EmployeeID, SUM(SOH.SubTotal) AS SaleAmount
FROM Sales.SalesPerson SP INNER JOIN HumanResources.Employee E ON SP.SalesPersonID = E.EmployeeID INNER JOIN Person.Contact C ON E.ContactID = C.ContactID INNER JOIN Sales.SalesOrderHeader SOH ON SP.SalesPersonID = SOH.SalesPersonID INNER JOIN Sales.SalesOrderDetail SOD ON SOH.SalesOrderID = SOD.SalesOrderID INNER JOIN Production.Product P ON SOD.ProductID = P.ProductID INNER JOIN Production.ProductSubcategory PS ON P.ProductSubcategoryID = PS.ProductSubcategoryID INNER JOIN Production.ProductCategory PC ON PS.ProductCategoryID = PC.ProductCategoryID
WHERE (PC.ProductCategoryID = @ProductCategory) AND (PS.ProductSubcategoryID IN (@ProductSubcategory)) AND (SOH.OrderDate > @StartDate) AND (SOH.OrderDate < @EndDate)
GROUP BY C.LastName, C.FirstName, E.EmployeeID, PC.ProductCategoryID, PS.ProductSubcategoryID
ORDER BY SUM(SOH.SubTotal) DESC
The problem is that it is adding up the Sales Order header - for every line on the order - so an order with 10 lines has the Order total multiplied 10 times. The "SUM(SOH.SubTotal)" bits need to be substituted with SUM(SOD.LineTotal).
So the top 5 report should use the following query:
SELECT TOP (100) PERCENT C.LastName, C.FirstName, E.EmployeeID, SUM(SOD.LineTotal) AS SaleAmount
FROM Sales.SalesPerson AS SP INNER JOIN HumanResources.Employee AS E ON SP.SalesPersonID = E.EmployeeID INNER JOIN Person.Contact AS C ON E.ContactID = C.ContactID INNER JOIN Sales.SalesOrderHeader AS SOH ON SP.SalesPersonID = SOH.SalesPersonID INNER JOIN Sales.SalesOrderDetail AS SOD ON SOH.SalesOrderID = SOD.SalesOrderID INNER JOIN Production.Product AS P ON SOD.ProductID = P.ProductID INNER JOIN Production.ProductSubcategory AS PS ON P.ProductSubcategoryID = PS.ProductSubcategoryID INNER JOIN Production.ProductCategory AS PC ON PS.ProductCategoryID = PC.ProductCategoryID
WHERE (PC.ProductCategoryID = @ProductCategory) AND (PS.ProductSubcategoryID IN (@ProductSubcategory)) AND (SOH.OrderDate > @StartDate) AND (SOH.OrderDate < @EndDate)
GROUP BY C.LastName, C.FirstName, E.EmployeeID, PC.ProductCategoryID, PS.ProductSubcategoryID
ORDER BY SUM(SOD.LineTotal) DESC
For the dataset TopCustomers there is an identical mistake
SELECT TOP 5 S.Name AS StoreName, SUM(SOH.SubTotal) AS SaleAmount, PS.ProductSubcategoryID,nPS.ProductCategoryID
FROM Production.Product P INNER JOIN Production.ProductSubcategory PS ON P.ProductSubcategoryID = PS.ProductSubcategoryID INNER JOIN Sales.SalesOrderDetail SOD ON P.ProductID = SOD.ProductID INNER JOIN Sales.Customer CU INNER JOIN Sales.SalesOrderHeader SOH ON CU.CustomerID = SOH.CustomerID INNER JOIN Sales.Store S ON CU.CustomerID = S.CustomerID ON SOD.SalesOrderID = SOH.SalesOrderID
WHERE (SOH.OrderDate > @StartDate) AND (SOH.OrderDate < @EndDate) AND (PS.ProductCategoryID = @ProductCategory) AND (PS.ProductSubcategoryID IN (@ProductSubcategory))
GROUP BY S.Name, PS.ProductSubcategoryID, PS.ProductCategoryID
ORDER BY SUM(SOH.SubTotal) DESC
The correct query is:
SELECT TOP 5 S.Name AS StoreName, SUM(SOD.LineTotal) AS SaleAmount, PS.ProductSubcategoryID,nPS.ProductCategoryID
FROM Production.Product P INNER JOIN Production.ProductSubcategory PS ON P.ProductSubcategoryID = PS.ProductSubcategoryID INNER JOIN Sales.SalesOrderDetail SOD ON P.ProductID = SOD.ProductID INNER JOIN Sales.Customer CU INNER JOIN Sales.SalesOrderHeader SOH ON CU.CustomerID = SOH.CustomerID INNER JOIN Sales.Store S ON CU.CustomerID = S.CustomerID ON SOD.SalesOrderID = SOH.SalesOrderID
WHERE (SOH.OrderDate > @StartDate) AND (SOH.OrderDate < @EndDate) AND (PS.ProductCategoryID = @ProductCategory) AND (PS.ProductSubcategoryID IN (@ProductSubcategory)) GROUP BY S.Name, PS.ProductSubcategoryID, PS.ProductCategoryIDORDER BY SUM(SOD.LineTotal) DESC
I think these work :)
As for the sales territory drilldown the only fix I can see is to do an update query of the Sales Order Header file of the correct line totals from the Sales Order Detail files.
The Product line sales report has currently for "top employees" the following incorrect query in the dataset TopEmployees:
SELECT TOP 5 C.LastName, C.FirstName, E.EmployeeID, SUM(SOH.SubTotal) AS SaleAmount
FROM Sales.SalesPerson SP INNER JOIN HumanResources.Employee E ON SP.SalesPersonID = E.EmployeeID INNER JOIN Person.Contact C ON E.ContactID = C.ContactID INNER JOIN Sales.SalesOrderHeader SOH ON SP.SalesPersonID = SOH.SalesPersonID INNER JOIN Sales.SalesOrderDetail SOD ON SOH.SalesOrderID = SOD.SalesOrderID INNER JOIN Production.Product P ON SOD.ProductID = P.ProductID INNER JOIN Production.ProductSubcategory PS ON P.ProductSubcategoryID = PS.ProductSubcategoryID INNER JOIN Production.ProductCategory PC ON PS.ProductCategoryID = PC.ProductCategoryID
WHERE (PC.ProductCategoryID = @ProductCategory) AND (PS.ProductSubcategoryID IN (@ProductSubcategory)) AND (SOH.OrderDate > @StartDate) AND (SOH.OrderDate < @EndDate)
GROUP BY C.LastName, C.FirstName, E.EmployeeID, PC.ProductCategoryID, PS.ProductSubcategoryID
ORDER BY SUM(SOH.SubTotal) DESC
The problem is that it is adding up the Sales Order header - for every line on the order - so an order with 10 lines has the Order total multiplied 10 times. The "SUM(SOH.SubTotal)" bits need to be substituted with SUM(SOD.LineTotal).
So the top 5 report should use the following query:
SELECT TOP (100) PERCENT C.LastName, C.FirstName, E.EmployeeID, SUM(SOD.LineTotal) AS SaleAmount
FROM Sales.SalesPerson AS SP INNER JOIN HumanResources.Employee AS E ON SP.SalesPersonID = E.EmployeeID INNER JOIN Person.Contact AS C ON E.ContactID = C.ContactID INNER JOIN Sales.SalesOrderHeader AS SOH ON SP.SalesPersonID = SOH.SalesPersonID INNER JOIN Sales.SalesOrderDetail AS SOD ON SOH.SalesOrderID = SOD.SalesOrderID INNER JOIN Production.Product AS P ON SOD.ProductID = P.ProductID INNER JOIN Production.ProductSubcategory AS PS ON P.ProductSubcategoryID = PS.ProductSubcategoryID INNER JOIN Production.ProductCategory AS PC ON PS.ProductCategoryID = PC.ProductCategoryID
WHERE (PC.ProductCategoryID = @ProductCategory) AND (PS.ProductSubcategoryID IN (@ProductSubcategory)) AND (SOH.OrderDate > @StartDate) AND (SOH.OrderDate < @EndDate)
GROUP BY C.LastName, C.FirstName, E.EmployeeID, PC.ProductCategoryID, PS.ProductSubcategoryID
ORDER BY SUM(SOD.LineTotal) DESC
For the dataset TopCustomers there is an identical mistake
SELECT TOP 5 S.Name AS StoreName, SUM(SOH.SubTotal) AS SaleAmount, PS.ProductSubcategoryID,nPS.ProductCategoryID
FROM Production.Product P INNER JOIN Production.ProductSubcategory PS ON P.ProductSubcategoryID = PS.ProductSubcategoryID INNER JOIN Sales.SalesOrderDetail SOD ON P.ProductID = SOD.ProductID INNER JOIN Sales.Customer CU INNER JOIN Sales.SalesOrderHeader SOH ON CU.CustomerID = SOH.CustomerID INNER JOIN Sales.Store S ON CU.CustomerID = S.CustomerID ON SOD.SalesOrderID = SOH.SalesOrderID
WHERE (SOH.OrderDate > @StartDate) AND (SOH.OrderDate < @EndDate) AND (PS.ProductCategoryID = @ProductCategory) AND (PS.ProductSubcategoryID IN (@ProductSubcategory))
GROUP BY S.Name, PS.ProductSubcategoryID, PS.ProductCategoryID
ORDER BY SUM(SOH.SubTotal) DESC
The correct query is:
SELECT TOP 5 S.Name AS StoreName, SUM(SOD.LineTotal) AS SaleAmount, PS.ProductSubcategoryID,nPS.ProductCategoryID
FROM Production.Product P INNER JOIN Production.ProductSubcategory PS ON P.ProductSubcategoryID = PS.ProductSubcategoryID INNER JOIN Sales.SalesOrderDetail SOD ON P.ProductID = SOD.ProductID INNER JOIN Sales.Customer CU INNER JOIN Sales.SalesOrderHeader SOH ON CU.CustomerID = SOH.CustomerID INNER JOIN Sales.Store S ON CU.CustomerID = S.CustomerID ON SOD.SalesOrderID = SOH.SalesOrderID
WHERE (SOH.OrderDate > @StartDate) AND (SOH.OrderDate < @EndDate) AND (PS.ProductCategoryID = @ProductCategory) AND (PS.ProductSubcategoryID IN (@ProductSubcategory)) GROUP BY S.Name, PS.ProductSubcategoryID, PS.ProductCategoryIDORDER BY SUM(SOD.LineTotal) DESC
I think these work :)
As for the sales territory drilldown the only fix I can see is to do an update query of the Sales Order Header file of the correct line totals from the Sales Order Detail files.
Thursday, October 30, 2008
Problems we are having with Adventure works 2005
Product Line Sales:
Top 5 Customers - sales are inflated significantly. Adventure works sales in total (from the Sales order header) are approx $97M
Do a report (Bikes, road bikes, Calendar year 2003) and the Top 5 Sales people sold over $158 millions - does anyone notice this as a problem ? Just a slight reconciliation difference I would think
Or is this report deliberately broken so people can fix it?
Now my next favourite report:
Sales Territory Drill down
The Sales by Order by Header does agree to the sales order total from the 'Drill down' - hmm housten we have a problem
Australia , Lynn Tsoflias Sales order SO51150 $40,274.01 =/= the drilldown $26,328 (after discount of $1,996)
In fact the Adventure works database is broken - Sales order Header subtotal =/= sum(Sales Order Detail . line total) - total discrepancy a cool $17 mill or so. A redememing fact is that the dbo.FactResellerSales table does indeed have the 'correct' sales number.(It reconciles to the Sales order Detail)
ALSO this report is missing sales made by managers - or dont managers count? (they are also missing from the view Sales.vSalesPerson in the
=======================================
So is there something I am completely missing?
Seriously though I am using Adventure works for SQL server 2005 - are these problems that are fixed in the Adventure works 2008? (havent installed yet as I dont have SQL 2008 yet)
Please feel free to contribute to the discussion
Top 5 Customers - sales are inflated significantly. Adventure works sales in total (from the Sales order header) are approx $97M
Do a report (Bikes, road bikes, Calendar year 2003) and the Top 5 Sales people sold over $158 millions - does anyone notice this as a problem ? Just a slight reconciliation difference I would think
Or is this report deliberately broken so people can fix it?
Now my next favourite report:
Sales Territory Drill down
The Sales by Order by Header does agree to the sales order total from the 'Drill down' - hmm housten we have a problem
Australia , Lynn Tsoflias Sales order SO51150 $40,274.01 =/= the drilldown $26,328 (after discount of $1,996)
In fact the Adventure works database is broken - Sales order Header subtotal =/= sum(Sales Order Detail . line total) - total discrepancy a cool $17 mill or so. A redememing fact is that the dbo.FactResellerSales table does indeed have the 'correct' sales number.(It reconciles to the Sales order Detail)
ALSO this report is missing sales made by managers - or dont managers count? (they are also missing from the view Sales.vSalesPerson in the
=======================================
So is there something I am completely missing?
Seriously though I am using Adventure works for SQL server 2005 - are these problems that are fixed in the Adventure works 2008? (havent installed yet as I dont have SQL 2008 yet)
Please feel free to contribute to the discussion
Subscribe to:
Posts (Atom)