No, this article is not about
thinning out the fat rows out of your cursors—which might not be a bad idea. It
is about using the ADO 2.x "SHAPE" syntax to manage hierarchical data.
For some time now, the Fox and ADO teams have been working on creating a
mechanism to handle the construction and management of hierarchies. The result
is a separate provider in ADO designed specifically to handle this fairly
common data structure. This new OLE DB provider seems to make it easier to deal
with a number of common data access problems—but as we'll see, provider has a
specific targeted audience—and that might not be you.
I've asked most of my classes
and conference audiences if they use "hierarchical" data—most won’t
admit to it—at least not in public. I suppose they think "hierarchy"
is one of those fancy terms that only college professors use. While that might
be true, I suspect that you use hierarchical data when you work with your own database.
If you have queries that return customers and their orders, or books and their
authors, or reference similarly related tables, then you use hierarchical data.
Yep, you were right—it's a fancy term for one-to-many relationships.
When I started researching the
SHAPE provider, I discovered that there were many articles written about it,
but none seemed to talk to my issues. That is, most of the examples use unbound
queries against tiny databases—such as SELECT * FROM Authors in the sample Pubs
database. I did find a useful article "Shape Up Your Data Using ADO 2.0
Hierarchical Recordsets in ASP" on the MSDN CD that built parameter
queries, but this article discusses a Web-based application, and I usually
focus on client/server Visual Basic applications.
While investigating the Shape
provider, I kept the client/server and large database developer in mind. As I
got deeper into the Shape functionality, it became clear that the provider was
meant for smaller, table-based databases. Too many of its functional
assumptions made it unsuitable for sizeable databases—or databases that can
generate complex hierarchical result sets on their own. SQL Server, for one,
can easily return a Recordset based on a single or a set of parent rows and all
associated child rows—to any depth necessary—with a single call to the server.
However, not all OLE DB or ODBC providers are capable of this level of
sophistication. Simpler providers have no way to return these hierarchies. But,
the Shape provider is more than just a way to return Parent/child hierarchies.
It can also be used to compute more complex aggregate functions used when you
create reports.
When I worked on Microsoft's
Visual Studio Marketing team, I demonstrated the Shape provider any number of
times. After about a minute of flashy clicking and dragging, I ended up with a
three-table hierarchy that looked something like Figure 1. This is really pretty easy. Just drag the
three tables from the Data View Window over to the Data Environment Designer
window and set the "Relation" property (on the Properties page) to
cross-reference the au_id and title_id fields. I dragged the parent object (the
"authors" command in this case) over to a form and chose the
MSHFlexGrid to display the rows. No, I'm not going to walk you through you this
again. That's because I want to show you an easier way that requires a few
lines of code but gives you a lot better performance and far more control over
your results. If you want to see it done, come by the office or one of my VBits
talks and I'll be happy to show you step-by-step.
Let's get started with the
basics by creating a Visual Basic application that uses the SHAPE provider to
do something simple. We'll use Visual Basic 6.0 Enterprise Edition for this
exercise—I make no guarantees on the suitability of the Professional Edition. I
also have SP3 installed, which means I'm using ADO 2.1 GA whether I want to or
not. I did not, however, set my references to the 2.1 library—that breaks the DataEnvironment
Designer which it turns out I didn't really need. I accessed my local SQL
Server 7.0 (no SP) and the Pubs database to experiment. I'm glad I did because
some of the sample applications would have taken all afternoon to run if they
had a lot of data (more than 20 rows) to query.
Note: Keep in mind that the SHAPE provider's default behavior uses two (or more) SELECT * queries (with no WHERE clauses)—so do most of the examples in MSDN. That is, the Visual Basic tools generate one open-ended query as the "Parent" command and one additional open-ended query for each "child" command. These are submitted to SQL Server as a single batch. This means that if you don't code otherwise, the SHAPE provider will return the entire contents of all 'n' tables to your client (or middle-tier server)—or at least try to.
So, how do you create a simple
SHAPE statement in the first place? Well, you can steal leverage it from
the DataEnvironment Designer or copy it from my code shown below. I also won't
go into the raw syntax of the SHAPE statement—there are a half-dozen MSDN
articles for that. See "Shape Commands in General" in MSDN for
starters. Frankly, there's quite a bit there that you won't need to know at all
to run any of the examples I'm about to show you.
Start by setting up some General
Declaration variables:
Dim cnn As ADODB.Connection
Dim rst As ADODB.Recordset
Dim rstTitleAuthor As ADODB.Recordset
|
Figure 1: The DataEnvironment Designer with a sample hierarchy. |
Next, open a connection to your
favorite SQL Server—one that still has the sample Pubs database loaded. Note
that you need to create an ADO Connection object and set it's Provider property
to "MSDateShare" before you open it. You can use either the default
ODBC to OLE DB data provider or a straight OLE DB provider for SQL
Server. This means your Connection string should look just like it always does.
Mine references a DSN, but you can use a DSN-less connection.
Set cnn = New Connection
cnn.Provider = "MSDataShape"
cnn.Open "Data Provider=MSDASQL;" & _
"DSN=biblio70;uid=sa;pwd=xxxx;database=pubs"
Tip: As I tell my kids, do as I say, not as I do. Don't use SA as your development password. It masks all kinds of issues that will bite you when you try to deploy your application.
Once we're connected, we can
create a new Recordset and set the StayInSync property. We set this to false to
prevent extra queries from being generated. I'll talk about this later.
Set rst = New Recordset
rst.StayInSync = False
In the next step you'll have to
provide the SHAPE provider with a correctly constructed query. We pass this
query to the provider through the Recordset Source property. Consider that the
SHAPE provider does not really have much of a query processor. This does not
mean you can't submit complex queries, as the SHAPE provider simply passes
these on to the backend for processing.
In the first bracketed
expression after the SHAPE statement, you need to provide an SQL SELECT
statement (or a stored procedure) that returns the "parent" result
set. In our example, we'll request all of the rows from the authors table.
Of course, selecting the entire authors table is not that clever. No one in their right mind would request all the columns and all of the rows from a production table—unless the table has as few rows as the Pubs demonstration database. We'll fix this later.
When the parent result set is
created, the SHAPE provider returns it to your application just like any other
Recordset, but with an extra Field tacked on to the end. This means if your
table has eight columns, nine Field objects return for each row. This extra
(last) Field.Value contains another Recordset—containing the rows for the child
rows associated with the parent row.
|
Figure 2: Author "parent" rows and associated TitleAuthor "child" rows. |

Figure 2 shows how each row returned from the
Authors table contains an extra field pointing to the row(s) of the TitleAuthor
table that correspond to this specific author.
The next line in the SQL query
to be passed to the SHAPE provider should be an APPEND statement that tells the
SHAPE provider what rows should be fetched from the child table (titleauthor
in the example) for each parent row in the Authors table. Here's where we
concatenate the "APPEND ({select * from titleauthor} " into our
query. Yes, you can have more than one APPEND statement in your SHAPE query,
but let's stick with one child relationship for now.
The last line in our SQL query
is a RELATE statement that tells the SHAPE provider how to tie the two sets of
rows together. In this case it's the "au_id" (author id) field in the
two tables that have to match up. This means for each row in the authors table,
the SHAPE provider will locate all rows in the TitleAuthor table having the
same au_id and create a child Recordset with those rows. We name this
relationship with the AS clause. This name is assigned to the final (extra)
Field object on the parent's Recordset so we can refer to it by name.
Yes, it's the SHAPE provider that matches these rows—it's not SQL Server, or Jet, or any other backend you happen to be using. This means that the SHAPE provider has to have these rows loaded into memory (your client system's RAM) to get this to work.
Here's the completed code to
build the SHAPE statement. Next comes the code to create the parent Recordset object.
rst.Source = "SHAPE {select * from authors} " _
& " APPEND ({select * from titleauthor}" _
& " RELATE au_id TO au_id) AS chapTitleAuthor"
'
rst.Open , cnn
Once the Recordset returns, you
can step through the rows in the usual manner. Keep in mind that you need to
fetch the rows of the Recordset immediately. This frees up locks and permits
other users to work with the data—especially since we used an open-ended query
to build the parent result set.
For purposes of this example, I
populated a TextBox control array with the values from the parent Recordset by
binding the TextBox controls to an ADO Data Control (ADC), which provides
updatability and navigation through the parent Recordset. The TextBox controls and
ADC were not initialized at design time—we'll set the required properties
entirely in code.
'
' Bind the root Recordset to the text boxes
'
For i = 0 To rst.Fields.Count - 2
txtFields(i).DataField = rst.Fields(i).Name
Next I
We set the ADC's Recordset
property to the parent Recordset returned by the SHAPE provider. This fills the
TextBox controls and permits us to navigate up and down in the "Select *
from authors" rowset.
Set Adodc1.Recordset = rst
As we arrive at each row, we
reference the child rowset by extracting the Recordset from the extra Field
object tacked on to the end of the Fields returned for each parent row. Note
that this Field object is named in the RELATE clause of the SHAPE statement.
Set rstTitleAuthor = rst.Fields("chapTitleAuthor").Value
Note: Referencing a Field object by name as shown above is the slowest possible technique. It can be six-times slower than referencing the Field object by number. In a high-performance application, it would be better to use an ordinal reference such as rst.Fields(rst.Fields.Count-1).Value. This small change, multiplied times each Field object reference can make a big difference in overall performance.
For this example, I added a
Microsoft Hierarchical FlexGrid (MSHFlexGrid) to the form to display the child
rows. This is easy to do—you simply have to assign the child Recordset to the
DataSource property. Once assigned, the grid fills with the current set of
child rows.
Set MSHFlexGrid1.DataSource = rstTitleAuthor
Question: Couldn't we have just used the power of the MSHFlexGrid to display both the parent and child rows? Sure. But for now, let's take this a step at-a-time. I'll illustrate that technique later.
When we use the ADC to position
from row to row, we have to remind the MSHFlexGrid about the Recordset to
display. We do this in the MoveComplete event as shown below.
Private Sub Adodc1_MoveComplete(ByVal adReason As ADODB.EventReasonEnum, _
ByVal pError As ADODB.Error, adStatus As ADODB.EventStatusEnum, _
ByVal pRecordset As ADODB.Recordset)
' Use the faster technique here.
Set rstTitleAuthor = rst(rst.Fields.Count - 1).Value
Set MSHFlexGrid1.DataSource = rstTitleAuthor
End Sub
What we end up with (as shown in
Figure 3) is a form
that shows a single parent row at the top in a number of bound TextBox
controls, with the child rows shown below in the MSHFlexGrid.

Figure 3: Using bond TextBox controls with the MSHFlexGrid
We could have used the
MSHFlexGrid to display the parents and all of the children. I didn't choose
that approach for a variety of reasons:
The grid is a PIA (that's a technical
term) to setup in code and since I didn't use the DataEnvironment Designer to
create the parent-child relationships (I could have). This means I would have
to wade into the grid's properties to set it up correctly. The grid properties
don't seem to permit changing much at runtime anyway. For example, there's no
(apparent) way to change the cell (grid column on a band) width at runtime (or
design-time).
The help topics for the grid assume
you'll use it with the DataEnvironment Designer and don't really help much to
show how to set it up the way you can interactively through the grid's property
page dialogs.
Note: I pressed F1 with the cursor focus on the MSHFlexGrid expecting MSDN to show me help topics about the grid. It didn't. MSDN complained, "The compiled help (.chm) file does not contain context IDs." I guess that's because I'm using the most recent (July 1999) version of MSDN. Sigh. There's plenty of help on the grid—just do a "Search" for it in MSDN. I found 166 entries mentioning MSHFlexGrid.
Unless you tell it otherwise, the
DataEnvironment Designer does not use the right (default) ADO connection or
Recordset properties. This means you'll have to remind it to stop prompting for
the user ID and password while opening.
The DataEnvironment Designer (unless
you code otherwise) launches itself before Form_Load time. This means that if
something goes wrong, your error handler won't be there to catch it.
Deactivating the DataEnvironment Designer isn't easy—if you want the bound
controls to auto-synchronize at design time.
The DataEnvironment Designer does not
wait for parameters to be passed into the query you're running. Since we'll be
running parameter-based queries (almost) exclusively, this simply won't work.
Even if we trap the WillExecute event
and pass in first-time parameters, we will still have to manually rebind (in
code) any simple bound controls (like TextBox controls) to the newly created
DataEnvironment Designer Recordset. While this is not a problem with the
MSHFlexGrid (it's a "complex" bound control), it is an issue for many
of my designs.
|
Figure 4 Shape Example 2—an MSHFlexGrid populated from a DataEnvironment Designer generated query. |
The DataEnvironment Designer
approach seems tempting. It makes a flashy demo and what you get is a codeless
component that shows all parent rows with each child's rows tucked neatly
underneath. For example, Figure
4 illustrates what the MSHFlexGrid looks like after having been
populated from a DataEnvironment Designer-based query. Notice that the
individual rows are already expanded. If you use the CollapseAll method, the
grid will only show the parent rows. Clicking on the "+" signs will
expand the child rows.
Granted, this is cool. And it
did not take long to setup using the drag-drop techniques. But the impact it
has on your system is substantial.
To get the flexibly, performance
and scalability I'm sure you will demand, you will want a code-based approach.
Tip: If you insist on using the DataEnvironment Designer to build your relationships and bind to the MSHFlexGrid, be sure to reset the DataEnvironment Designer's Connection object's properties using its Visual Basic-based property page as in Figure 5. Set the Run (and Design) PromptBehavior properties to "4 – adPromptNever". This gives your code a chance to capture the login ID failures—instead of having ODBC (or OLE DB) expose login ID dialogs. These dialogs give the user an opportunity to re-guess their login ID and password—over and over again—until they get it right—something I prefer to avoid.

Figure 5: The DataEnvironment Designer Connection (Visual Basic) property page.
Ok, so let's not get wrapped
around the axel with the DataEnvironment Designer and the MSHFlexGrid. There
are many other ways to display and manage the data returned from the SHAPE
provider. We'll examine these a little later.
Before we get started, let's see
what impact the default (demo) program has on the server. We'll then compare
that to the damage done by the first example. Next, we'll try to see what can
be done to minimize that damage.
Remember that our initial query
didn't use the DataEnvironment Designer—it used the same open-ended queries to
return the rows—two "SELECT * FROM <table>" queries as
generated by the DataEnvironment Designer. Using SQL Server 7.0's Profiler, we
can see what impact these queries have on SQL Server. The Profiler displays the
raw Transact SQL (TSQL) queries and other operations (such as opening
connections).
Note: For some reason, when I was developing the data source-based samples (using the ADC and the DataEnvironment Designer), I noticed that the server eventually started returning connection errors for no apparent reason. The Profiler told me that ADO, the ADC or Visual Basic had not bothered to close and release the connection(s) when I pressed the Stop button in the Visual Basic IDE or even ended the code normally. I tried to add code to explicitly close the Recordsets and Connection objects to no avail. I had to exit Visual Basic completely to get the Connections closed. Folks, this is the same problem I reported back when I wrote my first book on VBSQL—seven years ago.
Let's take a look at the
Profiler output for (part of) the example runs.
+RPC:Starting
sp_prepare @P1 output, NULL, N'
select * from titleauthor', 1
select @P1 15:36:38.753
+SQL:BatchStarting
select * from authors;select * from titleauthor 15:36:38.763
The Profiler tells us that ADO
told SQL Server to "prepare" a query plan for the titleauthor
query and to then execute the two open-ended SELECT queries against authors
and titleauthor. That's it. The SHAPE provider accepts all of the rows
from both SELECT queries and passes the data up to the application to process.
ADO makes no further trips to the server. This means that the SHAPE provider
considers this data static—it does not attempt to get fresh data when it
positions from row to row.
It turns out that the queries
generated by the DataEnvironment Designer using the drag-and-drop method are
identical to those generated by our hard-coded example1 code shown earlier.
Since the two tables are small,
(Authors has about 22 rows, titleauthor only about 25 rows) only about 50 rows
are fetched—no big deal. So what if there were 16,000 authors and 50,000
titleauthor rows? Obviously, this approach would not work—or at least not very
quickly. In addition, since these are open-ended queries, the negative impact
on scalability would be easily apparent.
Keeping this in mind, I wanted
to find a way to get just the rows I wanted, but still let the SHAPE provider
do it's thing—providing the one-to-many relationship management in memory and
still provide updatability for those (albeit rare) cases where I want to update
through a cursor.
First, it was clear that I
needed to limit the rows fetched by the initial "parent" query. Once
this is done, we can selectively fetch the child rows "on demand"—or
can we? Well, it turns out that you can, but you have to use another feature of
the SHAPE provider that we'll talk about later. So, limiting the initial parent
query seems easy enough—just provide a WHERE clause. Well, we also don't want
to hard-code a WHERE clause (in most cases), so I tried to create an ADO
Command object to generate the SHAPE Recordsets. Forget it, it didn't work. I
did get the old-reliable "concatenate-on-the-fly" technique to work.
By limiting the authors query to return only a single author I forced ADO to
re-fetch each time we want to see another author. Sure, you could setup a query
that returned only a selected group of parent rows—say, just the authors from
California.
However, the child query is
still returning all of the rows from the titleauthor table. I figured
the SHAPE provider must have a way to query the child rows only as
needed.
Note: The fetch-all-the-rows-at-once approach might be required for situations where you can't depend on a persistent connection—as with Web pages or middle-tier components. But this does not mean you must fall back on the open-ended queries. There are still ways to write your queries that only fetch "just enough" data.
Well, it does. If you re-phrase
your SHAPE statement to pass a parameter from the parent query to the child
query, ADO and the SHAPE provider change its strategy to fetch the child rows
on demand. Here's how the parameter-based syntax looks:
rst.Source = "SHAPE {select au_id, au_lname, au_fname, city, state, zip" _
& " From authors where state ='" & txtStateWanted & "'} " _
& " APPEND ({select * from titleauthor where au_id = ?}" _
& " RELATE au_id TO Parameter 0) AS chapTitleAuthor"
I capture the state wanted in a
TextBox control (see Example3) and concatenate it into the query when the user
presses the "Search" button. Note that the code ties the au_id in the
parent query with a "?" in the child query in the RELATE clause.
Can you simply execute an stored
procedure instead of the SELECT statement(s) in the parent and child queries?
Sure. See Example4 for a working example.
rst.Source = "SHAPE {execute GetAuthorsByState '" & txtStateWanted & "'} " _
& " APPEND ({select * from titleauthor where au_id = ?}" _
& " RELATE au_id TO Parameter 0) AS chapTitleAuthor"
Let's take a look again at what the
Profiler reports about the parameter-based queries. I discovered something
interesting here that probably will surprise you as much as it did me.
The TSQL statements being sent
make a lot of sense—or they seem to at first.
First,
ADO asks SQL Server to return (just) metadata for the titleauthor query.
+SQL:BatchStarting
SET FMTONLY ON
select * from titleauthor
SET FMTONLY OFF 17:02:13.917
Next, ADO performs the new
parameter-based query using the concatenated parameter we supplied in code
(from the TextBox on the form). This returns an initial rowset with just the
"California" authors.
+SQL:BatchStarting
select au_id,au_lname, au_fname, phone, Address, city,
state, zip from authors where state ='CA' 17:02:13.927
Next, ADO makes a round-trip to
find out what the metadata looks like for the key-based query against the titleauthor
table. We're up to three round trips so far.
+SQL:BatchStarting
SET FMTONLY ON
select au_id from
titleauthor
SET FMTONLY OFF 17:02:13.977
At this point ADO creates a temporary
stored procedure used to fetch columns from the titleauthor table. This
query fires whenever a specific authors table row is selected. Since we
select a whole series of author rows we can expect this to be executed quite a
few times—once for each row in the parent Recordset. Note that the final SELECT
in this batch is simply returning the parameter @P1 from the sp_prepare
statement.
+RPC:Starting
sp_prepare @P1 output, N'@P1 varchar(11)', N'
select * from titleauthor where au_id = @P1', 1
select @P1 17:02:13.977
The final TSQL code reported by
the profiler is the fetch (finally!) for our first parent row. That is, the
SHAPE provider knows that the au_id for the first row is '172-32-1176' so it queries the
server for any titles matching this author id.
+RPC:Starting
sp_execute 1, '172-32-1176' 17:02:13.987
Well, we're not done yet. At
this point we have only positioned to the first row of the parent Recordset and
fetched the first set of child (titleauthor) rows. As we fetch additional rows we
would expect ADO and the SHAPE provider to re-use temporary stored procedure #1
(as created above) to fetch additional rows from the titleauthor table. Sadly,
this is not the case. As we can see in the Profiler dump, the SHAPE provider
tells SQL Server to drop (un-prepare) the preceding temporary stored procedure
and create another—just like it. Sigh. It then executes the new temporary
stored procedure to return the next set of child rows.
+RPC:Starting
sp_unprepare 1 17:21:01.367
+RPC:Starting
sp_prepare @P1 output, N'@P1
varchar(11)', N'
select * from titleauthor where au_id = @P1', 1
select @P1 17:21:01.367
+RPC:Starting
sp_execute 2, '213-46-8915' 17:21:01.377
Now this was a surprise to me. I
would have expected the SHAPE provider to be able to at least leverage the work
done to create the child-row queries already created. As you can see, this
"just-enough" approach has its problems too. It requires several more
round-trips per row than the "all-at-once" approach—which, frankly,
looks like it has quite a bit of merit at this point. The
"just-enough" approach does fetch current data for each child row,
but depends on a static set of parent rows to drive the secondary query.
You can see how adding
additional children to the hierarchy might be interesting. For example,
customer-order-item queries would be fairly common. However, you can also see
how any additional complexity in the hierarchy would be translated
exponentially in increased overhead.
This brief article has given you
a way to get started with the SHAPE syntax and to be able to create
applications that can both scale and perform reasonably well. There is a lot
more to the SHAPE syntax and I hope you will have less trouble than I did
trying to figure out how to use it.