Getting Your Data into SHAPE

William R. Vaughn
President, Beta V Corporation

No, this article is not about thinning out the fat rows out of your cursors—which might not be a bad idea. It is about using the ADO 2.x "SHAPE" syntax to manage hierarchical data. For some time now, the Fox and ADO teams have been working on creating a mechanism to handle the construction and management of hierarchies. The result is a separate provider in ADO designed specifically to handle this fairly common data structure. This new OLE DB provider seems to make it easier to deal with a number of common data access problems—but as we'll see, provider has a specific targeted audience—and that might not be you.

I've asked most of my classes and conference audiences if they use "hierarchical" data—most won’t admit to it—at least not in public. I suppose they think "hierarchy" is one of those fancy terms that only college professors use. While that might be true, I suspect that you use hierarchical data when you work with your own database. If you have queries that return customers and their orders, or books and their authors, or reference similarly related tables, then you use hierarchical data. Yep, you were right—it's a fancy term for one-to-many relationships. 

When I started researching the SHAPE provider, I discovered that there were many articles written about it, but none seemed to talk to my issues. That is, most of the examples use unbound queries against tiny databases—such as SELECT * FROM Authors in the sample Pubs database. I did find a useful article "Shape Up Your Data Using ADO 2.0 Hierarchical Recordsets in ASP" on the MSDN CD that built parameter queries, but this article discusses a Web-based application, and I usually focus on client/server Visual Basic applications.

While investigating the Shape provider, I kept the client/server and large database developer in mind. As I got deeper into the Shape functionality, it became clear that the provider was meant for smaller, table-based databases. Too many of its functional assumptions made it unsuitable for sizeable databases—or databases that can generate complex hierarchical result sets on their own. SQL Server, for one, can easily return a Recordset based on a single or a set of parent rows and all associated child rows—to any depth necessary—with a single call to the server. However, not all OLE DB or ODBC providers are capable of this level of sophistication. Simpler providers have no way to return these hierarchies. But, the Shape provider is more than just a way to return Parent/child hierarchies. It can also be used to compute more complex aggregate functions used when you create reports.

When I worked on Microsoft's Visual Studio Marketing team, I demonstrated the Shape provider any number of times. After about a minute of flashy clicking and dragging, I ended up with a three-table hierarchy that looked something like Figure 1. This is really pretty easy. Just drag the three tables from the Data View Window over to the Data Environment Designer window and set the "Relation" property (on the Properties page) to cross-reference the au_id and title_id fields. I dragged the parent object (the "authors" command in this case) over to a form and chose the MSHFlexGrid to display the rows. No, I'm not going to walk you through you this again. That's because I want to show you an easier way that requires a few lines of code but gives you a lot better performance and far more control over your results. If you want to see it done, come by the office or one of my VBits talks and I'll be happy to show you step-by-step.


Building a simple SHAPE

Let's get started with the basics by creating a Visual Basic application that uses the SHAPE provider to do something simple. We'll use Visual Basic 6.0 Enterprise Edition for this exercise—I make no guarantees on the suitability of the Professional Edition. I also have SP3 installed, which means I'm using ADO 2.1 GA whether I want to or not. I did not, however, set my references to the 2.1 library—that breaks the DataEnvironment Designer which it turns out I didn't really need. I accessed my local SQL Server 7.0 (no SP) and the Pubs database to experiment. I'm glad I did because some of the sample applications would have taken all afternoon to run if they had a lot of data (more than 20 rows) to query.

Note: Keep in mind that the SHAPE provider's default behavior uses two (or more) SELECT * queries (with no WHERE clauses)—so do most of the examples in MSDN. That is, the Visual Basic tools generate one open-ended query as the "Parent" command and one additional open-ended query for each "child" command. These are submitted to SQL Server as a single batch. This means that if you don't code otherwise, the SHAPE provider will return the entire contents of all 'n' tables to your client (or middle-tier server)—or at least try to. 

So, how do you create a simple SHAPE statement in the first place? Well, you can steal leverage it from the DataEnvironment Designer or copy it from my code shown below. I also won't go into the raw syntax of the SHAPE statement—there are a half-dozen MSDN articles for that. See "Shape Commands in General" in MSDN for starters. Frankly, there's quite a bit there that you won't need to know at all to run any of the examples I'm about to show you.

 

Start by setting up some General Declaration variables:

    Dim cnn As ADODB.Connection

    Dim rst As ADODB.Recordset

    Dim rstTitleAuthor As ADODB.Recordset

 

Figure 1: The DataEnvironment Designer with a sample hierarchy.

Next, open a connection to your favorite SQL Server—one that still has the sample Pubs database loaded. Note that you need to create an ADO Connection object and set it's Provider property to "MSDateShare" before you open it. You can use either the default ODBC to OLE DB data provider or a straight OLE DB provider for SQL Server. This means your Connection string should look just like it always does. Mine references a DSN, but you can use a DSN-less connection.

    Set cnn = New Connection

    cnn.Provider = "MSDataShape"

    cnn.Open "Data Provider=MSDASQL;" & _

               "DSN=biblio70;uid=sa;pwd=xxxx;database=pubs"

 

Tip: As I tell my kids, do as I say, not as I do. Don't use SA as your development password. It masks all kinds of issues that will bite you when you try to deploy your application.

 

Once we're connected, we can create a new Recordset and set the StayInSync property. We set this to false to prevent extra queries from being generated. I'll talk about this later.

 

    Set rst = New Recordset

    rst.StayInSync = False

 

In the next step you'll have to provide the SHAPE provider with a correctly constructed query. We pass this query to the provider through the Recordset Source property. Consider that the SHAPE provider does not really have much of a query processor. This does not mean you can't submit complex queries, as the SHAPE provider simply passes these on to the backend for processing.

In the first bracketed expression after the SHAPE statement, you need to provide an SQL SELECT statement (or a stored procedure) that returns the "parent" result set. In our example, we'll request all of the rows from the authors table.

Of course, selecting the entire authors table is not that clever. No one in their right mind would request all the columns and all of the rows from a production table—unless the table has as few rows as the Pubs demonstration database. We'll fix this later.

When the parent result set is created, the SHAPE provider returns it to your application just like any other Recordset, but with an extra Field tacked on to the end. This means if your table has eight columns, nine Field objects return for each row. This extra (last) Field.Value contains another Recordset—containing the rows for the child rows associated with the parent row.

Figure 2: Author "parent" rows and associated TitleAuthor "child" rows.

 

Figure 2 shows how each row returned from the Authors table contains an extra field pointing to the row(s) of the TitleAuthor table that correspond to this specific author.

The next line in the SQL query to be passed to the SHAPE provider should be an APPEND statement that tells the SHAPE provider what rows should be fetched from the child table (titleauthor in the example) for each parent row in the Authors table. Here's where we concatenate the "APPEND ({select * from titleauthor} " into our query. Yes, you can have more than one APPEND statement in your SHAPE query, but let's stick with one child relationship for now.

The last line in our SQL query is a RELATE statement that tells the SHAPE provider how to tie the two sets of rows together. In this case it's the "au_id" (author id) field in the two tables that have to match up. This means for each row in the authors table, the SHAPE provider will locate all rows in the TitleAuthor table having the same au_id and create a child Recordset with those rows. We name this relationship with the AS clause. This name is assigned to the final (extra) Field object on the parent's Recordset so we can refer to it by name. 

Yes, it's the SHAPE provider that matches these rows—it's not SQL Server, or Jet, or any other backend you happen to be using. This means that the SHAPE provider has to have these rows loaded into memory (your client system's RAM) to get this to work.

Here's the completed code to build the SHAPE statement. Next comes the code to create the parent Recordset object.

 

    rst.Source = "SHAPE  {select * from authors} " _

              & " APPEND ({select * from titleauthor}" _

              & " RELATE au_id TO au_id) AS chapTitleAuthor"

    '

    rst.Open , cnn

 

Once the Recordset returns, you can step through the rows in the usual manner. Keep in mind that you need to fetch the rows of the Recordset immediately. This frees up locks and permits other users to work with the data—especially since we used an open-ended query to build the parent result set.

For purposes of this example, I populated a TextBox control array with the values from the parent Recordset by binding the TextBox controls to an ADO Data Control (ADC), which provides updatability and navigation through the parent Recordset. The TextBox controls and ADC were not initialized at design time—we'll set the required properties entirely in code.

    '

    '   Bind the root Recordset to the text boxes

    '

    For i = 0 To rst.Fields.Count - 2

        txtFields(i).DataField = rst.Fields(i).Name

    Next I

We set the ADC's Recordset property to the parent Recordset returned by the SHAPE provider. This fills the TextBox controls and permits us to navigate up and down in the "Select * from authors" rowset.

   

    Set Adodc1.Recordset = rst

 

As we arrive at each row, we reference the child rowset by extracting the Recordset from the extra Field object tacked on to the end of the Fields returned for each parent row. Note that this Field object is named in the RELATE clause of the SHAPE statement.

 

    Set rstTitleAuthor = rst.Fields("chapTitleAuthor").Value

 

Note: Referencing a Field object by name as shown above is the slowest possible technique. It can be six-times slower than referencing the Field object by number. In a high-performance application, it would be better to use an ordinal reference such as rst.Fields(rst.Fields.Count-1).Value. This small change, multiplied times each Field object reference can make a big difference in overall performance.

For this example, I added a Microsoft Hierarchical FlexGrid (MSHFlexGrid) to the form to display the child rows. This is easy to do—you simply have to assign the child Recordset to the DataSource property. Once assigned, the grid fills with the current set of child rows.

Set MSHFlexGrid1.DataSource = rstTitleAuthor

 

Question: Couldn't we have just used the power of the MSHFlexGrid to display both the parent and child rows? Sure. But for now, let's take this a step at-a-time. I'll illustrate that technique later.

When we use the ADC to position from row to row, we have to remind the MSHFlexGrid about the Recordset to display. We do this in the MoveComplete event as shown below.

 

Private Sub Adodc1_MoveComplete(ByVal adReason As ADODB.EventReasonEnum, _

 ByVal pError As ADODB.Error, adStatus As ADODB.EventStatusEnum,  _

 ByVal pRecordset As ADODB.Recordset)

 

' Use the faster technique here.

        Set rstTitleAuthor = rst(rst.Fields.Count - 1).Value          

        Set MSHFlexGrid1.DataSource = rstTitleAuthor

End Sub

 

What we end up with (as shown in Figure 3) is a form that shows a single parent row at the top in a number of bound TextBox controls, with the child rows shown below in the MSHFlexGrid.

Figure 3: Using bond TextBox controls with the MSHFlexGrid

Why not use the DataEnvironment Designer?

We could have used the MSHFlexGrid to display the parents and all of the children. I didn't choose that approach for a variety of reasons:

*   The grid is a PIA (that's a technical term) to setup in code and since I didn't use the DataEnvironment Designer to create the parent-child relationships (I could have). This means I would have to wade into the grid's properties to set it up correctly. The grid properties don't seem to permit changing much at runtime anyway. For example, there's no (apparent) way to change the cell (grid column on a band) width at runtime (or design-time).

*   The help topics for the grid assume you'll use it with the DataEnvironment Designer and don't really help much to show how to set it up the way you can interactively through the grid's property page dialogs.

Note: I pressed F1 with the cursor focus on the MSHFlexGrid expecting MSDN to show me help topics about the grid. It didn't. MSDN complained, "The compiled help (.chm) file does not contain context IDs."  I guess that's because I'm using the most recent (July 1999) version of MSDN. Sigh. There's plenty of help on the grid—just do a "Search" for it in MSDN. I found 166 entries mentioning MSHFlexGrid.

*   Unless you tell it otherwise, the DataEnvironment Designer does not use the right (default) ADO connection or Recordset properties. This means you'll have to remind it to stop prompting for the user ID and password while opening.

*   The DataEnvironment Designer (unless you code otherwise) launches itself before Form_Load time. This means that if something goes wrong, your error handler won't be there to catch it. Deactivating the DataEnvironment Designer isn't easy—if you want the bound controls to auto-synchronize at design time.

*   The DataEnvironment Designer does not wait for parameters to be passed into the query you're running. Since we'll be running parameter-based queries (almost) exclusively, this simply won't work.

*   Even if we trap the WillExecute event and pass in first-time parameters, we will still have to manually rebind (in code) any simple bound controls (like TextBox controls) to the newly created DataEnvironment Designer Recordset. While this is not a problem with the MSHFlexGrid (it's a "complex" bound control), it is an issue for many of my designs.

 

Figure 4 Shape Example 2—an MSHFlexGrid populated from a DataEnvironment Designer generated query.

The DataEnvironment Designer approach seems tempting. It makes a flashy demo and what you get is a codeless component that shows all parent rows with each child's rows tucked neatly underneath. For example, Figure 4 illustrates what the MSHFlexGrid looks like after having been populated from a DataEnvironment Designer-based query. Notice that the individual rows are already expanded. If you use the CollapseAll method, the grid will only show the parent rows. Clicking on the "+" signs will expand the child rows. 

Granted, this is cool. And it did not take long to setup using the drag-drop techniques. But the impact it has on your system is substantial.

To get the flexibly, performance and scalability I'm sure you will demand, you will want a code-based approach.


 

Tip: If you insist on using the DataEnvironment Designer to build your relationships and bind to the MSHFlexGrid, be sure to reset the DataEnvironment Designer's Connection object's properties using its Visual Basic-based property page as in Figure 5. Set the Run (and Design) PromptBehavior properties to "4 – adPromptNever". This gives your code a chance to capture the login ID failures—instead of having ODBC (or OLE DB) expose login ID dialogs. These dialogs give the user an opportunity to re-guess their login ID and password—over and over again—until they get it right—something I prefer to avoid.

Figure 5: The DataEnvironment Designer Connection (Visual Basic) property page.


Analyzing the impact

Ok, so let's not get wrapped around the axel with the DataEnvironment Designer and the MSHFlexGrid. There are many other ways to display and manage the data returned from the SHAPE provider. We'll examine these a little later.

Before we get started, let's see what impact the default (demo) program has on the server. We'll then compare that to the damage done by the first example. Next, we'll try to see what can be done to minimize that damage.

Remember that our initial query didn't use the DataEnvironment Designer—it used the same open-ended queries to return the rows—two "SELECT * FROM <table>" queries as generated by the DataEnvironment Designer. Using SQL Server 7.0's Profiler, we can see what impact these queries have on SQL Server. The Profiler displays the raw Transact SQL (TSQL) queries and other operations (such as opening connections).

Note: For some reason, when I was developing the data source-based samples (using the ADC and the DataEnvironment Designer), I noticed that the server eventually started returning connection errors for no apparent reason. The Profiler told me that ADO, the ADC or Visual Basic had not bothered to close and release the connection(s) when I pressed the Stop button in the Visual Basic IDE or even ended the code normally. I tried to add code to explicitly close the Recordsets and Connection objects to no avail. I had to exit Visual Basic completely to get the Connections closed. Folks, this is the same problem I reported back when I wrote my first book on VBSQL—seven years ago.

Let's take a look at the Profiler output for (part of) the example runs.

+RPC:Starting

       sp_prepare @P1 output, NULL, N'

select * from titleauthor', 1

select @P1   15:36:38.753

+SQL:BatchStarting 

select * from authors;select * from titleauthor      15:36:38.763

 

The Profiler tells us that ADO told SQL Server to "prepare" a query plan for the titleauthor query and to then execute the two open-ended SELECT queries against authors and titleauthor. That's it. The SHAPE provider accepts all of the rows from both SELECT queries and passes the data up to the application to process. ADO makes no further trips to the server. This means that the SHAPE provider considers this data static—it does not attempt to get fresh data when it positions from row to row.

It turns out that the queries generated by the DataEnvironment Designer using the drag-and-drop method are identical to those generated by our hard-coded example1 code shown earlier.

Since the two tables are small, (Authors has about 22 rows, titleauthor only about 25 rows) only about 50 rows are fetched—no big deal. So what if there were 16,000 authors and 50,000 titleauthor rows? Obviously, this approach would not work—or at least not very quickly. In addition, since these are open-ended queries, the negative impact on scalability would be easily apparent.

Finding a better way

Keeping this in mind, I wanted to find a way to get just the rows I wanted, but still let the SHAPE provider do it's thing—providing the one-to-many relationship management in memory and still provide updatability for those (albeit rare) cases where I want to update through a cursor.

First, it was clear that I needed to limit the rows fetched by the initial "parent" query. Once this is done, we can selectively fetch the child rows "on demand"—or can we? Well, it turns out that you can, but you have to use another feature of the SHAPE provider that we'll talk about later. So, limiting the initial parent query seems easy enough—just provide a WHERE clause. Well, we also don't want to hard-code a WHERE clause (in most cases), so I tried to create an ADO Command object to generate the SHAPE Recordsets. Forget it, it didn't work. I did get the old-reliable "concatenate-on-the-fly" technique to work. By limiting the authors query to return only a single author I forced ADO to re-fetch each time we want to see another author. Sure, you could setup a query that returned only a selected group of parent rows—say, just the authors from California.

However, the child query is still returning all of the rows from the titleauthor table. I figured the SHAPE provider must have a way to query the child rows only as needed.

Note: The fetch-all-the-rows-at-once approach might be required for situations where you can't depend on a persistent connection—as with Web pages or middle-tier components. But this does not mean you must fall back on the open-ended queries. There are still ways to write your queries that only fetch "just enough" data.

Well, it does. If you re-phrase your SHAPE statement to pass a parameter from the parent query to the child query, ADO and the SHAPE provider change its strategy to fetch the child rows on demand. Here's how the parameter-based syntax looks:

    rst.Source = "SHAPE  {select au_id, au_lname, au_fname, city, state, zip" _

                    & " From authors where state ='" & txtStateWanted & "'} " _

                   & " APPEND ({select * from titleauthor where au_id = ?}" _

                   & " RELATE au_id TO Parameter 0) AS chapTitleAuthor"

 

I capture the state wanted in a TextBox control (see Example3) and concatenate it into the query when the user presses the "Search" button. Note that the code ties the au_id in the parent query with a "?" in the child query in the RELATE clause.

Can you simply execute an stored procedure instead of the SELECT statement(s) in the parent and child queries? Sure. See Example4 for a working example.

   rst.Source = "SHAPE {execute GetAuthorsByState '" & txtStateWanted & "'} " _

                   & " APPEND ({select * from titleauthor where au_id = ?}" _

                   & " RELATE au_id TO Parameter 0) AS chapTitleAuthor"

The impact of the "just-enough" approach

Let's take a look again at what the Profiler reports about the parameter-based queries. I discovered something interesting here that probably will surprise you as much as it did me.

The TSQL statements being sent make a lot of sense—or they seem to at first.

First, ADO asks SQL Server to return (just) metadata for the titleauthor query.

+SQL:BatchStarting 

SET FMTONLY ON

select * from titleauthor

SET FMTONLY OFF     17:02:13.917

Next, ADO performs the new parameter-based query using the concatenated parameter we supplied in code (from the TextBox on the form). This returns an initial rowset with just the "California" authors.

+SQL:BatchStarting 

select au_id,au_lname, au_fname, phone, Address, city,

state, zip from authors where state ='CA'      17:02:13.927

 

Next, ADO makes a round-trip to find out what the metadata looks like for the key-based query against the titleauthor table. We're up to three round trips so far.

+SQL:BatchStarting 

SET FMTONLY ON

select  au_id from titleauthor

SET FMTONLY OFF     17:02:13.977

 

At this point ADO creates a temporary stored procedure used to fetch columns from the titleauthor table. This query fires whenever a specific authors table row is selected. Since we select a whole series of author rows we can expect this to be executed quite a few times—once for each row in the parent Recordset. Note that the final SELECT in this batch is simply returning the parameter @P1 from the sp_prepare statement.

+RPC:Starting

sp_prepare @P1 output, N'@P1 varchar(11)', N'

select * from titleauthor where au_id = @P1', 1

select @P1   17:02:13.977

 

The final TSQL code reported by the profiler is the fetch (finally!) for our first parent row. That is, the SHAPE provider knows that the au_id for the first row is '172-32-1176' so it queries the server for any titles matching this author id. 

+RPC:Starting

sp_execute 1, '172-32-1176'      17:02:13.987

 

Well, we're not done yet. At this point we have only positioned to the first row of the parent Recordset and fetched the first set of child (titleauthor) rows. As we fetch additional rows we would expect ADO and the SHAPE provider to re-use temporary stored procedure #1 (as created above) to fetch additional rows from the titleauthor table. Sadly, this is not the case. As we can see in the Profiler dump, the SHAPE provider tells SQL Server to drop (un-prepare) the preceding temporary stored procedure and create another—just like it. Sigh. It then executes the new temporary stored procedure to return the next set of child rows.

+RPC:Starting

       sp_unprepare 1      17:21:01.367

+RPC:Starting

       sp_prepare @P1 output, N'@P1 varchar(11)', N'

select * from titleauthor where au_id = @P1', 1

select @P1   17:21:01.367

+RPC:Starting

       sp_execute 2, '213-46-8915'      17:21:01.377

 

Now this was a surprise to me. I would have expected the SHAPE provider to be able to at least leverage the work done to create the child-row queries already created. As you can see, this "just-enough" approach has its problems too. It requires several more round-trips per row than the "all-at-once" approach—which, frankly, looks like it has quite a bit of merit at this point. The "just-enough" approach does fetch current data for each child row, but depends on a static set of parent rows to drive the secondary query.

You can see how adding additional children to the hierarchy might be interesting. For example, customer-order-item queries would be fairly common. However, you can also see how any additional complexity in the hierarchy would be translated exponentially in increased overhead.

The rest of the SHAPE story

This brief article has given you a way to get started with the SHAPE syntax and to be able to create applications that can both scale and perform reasonably well. There is a lot more to the SHAPE syntax and I hope you will have less trouble than I did trying to figure out how to use it.