USQL candidate? [Was Re: [RT] Connecting databases with Cocoon]

Miles Elam Wed, 25 Feb 2004 00:47:08 -0800

Here's an idea I've been kicking around. Let's see what folks think of it. I'm only covering the non-Java-programmer crowd here. Programmatic access via EJB/Component is and should be done through Flow IMHO. For others, especially those with a more publishing and/or simple in-out requirements, this is my proposal.

---------------------------------------------------------------------

Have mapping like with XML-DBMS but not via a RelaxNG/XMLSchema wannabe. No offense intended toward the XML-DBMS folks, but the task didn't seem all that inviting to me. I've seen similar approaches via DTD annotation.

This aversion is on multiple levels. First and foremost, the mapping file is yet another file to create and maintain. Sure it's not Java, but the layer-specific syntax might as well be another programming language. "Just" XML is a misnomer. After all, is XSLT just another document schema or is it a programming language? (Didn't someone find it to Turing-complete some time back?) I also have concerns over complexity which I will cover later.

Most relevant to me, I am tired of doing the same old thing over and over again. This is what computers are for: to take the burden of repetitive, tedious tasks from us and to do these task faster than we can. And in my book that includes writing yet another set of SQL statements. To be perfectly honest, I do not consider knowledge of SQL syntax to be a significant enhancement to my life. Don't get me wrong. It has come in very handy over the years, but it is simply an obscure (albeit relatively simple) syntax to reference an ephemeral data organization pattern. I might say the same about XML were it not for the fact that XML is just the data structure without much in the way of syntax; XML is a serialization and processing format and serves that function far better than any alternative I can dream up.

This brings me to my point. A relational database is a largely interrelated data structure. XML is a largely interrelated data structure. SQL is simply the convenient middleman. Others have seen this fact as is evidenced by the number of relational-to-XML tools out there. But where these tool strive desperately to solve all issues that many come up with data structure synchronization, I just wanted my common cases to go away. If greater than ninety percent of my cases can be solved with a simple syntax and less than ten percent of my cases require more difficult roads, I would be satisfied. Let's face it, we've all been confronted with the ninety percent case more times than we care to speak about. It is the corner case that intrigues us so much that we enjoy spending time to solve it. The less time the common case takes, the more time we can spend in enjoyable problem solving.

I'm tired of spending time on getting an integer out of a result set and explicitly converting it to a string representation of the hex value. Frankly, I'm tired of having to remind the computer that I'm working with an integer. My database knows what type the field is. I'm the one who told it when setting up the database schema. My application server knows where my database is. I'm the one who specified the datasource. I don't see the need to tell my servers over and over again when they have things called "memory" and "processing power."

So here's my proposal for a syntax that's as simple as possible but no simpler.

------------------------------------------------

Step 1: Let's assume we have a table called "books" in a database with the following structure and information. To make the example as simple as possible, we'll keep the number of items in the table small. It should be trivial to imagine a larger dataset.

Table "books"
 id [serial]
 isbn [integer]
 title [varchar(100)]
 subtitle [varchar(100)]
 author [vachar(50)]
 pubdate [date]

------------------------------
1
764543555
Professional XML Development with Apache Tools
Xerces, Xalan, FOP, Cocoon, Axis, Xindice
Theodore W. Leung
1 Dec 2000
------------------------------
2
735712352
Cocoon
Building XML Applications
Carsten Ziegeler, Matthew Langham
24 July 2002
------------------------------
3
672322579
Cocoon Developer's Handbook
NULL
Lajos Moczar, Jeremy Aston
10 Dec 2002
------------------------------

------------------------------------------------

Step 2: Create the Cocoon (or EJB?) datasource that points to the above. In this case, we'll call it "library."

------------------------------------------------

Step 3: Create a simple XML file that reflects the structure and what we want the output to look like.

<xml version="1.0" encoding="ISO-8859-15"?>
<BookList>
 <Book>
   <ISBN/>
   <Title/>
   <Subtitle/>
   <Author/>
   <Published/>
 </Book>
</BookList>

------------------------------------------------

Step 4: Add a few "extra" attributes to that XML file. Adding the attributes "table" and "datasource" signal a context point. All results from this query will be relative to this XML element.

<xml version="1.0" encoding="ISO-8859-15"?>
<BookList>
 <Book xmlns:ds="http://geekspeak.org/datasource/1.0";
            ds:datasource="library"
            ds:table="books">
   <ISBN ds:select="isbn:##########"/>
   <Title ds:select="title"/>
   <Subtitle ds:select="subtitle"/>
   <Author ds:select="author"/>
   <Published ds:select="pubdate:yyyy-MM-dd"/>
 </Book>
</BookList>

------------------------------------------------

The output XML after transformation would be the following.

<xml version="1.0" encoding="ISO-8859-15"?>
<BookList>
 <Book>
   <ISBN>0764543555</ISBN>
   <Title>Professional XML Development with Apache Tools</Title>
   <Subtitle>Xerces, Xalan, FOP, Cocoon, Axis, Xindice</Subtitle>
   <Author>Theodore W. Leung</Author>
   <Published>2000-12-01</Published>
 </Book>
 <Book>
   <ISBN>0735712352</ISBN>
   <Title>Cocoon</Title>
   <Subtitle>Building XML Applications</Subtitle>
   <Author>Carsten Ziegeler, Matthew Langham</Author>
   <Published>2002-07-24</Published>
 </Book>
 <Book>
   <ISBN>0672322579</ISBN>
   <Title>Cocoon Developer's Handbook</Title>
   <Author>Lajos Moczar, Jeremy Aston</Author>
   <Published>2002-12-10</Published>
 </Book>
</BookList>

The datasource namespace is stripped during transformation. Note that for the ISBN, we simply use (Java) standard number formatting syntax. How does it know that it's an integer? Because the database told it so. The same is true for how it knows that the <Published> element is a date datatype. The database has defined the corresponding field to be a date. All we need to do at this point is tell it what format you want the date returned.

The third row doesn't have a <Subtitle> tag because the table column "subtitle" was NULL. How do we enforce that the tag always exists? That's your datasource's data model that decides. If it shouldn't be missing/NULL, the table column should be "NOT NULL" or have some other similar constraint. The resultant XML will always be as strict/loose as you underlying data model.

------------------------------------------------

This mucks with the idea of a logicsheet obviously (This isn't XSP generator input). It fits well with Stefano's notion of virtual generators though. If a file generator is followed by a datasource output transformer in a virtual generator named "datasource." (http://marc.theaimsgroup.com/?l=xml-cocoon-dev&m=106131747919501&w=2) In fact, virtual sitemap components could obviate the need for many logicsheets, but that is another (probably very controversial) topic to be covered later.

This is actually more of a move toward "compiler vs. assembly": make the syntax as simple as possible for the author and leave the heavy lifting for the computer to handle. In other words, computer processing time is cheap; developer time is expensive. The most important goal for me is ease of authorship, comprehension, and maintainability. Authorship is covered by having the output document be as close to the mapping file as possible. Comprehension is handled (as with logicsheets) through the use of namespaces -- making separate portions as distinct from one another as possible. Maintainability comes from having the underlying database handle all datatype issues and keeping everything in a single mapping file -- fewer possibilities for stale description files.

With a mapping/template file, you could have more complex structures such as the following for a radio show I work on:

<?xml version="1.0" encoding="ISO-8859-1"?> <show xmlns:ds="http://geekspeak.org/datasource/1.0"; ds:table="shows"  ds:constraint="livedate = {#date}"> <showinfo> <topic ds:select="topic"/> <livedate ds:select="livedate"/> <participationgroup> <participant ds:table="showparticipants"  ds:constraint="showid = {id}" username="{username}"> <ds:value-of select="info:xml"/> <role ds:select="role"/> </participant> </participationgroup> <abstract ds:select="abstract:xml"/> </showinfo> <ds:value-of select="content:xml"/> </show>

Here the model is the same as XSLT's attribute value replacement with '{' and '}' as value delimiters. Also similar to XSLT is the <value-of> element albeit with a different namespace. In the main table declaration, we have the placeholder "{#date}". This is meant to refer to a parameter passed from the sitemap but could just as easily be removed in favor of a variable replacement further up the sitemap processing chain (in a dedicated generator/transformer or XSP). The nested table reference would take the "{id}" value from the parent query context. Presumably this nested query would be translated into a PreparedStatement by the transformer with the parameter updated on each iteration, but this is an implementation detail. I'm talking about designing the API and covering use cases here.

The item:

<abstract ds:select="abstract:xml"/>

is meant to be an extraction of XML from the specified (text-based) column with the root being the "abstract" element. Unlike with XInclude or Cocoon's CInclude, I see it as a document fragment rather than a complete document on its own; In other words, while requiring well-formedness, it could have mutiple roots since it's being added to a larger document with an already-defined root element.

------------------------------------------------

Caching becomes an interesting aspect. Cache keys would simply be a sitemap parameter concatenation. Geoff Howard's event based cache is one option for validity. Another is to check the value of a particular field (which gets updated by a trigger for example). Even in nested table mapping files, I'd suggest just one cache trigger or you could end up with validity objects that are in fact aggregate validities of every joined value.

Constructs such as the following would allow for flexible datasource caching strategies.

<?xml version="1.0" encoding="ISO-8859-1"?>
<show xmlns="http://geekspeak.org/2.0";
     xmlns:ds="http://geekspeak.org/datasource/1.0";
     ds:table="shows"
     ds:constraint="livedate = {#date}">
 <showinfo>
   <topic ds:select="topic"/>
   <livedate ds:select="livedate"/>

 <include href="show{id}_participants.xml" xmlns="http://www.w3.org/2001/XInclude";>

   <abstract ds:select="abstract:xml"/>
 </showinfo>
 <ds:value-of select="content:xml"/>
</show>

<?xml version="1.0" encoding="ISO-8859-1"?> <participationgroup> <participant ds:table="showparticipants"  ds:constraint="showid = {#showid}" username="{username}"> <ds:value-of select="info:xml"/> <role ds:select="role"/> </participant> </participationgroup>

------------------------------------------------

What about insert/update/delete? This is simultaneously easier and harder. Easier because the exact same input file can be used as an input map as well.

<map:transform type="datasourceinsert" src="mappingfile.xml"/>

Is this less flexible than mapping syntaxes based on or similar to DTD, RelaxNG, or XMLSchema? Yes. But it is far easier to read and understand. Also that flexibility only really shows up with document-oriented sections of the input. Data-oriented portions are usually tightly locked with the XML schema. In this case, document-oriented portions are simply treated as database text items and thrown in wholesale. Validation can be performed on the source document to check correctness, but I haven't yet come across a concrete reason to address the issue in datasource interchange layer. Perhaps the relatively strict constraints on the input could be more boon than bane?

Extracting the data from the source via rules from the map should be pretty straightforward. Most items will simply be "ds:select" lookups and the attribute placeholder syntax lends itself well to regexes.

<MyElement myattr="a {foo} deserves a {bar}"/>

"myattr" transforms easily by converting the "{...}" patterns to "(.*)" as in the following regex pattern for input processing

/^a (.*) deserves a (.*)$/

So when is it harder? When it isn't clear exactly what you want. Should it be a database update? Should it be an insert? Should it be a sync (checking for an existing entry, inserting if missing, and updating if present)? With this mapping syntax, only one choice is possible at one time; no explicit sequence of updates followed by some inserts like with the SQLTransformer. It can do multiple (in the sync case), but that is more processing time and may require table locks. In most web environments, data is viewed far more often than updated so perhaps this isn't a real performance issue. 80-20 rule may dictate that inserts/updates are not in the 20% of code that takes up 80% of the time/resources. Your mileage may vary. Personally, I usually do one type of operation at a time, but I am not everyone. Anyone have ideas or thoughts about this one?

------------------------------------------------

In short, there is strong reliance on the database for processing. How does it work? I don't know. (Well, I *do* know how the code I've written works, but this is the hypothetical user "I".) I shouldn't know. In certain predictable cases, a join could be employed behind the scenes. SAX pass-throughs are possible until you hit a context point; But then only the context need be saved in memory, not the whole document. Something akin to XSLTC could be written as well. But I'm getting way too far ahead of myself. The point being that the syntax being as clear and functional as possible is more important. Just as XSLT processors are much faster today than initial attempts, this could be optimized as time and need presented itself.

Some caveats though.

While I have very clear ideas on how it could be implemented and have writted about 1500 lines of Java for the RDBMS->XML transformer portion, it doesn't currently exist (unlike the XSP logicsheet and SQLTransformer). I'm getting slammed in school and paid work takes precedence as well so I can't guarantee when I can get in working/debugged. I have every intention of donating the code if/when finished, so that's not the issue. I also wouldn't mind giving what I have to folks who request it so that they can tinker with it (or rewrite it completely after badmouthing my coding skills... ;-)).

And finally, it requires Java 1.4. This is because of the PreparedStatements and passing parameters. ParameterMetaData is required to see what the parameter datatype is. This class is only available with JDBC 3.0 hence the dependency on v1.4.

Bottom line: I'm going to use it when finished unless someone in the Cocoon community comes up with something that blows my socks off (it's happened before). The real question is whether others want it too.

Do you?

- Miles

Reinhard Poetz wrote:

After using Cocoon for years I still think we are not very clear in
telling our users what is the right tool for their needs. Of course this
depends on their use cases and their skills.
I see three different "Cocoon & databases" szenarios:

1. Use Cocoon in enterprise szenarios which are highly interactive (--> many forms) + data which has to be saved in databases

--> use CocoonForms, controlled by the Flow, a business facade written in Java, O/R-mapping as persistence layer
  --> or use a tool like XML-DBMS (see Mail from
      Daniel) as CocoonForms - database binding
      tool.
2. Use Cocoon as 'pure' publishing engine
  --> use ESQL or SQL Transformer (or sooner or
      later USQL ...)
3. Use Cocoon 'mainly' for publishing and there are
  only a few interactive pages that have to be
  persistent in a database and the user is not
  familiar with Java (this is for many people
  a reason why to use Cocoon because they can
  use XML everywhere and don't need to learn
  a programming language)
As you can see szenario 1 and 2 have clear solutions, scenario 3 doesn't have. Some use ESQL with the drawback of mixing concers, some use SQLTransformer which doesn't make it easy to make create/update/delete-applications and some use DatabaseActions which are IMO the simplest way but they are not supported by flow and which are hidden in the depth of Cocoon (IMHO).
IMO we should make the DatabaseActions more general components (...
shouldn't be too difficult) which can be used from within flow or from
actions and we should make clear statements what we as developer
consider as 'best practices' and what are the pros and cons of each
technology.
WDYT?

-- Reinhard

USQL candidate? [Was Re: [RT] Connecting databases with Cocoon]

Reply via email to