Wiki

Serious Enquiries with XQuery

by Kavindra Devi Palaraja

Those of you who follow the progress of Qt development will have noticed the recent introduction of the QtXmlPatterns module. This module enables the use of XQuery in Qt applications, which can be used in various ways to select, extract and aggregate data in XML or similarly-structured files.

In order to give you a taste of what QtXmlPatterns has to offer, we will implement a simple Qt-based Web robot which can be used to perform some basic XQuery queries such as listing elements and checking links in HTML documents. We will make a user-friendly interface to provide a clear view of the input website, the query executed, and the output generated. However, before we start building our robot, let's look at XQuery in a bit more detail.

About XQuery

At the simplest level, XQuery is a language for querying XML documents for the purposes of finding and manipulating the information stored within them. Just as SQL is used to interact with relational databases, XQuery is the upcoming language of choice for handling XML from online sources.

XQuery is a functional, statically-typed language that is part of a family of Web technologies concerned with processing XML data. There are more than 100 built-in functions provided with XQuery and various ways to use them. The in-depth usage of XQuery is beyond the scope of this article; instead we will outline the basics of using the QtXmlPatterns module. We will look at two kinds of queries:

  • Path Expressions are used to select parts of a document, comprising of a sequence of steps separated by path separators (/).
  • FLWORs are used to iterate through a document and manipulate the data within. The acronym, pronounced "flowers", stands for: for, let, where, order by, return.
The Web robot will execute four queries; each query will be explained in further detail in the coming sections.

Designing the Web Robot

Typically, Web robots analyze websites, carrying out routine tasks such as checking links and collecting statistics. This is done based on the permissions specified in the website's robots.txt file. My colleague, Frans Englich, suggested another type of Web robot to illustrate the use of XQuery, which simply selects one page in a website and analyzes its content with a few queries.

The User Interface

Qt Designer is used to design the Web robot's user interface, which consists of three sections represented by QGroupBox widgets:

  • The input section: We use a read-only QLineEdit to display the website's address and a QWebView to display the website itself. Between these two widgets, we place a vertical spacer. The widgets and the spacer are then laid out in a QGridLayout.
  • The query section: A QTextEdit is used to display the query being executed. This query is selected using one of the four QPushButton objects below the text edit. The push buttons are laid out horizontally in a QHBoxLayout. We then lay out the text edit and the buttons' layout, vertically, in a QVBoxLayout.
  • The output section: We use another QTextEdit to display the output. A layout object is not required here as the position of the text edit will be managed by the group box.

The user interface in Qt Designer

The Webrobot GUI

Implementing the Web Robot

Now that we have our user interface designed and saved in a Qt Designer .ui file, we can move on to implementing the Web robot.

The Resource File

We require a Qt Resource file (.qrc) within which we can embed the (.ui) file and the XQuery queries (.xq). The queries are embedded in separate files and will be loaded at run-time. The contents of the resource file is shown below.

    <!DOCTYPE RCC><RCC version="1.0">
    <qresource>
 
        <file>forms/robot.ui</file>
        <file>queries/query1.xq</file>
        <file>queries/query2.xq</file>
        <file>queries/query3.xq</file>
 
        <file>queries/query4.xq</file>
    </qresource>
    </RCC>

The MainWindow Class

Next, we create a MainWindow class to hold our widgets. This class is a subclass of QMainWindow, containing the definitions necessary to create a front end for our Web robot.

    class MainWindow : public QMainWindow
    {
        Q_OBJECT
 
    public:
        MainWindow();
 
    public slots:
        void evaluate(const QString &str);

There's a constructor and a slot, evaluate() to evaluate queries that will be read in through a file.

There are various ways to embed a .ui file into a program, but we will use the UiTools module. So, the private section of our MainWindow class consists of all the widgets used in the user interface, a central widget for the MainWindow, robotWidget, and a QSignalMapper object, signalMapper. We also have a private function, loadUiFile() to load the .ui file mentioned earlier.

    private:
        QLineEdit* ui_websiteLineEdit;
        QPushButton* ui_queryButton1;
        ...
        QWidget* robotWidget;
        QSignalMapper *signalMapper;
 
        QWidget* loadUiFile();
    };

Let's look at the implementation of the constructor:

    MainWindow::MainWindow()
    {
        robotWidget = loadUiFile();
        ui_websiteLineEdit = qFindChild<QLineEdit*>(this,
            "websiteLineEdit");
        ui_websiteViewer = qFindChild<QWebView*>(this,
            "websiteViewer");
        ui_queryTextEdit = qFindChild<QTextEdit*>(this,
            "queryTextEdit");
        ...

The user interface is loaded into robotWidget. Then, QObject's qFindChild() function is used to access its widgets.

The QSignalMapper class is used to collect a group of signals and re-emit them with parameters that correspond to the object that sent the signal. This class is useful when you would like to connect many signals to one slot but still identify which object emitted the signal and process it accordingly.

In our case, we want to map four push buttons to one evaluate function; hence, we connect ui_queryButton1's clicked() signal to signalMapper's map() slot. This is done for all four push buttons. You can picture signalMapper as the middleman who ensures that the right query is executed when a button is pressed.

      signalMapper = new QSignalMapper(this);
        connect(ui_queryButton1, SIGNAL(clicked()),
            signalMapper, SLOT (map()));
        ...
        signalMapper->setMapping(ui_queryButton1,
            QString(":queries/") + "query1.xq");
        ...
        connect(signalMapper, SIGNAL(mapped(const QString &)),
            this, SLOT(evaluate(const QString &)));

Then, we invoke the setMapping() function to provide each push button with its own string parameter. This parameter determines which query file will be loaded for a particular push button. Again, this is done for all four push buttons. Lastly, we connect signalMapper's mapped() signal to our evaluate() function.

      connect(ui_websiteViewer,
            SIGNAL(urlChanged(const QUrl &)),
            this, SLOT(updateLocation(const QUrl &)));
 
        ui_websiteViewer->setUrl(
            QUrl("http://doc.trolltech.com/qq/"));
 
        setCentralWidget(robotWidget);
        setWindowTitle(tr("XQuery Web Robot"));
 
        evaluate(":queries/query1.xq");
    }

When the Web robot is run, we intend to display the page on which we want to run our queries. So, we set the ui_websiteViewer's URL to the address for the Qt Quarterly website. Finally, we execute our first query with the help of the evaluate() function, displaying the output in the output viewer.

If you use Qt's UiTools module to load a .ui file, you need to instantiate the QUiLoader class and invoke its load() function. The function below illustrates this.

    QWidget* MainWindow::loadUiFile()
    {
        QUiLoader loader;
        QFile file(":/forms/robot.ui");
        file.open(QFile::ReadOnly);
        QWidget *formWidget = loader.load(&file, this);
        file.close();
        return formWidget;
    }

The next function in our MainWindow class is the evaluate() function which takes a QString parameter. It is within this function that we process our queries.

    void MainWindow::evaluate(const QString &fileName)
    {
        QFile queryFile(fileName);
        queryFile.open(QIODevice::ReadOnly);
 
        QString queryString =
            QTextStream(&queryFile).readAll();
 
        ui_queryTextEdit->setPlainText(queryString);

We begin by reading the file with fileName, which contains a query. Then we display it in our query viewer.

To evaluate a query on the contents of the document, we create a QXmlQuery object, bind the inputDocument variable to the document's URL, and call setQuery() to set our query to the text we read from the query file.

      QXmlQuery query;
 
        query.bindVariable("inputDocument",
                           QVariant(ui_websiteViewer->url()));
        query.setQuery(queryString, ui_websiteViewer->url());

Once we have bound the query to the variable, we check that the query we are trying to execute is a valid one or not. If it is an invalid query, we display a QMessageBox with an appropriate message.

      if (!query.isValid()) {
            QMessageBox::information(this,
                tr("Invalid Query"), tr("The query you are "
                "trying to execute is invalid."));
            return;
        }

The next step is to evaluate the query. We start by declaring the variables needed to hold our output.

      QByteArray outArray;
        QBuffer buffer(&outArray);
        buffer.open(QIODevice::ReadWrite);

QXmlFormatter is the class responsible for formatting XML output, making it more readable. We construct a QXmlFormatter with the QXmlQuery and QIODevice as parameters.

      QXmlFormatter formatter(query, &buffer);
 
        if (!query.evaluateTo(&formatter)) {
            QMessageBox::information(this,
                tr("Cannot Execute Query"), tr("An error "
                "occured while executing the query."));
            return;
        }

We attempt to evaluate our query using the evaluateTo() function. If the evaluation succeeds, we display the output. Otherwise, we display a message box to inform the user.

Finally, we close our buffer and output outArray's contents to our output viewer text edit. Since the output is supplied as UTF8 encoded text, we have to decode it before passing it to the text edit.

      buffer.close();
        ui_outputTextEdit->setPlainText(
            QString::fromUtf8(outArray.constData()));
    }

We add one more slot to keep the line edit containing the website's address in sync with the browser.

    void MainWindow::updateLocation(const QUrl &url)
    {
        ui_websiteLineEdit->setText(url.toString());
    }

The front end of our Web robot is now complete. Next, we take a look at the implementation of the queries themselves.

Writing the Queries

Before we look at how to write queries, it is important to note that XQuery queries only work on well-formed XML documents. As a result, we won't be able to use them on sites that use HTML instead of XHTML.

The queries we will use are simple in the sense that they only filter the information found in these online documents, producing output that is easy to understand when displayed as plain text.

Loading the Website

The first query we write is used to obtain the document node of the Qt Quarterly website. We invoke the doc() function on the contents of the inputDocument variable:

    doc($inputDocument)

Recall that inputDocument is bound to the value which we set up in our evaluate() function, so the Web browser will display the Qt Quarterly website while the output viewer will display the corresponding XHTML source text for the page.

Listing Valid RSS Links

Earlier, we mentioned FLWORs, which are used in more complex queries to join data, construct elements, sort data, and so on. For our second query, we use this type of expression to extract RSS links which we validate with the doc-available() function. The query's output will only display valid links.

    for $alternate in doc($inputDocument)//*:link[@rel=
        "alternate" and @type="application/rss+xml"]
    return
        if (doc-available(resolve-uri($alternate/@href)))
        then $alternate
        else ()

Although we do not have any let or where clause, our query is still valid because a FLWOR requires only one for or let clause.

The double slash ("//") in combination with the link string selects link elements anywhere in the document, and the asterisk (*) indicates that the elements are from any namespace. We also only select elements with suitable rel and type attributes.

We invoke the resolve-uri() function to resolve the relative URI against the base URI, forming the absolute URI. The result of the query for RSS links looks like this:

The output from a query

XQuery Built-in Functions We Require

From the various built-in functions provided with XQuery, we select five common functions to use with our Web robot:

  • doc() takes a URI and retrieves its document node (the entire XML document).
  • doc-available() takes a URI and calls the doc() function on it. Returns true if the doc() function results in a document node; returns false otherwise.
  • resolve-uri() allows a relative URI to be resolved against absolute URI.
  • count() counts the number of arguments.
  • starts-with() returns true or false, indicating whether one string starts with the characters of another string.

Listing All Image Elements

Our third query attempts to list all image elements anywhere in the Qt Quarterly website, within any namespace.

    doc($inputDocument)//*:img

The output is shown below.

The output from a query

Counting Qt Quarterly Issues

Suppose we'd like to count the number of Qt Quarterly issues we have so far. We know that each image is embedded within a table. So, we select all table elements, all td elements, and all img elements within them. Then we invoke the starts-with() function, because we only need image elements whose alt attributes begin with "Issue".

    count(doc($inputDocument)//*:table//*:td//*:img[
        starts-with(@alt, "Issue")])

The output for this query is "24" at the time of writing.

The output from a query

Final Thoughts

The Web robot is only a small preview of what it possible with the QtXmlPatterns module. You can modify the queries to perform more advanced operations, such as checking for broken links, searching for data encoded in the document using microformats, and so on.

Also, if you would like to analyze more than one website, you have to subclass QAbstractXmlNodeModel, which can represent any kind of data source, in a form that can be queried with QXmlQuery. Remember that not all websites are made with well-formed XML---addressing this real world problem is a challenge we decided not to face when writing this article.

Another aspect of XQuery that we haven't explored here is the use of queries as templates, where the expressions themselves are written as part of a XML document. This makes it possible to write applications that take content, process it, and feed it back into a Web browser. Content manipulation along these lines could even form the basis of a report generator or database front end.

Qt comes with a set of examples and demos for the QtXmlPatterns module that show how to use XQuery in your applications. We recommend starting with the Recipes example before moving on to more advanced features.

Обсудить на форуме...