Reflecting upon Web Abstraction and URL

URL is the key user perspective about an web application

S V Ramu (2003-02-16)

Prelude

The Internet is the culmination of years of passion and commitment to collaborate, in spite of differences. The three key broad conceptual pillars of this Internet revolution, is the HTTP, HTML and the URL. HTTP, for most common Internet user, is well behind the scenes. You can in fact build and run a site without knowing even a single command of HTTP (Hyper Text Transfer Protocol - RFC 1945, RFC 2068). HTTP's job is just to make two different computers to talk together. Simply put, a URL (Uniform Resource Locater - RFC 1738) is a standard naming convention to call a website and its pages. And HTML is just a convenient page presentation format. In itself each of these concepts/standards is fairly simple, but together they have created this amazing world of Internet that we are all the fortunate benefactor. By the way, this RFC (Request For Comments) scheme is by itself a remarkable example of decentralized interests but yet very coherent standards. Every minute bit of the Internet usage is documented and standardized.

HTTP and Distributed Computing

...The HTTP protocol is based on a request/response paradigm. A client establishes a connection with a server and sends a request to the server in the form of a request method, URI, and protocol version, followed by a MIME-like message containing request modifiers, client information, and possible body content. The server responds with a status line, including the message's protocol version and a success or error code, followed by a MIME-like message containing server information, entity meta information, and possible body content...

RFC 1945

From even the early stages of computing, application designers have dreamed of distributed programming. Though much have been achieved, it is still a dream in many parts. Though websites are not examples of distributed computing in general, the upcoming Web Services model, and its related efforts, prove that this very old websites are the forerunner of the current happenings. The basic underpinnings of this web model is HTTP, which is still relevant in the modern web services world.

HTTP is a typical wire protocol. Which means it standardize just the pure text messages that is exchanged between the client and the server. The power of this model is that this protocol does not depend on the CPU, operating system, or any other thing of the computers that reside on the both ends of the system. As long as the application on the both ends create these HTTP message and knows how to handle them, things will work. The beauty is it does not depend on the lower network protocol either.

...On the Internet, HTTP communication generally takes place over TCP/IP connections. The default port is TCP 80, but other ports can be used. This does not preclude HTTP from being implemented on top of any other protocol on the Internet, or on other networks. HTTP only presumes a reliable transport; any protocol that provides such guarantees can be used, and the mapping of the HTTP/1.0 request and response structures onto the transport data units of the protocol in question is outside the scope of this specification...

RFC 1945

Whenever you browse to a page, remember that the machine that is serving the page to your cool Windows XP machine, could be an old machine powered by some non-Intel CPU (say SunSPARC etc.), maybe running under Linux or Solaris, and the pages are generated by an handful of Perl scripts, or ASP if you like. The combination can be anything. As long as there is a web server on one side, and a browser on the other side the communication can be happily proceeding. This abstractive power of Internet to bridge any two machines, is what have inspired the modern technologists to use this very same protocol, HTTP, for their new offerings in the form of web services.

HTML and XML - The genesis of universal Markup Languages

HTML is a meta language. For a long time ASCII was the only predominant standard that was widely accepted and unquestiningly used. Two machines which can send and receive bit streams are no-good with communication, unless, something is accepted regarding the bit sequence format. ASCII was this format. But, human speech is highly expressive. A Simple ASCII just conveys the content of what needs to be told. The nuances and gestures that we use while we speak to emphasize or elide something, needs more expressiveness than what ASCII can provide. Things like showing a specific text in bold or italics is the usual way of coding this nuances into a plain ASCII text. HTML is just one such simple tagging meta language to encode such nuances of speech into plain ASCII, using ASCII only.

Is the HTML's tagging format the most optimal one? Mostly yes. One superfluousness that I find is regarding the closing tag. If say the font tag starts with <font color="red">, then is there a need to close this with </font>? Won't a simple </> will do? In fact had we used it this way, we could have saved a rule in the XML spec, which says that the tags should NOT overlap. You need not worry about facing problems down the line, because for a long time now, programming languages use the (), {}, begin-end like delimiters without confusion, even if they embed very deeply. Just check the end of most Java files and you'll see many }}}} like structures, which the programmers don't feel uncomfortable with (if you are still in doubt checkout a LISP file!). But all these said, this old style of closing a tag with its own name, does make sense for clarity sake. If you have noted codes from experienced programmers, they usually close their while and if blocks with a //while and //if style comments, to mark it off clearly. So, the HTML style mark up (tagging) language is really an optimal one, both with respect to minimality of coding and readability.

If you had noted, the whole HTML meta encoding, and hence the XML encoding use just 5 meta characters, namely < > & / ; . What it means is, if we are scanning a stream of ASCII (or Unicode) characters, you just need to take these five meta characters as special alert points. In fact if you take just the starting two < &, as meta chars, that should do too, because you can treat > specially only if you had encountered a < before. This type of non-puristic tolerance is what has made MS IE and other browsers popular. The point is, the HTML and the XML (its latest generalization) syntax is the reasonably simplest and most readable encoding thing around. And the silver lining is that it is a globally accepted and widely used as the all-purpose tagging format.

If you had seen the efforts like NanoXML, or TinyXML, which want to simplify the XML further, you notice that they envision that new format mainly to encode data structures. The main target of these simplification efforts is the 'attributes'. But the fame and usefulness of XML like tagging, came into fore only in HTML, which is not a data structure per se, but is a document. When you want to say that a given word should be 'red' in color, it is messy to think that we need a tag like <red>, we feel comfortable with something like <font color="red">. This way, the subordination of color to font is very clear, and the value red is just one value. So we cannot avoid three distinct concepts while tagging, namely, the tagged, the tag and the attribute of that tag. Maybe we can consider, that the double quotes is unnecessary, as the space is there to delimit a value from the next name. But given the unanimous acceptance of XML today, this is only a very slight inconvenience. The XML format is the nicest compromise to tag any given ASCII (or UNICODE) text, with minimal meta characters, and high human readability.

URL - The old fame and new meaning

In the early days, or in a simple case even now, when a website is just a collection of HTML files, standardizing a convention to call these files directly through the URL was important, as the whole process can be automated with a complaint web server which will appropriately pick the files from the local directories and serve it across the net. For example if you say http://www.mydomain.com/myfolder/myfile.html It could very well mean that the file myfile.html is inside the myfolder which is in the site http://www.mydomain.com. This is the URL that a client knows, and that is how the website sees it too. But with the dynamic page generation capabilities of CGI, ASP, JSP etc. the URL is now only symbolic.

In many modern websites, even a call to myfile.pdf can be dynamically generated from some XML snippet or even from a database. Of course even now this folder like separation lends itself to context separation, and dot extensions of files to the format of presentation. But the original intention of mimicking the files and folders is fast going out-of-fashion. This point lends us to the fact, that while developing the dynamic content, the real URL to be sent out in responses should be carefully abstracted. Because, URL is like a phone number given to a friend. Whether we are there in that number or not, the friend does expect us to be there. So, even if we move our number, we should arrange to redirect the old number calls to the new one, instead of trying to update every one of our friend.

This redirecting of URL is being practiced for a long time now. But, thinking of our web application itself as a layer, separate from the user's URL model, is becoming more and more important, especially for web applications that can be packaged into a CD, and given to different clients. While developing a web application, we do need to fill the response pages with links and real URL. But at that time we do not know what our server customers would like as the URL model. For all we know, they might want to use our application in tandem with their existing dynamic pages. If so the need for having configurable URL would be an elegant extensibility factor for our web application.

While HTTP has abstracted the client and server completely to each other, what remain as the interface to server for the client is only the URL. The URL now assumes, almost an API like importance. The URL is our client's only interface to our web site or service. For many ISPs it might be mandatory to abstract this URL model completely to both the client and the application. The URL given to the client might be strategically important, and might need to be fairly unchanging, irrespective of their web application's status or upgrade. Maybe we can start to think about an website as a collection of URLs that need to be serviced in a certain way, and then work towards satisfying those URL requests. Maybe this way, the testing of a site, might be testing of a series of URLs.

Implementing the URL abstraction

Appreciate the following similarities between a URL,

http://www.mydomain.com/package/method?param=value

and a function call,

package.method(param=value);

The key thing that differentiate these two is the easy possibility of distributedness in case of the URL. Thus URL by itself is a reasonably endowed RPC (Remote Procedure Call) format.

So, to accomplish this complete URL separation to our application, we must allow the URL coming to the web server to be first transformed into appropriate service processor. And response that is generated by our application, should do it based on internal logical URL, which will then be transformed into client based URL, based on the mapping module that in first place directed the client request to the web services. Practically I'm considering Apache like URL rewriting capabilities, that can be dynamically configured with Regex like tools. Once the incoming URLs are transformed suitably with the powerful server based regex capabilities, it comes to my universal front controlling servlet/jsp, which then dispatches the request to appropriate request processing Java interfaces.

Epilogue

I must tell that these thoughts started from my first attempt at creating an redistributable web application. For all I know, these ideas might be already practiced by the veterans in this field. All the same I tried to explains the details of the route I took. Moreover I wanted to share this exciting understanding that the web model is trying to give all the API (Application Programming Interface) like capabilities through URL.