Sunday, May 5, 2013

The Horror of the Web


One of the most important advances in computer use was the development of what-you-see-is-what-you-get (or WYSIWYG) word processors. They made computers enormously more useful for the average user. Before these tools, there was no good way for average users to prepare good-looking documents on a computer. They either had to produce documents with minimal formatting or they had to learn complex markup languages like troff or tex.


Consider what you need to do to successfully use a markup language. First, you have to learn a complex and unintuitive text notation for non-text features in a document. Second, you need to develop the skill of mapping your concept of a page into this arcane syntax. Third, you need to develop debugging skills. It is very rare for a markup language to produce an acceptable result the first time; there are always syntax errors and unexpected interactions of the markup.


Finding and fixing these sorts of problems is difficult, and for many people unpleasant as well. The process of debugging involves a repetitive cycle of test-debug-fix, test-debug-fix which is a needless waste of time. With WYSIWYG word processors, there are many fewer errors possible and what errors are possible are seen the moment they are produced. There is no need for the time consuming test-debug-fix cycle.

The World Wide Web was in some ways a leap backwards in computer technology. Before the Web, most documents (outside of some science and engineering departments) were written using WYSIWYG word processors. But then suddenly it seemed that all documents had to be web-accessible and all of that changed. WYSIWYG went out the window and everyone had to resort to raw HTML.

An entire industry sprang up in which anyone with any technical skill at all could make money writing web pages for people who did not have technical skills, and once again enormous amounts of (rather expensive) time were being wasted in the test-debug-fix cycle. Then more highly talented people had to spend their valuable time writing WYSIWYG word processors that read and write HTML.

This sudden move to the Web and HTML was quite literally an economic disaster. No one noticed the disaster because people don’t see such problems and inconveniences in economic terms. You need to start putting all of your user manuals in HTML? OK, the usual docs team can’t do that because they don’t know HTML so let’s hire a Web team. The Web team is happy because they get hired. The docs team is relieved because they aren’t going to have to learn that complex format with less-than and greater-than symbols. The company management is happy because they get more reports which makes them more important.

The people paying the price are the customers who see prices go up, the investors who see profits go down, and all other technical businesses who can't find enough talented people to do real product work because everyone is writing web pages. All of that wasted talent put technological progress back by some amount. I don't know how much, but given the numbers of people involved, it had to be substantial.
I’m no economist, so take this as the amateur speculation that it is, but I wonder if a contributing factor to the bursting of the tech bubble was the horrendous amount of talent being wasted trying to match up angle brackets.

Now, the Web was a good thing; I am by no means arguing against that. And there was a lot of value in moving documents onto the Web so that they would be more readily available. My complaint is that it was just done wrong --and obviously wrong.

The problem started with the fact that HTML is a markup language rather than a binary document format. This not only burdened users with the horrors of markup language, it also, I believe, delayed the eventual production of WYSIWYG word processors for the Web because people tend to get used to what they have and not demand something better.

If the Web had begun with a binary document format in the first place, introduced not only with a simple browser but also with a simple WYSIWYG word processor, the technological history of the 1990s would have been dramatically different and better. Netscape would have put out a highly advanced WYSIWYG word processor along with the browser. Microsoft would have had to copy the plan --maybe adding Web output for word several years earlier than they did. Throughout the explosive growth period of the internet, everyone would have been able to write Web documents as easily as they could write Word documents, and all of those people who were hired to write HTML could have been doing something more useful.


There is another way in which the Web was a leap backward. In the beginning, graphical user interfaces (GUIs) were created with an API that would place, draw, and control the graphical elements. Like text markup, this is a tedious, time-consuming process of test-debug-fix. Then along came interactive GUI tools such as those in Delphi and Visual Basic that made the process almost trivial compared to the previous methods.

I’ll never forget the first interactive form I made in Delphi. It took no more than a few minutes to create, then I ran it and it was perfect the first time. It was like a miracle.

Then suddenly Windows applications were no longer good enough and our customers started demanding Web interfaces. That meant writing the GUI in HTML with clunky and complex back end like Java Server Pages to produce a user interface that was less attractive, less standardized, slower, buggier, and much harder to maintain. And producing this monstrosity was enormously more difficult and expensive.

For me, it was like I had lived my whole life in the sewers until one day, I came crawling out, blinking and amazed, into the bright light of day. I saw the blue of the sky and smelled the fragrance of the flowers. I lay in the sun to dry my shriveled prune-like skin. I thought that the days of darkness and foulness were behind me. Then WHAM! Back in the sewer for you! Back into the fetid pit where there shall be weeping and wailing and gnashing of teeth.

Do I sound bitter? Then so be it; I am bitter. Well, I was bitter. Since then I’ve moved on from user applications to servers so the horror of HTML has not had much effect on me for many years and my temper has mellowed with age. Yet scars do remain. Yes. Scars do remain.

Where Delphi and Visual Basic had already solved most of the problems of doing graphical user interfaces, doing anything web-enabled meant coming up with new solutions all over again. Unfortunately, many people from the Unix community that created the Web were unaware of these tools (as I was ignorant of such things until I was dragged kicking and screaming to Windows), and they didn't even know what great tools they were throwing away. Even in 2005, there were no development tools for designing Web interaces that were nearly as good as the Delphi and Visual Basic of 1995.

In 1995, before the explosion of the Web, developers were starting to settle on a reduced set of tools and methods for writing applications. It wasn't down to one set, but that arguably isn't a good idea anyway and at least things were going in a good direction.

All of that changed in the Web explosion, where everyone had to find their own way to do things and then wrote tools to do things that way. Today, in order to be confident that you can go into any organization and work on their web pages you need to know html, xml, vml, et al., javascript, stylesheets, php, perl, tcl, python, Java, SQL (various flavors), and a bunch of server and development platforms that all work completely differently. For developers it has become the world wide nightmare.

I blame this on the text format also. Because HTML is text, 
it is hard to write good tools for it. Because it is hard to write good tools for it, there was a long time when there were no good tools. But also because it is text, lots and lots of developers knew the format in excruciating detail so they were able to write ad hoc tools to solve particular problems. And because there were no good tools, the ad hoc tools were forced to continue to grow and expand to handle problem after problem, until the ad hoc tools became platforms, most of which were never designed but only grew to match some particular set of applications.

This was destined to happen because for a lot of things that developers want to do, there really was no good way to do it in HTML. And that is because HTML was designed as a language to write simple documents with hyperlinks.
When people started using HTML for video, interactive forms, animations, sound, a GUI for remote applications, and all of those other things, there was never any real effort to rethink and redesign based on the new use cases. Instead, anyone who wanted to do something new would add some random feature that fit their own application. Then others would hack up extensions to the extensions to handle their slightly different application. Then the feature would get added in different forms in different browsers until a committee randomly picked a hacked up compromise standard that suited the applications of the major players.

This is a horrible way to design anything, much less the format for most of the documents on earth, and I think, again, is partly due to the text nature of HTML. The people adding the new HTML feature only had to change the reader. They didn't have to make it work in a WYSIWYG editor. They thought in terms of markup language rather than in terms of "how do I store this information effectively for the computer".
When you added an extension, you had to not only add the data, you had to find a textual way to represent the data and it had to be in a form that humans could directly read and write.

Data formats should not have to take human limitations into account. Data formats should be designed for machines. If people need to view the data they should do so through an appropriate viewing tool. This is simply an issue of good design to good requirements. If you have requirements that call for a file format you should ask, "who is going to read and produce this format?" The answer should be "users" or "programs". If the answer is "users and programs" then you have a bug in your requirements. Fix the requirements bug and you will know how to design your file format.

Taking this a step further, you should realize that the answer should always be "a program". Computer programs access files, not users. Some computer programs have the job of presenting the data to a user in the form of text, graphics, sound, or some other medium, but it was the program that actually accessed the data.

So write really good libraries to access the data and use a data format that is convenient for the program. There is no need to handle user errors or to give feedback to the user about these errors. Not only does this reduce the difficulty of producing the software but the final representation of the data is much more compact.
Design effective user interfaces to let the user express concepts interactively and give immediate feedback, but don’t worry about letting the user open the data file with a text editor. This policy pushes the complexity from the tool user to the tool designer where it belongs. There are many fewer hours spent writing the tools than in using them.

This advice goes directly against the unfortunate trend of converting all data formats to XML, another text-based language. Back in the day when XML was new, I had always hoped that when XML got out of the stage where it was the darling of the technology groupies and hit the front line of real developers it would get ignored. But this didn’t happen. In fact serious developers ended up adopting XML at an alarming rate as a general purpose data representation language.

The reasons for this are clear: XML offers the promise of a standardized data format for any type of data that can be read in any language. All you have to do is provide a specification file and any program in any language can potentially read data from any source. The advantages of such a scheme are tremendous, but XML is a poor implementation of it.

There are three reasons why XML is a poor implementation of a universal data standard. First, it is a text-based standard, giving the false impression that creating data files by hand is a reasonable thing to do. By contrast, in a better-designed format all data files would be created by GUIs or APIs, and no one except the authors of the APIs would ever have to know what the internals of the data file look like.

Second, XML only has a couple of types that it can represent directly, namely strings and lists of node-labeled trees. It doesn’t know anything about integers or tables or timestamps, or anything else. The programmer has to parse and type check all of these things by hand.

Third, XML uses node-labeled trees. This is because XML elements are the nodes and elements have labels but the child elements are all just stuffed in a parent element. There is no concept of having special edge to child elements such as being the “next” child element or the “previous” child element (for implementing a doubly-linked list, for example). To simulate such things, you have to insert another node between the parent and the child and label that node something like “next” or “previous”. This complicates the structure and makes it take more space.

By contrast, practically every programming language instead uses edge-labeled graphs such as you get from objects in C++ or Java. The objects are the nodes. The names of the members that point to other objects are the edge labels. The graph is just a connected set of objects that you can traverse through the members.

The difference between node-labeled and edge-labeled structures has led to confusion by people who don't notice the difference and try to design data structures in XML the same way that they would design them in an object oriented language.

I know of no general representational format other than XML that uses node-labeled graphs as the basic unit, so why does XML use them? Of course, it’s because XML is really a markup language (hence, eXtensible Markup Language) and not a binary data format language. If only people had kept that in mind.

I should back up for a minute here and admit that the problem isn't really the design of XML so much as the use of XML for purposes for which it was not designed. As an eXtensible Markup Language, as a representation for fancy text with small amounts of embedded data, XML is fine, other than being text-based. For general data representation it is just not very good. Using XML as a general purpose data representation format is like using Javascript to write a compiler. It can be done but it is a bad idea. And you can't blame the designers of Javascript for that; you have to blame the people that are misusing it.

Yet throughout the first decade of the Twenty-first Century I kept seeing more and more programs with perfectly good data formats being converted to use XML, and new programs using XML as their data format even though it was clearly not a good technical decision to use XML.

Thankfully, there is some hope on the data format front. Several binary formats such as Apache Thrift, Apache Avro, and Protocol Buffers are becoming popular. These are general-purpose, binary data formats with APIs in multiple languages that offer the same advantages that got people so excited about XML, namely that they are standard formats that can be used to represent any sort of data and can be read by any language. But these formats are much better than XML in nearly all respects for the purpose of representing general data and it’s a good thing that developers are noticing them.

No comments:

Post a Comment