Commit 7ee4ea29 authored by Lars Knoll's avatar Lars Knoll
Browse files

moved the new html implementation over to the main branch.

BEWARE: It might not compile. I'll fix this in the evening, after I
come home from Erlangen. It'll break apps using khtml. As I will do
some rewrite of the external API in the next one or two days, it's
probably not worth fixing that...

But we have JAVA now!

svn path=/trunk/kdelibs/; revision=30525
parent d3e05843
This diff is collapsed.
......@@ -14,7 +14,7 @@ Stage 3.: The HTML-Layout
The Tokenizer
The tokenizer is located in khtmltoken.cpp. The tokenizer uses the contents
The tokenizer is located in khtmltokenizer.cpp. The tokenizer uses the contents
of a HTML-file as input and breaks this contents up in a linked list of
tokens. The tokenizer recognizes HTML-entities and HTML-tags. Text between
begin- and end-tags is handled distinctly for several tags. The distinctions
......@@ -40,220 +40,93 @@ The HTML-Parser
The HTML-parser interprets the stream of tokens provided by the tokenizer
and constructs a structure of renderable elements. Two types of renderable
elements can be distinguished: HTML-objects and HTML-clues.
and constructs a tree of elements representing the document according
to the Document Object Model (DOM, see For HTML,
one can distiguish between 3 kinds of basic Objects the document is
build up from:
* HTML-objects
* Text
A HTML-object is a object which can be drawn on the screen. Examples of it
are text, links, images and lines.
Text is a basic class holding some text of the page.
* HTML-clues
* HTMLBlockElement
A HTML-clue is a container which can contain HTML-objects and/or other
HTML-clues. A HTML-clue determines how the elements which it contains are
positioned with respect to each other.
Elements representing a block in the document (like <hr>, <table>,
<blockquote>, <li>, ...). These elements can contain inline elements
(the ones forming paragraphs) and other block elements. Block elements
have the ability to render themselves and the inline elements, which
they contain.
Example I:
The HTMLClueFlow positions its elements in a 'flow' like the text in a book:
It starts from the left and moves to the right. When it hits the
right-margin it moves down and continues from the left-margin.
* HTMLInlineElement
Inline elements are all elements, which are rendered as part of a
paragraph (eg. <b>, <img>, <tt>, ...) Inline elements do render
themselves, but are rendered by the surrounding block element. Inline
elements can't contain any block elements.
The root of all elements is the HTMLDocument.
The HTML-Layout
When the complete structure of HTML-clues and HTML-objects is build. The
HTML-layout starts: each HTML-object is positioned. The positioning depends
When the complete structure of Elements and Text is build, the
HTML-layout starts: each HTMLElement is positioned. The positioning depends
on the available screen-width.
### This might change still
The positioning starts with the calculation of the minimum screen-width
required to display the complete HTML page. The calcMinSize method in
HTML-clues and HTML-objects is used for this. The minimum size is calculated
recursively through all HTML-clues.
When the minimum size is known it compared against the actuaal available
When the minimum size is known it compared against the actual available
screen-size. If the minimum size is less than the available
screen-size the available screen size will be used as the maximum screen
size. If the minimum size is greater than the available size the minimum
size is used as the maximum screen size. In taht case, if configured, a
size is used as the maximum screen size. In that case, if configured, a
horizontal scrollbar will be added to be able to scroll.
Advanced Topics
The HTMLText objects
There are several text-related objects:
HTMLHSpace: A horizontal space
HTMLVSpace: A vertical space, e.g. linefeed
HTMLText: A non-breakable text object
HTMLTextMaster: A breakable text-object
HTMLLinkText: A non-breakable hyperlinked text object
HTMLLinkTextMaster: A breakable hyperlinked text object
HTMLHSpace is equivalent to HTMLText(" ", ...) but slightly smaller
in memory usage
HTMLVSpace is used for a forced line-break (e.g. linefeed)
HTMLText is used for text which shouldn't be broken.
HTMLTextMaster is used for text which may be broken on spaces,
it should only be used inside HTMLClueFlow.
For text without spaces HTMLTextMaster is equivalent
to HTMLText. In such cases HTMLText is more efficient.
HTMLLinkText is like HTMLText but can be hyperlinked.
HTMLLinkTextMaster is like HTMLTextMaster but can be hyperlinked.
Basically all functionality is provided by HTMLVSpace and HTMLText.
The additional functionality of HTMLLLinkText is not put in HTMLText
to keep the memory usage of the frequently used HTMLText object low.
Since often single spaces are used in HTML, they got their own, even
smaller object.
Another often encountered pattern is a paragraph of text. The
HTMLTextMaster is designed for this purpose. It splits the paragraph
in lines during layout and allocates a HTMLTextSlave object for each
line. The actual text itself is maintained by the HTMLTextMaster
object making efficient memory usage possible.
The HTMLTextMaster/HTMLTextSlave objects
Text sequences rendered with the same font-settings are kept in one single
string as much as possible. If such a string contains normal (breaking)
spaces, this string is converted into a HTMLTextMaster object. If such a
string contains non-breaking spaces it is converted into a HTMLText object.
The non-breaking spaces are passed as normal spaces to the HTMLText object,
since the HTMLText object does never break up any text, the spaces act as
non-breaking spaces.
If a string contains both breaking and non-breaking spaces the string is
split up across a HTMLText, HTMLSpace and HTMLTextMaster object. The reason
for this is that we can't pass non-breaking spaces to the HTMLTextMaster
object due to problems with fonts. Not all fonts print the character
0xA0. A workaround in Qt is announced. Until that workaround is mainstream,
we can't use non-breaking spaces in our HTMLTextMaster object.
The actual line-breaking is done in the HTMLTextSlave::fitLine() method. This
method is called from HTMLClueFlow::calcSize().
The parent of the HTMLTextSlave should be the HTMLClueFlow object.
In HTMLClue is also a call to fitLine(). This call has its arguments setup to
instruct the HTMLTextSlave not to do any breaking at all. HTMLClue is used
for the contents of <pre>..</pre> tags. It is rather strange if HTMLClue
contains any HTMLTextSlave objects though. Text within <pre>..</pre> tags
should all end up as HTMLText objects.
The line breaking is pretty tricky. This is how it is done without any
HTMLTextMaster objects (e.g. without fitLine functionality)
The HTMLTextMaster object contains the text string. During lay-out the
HTMLTextMaster spawns off a HTMLTextSlave which prints (a part of) the
text string. All layout issues are further handled by HTMLTextSlave.
Basic Text lay-out
khtml does now use DOM Level1 (see for details) for holding
documents. Although the dom implementation isn't finished yet, it's already
quite useable. The DOM is implemented as classes with automatic memory
management. We have internal classes (the *Impl classes) holding the data
of the DOM, but the programmer uses "pointer" classes to these internal ones.
The implementations hold a reference count of how many "pointer"/API instances
are pointing to them. Once the reference count drops to 0, the implementation
gets deleted.
The dom_* files implement the core DOM, the html_* files the html DOM.
As I focused on html, all classes in the core DOM used only for XML are not
Every BlockElements goes through it's children during layout. Once it
encounters an inline element, it starts a paragraph. All inline elements
(and it's children) are scanned until it encounters the next BlockElement.
Text and inline elements are put together, and a line breaking algoritm
decides, when to start a new line.
Each frame in a HTMLDocument has a body, represented by the
HTMLBodyElement. This element holds a pointer to a HTMLScrollview, into
which the contents of the body is rendered.
Basically the line-breaking is done by
HTMLClueFlow::calcSize(). It lays out the text line by line. It collects
"run"'s of objects. A "run" is a sequence of objects with no white-space
inbetween them. If a run is complete it is checked whether the width of the
run fits the current line. If it fits another run is made until no more run
can be added: the line is full. The last run which didn't fit anymore on
the line is rejected.
If we have line consisting og a set of run's which fit the current width
of the HTMLClueFlow we have to check whether the total available space at
the point where we are expecting the text is indeed enough for our line.
This seems redundant but although the width of the Clue is big enough to
hold the line it can be that a floating images is right beneath us and that
we therefor have not enough space for the height of the line. If this is the
case we ask for a new position (further down in the Clue) which provides
enough height for our line. Since this position may have a different width, we
throw away our line and start making a new line given the new width.
This isn't very efficient but it shouldn't occur too often.
Text layout using fitLine().
The text-layout algorithm mentioned above assumes that all objects (making
up the runs) have a fixed width wich can't be changed. The HTMLTextSlave
objects however, can break themselve up in multiple HTMLTextSlave objects.
The fitLine() call is a hint to the HTMLTextSlave object to break itself up
if necassery. After it has done so (or if it chooses not to do so) the
HTMLTextSlave is further treated by the text-layout algorithm as a normal
fixed size object.
For efficiency the fitLine() function returns some information to the
text-layout algorithm. If HTMLTextSlave has broken itself up it makes no
sense to try to add more objects to that line. The next object will be the
remaining part of the HTMLTextSlave, if it would have fitted, the original
HTMLTextSlave would had splitted in the first place. So if HTMLPartialFit
is returned, this is a hint to the layout-algorithm that with thi HTMLTextSlave
the line is full and that it doesn't need to try to add more objects to that
Another possibily is that the HTMLTextSlave sees no way to break itself so
that it fits the available space. In that case it does no further attempt
and returns HTMLNoFit to the layout algorithm. This is an indication that the
run the HTMLTextSlave is part of, will not fit the available space and that it
is useless to try any further.
An exception to this rule is the case where this run is the first run of the
line. In that case the HTMLTextSlave should make itself as small as possible.
The resulting run will then not fit within the width of the current line.
However, another position for the text is searched where the line does fit.
If such a position is not found the widest available position is choosen.
In this case the text will overflow the Clue borders. This should only be
possible to happen if the width of the Clue didn't take (for some reason
or the other) the minimumWidth of its contents into account.
Memory Management
Basically all strings used in a page are allocated through the tokenizer.
Strings passed to HTMLObjects are used as provided, no copy is
made. The caller should ensure that the passed strings remain allocated during
the lifetime of the HTMLObject. The simplest way to do that is to allocate the
string through the tokenizer. All strings allocated this way are deleted when
a new page is being loaded.
By deleting the HTMLObjects before deleting the tokenizer, we ensure that the
strings will not be used after they have been deleted.
Clue nesting
The following parent-child relations can exist:
Parent | Child
,, | HTMLClueFlow
HTMLClueFlow | HTMLClueAligned
,, | (objects)
HTMLClueAligned| (objects)
All but HTMLClueH's last children should have a fixed width,
it's last child may have any kind of width.
This is the base class for all renderable HTML ELements. These are basically
all block level elements. Block level elements take care of rendering their
inline contents.
......@@ -17,40 +17,70 @@
# the Free Software Foundation, Inc., 59 Temple Place - Suite 330,
# Boston, MA 02111-1307, USA.
# for profiling.
# for profiling.
# command line for line by line profiling (into seperate files):
# gprof -l -y
# AM_CXXFLAGS += -pg
libkhtml_la_SOURCES = kbrowser.cpp khtmlcache.cpp khtmlparser.cpp \
khtmljscript.cpp khtml.cpp khtmlobj.cpp khtmlfont.cpp \
khtmlform.cpp khtmltoken.cpp khtmlstring.cpp \
khtmlframe.cpp khtmlview.cpp khtmlclue.cpp debug.cpp \
khtmldata.cpp khtmltable.cpp khtmliter.cpp \
khtmlsavedpage.cpp khtmlstyle.cpp khtmlembed.cpp khtmldecoder.cpp
SUBDIRS = java kjs
libkhtml_la_METASOURCES = khtml.moc khtmlcache.moc khtmlembed.moc \
khtmlform.moc khtmlframe.moc khtmlview.moc kbrowser.moc
include_HEADERS = khtml.h khtmlview.h khtmlsavedpage.h khtmlembed.h khtmlcache.h \
noinst_HEADERS = khtmlchain.h khtmlparser.h khtmljscript.h \
khtmlobj.h khtmlfont.h khtmlform.h khtmltoken.h khtmlframe.h \
khtmlclue.h khtmldata.h khtmltable.h khtmliter.h khtmlstyle.h \
khtmldecoder.cpp khtmlstring.h
libkhtml_la_SOURCES = dtd.cpp\
khtmltoken.cpp khtmltokenizer.cpp khtmlparser.cpp\
khtmlsavedpage.cpp khtmldecoder.cpp khtml.cpp khtmlio.cpp\
khtmldata.cpp khtmltext.cpp \
khtmlstyle.cpp khtmlfont.cpp khtmlembed.cpp\
dom_doc.cpp dom_docimpl.cpp dom_element.cpp dom_elementimpl.cpp \
dom_misc.cpp dom_node.cpp dom_nodeimpl.cpp \
dom_string.cpp dom_stringimpl.cpp dom_text.cpp dom_textimpl.cpp \
dom_xml.cpp \
html_base.cpp html_baseimpl.cpp html_block.cpp html_blockimpl.cpp \
html_document.cpp html_documentimpl.cpp html_element.cpp \
html_elementimpl.cpp html_form.cpp html_formimpl.cpp \
html_head.cpp html_headimpl.cpp html_image.cpp \
html_imageimpl.cpp html_inline.cpp html_inlineimpl.cpp html_list.cpp \
html_listimpl.cpp html_misc.cpp html_miscimpl.cpp html_table.cpp \
html_tableimpl.cpp html_object.cpp html_objectimpl.cpp\
libkhtml_la_METASOURCES = khtml.moc khtmlio.moc khtmlembed.moc \
kbrowser.moc html_formimpl.moc
include_HEADERS = khtml.h khtmlsavedpage.h khtmlembed.h khtmlio.h \
kbrowser.h \
dom_core.h dom_doc.h dom_element.h dom_exception.h \
dom_misc.h dom_node.h dom_string.h dom_text.h \
dom_xml.h \
html_base.h html_block.h html_document.h html_element.h \
html_form.h html_head.h html_image.h \
html_inline.h html_list.h html_misc.h html_table.h html_object.h
noinst_HEADERS = khtmlparser.h khtmlfont.h khtmltoken.h \
khtmldata.h khtmlstyle.h khtmldecoder.cpp khtmlstring.h \
dom_docimpl.h dom_elementimpl.h dom_nodeimpl.h dom_stringimpl.h \
dom_textimpl.h html_baseimpl.h \
html_blockimpl.h html_documentimpl.h html_elementimpl.h \
html_formimpl.h html_headimpl.h html_imageimpl.h \
html_inlineimpl.h html_listimpl.h html_miscimpl.h html_tableimpl.h \
libkhtml_la_LDFLAGS = -version-info 3:0
libkhtml_la_LIBADD = ../jscript/ ../kimgio/ ../kio/
libkhtml_la_LIBADD = ./kjs/ ../kimgio/ ../kio/
data_DATA = khtml_dnd.xpm
data_DATA = khtml_dnd.png
datadir = $(kde_datadir)/khtml/pics
INCLUDES = -I$(top_srcdir)/jscript -I$(top_srcdir)/kimgio -I$(top_srcdir)/kio \
INCLUDES = -I$(top_srcdir)/khtml/kjs -I$(top_srcdir)/kimgio -I$(top_srcdir)/kio \
-I$(top_srcdir)/khtml/java $(all_includes)
......@@ -68,7 +98,12 @@ EXTRA_DIST = $(khtmltags_DIST) makeprop maketags test $(data_DATA) \
cd $(srcdir) && \
perl makeattrs && \
bash maketags && \
bash makeprop
perl scripts/makeattrs && \
bash scripts/maketags && \
bash scripts/makeprop
## test program
bin_PROGRAMS = testkhtml
testkhtml_SOURCES = testkhtml.cpp $(libkhtml_la_SOURCES)
testkhtml_LDADD = $(LIB_KDEUI) -L../kimgio -lkimgio -L../kio -lkio \
-Ljava -lkjava -Lkjs -lkjs $(LIB_DMALLOC)
......@@ -13,9 +13,10 @@ It was extended by
Josip A. Gracin <>,
Martin Jones <>,
Waldo Bastian <>
Lars Knoll <>
It is currently primarily maintained and developed by Martin Jones and
Waldo Bastian.
It is currently primarily maintained and developed by Martin Jones,
Waldo Bastian and Lars Knoll.
Revision History
Here's what's still missing (roughly in order of importance):
* Add a select(...) function to HTMLTextSlave so that "select all"
* check the remaining ascii() calls
* move the attribute parsing into the objects
* simplify the KHTMLWidget/KHTMLView api, perhaps add a widget, which
is almost a browser (move some stuff from konqueror here...)
* use KURL instead of QString url everywhere for URL's. This
would avoid quite some QString<-->KURL conversions
WABA: But for the time being, a QString is much cheaper to
store than a KURL, especially since we use HTMLStrings
for this. The only extra time spend is in the KURL.url()
call which isn't too much of an overhead. Most of the
time we don't convert a QString to a KURL a second time,
because most URLs on a page aren't used.
(You usually only click on one link/URL)
* many functions using QString now could be changed to use references,
to avoid copying of the string
* move the parsing of table tags into the main parser
* get rid of the blockingTokens in the tokenizer
* incremental table layout
* make the object tree reflect the real document structure.
Needed for dynamic html
* perhaps split up htmlobj.* into several files
* charset support (--> kcharsets)
* some html-4 elements are still not supported
* add (use) more attributes
* BiDi support
* jscript
* java
\ No newline at end of file
Here's what's still missing (without order):
* BiDi
* clean up the calcParagraph code
* handle adding of elements better
* provide a cache for sheets
* make them really CSS2 compliant
* lots of functions in the Impl classes
* mouse handling
--> partly done (clicking works), selection/DND still missing
* selection
* make all destructors in the DOM virtual
* frameset should be a QSplitter
* completely
* completely
* some attributes (%coreattrs, %events, and a few others)
should be implemented in HTML(Inline/Block)Element and not in
the classes themselves.
--> partly done
* rendering is quite slow at the moment, because of missing
checks on the bounding rectangle --> mostly done...
* remove the currentStyle from the elements
Memory usage:
* high at the moment. IMO the biggest reason is the immense
usage by the style sheet objects
Still thinking about a better solution. Perhaps one
stylesheet object would do, and the element get their
information from there (something like document->styleSheet()->XXX()).
most are still missing. Here comes a list.
a mostly done
address done
b done
big done
blockquote done (attrs missing)
body partly
br done
caption mostly
center done
cite done
code done
dir see ul
div should work
em done
form partly done
frame mostly done
frameset mostly done
h1 done
h2 done
h3 done
h4 done
h5 done
h6 done
head done
hr done
html done
i done
img mostly (maps missing)
input partly done
kbd done
li done
menu see ul
ol see ul
optgroup partly
option partly
p done
pre done
s done
samp done
select partly
small done
strike done
strong done
table mostly
tbody partly
td done
textarea mostly done
tfoot partly
th done
thead partly
tr done
tt done
u done
ul done
var done
Status of test pages (only layout at the moment, the other stuff is
still missing...):
URL1.html ok
URL2.html ok
align.html ok (crashes because some non-alignment related
list stuff, though:)
align2.html ok
anchor1.html ok
anchor2.html ok
badpages.html ok
listing.html mostly ok