eText

[1/2] from: al:bri:xtra at: 14-Nov-2000 23:56

Here's some preliminary code on my eText script (it's nearly right, just a few little problems I haven't fixed yet): [ Rebol [ Name: 'eText Title: "eText" File: %"eText.r" Author: "Andrew Martin" eMail: [Al--Bri--xtra--co--nz] Date: 14/November/2000 Home: http://members.nbci.com/AndrewMartin/Rebol/eText/eText.r Version: 1.0.0 Purpose: {Takes a string of text and transforms it into HTML dialect.} Usage: [ write %Document.html eText read %Document.txt ] ] do %../units/ascii.r do %../units/flow.r eText!: make object! [ Replacements: func [Document [string!]] [ foreach [Target Replacement] [ "&" "&" "<" "<" ">" ">" "---" "" ; Em dash "--" "" ; En dash "-TM" "" ; Trademark "(C)" "©" ; Copyright "(R)" "®" ; Registered trademark "..." "" ; Ellipsis "1/4" "¼" "1/2" "½" "3/4" "¾" " x " " × " ; Multiply sign " / " " ÷ " ; Division sign ][ replace/all Document Target Replacement ] Document ] Printable: ASCII/Printable Dialect: make block! 0 Text: none InitialParagraph: true Fragment: false Title: [ copy Text some Printable "^/" ( append Dialect either Fragment [ reduce [ 'h1 Text ] ][ reduce [ 'title Text 'body 'h1 Text ] ] ) ] Heading: [ 2 "^/" copy Text some Printable 2 "^/" ( InitialParagraph: true repend Dialect [ 'h2 Text ] ) ] Paragraph: [ "^/" copy Text some Printable opt "^/" ( repend Dialect [ either InitialParagraph [ InitialParagraph: false 'ip ][ 'p ] trim/lines Text ] ) ] CodeLines: make block! 0 CodeLine: [ 2 "^-" any "^-" copy Text some Printable "^/" ( append CodeLines join Text "^/" ) ] Code: [ "^/" some CodeLine "^/" ( repend Dialect [ 'BlockQuote 'Pre CodeLines ] CodeLines: make block! 0 InitialParagraph: true ) ] ListItems: make block! 0 ListItem: [ some "^-" copy Text some Printable "^/" ( append Text ListItems ) ] ListType: none BulletListItem: [ "*" ListItem (ListType: none) ] CapitalLetterListItem: [ "A" ListItem (ListType: "A") ] LowercaseLetterListItem: [ "a" ListItem (ListType: "a") ] CapitalRomanListItem: [ "I" ListItem (ListType: "I") ] LowercaseRomanListItem: [ "i" ListItem (ListType: "i") ] ArabicListItem: [ ["1" | "0"] ListItem (ListType: "1") ] List: [ "^/" some [ BulletListItem | CapitalLetterListItem | LowercaseLetterListItem | CapitalRomanListItem | LowercaseRomanListItem | ArabicListItem ] "^/" ( append Dialect either none? ListType [ reduce ['list ListItems] ][ reduce ['list/type ListItems ListType] ] ListItems: make block! 0 InitialParagraph: true ) ] InitialTableRow: true TableText: difference ASCII/Printable charset "|^/" TableCells: make block! 0 TableCell: [ copy Text some TableText ["|" | "^/"] ( repend TableCells [ either InitialTableRow ['TH]['TD] trim Text ] ) ] TableRows: make block! 0 TableRow: [ "|" some TableCell ( repend TableRows [ 'TR TableCells ] InitialTableRow: false TableCells: make block! 0 ) ] Table: [ "^/" some TableRow "^/" ( repend Dialect [ 'Table TableRows ] TableRows: make block! 0 InitialParagraph: true InitialTableRow: true ) ] HorizontalRule: [ any "^/" 3 "-" any "-" opt "^/" (append Dialect 'hr) ] set 'eText func [ {Takes a string of text and transforms it into HTML dialect.} Document [string!] /Fragment {This text is only a fragment of a larger HTML page.} ][ self/Fragment: Fragment parse/case/all Replacements flow Document [ any [ Title any [ Heading | HorizontalRule | Table | Code | List | Paragraph ] ] copy Text to end (print copy/part Text 100) ] Dialect ] ] ] What does it do? It converts plain ascii text with "natural" formatting, like this (Emma from the Gutenberg project): Emma, by Jane Austen VOLUME I CHAPTER I Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to unite some of the best blessings of existence; and had lived nearly twenty-one years in the world with very little to distress or vex her. She was the youngest of the two daughters of a most affectionate, indulgent father; and had, in consequence of her sister's marriage, been mistress of his house from a very early period. Her mother had died too long ago for her to have more than an indistinct remembrance of her caresses; and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection. To nice HTML code like this: <html> <head> <title>Emma, by Jane Austen</title> </head> <body> <h1>Emma, by Jane Austen</h1> <h2>VOLUME I</h2> <h2>CHAPTER I</h2> Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to unite some of the best blessings of existence; and had lived nearly twenty-one years in the world with very little to distress or vex her. <br>&nbsp &nbsp &nbsp She was the youngest of the two daughters of a most affectionate, indulgent father; and had, in consequence of her sister's marriage, been mistress of his house from a very early period. Her mother had died too long ago for her to have more than an indistinct remembrance of her caresses; and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection. There's also facility for tables, lists and putting in code examples, and so on. Andrew Martin ICQ: 26227169 http://members.nbci.com/AndrewMartin/

[2/2] from: al:bri:xtra at: 18-Nov-2000 8:58

Earlier I wrote:

> ...eText to XML/XHTML/HTML and not bother to inflict markup on people.

Instead, I'll use white space intelligently along with Rebol embedded in the script (in a very nice way), to generate web pages. Here is the eText specification that Garold and I have been working on. Naturally, the document itself is in the eText format. We're discussing eText on this list: [Web_Dialect--egroups--com]. Subscribe to the list: [Web_Dialect-subscribe--egroups--com]. Andrew Martin ICQ: 26227169 http://members.nbci.com/AndrewMartin/ -><- -- Attached file included as plaintext by Listar -- -- File: eText_am_02_glj.txt eText Note: The above line is the title of the HTML document and the text in the H1 tag for the HTML. Author: Andrew Martin eMail: [Al--Bri--xtra--co--nz] Date: 16/November/2000 Site: http://members.nbci.com/AndrewMartin/ Comment: Did you know that writing the contents of a Rebol header is now quite natural (at least to me)? First word followed by a colon is the trigger for META data in the eText document. The data for the item continues on until the end of the line. The pattern could be continued... GLJ -- I use the word / phrase ':' construct routinely from long before Rebol. I think we should retain it, but I am not certain yet just how best to do that. This is a header Why is the above line a header (H2 in HTML)? Because it's separated from the text above and below by 2 blank lines. It's also short, less than 40 odd characters, so it must be header. It also _doesn't_ have a terminating period or full stop at the end or some other sentence terminator. GLJ -- I see that we are working from slightly different perspectives. I was not working on guessing at the intended or incidental structure of totally free format text. Rathere I was looking at text that was intended to have a structure but was prepared in ASCII without access to further formatting. That is why I don't object to marking such things as indent levels. I also tend to think in outlines naturally, and to compose that way. This is a subsequent paragraph. This paragraph will be have the first line indented on the HTML nice looking version. The above paragraph is an initial paragraph, a paragraph that shouldn't have the first line indented. Note that I'm letting my text editor wrap lines appropriately as I can't be bothered doing it for my tools, I'd much rather let my tools wrap my text for me. Wouldn't you? I can tell that the above text are paragraphs, because they have a full stop at the end, are long, and have only one blank line between them. They also have multiple sentences in them, with a period (or other similar terminator, like "-", ":", ";", "?" or "!") at the end of each sentence. GLJ -- I work in a programmer's editor which word wraps automatically rather than something like Notepad or Worpad where the wrapping is only visual. I definitely *do* want the text re-wrapped as needed. HTML will wrap it anyway. This implies to me that blank lines separate paragraphs. This runs me into problems with list items which I tend not to separate. Double Quotes Surrounding text in double quotes should leave the text unchanged. This effect stops at a newline or end of line, just in case they're not balanced? We might need to think more about this. GLJ -- No-Tags treats any item that would normally expect a balancing item as being a single character rather than markup if there is no balancing character within the paragraph. That is that effects are limited to paragraphs. This needs some work. Short single lines of text between paragraphs, with one blank line before and after and no sentence terminator, should be a H3 heading or sub-heading. Perhaps at most 40 characters in length. GLJ -- How do you tell H2 from H3 from ...? I find a need for a title and at least 3 levels of heading. Since HTML supports 6 levels I saw no reason not to do so as well. If I insist on writing one sentence long paragraphs, going on and on, droning about nothing at all, until you are tired reading this text, it will come out all inside a H2 paragraph tag, and will be very obvious to all, and so should be very embarrassing at immediate glance, that the full stop or period is missing at the end of this sentence If I try to trick the interpreter The above line could be considered a sub-header. Or it might be a sentence fragment. It's short enough to be considered a sub-header, so even if the interpreter is wrong and makes it a sub-header using H3 HTML tag, it still makes the error obvious to human eyes. ---- The above should be one section of text The above line should be a header, H2 in HTML. I totally agree with your statement: "The Basic Idea of eText is to allow documents to be created in relatively plain text that is still human readable." I'd also like to add that eText should be easy to create and modify for unsophisticated users, who only know how to type, and more computer professionals, who may be a bit tired and want some that's obvious. eText also *shouldn't* require manual line wrapping. I'm using a Windows Notepad replacement to generate this text, the standard Windows Notepad should be able to cope as well. Just turn "Word Wrap" on. GLJ -- I agree that eText should be suitable for unsophisticated users. I tend to think that I want a smooth transition to fairly complete control. I would like my plain text to allow me to go quite a ways before I need to move to better tools. I would even consider marking the document for level of formality from no markup to intermediate markup to full markup. I have no quarrel with the translator gussing at this. For example, if I use '#' to mark headers it should feel free to assume that all headers are marked, etc. I find that programs that get too smart can get very hard to outsmart. I at least want a way to override the programs guesses. That would allow me to feed plain text in and then go "fix up" the places where it didn't guess right. I will discuss the purposes of eText later. I've improved on my ideas for recognising headers by closely examing your text in markup.mtx and noticed how your level 3 headings had a blank line before and after. Going backwards to two blank lines to separate level 2 headings is consistent with Project Gutenberg eText, and seems fairly obvious. Character-wise markup For character-wise markup, I'd like to use the following (basically as you suggest): *bold* - Asterix text asterix is bold or strong emphasis. _underline_ - Underline text underline is underline or mild emphasis. ~italic~ - Tilde text tilde is italic. =fixed - Equals text equals is fixed width. -- - Two dash inside text is en-dash. --- - Three dash inside text is em-dash. ---- - Four dash or more in the left column is a horizontal rule. ==== - Four equals or more in the left colum is a bigger break, perhaps using the DIV tag in HTML. I'm not sure what it should be yet. ^text - superscripts the following text until the first white space. ^^text - subscripts the following text? The above text should end up as a HTML table of definitions. The pattern should be: Text [TAB | SPACE] "-" [TAB | SPACE] text The above line (effectively code or script) should be blockquoted, preformatted and in typewriter font. The pattern is two tabs in from the left margin. The text should be unchanged, except for HTML tags, which should be translated so as to be visible in a browser. In other words, "<" should be "<", ">" should be ">" and "&" should be "&". GLJ -- While I went back and changed this to a table, I wouldn't have considered doing it automatically. I often do such things as this with responses -- or other material -- set off by dashes. If all text were prepared with paragraphs recognized by newline, then perhaps the fact that there were multiple lines could signal a table? Embedding HTML Do we really need to? A graphically intensive page would be better designed using GIF, JPG, Flash and Style Sheets, and should have a computer graphics artist working on the project. There would be very little human-readable text in the page, I feel. Still, if we're careful, it could be included? GLJ -- The only reasons for considering embedded HTML are: 1) Twiki allowed it and was one of the systems I was considering, 2) It is currently the only approach I have to font size and color, and 3) The notation doesn't rule it out. I don't care for embedded HTML, but more and more people are allowing it and using it. Also formatting within table cells is easier with some of the tags. I object to HTML as being the only way to do things which some systems insist on. Embedding Rebol script Rebol Server Pages (RSP) I think the ASP "standard", that I adapted for Rebol, should serve: <%! - Directives or meta-information. <% - Rebol code that is not intended to return a value. <%: - Rebol code that returns a value. %> - Terminate any of the above opening tags. The ASP "standard" uses "@" for directives and "=" for expressions that return a value. GLJ -- I don't have a quarrel with any procedure that works here. I think that something that looks familiar to anyone who has embedded other languages in HTML is acceptable. I don't think there is any reason for introducing anything really foreign looking. The suggested use of braces came from MTX and Latte which found that they provided minimal interference. Embedded Rebol For simple embedding of Rebol values in the text, I'd suggest using ":" or colon to mark the start of rebol word to "get". For example, the date now is :now. The ":now" will be replaced with the rebol value for now, 17/Oct/2010 or whatever the time/date is. If the value is a file!, then the value gets substituted appropriately into the resulting HTML file. GLJ -- Ok -- provisionally. I think that trying to guess too cleverly will result in surprises. I prefer that all programs adhere to the WYGINS (What You Get Is No Surprise) principle. Links Here's a link to my site: http://members.nbci.com/AndrewMartin/. My email address: [Al--Bri--xtra--co--nz]. Here's some boiler plate text to click on: file:MyBoilerPlate.txt. Note that Rebol and Oscar can recognise url! and email! datatypes. I think a sentence, an initial capital letter terminated with a ":", followed by a url! or email! (or "text.txt" for file!) and followed by a period, should be sufficient grounds to be considered a link. So the above text should look like this in HTML: <a href="http://members.nbci.com/AndrewMartin/">Here's a link to my site</a>. and: <a href="mailto:[Al--Bri--xtra--co--nz]">My email address</a>. and: <a href="MyBoilerPlate.txt">Here's some boiler plate text to click on</a>. Note that the ":" has been removed, and the "." is after the link. I've found that having the period outside the link is more pleasing to my eye at least. GLJ -- The URL and the email address are fine. Most browsers and modern mail clients will do this. I am wary of the file notation as there isn't any standard practice that I can tie it to. GLJ -- "Here's some boiler plate text to click on: file:MyBoilerPlate.txt. " I think that the "file:" sneaked in there. "Here's some boiler plate text to click on: MyBoilerPlate.txt." matches the pattern you describe [ Initial cap words: <something linkable>. ] works for me. It is not so strange that I couldn't adjust. A picture could be recognised by the above pattern, then checked for .gif or .jpg extensions and substituting a "<image>" tag instead of the "<a>" tag. The text before the colon substitutes for the ALT text. GLJ -- Once we recognize that we have a link, trying to to interpret the extension is perfectly reasonable. When creating the page interactively, like in a Wiki or Sparrow, when a link to a local file is created, the system will create a "blank" page for the link to go to (if there's not already a page of that name). The system should also be free to modify the case of file name in the link to agree with pre-existing local files. It should also substitute " " for spaces in filenames, much like MS Internet Explorer does, and Rebol does with URLs. GLJ -- I think I like the Wiki idea of a '?' that is a link to create a non-existent local link. It is easy to get used to and immediately points out spelling mistakes when presented. Wiki Word Links I dislike the WikiWordLinks. For example a link to Wiki requires the word to be written as "WIki", which is ~unnatural~ to me. GLJ -- It depends on what your are doing. If you are working in a Wiki where most of the point is linking to other material, I think WikiWords make lots of sense. The fact that there is a surprise in other text is unfortunate, but the alterantive is a requirement to format every embedded link specially which violates our premise of natural ease of use. Wiki suffers from lack of any easy way to mark non-WikiWord links. The WikiWord tends to be natural to those who came from programming environments such as Pascal. Also, Wiki is not intended to translate text that wasn't created for it. Including files :"My Boiler Plate Text File.txt" The above line is a command to include the contents of the %"My Boiler Plate Text File.txt" straight into this text when converted to HTML. The colon can be read as "get (or evaluate) the thing to the right". The complete contents of the file are substituted for the line (including the newline at the end). GLJ -- This is starting to get dicey. This construct doesn't suggest to me any special tratment. It is available for use from our syntax rules, but I don't care for it. I suggest that at the level of file inclusion and variables we consider that we have entered the realm of embedded Rebol. I would prefer to see this as <%: %"My Boiler Plate Text File.txt" %> This is embedded Rebol returning a value whiich is the content of a file. I don't even object to using specific directives for this. I think that "%*.txt" files should be executed in this page as if their contents had been written in manually. I'm not sure what should happen with other file extensions. Perhaps "%*.html" and "%*.htm" files should be inserted *after* the page gets translated into HTML? GLJ -- I agree that "%*.txt" files should get included at the time of construction. If we allow embedded HTML, there is no problem with including them at the time we expand the file. Rebol Server Pages presumably have a ".rsp" extension and will get included when seen also. Lists I like just simply using a asterix in the left margin for unordered lists: * My first list item. * Another item. I think that's reasonably sensible. A ordered list just substitutes "I" for Roman (followed by a tab), "0" or zero for Arabic, "A" or "a" for capital or lower case letters. For nested lists, we could try simply using one or more tabs (or several spaces) before the list item. Like: * Unordered list item. 0 First item 0 Second item 0 Third * Option I Roman list item A Arabic list item A Another I More romans Naturally, the system numbers list items sequentially, and doesn't care when we shuffle the order of the above, and will always number the items correctly. An optional few characters after the list item "key", like ")./" should be easy to add to the parse rules for list items. An improvement on the above would be to allow 0 - 9 for Arabic, correctly written roman numerals for Roman, A - Z and a - z for Letters. GLJ -- It seems to me that bulleted lists are unordered and ordered lists have values. I wouldn't format an ordered list without item "numbers". Output should create the numbers in sequence, as they will automatically if we use lists in HTML. Most of the systems require whitespace to indent a list item and measure whitespace to determine sub lists, but there is no real reason that the top-level list needs to be indented. I think determining sublists from indentation makes sense. I suggest something like: 1) The first list element determines the type of the (sub)list -- 'A' - Upper Arabic, 'a' - Lower Arabic, 'I' - Upper Roman, 'i' - Lower Roman, '0-9' - Numbered, '*' | '-' | 'o' (others?) - bulleted (unordered) list. I don't know about the support for numbering styles in all browsers. The intent is that if it looks like lists it should create lists. I think that implies allow