Lexical syntax and datum syntax

The syntax of Scheme code is organized in three levels:

  1. the lexical syntax that describes how a program text is split into a sequence of lexemes,

  2. the datum syntax, formulated in terms of the lexical syntax, that structures the lexeme sequence as a sequence of syntactic data, where a syntactic datum is a recursively structured entity,

  3. the program syntax formulated in terms of the read syntax, imposing further structure and assigning meaning to syntactic data.

Syntactic data (also called external representations) double as a notation for objects, and Scheme’s (rnrs io ports (6)) library (library section on “Port I/O”) provides the get-datum and put-datum procedures for reading and writing syntactic data, converting between their textual representation and the corresponding objects. Each syntactic datum represents a corresponding datum value. A syntactic datum can be used in a program to obtain the corresponding datum value using quote (see section 11.4.1).

Scheme source code consists of syntactic data and (non-significant) comments. Syntactic data in Scheme source code are called forms. (A form nested inside another form is called a subform.) Consequently, Scheme’s syntax has the property that any sequence of characters that is a form is also a syntactic datum representing some object. This can lead to confusion, since it may not be obvious out of context whether a given sequence of characters is intended to be a representation of objects or the text of a program. It is also a source of power, since it facilitates writing programs such as interpreters or compilers that treat programs as objects (or vice versa).

A datum value may have several different external representations. For example, both “#e28.000” and “#x1c” are syntactic data representing the exact integer object 28, and the syntactic data “(8 13)”, “( 08 13 )”, “(8 . (13 . ()))” all represent a list containing the exact integer objects 8 and 13. Syntactic data that represent equal objects (in the sense of equal?; see section 11.5) are always equivalent as forms of a program.

Because of the close correspondence between syntactic data and datum values, this report sometimes uses the term datum for either a syntactic datum or a datum value when the exact meaning is apparent from the context.

An implementation must not extend the lexical or datum syntax in any way, with one exception: it need not treat the syntax #!<identifier>, for any <identifier> (see section 4.2.4) that is not r6rs, as a syntax violation, and it may use specific #!-prefixed identifiers as flags indicating that subsequent input contains extensions to the standard lexical or datum syntax. The syntax #!r6rs may be used to signify that the input afterward is written with the lexical syntax and datum syntax described by this report. #!r6rs is otherwise treated as a comment; see section 4.2.3.

4.1  Notation

The formal syntax for Scheme is written in an extended BNF. Non-terminals are written using angle brackets. Case is insignificant for non-terminal names.

All spaces in the grammar are for legibility. <Empty> stands for the empty string.

The following extensions to BNF are used to make the description more concise: <thing>* means zero or more occurrences of <thing>, and <thing>+ means at least one <thing>.

Some non-terminal names refer to the Unicode scalar values of the same name: <character tabulation> (U+0009), <linefeed> (U+000A), <carriage return> (U+000D), <line tabulation> (U+000B), <form feed> (U+000C), <carriage return> (U+000D), <space> (U+0020), <next line> (U+0085), <line separator> (U+2028), and <paragraph separator> (U+2029).

4.2  Lexical syntax

The lexical syntax determines how a character sequence is split into a sequence of lexemes, omitting non-significant portions such as comments and whitespace. The character sequence is assumed to be text according to the Unicode standard [27]. Some of the lexemes, such as identifiers, representations of number objects, strings etc., of the lexical syntax are syntactic data in the datum syntax, and thus represent objects. Besides the formal account of the syntax, this section also describes what datum values are represented by these syntactic data.

The lexical syntax, in the description of comments, contains a forward reference to <datum>, which is described as part of the datum syntax. Being comments, however, these <datum>s do not play a significant role in the syntax.

Case is significant except in representations of booleans, number objects, and in hexadecimal numbers specifying Unicode scalar values. For example, #x1A and #X1a are equivalent. The identifier Foo is, however, distinct from the identifier FOO.

4.2.1  Formal account

<Interlexeme space> may occur on either side of any lexeme, but not within a lexeme.

<Identifier>s, ., <number>s, <character>s, and <boolean>s, must be terminated by a <delimiter> or by the end of the input.

The following two characters are reserved for future extensions to the language: { }

<lexeme> → <identifier> | <boolean> | <number>
         | <character> | <string>
         | ( | ) | [ | ] | #( | #vu8( | | | , | ,@ | .
         | # | # | #, | #,@
<delimiter> → ( | ) | [ | ] | " | ; | #
         | <whitespace>
<whitespace> → <character tabulation>
         | <linefeed> | <line tabulation> | <form feed>
         | <carriage return> | <next line>
         | <any character whose category is Zs, Zl, or Zp>
<line ending> → <linefeed> | <carriage return>
         | <carriage return> <linefeed> | <next line>
         | <carriage return> <next line> | <line separator>
<comment> → ; ⟨all subsequent characters up to a
         <line ending> or <paragraph separator>⟩
         | <nested comment>
         | #; <interlexeme space> <datum>
         | #!r6rs
<nested comment> → #| <comment text>
         <comment cont>* |#
<comment text> → ⟨character sequence not containing
         #| or |#
<comment cont> → <nested comment> <comment text>
<atmosphere> → <whitespace> | <comment>
<interlexeme space> → <atmosphere>*

<identifier> → <initial> <subsequent>*
         | <peculiar identifier>
<initial> → <constituent> | <special initial>
         | <inline hex escape>
<letter> → a | b | c | ... | z
         | A | B | C | ... | Z
<constituent> → <letter>
         | ⟨any character whose Unicode scalar value is greater than
             127, and whose category is Lu, Ll, Lt, Lm, Lo, Mn,
             Nl, No, Pd, Pc, Po, Sc, Sm, Sk, So, or Co⟩
<special initial> → ! | $ | % | & | * | / | : | < | =
         | > | ? | ^ | _ | ~
<subsequent> → <initial> | <digit>
         | <any character whose category is Nd, Mc, or Me>
         | <special subsequent>
<digit> → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<hex digit> → <digit>
         | a | A | b | B | c | C | d | D | e | E | f | F
<special subsequent> → + | - | . | @
<inline hex escape> → \x<hex scalar value>;
<hex scalar value> → <hex digit>+
<peculiar identifier> → + | - | ... | -> <subsequent>*
<boolean> → #t | #T | #f | #F
<character> → #\<any character>
         | #\<character name>
         | #\x<hex scalar value>
<character name> → nul | alarm | backspace | tab
         | linefeed | newline | vtab | page | return
         | esc | space | delete
<string> → " <string element>* "
<string element> → <any character other than " or \>
         | \a | \b | \t | \n | \v | \f | \r
         | \" | \\
         | \<intraline whitespace><line ending>
            <intraline whitespace>
         | <inline hex escape>
<intraline whitespace> → <character tabulation>
         | <any character whose category is Zs>

A <hex scalar value> represents a Unicode scalar value between 0 and #x10FFFF, excluding the range [#xD800, #xDFFF].

The rules for <num R>, <complex R>, <real R>, <ureal R>, <uinteger R>, and <prefix R> below should be replicated for R = 2, 8, 10, and 16. There are no rules for <decimal 2>, <decimal 8>, and <decimal 16>, which means that number representations containing decimal points or exponents must be in decimal radix.

<number> → <num 2> | <num 8>
         | <num 10> | <num 16>
<num R> → <prefix R> <complex R>
<complex R> → <real R> | <real R> @ <real R>
         | <real R> + <ureal R> i | <real R> - <ureal R> i
         | <real R> + <naninf> i | <real R> - <naninf> i
         | <real R> + i | <real R> - i
         | + <ureal R> i | - <ureal R> i
         | + <naninf> i | - <naninf> i
         | + i | - i
<real R> → <sign> <ureal R>
         | + <naninf> | - <naninf>
<naninf> → nan.0 | inf.0
<ureal R> → <uinteger R>
         | <uinteger R> / <uinteger R>
         | <decimal R> <mantissa width>
<decimal 10> → <uinteger 10> <suffix>
         | . <digit 10>+ <suffix>
         | <digit 10>+ . <digit 10>* <suffix>
         | <digit 10>+ . <suffix>
<uinteger R> → <digit R>+
<prefix R> → <radix R> <exactness>
         | <exactness> <radix R>

<suffix> → <empty>
         | <exponent marker> <sign> <digit 10>+
<exponent marker> → e | E | s | S | f | F
         | d | D | l | L
<mantissa width> → <empty>
         | | <digit 10>+
<sign> → <empty> | + | -
<exactness> → <empty>
         | #i| #I | #e| #E
<radix 2> → #b| #B
<radix 8> → #o| #O
<radix 10> → <empty> | #d | #D
<radix 16> → #x| #X
<digit 2> → 0 | 1
<digit 8> → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7
<digit 10> → <digit>
<digit 16> → <hex digit>

4.2.2  Line endings

Line endings are significant in Scheme in single-line comments (see section 4.2.3) and within string literals. In Scheme source code, any of the line endings in <line ending> marks the end of a line. Moreover, the two-character line endings <carriage return> <linefeed> and <carriage return> <next line> each count as a single line ending.

In a string literal, a <line ending> not preceded by a \ stands for a linefeed character, which is the standard line-ending character of Scheme.

4.2.3  Whitespace and comments

Whitespace characters are spaces, linefeeds, carriage returns, character tabulations, form feeds, line tabulations, and any other character whose category is Zs, Zl, or Zp. Whitespace is used for improved readability and as necessary to separate lexemes from each other. Whitespace may occur between any two lexemes, but not within a lexeme. Whitespace may also occur inside a string, where it is significant.

The lexical syntax includes several comment forms. In all cases, comments are invisible to Scheme, except that they act as delimiters, so, for example, a comment cannot appear in the middle of an identifier or representation of a number object.

A semicolon (;) indicates the start of a line comment.The comment continues to the end of the line on which the semicolon appears.

Another way to indicate a comment is to prefix a <datum> (cf. section 4.3.1) with #;, possibly with <interlexeme space> before the <datum>. The comment consists of the comment prefix #; and the <datum> together. This notation is useful for “commenting out” sections of code.

Block comments may be indicated with properly nested #|and |# pairs.

#|
   The FACT procedure computes the factorial
   of a non-negative integer.
|#
(define fact
  (lambda (n)
    ;; base case
    (if (= n 0)
        #;(= n 1)
        1       ; identity of *
        (* n (fact (- n 1))))))

The lexeme #!r6rs, which signifies that the program text that follows is written with the lexical and datum syntax described in this report, is also otherwise treated as a comment.

4.2.4  Identifiers

Most identifiersallowed by other programming languages are also acceptable to Scheme. In general, a sequence of letters, digits, and “extended alphabetic characters” is an identifier when it begins with a character that cannot begin a representation of a number object. In addition, +, -, and ... are identifiers, as is a sequence of letters, digits, and extended alphabetic characters that begins with the two-character sequence ->. Here are some examples of identifiers:

lambda         q                soup
list->vector   +                V17a
<=             a34kTMNs         ->-
the-word-recursion-has-many-meanings

Extended alphabetic characters may be used within identifiers as if they were letters. The following are extended alphabetic characters:

! $ % & * + - . / : < = > ? @ ^ _ ~ 

Moreover, all characters whose Unicode scalar values are greater than 127 and whose Unicode category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd, Nl, No, Pd, Pc, Po, Sc, Sm, Sk, So, or Co can be used within identifiers. In addition, any character can be used within an identifier when specified via an <inline hex escape>. For example, the identifier H\x65;llo is the same as the identifier Hello, and the identifier \x3BB; is the same as the identifier λ.

Any identifier may be used as a variableor as a syntactic keyword(see sections 5.2 and 9.2) in a Scheme program. Any identifier may also be used as a syntactic datum, in which case it represents a symbol(see section 11.10).

4.2.5  Booleans

The standard boolean objects for true and false have external representations #t and #f.

4.2.6  Characters

Characters are represented using the notation #\<character>or #\<character name> or #\x<hex scalar value>.

For example:

#\a lower case letter a
#\A upper case letter A
#\( left parenthesis
#\ space character
#\nul U+0000
#\alarm U+0007
#\backspace U+0008
#\tab U+0009
#\linefeed U+000A
#\newline U+000A
#\vtab U+000B
#\page U+000C
#\return U+000D
#\esc U+001B
#\space U+0020
preferred way to write a space
#\delete U+007F
#\xFF U+00FF
#\x03BB U+03BB
#\x00006587 U+6587
#\λ U+03BB
#\x0001z &lexical exception
#\λx &lexical exception
#\alarmx &lexical exception
#\alarm x U+0007
followed by x
#\Alarm &lexical exception
#\alert &lexical exception
#\xA U+000A
#\xFF U+00FF
#\xff U+00FF
#\x ff U+0078
followed by another datum, ff
#\x(ff) U+0078
followed by another datum,
a parenthesized ff
#\(x) &lexical exception
#\(x &lexical exception
#\((x) U+0028
followed by another datum,
parenthesized x
#\x00110000 &lexical exception
out of range
#\x000000001 U+0001
#\xD800 &lexical exception
in excluded range

(The notation &lexical exception means that the line in question is a lexical syntax violation.)

Case is significant in #\<character>, and in #\⟨character name⟩, but not in #\x<hex scalar value>. A <character> must be followed by a <delimiter> or by the end of the input. This rule resolves various ambiguous cases involving named characters, requiring, for example, the sequence of characters “#\space” to be interpreted as the space character rather than as the character “#\s” followed by the identifier “pace”.

Note:   The #\newline notation is retained for backward compatibility. Its use is deprecated; #\linefeed should be used instead.

4.2.7  Strings

String are represented by sequences of characters enclosed within doublequotes ("). Within a string literal, various escape sequencesrepresent characters other than themselves. Escape sequences always start with a backslash (\):

These escape sequences are case-sensitive, except that the alphabetic digits of a <hex scalar value> can be uppercase or lowercase.

Any other character in a string after a backslash is a syntax violation. Except for a line ending, any character outside of an escape sequence and not a doublequote stands for itself in the string literal. For example the single-character string literal "λ" (doublequote, a lower case lambda, doublequote) represents the same string as "\x03bb;". A line ending that does not follow a backslash stands for a linefeed character.

Examples:

"abc" U+0061, U+0062, U+0063
"\x41;bc" "Abc" ; U+0041, U+0062, U+0063
"\x41; bc" "A bc"
U+0041, U+0020, U+0062, U+0063
"\x41bc;" U+41BC
"\x41" &lexical exception
"\x;" &lexical exception
"\x41bx;" &lexical exception
"\x00000041;" "A" ; U+0041
"\x0010FFFF;" U+10FFFF
"\x00110000;" &lexical exception
out of range
"\x000000001;" U+0001
"\xD800;" &lexical exception
in excluded range
"A
bc" U+0041, U+000A, U+0062, U+0063
if no space occurs after the A

4.2.8  Numbers

The syntax of external representations for number objects is described formally by the <number> rule in the formal grammar. Case is not significant in external representations of number objects.

A representation of a number object may be written in binary, octal, decimal, or hexadecimal by the use of a radix prefix. The radix prefixes are #b(binary), #o(octal), #d(decimal), and #x(hexadecimal). With no radix prefix, a representation of a number object is assumed to be expressed in decimal.

A representation of a number object may be specified to be either exact or inexact by a prefix. The prefixes are #efor exact, and #ifor inexact. An exactness prefix may appear before or after any radix prefix that is used. If the representation of a number object has no exactness prefix, the constant is inexact if it contains a decimal point, an exponent, or a nonempty mantissa width; otherwise it is exact.

In systems with inexact number objects of varying precisions, it may be useful to specify the precision of a constant. For this purpose, representations of number objects may be written with an exponent marker that indicates the desired precision of the inexact representation. The letters s, f, d, and l specify the use of short, single, double, and long precision, respectively. (When fewer than four internal inexact representations exist, the four size specifications are mapped onto those available. For example, an implementation with two internal representations may map short and single together and long and double together.) In addition, the exponent marker e specifies the default precision for the implementation. The default precision has at least as much precision as double, but implementations may wish to allow this default to be set by the user.

3.1415926535898F0 
       Round to single, perhaps 3.141593
0.6L0
       Extend to long, perhaps .600000000000000

A representation of a number object with nonempty mantissa width, x|p, represents the best binary floating-point approximation of x using a p-bit significand. For example, 1.1|53 is a representation of the best approximation of 1.1 in IEEE double precision. If x is an external representation of an inexact real number object that contains no vertical bar, then its numerical value should be computed as though it had a mantissa width of 53 or more.

Implementations that use binary floating-point representations of real number objects should represent x|p using a p-bit significand if practical, or by a greater precision if a p-bit significand is not practical, or by the largest available precision if p or more bits of significand are not practical within the implementation.

Note:   The precision of a significand should not be confused with the number of bits used to represent the significand. In the IEEE floating-point standards, for example, the significand’s most significant bit is implicit in single and double precision but is explicit in extended precision. Whether that bit is implicit or explicit does not affect the mathematical precision. In implementations that use binary floating point, the default precision can be calculated by calling the following procedure:

(define (precision)
  (do ((n 0 (+ n 1))
       (x 1.0 (/ x 2.0)))
    ((= 1.0 (+ 1.0 x)) n)))

Note:   When the underlying floating-point representation is IEEE double precision, the |p suffix should not always be omitted: Denormalized floating-point numbers have diminished precision, and therefore their external representations should carry a |p suffix with the actual width of the significand.

The literals +inf.0 and -inf.0 represent positive and negative infinity, respectively. The +nan.0 literal represents the NaN that is the result of (/ 0.0 0.0), and may represent other NaNs as well.

If x is an external representation of an inexact real number object and contains no vertical bar and no exponent marker other than e, the inexact real number object it represents is a flonum (see library section on “Flonums”). Some or all of the other external representations of inexact real number objects may also represent flonums, but that is not required by this report.

4.3  Datum syntax

The datum syntax describes the syntax of syntactic datain terms of a sequence of <lexeme>s, as defined in the lexical syntax.

Syntactic data include the lexeme data described in the previous section as well as the following constructs for forming compound data:

4.3.1  Formal account

The following grammar describes the syntax of syntactic data in terms of various kinds of lexemes defined in the grammar in section 4.2:

<datum> → <lexeme datum>
         | <compound datum>
<lexeme datum> → <boolean> | <number>
         | <character> | <string> | <symbol>
<symbol> → <identifier>
<compound datum> → <list> | <vector> | <bytevector>
<list> → (<datum>*) | [<datum>*]
         | (<datum>+ . <datum>) | [<datum>+ . <datum>]
         | <abbreviation>
<abbreviation> → <abbrev prefix> <datum>
<abbrev prefix> → ’ | ‘ | , | ,@
         | #’ | #‘ | #, | #,@
<vector> → #(<datum>*)
<bytevector> → #vu8(<u8>*)
<u8> → ⟨any <number> representing an exact
                   integer in {0, ..., 255}⟩

4.3.2  Pairs and lists

List and pair data, representing pairs and lists of values (see section 11.9) are represented using parentheses or brackets. Matching pairs of brackets that occur in the rules of <list> are equivalent to matching pairs of parentheses.

The most general notation for Scheme pairs as syntactic data is the “dotted” notation (<datum1> . <datum2>) where <datum1> is the representation of the value of the car field and <datum2> is the representation of the value of the cdr field. For example (4 . 5) is a pair whose car is 4 and whose cdr is 5.

A more streamlined notation can be used for lists: the elements of the list are simply enclosed in parentheses and separated by spaces. The empty listis represented by () . For example,

(a b c d e)

and

(a . (b . (c . (d . (e . ())))))

are equivalent notations for a list of symbols.

The general rule is that, if a dot is followed by an open parenthesis, the dot, open parenthesis, and matching closing parenthesis can be omitted in the external representation.

The sequence of characters “(4 . 5)” is the external representation of a pair, not an expression that evaluates to a pair. Similarly, the sequence of characters “(+ 2 6)” is not an external representation of the integer 8, even though it is an expression (in the language of the (rnrs base (6)) library) evaluating to the integer 8; rather, it is a syntactic datum representing a three-element list, the elements of which are the symbol + and the integers 2 and 6.

4.3.3  Vectors

Vector data, representing vectors of objects (see section 11.13), are represented using the notation #(<datum> ...). For example, a vector of length 3 containing the number object for zero in element 0, the list (2 2 2 2) in element 1, and the string "Anna" in element 2 can be represented as follows:

#(0 (2 2 2 2) "Anna")

This is the external representation of a vector, not an expression that evaluates to a vector.

4.3.4  Bytevectors

Bytevector data, representing bytevectors (see library chapter on “Bytevectors”), are represented using the notation #vu8(<u8> ...), where the <u8>s represent the octets of the bytevector. For example, a bytevector of length 3 containing the octets 2, 24, and 123 can be represented as follows:

#vu8(2 24 123)

This is the external representation of a bytevector, and also an expression that evaluates to a bytevector.

4.3.5  Abbreviations

<datum>     
<datum>     
,<datum>     
,@<datum>     
#’<datum>     
#<datum>     
#,<datum>     
#,@<datum>     

Each of these is an abbreviation:
    <datum> for (quote <datum>),
    <datum> for (quasiquote <datum>),
    ,<datum> for (unquote <datum>),
    ,@<datum> for (unquote-splicing <datum>),
    #’<datum> for (syntax <datum>),
    #‘<datum> for (quasisyntax <datum>),
    #,<datum> for (unsyntax <datum>), and
    #,@<datum> for (unsyntax-splicing <datum>).