5.2 The string type and its double quoted syntax
As of SWI-Prolog versionĀ 7, text enclosed in double quotes
(e.g.,
"Hello world"
) is read as objects of the type string.
A string is a compact representation of a character sequence that lives
on the global (term) stack. Strings represent sequences of Unicode
characters including the character code 0 (zero). The length strings is
limited by the available space on the global (term) stack (see
set_prolog_stack/2).
Strings are distinct from lists, which makes it possible to detect them
at runtime and print them using the string syntax, as illustrated below:
?- write("Hello world!"). Hello world! ?- writeq("Hello world!"). "Hello world!"
Back quoted text (as in `text`
) is mapped to a
list of character codes in versionĀ 7. The settings for the flags
that control how double and back quoted text is read is summarised in
table 8.
Programs that aim for compatibility should realise that the ISO standard
defines back quoted text, but does not define the back_quotes
Prolog flag and does not define the term that is produced by back quoted
text.
Mode | double_quotes | back_quotes |
VersionĀ 7 default | string | codes |
--traditional | codes | symbol_char |
Section 5.2.4 motivates the introduction of strings and mapping double quoted text to this type.
5.2.1 Predicates that operate on strings
Strings may be manipulated by a set of predicates that is similar to the manipulation of atoms. In addition to the list below, string/1 performs the type check for this type and is described in section 4.6.
SWI-Prolog's string primitives are being synchronized with http://eclipseclp.org/wiki/Prolog/StringsECLiPSe. We expect the set of predicates documented in this section to be stable, although it might be expanded. In general, SWI-Prolog's text manipulation predicates accept any form of text as input argument and produce the type indicated by the predicate name as output. This policy simplifies migration and writing programs that can run unmodified or with minor modifications on systems that do not support strings. Code should avoid relying on this feature as much as possible for clarity as well as to facilitate a more strict mode and/or type checking in future releases.
- atom_string(?Atom, ?String)
- Bi-directional conversion between an atom and a string. At least one of the two arguments must be instantiated. Atom can also be an integer or floating point number.
- number_string(?Number, ?String)
- Bi-directional conversion between a number and a string. At least one of
the two arguments must be instantiated. Besides the type used to
represent the text, this predicate differs in several ways from its ISO
cousin:143Note that SWI-Prolog's
syntax for numbers is not ISO compatible either.
- If String does not represent a number, the predicate fails rather than throwing a syntax error exception.
- Leading white space and Prolog comments are not allowed.
- Numbers may start with '+' or '-'.
- It is not allowed to have white space between a leading '+' or '-' and the number.
- Floating point numbers in exponential notation do not require a dot
before exponent, i.e.,
"1e10"
is a valid number.
- term_string(?Term, ?String)
- Bi-directional conversion between a term and a string. If String
is instantiated, it is parsed and the result is unified with Term.
Otherwise Term is `written' using the option
quoted(true)
and the result is converted to String. - term_string(?Term, ?String, +Options)
- As term_string/2,
passing Options to either read_term/2
or write_term/2.
For example:
?- term_string(Term, 'a(A)', [variable_names(VNames)]). Term = a(_G1466), VNames = ['A'=_G1466].
- string_chars(?String, ?Chars)
- Bi-directional conversion between a string and a list of characters (one-character atoms). At least one of the two arguments must be instantiated.
- string_codes(?String, ?Codes)
- Bi-directional conversion between a string and a list of character codes. At least one of the two arguments must be instantiated.
- [det]text_to_string(+Text, -String)
- Converts Text to a string. Text is an atom, string
or list of characters (codes or chars). When running in
--traditional mode,
'[]'
is ambiguous and interpreted as an empty string. - string_length(+String, -Length)
- Unify Length with the number of characters in String. This predicate is functionally equivalent to atom_length/2 and also accepts atoms, integers and floats as its first argument.
- string_code(?Index, +String, ?Code)
- True when Code represents the character at the 1-based Index
position in String. If Index is unbound the string
is scanned from index 1. Raises a domain error if Index is
negative. Fails silently if Index is zero or greater than the
length of
String. The mode
string_code(-,+,+)
is deterministic if the searched-for Code appears only once in String. See also sub_string/5. - get_string_code(+Index, +String, -Code)
- Semi-deterministic version of string_code/3.
In addition, this version provides strict range checking, throwing a
domain error if Index is less than 1 or greater than the
length of String. ECLiPSe provides this to support
String[Index]
notation. - string_concat(?String1, ?String2, ?String3)
- Similar to atom_concat/3, but the unbound argument will be unified with a string object rather than an atom. Also, if both String1 and String2 are unbound and String3 is bound to text, it breaks String3, unifying the start with String1 and the end with String2 as append does with lists. Note that this is not particularly fast on long strings, as for each redo the system has to create two entirely new strings, while the list equivalent only creates a single new list-cell and moves some pointers around.
- [det]split_string(+String, +SepChars, +PadChars, -SubStrings)
- Break String into SubStrings. The SepChars
argument provides the characters that act as separators and thus the
length of
SubStrings is one more than the number of separators found if
SepChars and PadChars do not have common
characters. If
SepChars and PadChars are equal, sequences of
adjacent separators act as a single separator. Leading and trailing
characters for each substring that appear in PadChars are
removed from the substring. The input arguments can be either atoms,
strings or char/code lists. Compatible with ECLiPSe. Below are some
examples:
% a simple split ?- split_string("a.b.c.d", ".", "", L). L = ["a", "b", "c", "d"]. % Consider sequences of separators as a single one ?- split_string("/home//jan///nice/path", "/", "/", L). L = ["home", "jan", "nice", "path"]. % split and remove white space ?- split_string("SWI-Prolog, 7.0", ",", " ", L). L = ["SWI-Prolog", "7.0"]. % only remove leading and trailing white space ?- split_string(" SWI-Prolog ", "", "\s\t\n", L). L = ["SWI-Prolog"].
In the typical use cases, SepChars either does not overlap PadChars or is equivalent to handle multiple adjacent separators as a single (often white space). The behaviour with partially overlapping sets of padding and separators should be considered undefined. See also read_string/5.
- sub_string(+String, ?Before, ?Length, ?After, ?SubString)
- SubString is a substring of String. There are Before
characters in String before SubString, SubString
contains Length character and is followed by After
characters in String. If not enough information is provided
to compute the start of the match, String is scanned
left-to-right. This predicate is functionally equivalent to sub_atom/5,
but operates on strings. The following example splits a string of the
form
<name>=<value> into the name part (an
atom) and the value (a string).
name_value(String, Name, Value) :- sub_string(String, Before, _, After, "="), !, sub_string(String, 0, Before, _, NameString), atom_string(Name, NameString), sub_string(String, _, After, 0, Value).
- atomics_to_string(+List, -String)
- List is a list of strings, atoms, integers or floating point
numbers. Succeeds if String can be unified with the
concatenated elements of List. Equivalent to
atomics_to_string(List, '', String)
. - atomics_to_string(+List, +Separator, -String)
- Creates a string just like atomics_to_string/2,
but inserts
Separator between each pair of inputs. For example:
?- atomics_to_string([gnu, "gnat", 1], ', ', A). A = "gnu, gnat, 1"
- string_upper(+String, -UpperCase)
- Convert String to upper case and unify the result with UpperCase.
- string_lower(+String, LowerCase)
- Convert String to lower case and unify the result with LowerCase.
- read_string(+Stream, ?Length, -String)
- Read at most Length characters from Stream and return them in the string String. If Length is unbound, Stream is read to the end and Length is unified with the number of characters read.
- read_string(+Stream, +SepChars, +PadChars, -Sep, -String)
- Read a string from Stream, providing functionality similar to
split_string/4.
The predicate performs the following steps:
- Skip all characters that match PadChars
- Read up to a character that matches SepChars or end of file
- Discard trailing characters that match PadChars from the collected input
- Unify String with a string created from the input and Sep with the separator character read. If input was terminated by the end of the input, Sep is unified with -1.
The predicate read_string/5 called repeatedly on an input until Sep is -1 (end of file) is equivalent to reading the entire file into a string and calling split_string/4, provided that SepChars and PadChars are not partially overlapping.144Behaviour that is fully compatible would requite unlimited look-ahead. Below are some examples:
% Read a line read_string(Input, "\n", "\r", End, String) % Read a line, stripping leading and trailing white space read_string(Input, "\n", "\r\t ", End, String) % Read upto , or ), unifying End with 0', or 0') read_string(Input, ",)", "\t ", End, String)
- open_string(+String, -Stream)
- True when Stream is an input stream that accesses the content of String. String can be any text representation, i.e., string, atom, list of codes or list of characters.
5.2.2 Representing text: strings, atoms and code lists
With the introduction of strings as a Prolog data type, there are three main ways to represent text: using strings, atoms or code lists. This section explains what to choose for what purpose. Both strings and atoms are atomic objects: you can only look inside them using dedicated predicates. Lists of character codes are compound datastructures.
- Lists of character codes
- is what you need if you want to parse text using Prolog grammar rules (DCGs, see phrase/3). Most of the text reading predicates (e.g., read_line_to_codes/2) return a list of character codes because most applications need to parse these lines before the data can be processed.
- Atoms
- are identifiers. They are typically used in cases where
identity comparison is the main operation and that are typically not
composed nor taken apart. Examples are RDF resources (URIs that identify
something), system identifiers (e.g.,
'Boeing 747'
), but also individual words in a natural language processing system. They are also used where other languages would use enumerated types, such as the names of days in the week. Unlike enumerated types, Prolog atoms do not form not a fixed set and the same atom can represent different things in different contexts. - Strings
- typically represents text that is processed as a unit most of the time, but which is not an identifier for something. Format specifications for format/3 is a good example. Another example is a descriptive text provided in an application. Strings may be composed and decomposed using e.g., string_concat/3 and sub_string/5 or converted for parsing using string_codes/2 or created from codes generated by a generative grammar rule, also using string_codes/2.
5.2.3 Adapting code for double quoted strings
The predicates in this section can help adapting your program to the new convention for handling double quoted strings. We have adapted a huge code base with which we were not familiar in about half a day.
- list_strings
- This predicate may be used to assess compatibility issues due to the
representation of double quoted text as string objects. See
section 5.2 and section
5.2.4. To use it, load your program into Prolog and run list_strings/0.
The predicate lists source locations of string objects encountered in
the program that are not considered safe. Such string need to be
examined manually, after which one of the actions below may be
appropriate:
- Rewrite the code. For example, change
[X] = "a"
intoX = 0'a
. - If a particular module relies heavily on representing strings as
lists of character code, consider adding the following directive to the
module. Note that this flag only applies to the module in which it
appears.
:- set_prolog_flag(double_quotes, codes).
- Use a back quoted string (e.g.,
`text`
). Note that this will not make your code run regardless of the --traditional command line option and code exploiting this mapping is also not portable to ISO compliant systems. - If the strings appear in facts and usage is safe, add a clause to the multifile predicate check:string_predicate/1 to silence list_strings/0 on all clauses of that predicate.
- If the strings appear as an argument to a predicate that can handle string objects, add a clause to the multifile predicate check:valid_string_goal/1 to silence list_strings/0.
- Rewrite the code. For example, change
- check:string_predicate(:PredicateIndicator)
- Declare that PredicateIndicator has clauses that contain
strings, but that this is safe. For example, if there is a predicate
help_info/2 , where the second argument contains a double quoted string
that is handled properly by the predicates of the applications' help
system, add the following declaration to stop
list_strings/0
from complaining:
:- multifile check:string_predicate/1. check:string_predicate(user:help_info/2).
- check:valid_string_goal(:Goal)
- Declare that calls to Goal are safe. The module qualification
is the actual module in which Goal is defined. For example, a
call to format/3
is resolved by the predicate system:format/3. and the code below
specifies that the second argument may be a string (system predicates
that accept strings are defined in the library).
:- multifile check:valid_string_goal/1. check:valid_string_goal(system:format(_,S,_)) :- string(S).
5.2.4 Why has the representation of double quoted text changed?
Prolog defines two forms of quoted text. Traditionally, single quoted text is mapped to atoms while double quoted text is mapped to a list of character codes (integers) or characters represented as 1-character atoms. Representing text using atoms is often considered inadequate for several reasons:
- It hides the conceptual difference between text and program symbols.
Where content of text often matters because it is used in I/O, program
symbols are merely identifiers that match with the same symbol
elsewhere. Program symbols can often be consistently replaced, for
example to obfuscate or compact a program.
- Atoms are globally unique identifiers. They are stored in a shared
table. Volatile strings represented as atoms come at a significant price
due to the required cooperation between threads for creating atoms.
Reclaiming temporary atoms using Atom garbage collection is a
costly process that requires significant synchronisation.
- Many Prolog systems (not SWI-Prolog) put severe restrictions on the length of atoms or the maximum number of atoms.
Representing text as a list of character codes or 1-character atoms also comes at a price:
- It is not possible to distinguish (at runtime) a list of integers or
atoms from a string. Sometimes this information can be derived from
(implicit) typing. In other cases the list must be embedded in a
compound term to distinguish the two types. For example,
s("hello world")
could be used to indicate that we are dealing with a string.Lacking runtime information, debuggers and the toplevel can only use heuristics to decide whether to print a list of integers as such or as a string (see portray_text/1).
While experienced Prolog programmers have learned to cope with this, we still consider this an unfortunate situation.
- Lists are expensive structures, taking 2 cells per character (3 for SWI-Prolog in its current form). This stresses memory consumption on the stacks while pushing them on the stack and dealing with them during garbage collection is unnecessarilly expensive.
We observe that in many programs, most strings are only handled as a single unit during their lifetime. Examining real code tells us that double quoted strings typically appear in one of the following roles:
- A DCG literal
- Although represented as a list of codes is the correct representation for handling in DCGs, the DCG translator can recognise the literal and convert it to the proper representation. Such code need not be modified.
- A format string
- This is a typical example of text that is conceptually not a program identifier. Format is designed to deal with alternative representations of the format string. Such code need not be modified.
- Getting a character code
- The construct
[X] = "a"
is a commonly used template for getting the character code of the letter 'a'. ISO Prolog defines the syntax0'a
for this purpose. Code using this must be modified. The modified code will run on any ISO compliant processor. - As argument to list predicates to operate on strings
- Here, we see code such as
append("name:", Rest, Codes)
. Such code needs to be modified. In this particular example, the following is a good portable alternative:phrase("name:", Codes, Rest)
- Checks for a character to be in a set
- Such tests are often performed with code such as this:
memberchk(C, "~!@#$")
. This is a rather inefficient check in a traditional Prolog system because it pushes a list of character codes cell-by-cell the Prolog stack and then traverses this list cell-by-cell to see whether one of the cells unifies with C. If the test is successful, the string will eventually be subject to garbage collection. The best code for this is to write a predicate as below, which pushes noting on the stack and performs an indexed lookup to see whether the character code is in `my_class'.my_class(0'~). my_class(0'!). ...
An alternative to reach the same effect is to use term expansion to create the clauses:
term_expansion(my_class(_), Clauses) :- findall(my_class(C), string_code(_, "~!@#$", C), Clauses). my_class(_).
Finally, the predicate string_code/3 can be exploited directly as a replacement for the memberchk/2 on a list of codes. Although the string is still pushed onto the stack, it is more compact and only a single entity.
We offer the predicate list_strings/0 to help porting your program.