Many C++ programmers, even experienced ones, think that iterators are used and useful mainly for traversing data containers. (Many such programmers have heard of other types of iterators, such as stream iterators, but they have never considered them very useful.) However, iterators are much more powerful than that.
Let's start by looking at a simple problem that can be handily solved using iterators, and then look at its applications.
Converting between different character encoding formats can be a
non-trivial problem from a design point of view. The conversion routines
would need to know what kind of data containers are handling the character
data of different encodings. Should such routines provide their own "string"
data container, or should they support a variety of containers (such as the
standard std::string
and std::vector
containers)?
How about static arrays? What if a user would want to use a non-standard
custom container?
All these conundrums are moot if we use the same design principle as most standard library algorithms do: Make the routines iterator-based.
As a simple example, let's make a UCS-2 to ISO-Latin-1 conversion routine. (UCS-2 is basically the same thing as UTF-16, except that it's limited to the Unicode codepoints in the range from 0 to 65535; in other words, UCS-2 characters always take 2 bytes.) Since the point of this article is not the conversion itself, this particular conversion was chosen because it's unusually trivial to implement (because UCS-2/UTF-16 is identical to ISO-Latin-1 in the codepoint range from 0 to 255.)
template<typename InputIterator, typename OutputIterator> void ucs2ToISOLatin1(InputIterator inputBegin, InputIterator inputEnd, OutputIterator output) { while(inputBegin != inputEnd) { if(*inputBegin < 256) *output = *inputBegin; else *output = '?'; // Not in ISO-Latin-1 range ++inputBegin; ++output; } }
A simple example usage of the previous function would be:
std::vector<unsigned short> ucs2String; initializeWithSomething(ucs2String); std::string latin1String(ucs2String.size(), ' '); ucs2ToISOLatin1(ucs2String.begin(), ucs2String.end(), latin1String.begin());
The advantage of using iterators instead of the
ucs2ToISOLatin1()
function taking data containers directly
may not be immediately apparent, but it becomes much clearer when one
realizes that the function is agnostic to what kind of data container
and iterators are used.
To exemplify this, consider that this will also work:
unsigned short ucs2String[MAX_STR_LENGTH]; initializeWithSomething(ucs2String); char latin1String[MAX_STR_LENGTH]; ucs2ToISOLatin1(ucs2String, ucs2String + MAX_STR_LENGTH, latin1String);
Note how the one and same ucs2ToISOLatin1()
function handles
both cases. (Also note that any combination of one of the strings being a
static array and the other a dynamic data container will also work equally
well.)
Also, since the function uses iterators, we can convert only part of the input string rather than the entirety of it, which can sometimes be very useful.
There's a slight problem in the usage examples above: They assume that
the result will have as many characters as the input. This is true when
converting an UCS-2 encoded string into an ISO-Latin-1 one. However, this is
most certainly not true with other encodings. For example, converting an
UTF-16 string into an UTF-8 one will often create more characters (because
all code points above 127 have to be encoded as at least two characters in
UTF-8). It's even theoretically possible for our ucs2ToISOLatin1()
function to generate more characters to the output than in the input (unless
it specifically states otherwise), so it's potentially dangerous to use it
like above.
However, this isn't a problem at all, thanks to more specialized iterators. The problem described above can be easily solved by doing it like this:
std::vector<unsigned short> ucs2String; initializeWithSomething(ucs2String); std::string latin1String; ucs2ToISOLatin1(ucs2String.begin(), ucs2String.end(), std::back_inserter(latin1String));
Now the function can safely generate more output than there is input and
there will be no out-of-bounds accesses: Instead, latin1String
will grow as needed.
As for another trick, consider this:
std::vector<unsigned short> ucs2String; initializeWithSomething(ucs2String); ucs2ToISOLatin1(ucs2String.begin(), ucs2String.end(), std::ostream_iterator<char>(std::cout));
Now we are printing the resulting string directly to std::cout.
(This can be especially handy when debugging an application that uses eg.
UCS-2 or UTF-16 strings as the its native string format. Printing such strings
usually needs a conversion to UTF-8 or whatever the terminal is using. It would
be possible to first convert it to a string of that format and then print that
string, but why go through that trouble when we can print it directly like
this? A function call like above, using a stream iterator, is a handy way of
implementing an operator<<(std::ostream&, ...)
for a custom
string format that uses eg. UCS-2 as its character encoding.)
In the example above we had five different usage situations:
All cases were handled by one single function (and this function didn't need to take the different situations into account in any way), thanks to the iterator idiom.