unicode.h

Engine/source/core/strings/unicode.h

More...

Public Functions

bool
chompUTF8BOM(const char * inString, char ** outStringPtr)

Functions to read and validate UTF BOMs (Byte Order Marker) For reference: http://en.wikipedia.org/wiki/Byte_Order_Mark.

convertUTF16toUTF8(const UTF16 * unistring, UTF8(&) outbuffer)

Safe conversion function for statically sized buffers.

convertUTF16toUTF8N(const UTF16 * unistring, UTF8 * outbuffer, U32 len)
convertUTF8toUTF16(const UTF8 * unistring, UTF16(&) outbuffer)

Safe conversion function for statically sized buffers.

convertUTF8toUTF16N(const UTF8 * unistring, UTF16 * outbuffer, U32 len)

Functions that convert buffers of unicode code points, into a provided buffer.

Unicode conversion utility functions.

dStrchr(const UTF16 * unistring, U32 c)
dStrchr(UTF16 * unistring, U32 c)
dStrlen(const UTF16 * unistring)

Functions that calculate the length of unicode strings.

dStrlen(const UTF32 * unistring)
dStrrchr(const UTF16 * unistring, U32 c)
dStrrchr(UTF16 * unistring, U32 c)

Scanning for characters in unicode strings.

Functions that scan for characters in a utf8 string.

bool
oneUTF16toUTF32(const UTF16 * codepoint, U32 * unitsWalked)
oneUTF32toUTF8(const UTF32 codepoint, UTF8 * threeByteCodeunitBuf)
oneUTF8toUTF32(const UTF8 * codepoint, U32 * unitsWalked)

Functions that converts one unicode codepoint at a time.

Detailed Description

Public Functions

chompUTF8BOM(const char * inString, char ** outStringPtr)

Functions to read and validate UTF BOMs (Byte Order Marker) For reference: http://en.wikipedia.org/wiki/Byte_Order_Mark.

convertUTF16toUTF8(const UTF16 * unistring, UTF8(&) outbuffer)

Safe conversion function for statically sized buffers.

convertUTF16toUTF8N(const UTF16 * unistring, UTF8 * outbuffer, U32 len)

convertUTF8toUTF16(const UTF8 * unistring, UTF16(&) outbuffer)

Safe conversion function for statically sized buffers.

convertUTF8toUTF16N(const UTF8 * unistring, UTF16 * outbuffer, U32 len)

Functions that convert buffers of unicode code points, into a provided buffer.

  • These functions are useful for working on existing buffers.

  • These cannot convert a buffer in place. If unistring is the same memory as outbuffer, the behavior is undefined.

  • The converter clamps output to the BMP (Basic Multilingual Plane) .

  • Conversion to UTF-8 requires a buffer of 3 bytes (U8's) per character, + 1.

  • Conversion to UTF-16 requires a buffer of 1 U16 (2 bytes) per character, + 1.

  • Conversion to UTF-32 requires a buffer of 1 U32 (4 bytes) per character, + 1.

  • UTF-8 only requires 3 bytes per character in the worst case.

  • Output is null terminated. Be sure to provide 1 extra byte, U16 or U32 for the null terminator, or you will see truncated output.

  • If the provided buffer is too small, the output will be truncated.

createUTF16string(const UTF8 * unistring)

Unicode conversion utility functions.

Some definitions first:

  • Code Point: a single character of Unicode text. Used to disabmiguate from C char type.

  • UTF-32: a Unicode encoding format where one code point is always 32 bits wide. This format can in theory contain any Unicode code point that will ever be needed, now or in the future. 4billion+ code points should be enough, right?

  • UTF-16: a variable length Unicode encoding format where one code point can be either one or two 16-bit code units long.

  • UTF-8: a variable length Unicode endocing format where one code point can be up to four 8-bit code units long. The first bit of a single byte UTF-8 code point is 0. The first few bits of a multi-byte code point determine the length of the code point.

    see:

    http://en.wikipedia.org/wiki/UTF-8

  • Surrogate Pair: a pair of special UTF-16 code units, that encode a code point that is too large to fit into 16 bits. The surrogate values sit in a special reserved range of Unicode.

  • Code Unit: a single unit of a variable length Unicode encoded code point. UTF-8 has 8 bit wide code units. UTF-16 has 16 bit wide code units.

  • BMP: "Basic Multilingual Plane". Unicode values U+0000 - U+FFFF. This range of Unicode contains all the characters for all the languages of the world, that one would usually be interested in. All code points in the BMP are 16 bits wide or less. The current implementation of these conversion functions deals only with the BMP. Any code points above 0xFFFF, the top of the BMP, are replaced with the standard unicode replacement character: 0xFFFD. Any UTF16 surrogates are read correctly, but replaced. UTF-8 code points up to 6 code units wide will be read, but 5+ is illegal, and 4+ is above the BMP, and will be replaced. This means that UTF-8 output is clamped to 3 code units ( bytes ) per code point. Functions that convert buffers of unicode code points, allocating a buffer.

  • These functions allocate their own return buffers. You are responsible for calling delete[] on these buffers.

  • Because they allocate memory, do not use these functions in a tight loop.

  • These are useful when you need a new long term copy of a string.

createUTF8string(const UTF16 * unistring)

dStrchr(const UTF16 * unistring, U32 c)

dStrchr(UTF16 * unistring, U32 c)

dStrlen(const UTF16 * unistring)

Functions that calculate the length of unicode strings.

  • Since calculating the length of a UTF8 string is nearly as expensive as converting it to another format, a dStrlen for UTF8 is not provided here.

  • If *unistring does not point to a null terminated string of the correct type, the behavior is undefined.

dStrlen(const UTF32 * unistring)

dStrrchr(const UTF16 * unistring, U32 c)

dStrrchr(UTF16 * unistring, U32 c)

Scanning for characters in unicode strings.

getNthCodepoint(const UTF8 * unistring, const U32 n)

Functions that scan for characters in a utf8 string.

  • this is useful for getting a character-wise offset into a UTF8 string, as opposed to a byte-wise offset into a UTF8 string: foo[i]

isValidUTF8BOM(U8 bom)

oneUTF16toUTF32(const UTF16 * codepoint, U32 * unitsWalked)

oneUTF32toUTF16(const UTF32 codepoint)

oneUTF32toUTF8(const UTF32 codepoint, UTF8 * threeByteCodeunitBuf)

oneUTF8toUTF32(const UTF8 * codepoint, U32 * unitsWalked)

Functions that converts one unicode codepoint at a time.

  • Since these functions are designed to be used in tight loops, they do not allocate buffers.

  • oneUTF8toUTF32() and oneUTF16toUTF32() return the converted Unicode code point in *codepoint, and set *unitsWalked to the # of code units *codepoint took up. The next Unicode code point should start at *(codepoint + *unitsWalked).

  • oneUTF32toUTF8() requires a 3 byte buffer, and returns the # of bytes used.

  1
  2//-----------------------------------------------------------------------------
  3// Copyright (c) 2012 GarageGames, LLC
  4//
  5// Permission is hereby granted, free of charge, to any person obtaining a copy
  6// of this software and associated documentation files (the "Software"), to
  7// deal in the Software without restriction, including without limitation the
  8// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
  9// sell copies of the Software, and to permit persons to whom the Software is
 10// furnished to do so, subject to the following conditions:
 11//
 12// The above copyright notice and this permission notice shall be included in
 13// all copies or substantial portions of the Software.
 14//
 15// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 16// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 17// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 18// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 19// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
 20// FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
 21// IN THE SOFTWARE.
 22//-----------------------------------------------------------------------------
 23
 24#ifndef _UNICODE_H_
 25#define _UNICODE_H_
 26
 27#ifndef _TORQUE_TYPES_H_
 28#include "platform/types.h"
 29#endif
 30
 31
 32/// Unicode conversion utility functions
 33///
 34/// Some definitions first: 
 35/// - <b>Code Point</b>: a single character of Unicode text. Used to disabmiguate from C char type.
 36/// - <b>UTF-32</b>: a Unicode encoding format where one code point is always 32 bits wide.
 37///   This format can in theory contain any Unicode code point that will ever be needed, now or in the future. 4billion+ code points should be enough, right?
 38/// - <b>UTF-16</b>: a variable length Unicode encoding format where one code point can be
 39///   either one or two 16-bit code units long.
 40/// - <b>UTF-8</b>: a variable length Unicode endocing format where one code point can be
 41///   up to four 8-bit code units long. The first bit of a single byte UTF-8 code point is 0.
 42///   The first few bits of a multi-byte code point determine the length of the code point.
 43///   @see http://en.wikipedia.org/wiki/UTF-8
 44/// - <b>Surrogate Pair</b>: a pair of special UTF-16 code units, that encode a code point
 45///   that is too large to fit into 16 bits. The surrogate values sit in a special reserved range of Unicode.
 46/// - <b>Code Unit</b>: a single unit of a variable length Unicode encoded code point.
 47///   UTF-8 has 8 bit wide code units. UTF-16 has 16 bit wide code units.
 48/// - <b>BMP</b>: "Basic Multilingual Plane". Unicode values U+0000 - U+FFFF. This range
 49///   of Unicode contains all the characters for all the languages of the world, that one would
 50///   usually be interested in. All code points in the BMP are 16 bits wide or less.
 51
 52/// The current implementation of these conversion functions deals only with the BMP.
 53/// Any code points above 0xFFFF, the top of the BMP, are replaced with the
 54///  standard unicode replacement character: 0xFFFD.
 55/// Any UTF16 surrogates are read correctly, but replaced.
 56/// UTF-8 code points up to 6 code units wide will be read, but 5+ is illegal, 
 57///  and 4+ is above the BMP, and will be replaced.
 58///  This means that UTF-8 output is clamped to 3 code units ( bytes ) per code point.
 59
 60//-----------------------------------------------------------------------------
 61/// Functions that convert buffers of unicode code points, allocating a buffer.
 62/// - These functions allocate their own return buffers. You are responsible for
 63///   calling delete[] on these buffers.
 64/// - Because they allocate memory, do not use these functions in a tight loop.
 65/// - These are useful when you need a new long term copy of a string.
 66UTF16* createUTF16string( const UTF8 *unistring);
 67
 68UTF8*  createUTF8string( const UTF16 *unistring);
 69
 70//-----------------------------------------------------------------------------
 71/// Functions that convert buffers of unicode code points, into a provided buffer.
 72/// - These functions are useful for working on existing buffers.
 73/// - These cannot convert a buffer in place. If unistring is the same memory as
 74///   outbuffer, the behavior is undefined.
 75/// - The converter clamps output to the BMP (Basic Multilingual Plane) .
 76/// - Conversion to UTF-8 requires a buffer of 3 bytes (U8's) per character, + 1.
 77/// - Conversion to UTF-16 requires a buffer of 1 U16 (2 bytes) per character, + 1.
 78/// - Conversion to UTF-32 requires a buffer of 1 U32 (4 bytes) per character, + 1.
 79/// - UTF-8 only requires 3 bytes per character in the worst case.
 80/// - Output is null terminated. Be sure to provide 1 extra byte, U16 or U32 for
 81///   the null terminator, or you will see truncated output.
 82/// - If the provided buffer is too small, the output will be truncated.
 83U32 convertUTF8toUTF16N(const UTF8 *unistring, UTF16 *outbuffer, U32 len);
 84
 85U32 convertUTF16toUTF8N( const UTF16 *unistring, UTF8  *outbuffer, U32 len);
 86
 87/// Safe conversion function for statically sized buffers.
 88template <size_t N>
 89inline U32 convertUTF8toUTF16(const UTF8 *unistring, UTF16 (&outbuffer)[N])
 90{
 91   return convertUTF8toUTF16N(unistring, outbuffer, (U32) N);
 92}
 93
 94/// Safe conversion function for statically sized buffers.
 95template <size_t N>
 96inline U32 convertUTF16toUTF8(const UTF16 *unistring, UTF8 (&outbuffer)[N])
 97{
 98   return convertUTF16toUTF8N(unistring, outbuffer, (U32) N);
 99}
100
101//-----------------------------------------------------------------------------
102/// Functions that converts one unicode codepoint at a time
103/// - Since these functions are designed to be used in tight loops, they do not
104///   allocate buffers.
105/// - oneUTF8toUTF32() and oneUTF16toUTF32() return the converted Unicode code point
106///   in *codepoint, and set *unitsWalked to the \# of code units *codepoint took up.
107///   The next Unicode code point should start at *(codepoint + *unitsWalked).
108/// - oneUTF32toUTF8()  requires a 3 byte buffer, and returns the \# of bytes used.
109UTF32  oneUTF8toUTF32( const UTF8 *codepoint,  U32 *unitsWalked = NULL);
110UTF32  oneUTF16toUTF32(const UTF16 *codepoint, U32 *unitsWalked = NULL);
111UTF16  oneUTF32toUTF16(const UTF32 codepoint);
112U32    oneUTF32toUTF8( const UTF32 codepoint, UTF8 *threeByteCodeunitBuf);
113
114//-----------------------------------------------------------------------------
115/// Functions that calculate the length of unicode strings.
116/// - Since calculating the length of a UTF8 string is nearly as expensive as
117///   converting it to another format, a dStrlen for UTF8 is not provided here.
118/// - If *unistring does not point to a null terminated string of the correct type,
119///   the behavior is undefined.
120U32 dStrlen(const UTF16 *unistring);
121U32 dStrlen(const UTF32 *unistring);
122
123//-----------------------------------------------------------------------------
124/// Scanning for characters in unicode strings
125UTF16* dStrrchr(UTF16* unistring, U32 c);
126const UTF16* dStrrchr(const UTF16* unistring, U32 c);
127
128UTF16* dStrchr(UTF16* unistring, U32 c);
129const UTF16* dStrchr(const UTF16* unistring, U32 c);
130//-----------------------------------------------------------------------------
131/// Functions that scan for characters in a utf8 string.
132/// - this is useful for getting a character-wise offset into a UTF8 string, 
133///   as opposed to a byte-wise offset into a UTF8 string: foo[i]
134const UTF8* getNthCodepoint(const UTF8 *unistring, const U32 n);
135
136//------------------------------------------------------------------------------
137/// Functions to read and validate UTF BOMs (Byte Order Marker)
138/// For reference: http://en.wikipedia.org/wiki/Byte_Order_Mark
139bool chompUTF8BOM( const char *inString, char **outStringPtr );
140bool isValidUTF8BOM( U8 bom[4] );
141
142#endif // _UNICODE_H_
143