unicode.cpp

Classes:

class

Cache data for UTF16 strings.

Public Defines

define

kReplacementChar() 0xFFFD

replacement character. Standard correct value is 0xFFFD.

define

TORQUE_ENABLE_UTF16_CACHE()

Public Typedefs

HashTable< U32, UTF16Cache >

UTF16CacheTable

Cache for UTF16 strings.

Public Variables

const U8

sgByteMask8LUT []

Look up table.

const U16

sgByteMaskLow10

Mask for the data bits of a UTF-16 surrogate.

const U8

sgFirstByteLUT [128]

Look up table.

const U8

sgSurrogateLUT [64]

Look up table.

UTF16CacheTable

sgUTF16Cache

Public Functions

bool

chompUTF8BOM(const char * inString, char ** outStringPtr)

Functions to read and validate UTF BOMs (Byte Order Marker) For reference: http://en.wikipedia.org/wiki/Byte_Order_Mark.

U32

convertUTF16toUTF8DoubleNULL(const UTF16 * unistring, UTF8 * outbuffer, U32 len)

U32

convertUTF16toUTF8N(const UTF16 * unistring, UTF8 * outbuffer, U32 len)

U32

convertUTF8toUTF16N(const UTF8 * unistring, UTF16 * outbuffer, U32 len)

Functions that convert buffers of unicode code points, into a provided buffer.

UTF16 *

createUTF16string(const UTF8 * unistring)

Unicode conversion utility functions.

UTF8 *

createUTF8string(const UTF16 * unistring)

const UTF16 *

dStrchr(const UTF16 * unistring, U32 c)

UTF16 *

dStrchr(UTF16 * unistring, U32 c)

U32

dStrlen(const UTF16 * unistring)

Functions that calculate the length of unicode strings.

U32

dStrlen(const UTF32 * unistring)

const UTF16 *

dStrrchr(const UTF16 * unistring, U32 c)

UTF16 *

dStrrchr(UTF16 * unistring, U32 c)

Scanning for characters in unicode strings.

const UTF8 *

getNthCodepoint(const UTF8 * unistring, const U32 n)

Functions that scan for characters in a utf8 string.

bool

isAboveBMP(U32 codepoint)

bool

isSurrogateRange(U32 codepoint)

bool

isValidUTF8BOM(U8 bom)

UTF32

oneUTF16toUTF32(const UTF16 * codepoint, U32 * unitsWalked)

UTF16

oneUTF32toUTF16(const UTF32 codepoint)

U32

oneUTF32toUTF8(const UTF32 codepoint, UTF8 * threeByteCodeunitBuf)

UTF32

oneUTF8toUTF32(const UTF8 * codepoint, U32 * unitsWalked)

Functions that converts one unicode codepoint at a time.

Detailed Description

Public Defines

kReplacementChar() 0xFFFD

replacement character. Standard correct value is 0xFFFD.

TORQUE_ENABLE_UTF16_CACHE()

Public Typedefs

typedef HashTable< U32, UTF16Cache > UTF16CacheTable

Cache for UTF16 strings.

Public Variables

const U8 sgByteMask8LUT []

Look up table.

Feed value from firstByteLUT in, gives you the mask for the data bits of that UTF-8 code unit.

const U16 sgByteMaskLow10

Mask for the data bits of a UTF-16 surrogate.

const U8 sgFirstByteLUT [128]

Look up table.

Shift a byte >> 1, then look up how many bytes to expect after it. Contains -1's for illegal values.

const U8 sgSurrogateLUT [64]

Look up table.

Shift a 16-bit word >> 10, then look up whether it is a surrogate, and which part. 0 means non-surrogate, 1 means 1st in pair, 2 means 2nd in pair.

UTF16CacheTable sgUTF16Cache

Public Functions

chompUTF8BOM(const char * inString, char ** outStringPtr)

Functions to read and validate UTF BOMs (Byte Order Marker) For reference: http://en.wikipedia.org/wiki/Byte_Order_Mark.

convertUTF16toUTF8DoubleNULL(const UTF16 * unistring, UTF8 * outbuffer, U32 len)

convertUTF16toUTF8N(const UTF16 * unistring, UTF8 * outbuffer, U32 len)

convertUTF8toUTF16N(const UTF8 * unistring, UTF16 * outbuffer, U32 len)

Functions that convert buffers of unicode code points, into a provided buffer.

These functions are useful for working on existing buffers.
These cannot convert a buffer in place. If unistring is the same memory as outbuffer, the behavior is undefined.
The converter clamps output to the BMP (Basic Multilingual Plane) .
Conversion to UTF-8 requires a buffer of 3 bytes (U8's) per character, + 1.
Conversion to UTF-16 requires a buffer of 1 U16 (2 bytes) per character, + 1.
Conversion to UTF-32 requires a buffer of 1 U32 (4 bytes) per character, + 1.
UTF-8 only requires 3 bytes per character in the worst case.
Output is null terminated. Be sure to provide 1 extra byte, U16 or U32 for the null terminator, or you will see truncated output.
If the provided buffer is too small, the output will be truncated.

createUTF16string(const UTF8 * unistring)

Unicode conversion utility functions.

Some definitions first:

Code Point: a single character of Unicode text. Used to disabmiguate from C char type.
UTF-32: a Unicode encoding format where one code point is always 32 bits wide. This format can in theory contain any Unicode code point that will ever be needed, now or in the future. 4billion+ code points should be enough, right?
UTF-16: a variable length Unicode encoding format where one code point can be either one or two 16-bit code units long.
UTF-8: a variable length Unicode endocing format where one code point can be up to four 8-bit code units long. The first bit of a single byte UTF-8 code point is 0. The first few bits of a multi-byte code point determine the length of the code point.
see:
http://en.wikipedia.org/wiki/UTF-8
Surrogate Pair: a pair of special UTF-16 code units, that encode a code point that is too large to fit into 16 bits. The surrogate values sit in a special reserved range of Unicode.
Code Unit: a single unit of a variable length Unicode encoded code point. UTF-8 has 8 bit wide code units. UTF-16 has 16 bit wide code units.
BMP: "Basic Multilingual Plane". Unicode values U+0000 - U+FFFF. This range of Unicode contains all the characters for all the languages of the world, that one would usually be interested in. All code points in the BMP are 16 bits wide or less. The current implementation of these conversion functions deals only with the BMP. Any code points above 0xFFFF, the top of the BMP, are replaced with the standard unicode replacement character: 0xFFFD. Any UTF16 surrogates are read correctly, but replaced. UTF-8 code points up to 6 code units wide will be read, but 5+ is illegal, and 4+ is above the BMP, and will be replaced. This means that UTF-8 output is clamped to 3 code units ( bytes ) per code point. Functions that convert buffers of unicode code points, allocating a buffer.
These functions allocate their own return buffers. You are responsible for calling delete[] on these buffers.
Because they allocate memory, do not use these functions in a tight loop.
These are useful when you need a new long term copy of a string.

createUTF8string(const UTF16 * unistring)

dStrchr(const UTF16 * unistring, U32 c)

dStrchr(UTF16 * unistring, U32 c)

dStrlen(const UTF16 * unistring)

Functions that calculate the length of unicode strings.

Since calculating the length of a UTF8 string is nearly as expensive as converting it to another format, a dStrlen for UTF8 is not provided here.
If *unistring does not point to a null terminated string of the correct type, the behavior is undefined.

dStrlen(const UTF32 * unistring)

dStrrchr(const UTF16 * unistring, U32 c)

dStrrchr(UTF16 * unistring, U32 c)

Scanning for characters in unicode strings.

getNthCodepoint(const UTF8 * unistring, const U32 n)

Functions that scan for characters in a utf8 string.

this is useful for getting a character-wise offset into a UTF8 string, as opposed to a byte-wise offset into a UTF8 string: foo[i]

isAboveBMP(U32 codepoint)

isSurrogateRange(U32 codepoint)

isValidUTF8BOM(U8 bom)

oneUTF16toUTF32(const UTF16 * codepoint, U32 * unitsWalked)

oneUTF32toUTF16(const UTF32 codepoint)

oneUTF32toUTF8(const UTF32 codepoint, UTF8 * threeByteCodeunitBuf)

oneUTF8toUTF32(const UTF8 * codepoint, U32 * unitsWalked)

Functions that converts one unicode codepoint at a time.

Since these functions are designed to be used in tight loops, they do not allocate buffers.
oneUTF8toUTF32() and oneUTF16toUTF32() return the converted Unicode code point in *codepoint, and set *unitsWalked to the # of code units *codepoint took up. The next Unicode code point should start at *(codepoint + *unitsWalked).
oneUTF32toUTF8() requires a 3 byte buffer, and returns the # of bytes used.

  1
  2//-----------------------------------------------------------------------------
  3// Copyright (c) 2012 GarageGames, LLC
  4//
  5// Permission is hereby granted, free of charge, to any person obtaining a copy
  6// of this software and associated documentation files (the "Software"), to
  7// deal in the Software without restriction, including without limitation the
  8// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
  9// sell copies of the Software, and to permit persons to whom the Software is
 10// furnished to do so, subject to the following conditions:
 11//
 12// The above copyright notice and this permission notice shall be included in
 13// all copies or substantial portions of the Software.
 14//
 15// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 16// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 17// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 18// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 19// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
 20// FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
 21// IN THE SOFTWARE.
 22//-----------------------------------------------------------------------------
 23
 24#include <stdio.h>
 25
 26#include "core/frameAllocator.h"
 27#include "core/strings/unicode.h"
 28#include "core/strings/stringFunctions.h"
 29
 30#include "platform/profiler.h"
 31#include "console/console.h"
 32
 33#define TORQUE_ENABLE_UTF16_CACHE
 34
 35#ifdef TORQUE_ENABLE_UTF16_CACHE
 36#include "core/util/tDictionary.h"
 37#include "core/util/hashFunction.h"
 38#endif
 39
 40//-----------------------------------------------------------------------------
 41/// replacement character. Standard correct value is 0xFFFD.
 42#define kReplacementChar 0xFFFD
 43
 44/// Look up table. Shift a byte >> 1, then look up how many bytes to expect after it.
 45/// Contains -1's for illegal values.
 46static const U8 sgFirstByteLUT[128] = 
 47{
 48   1, 1, 1, 1,  1, 1, 1, 1,  1, 1, 1, 1,  1, 1, 1, 1, // 0x0F // single byte ascii
 49   1, 1, 1, 1,  1, 1, 1, 1,  1, 1, 1, 1,  1, 1, 1, 1, // 0x1F // single byte ascii
 50   1, 1, 1, 1,  1, 1, 1, 1,  1, 1, 1, 1,  1, 1, 1, 1, // 0x2F // single byte ascii
 51   1, 1, 1, 1,  1, 1, 1, 1,  1, 1, 1, 1,  1, 1, 1, 1, // 0x3F // single byte ascii
 52
 53   0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0, // 0x4F // trailing utf8
 54   0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0, // 0x5F // trailing utf8
 55   2, 2, 2, 2,  2, 2, 2, 2,  2, 2, 2, 2,  2, 2, 2, 2, // 0x6F // first of 2
 56   3, 3, 3, 3,  3, 3, 3, 3,  4, 4, 4, 4,  5, 5, 6, 0, // 0x7F // first of 3,4,5,illegal in utf-8
 57};
 58
 59/// Look up table. Shift a 16-bit word >> 10, then look up whether it is a surrogate,
 60///  and which part. 0 means non-surrogate, 1 means 1st in pair, 2 means 2nd in pair.
 61static const U8 sgSurrogateLUT[64] = 
 62{
 63   0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0, // 0x0F 
 64   0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0, // 0x1F 
 65   0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0, // 0x2F 
 66   0, 0, 0, 0,  0, 0, 1, 2,  0, 0, 0, 0,  0, 0, 0, 0, // 0x3F 
 67};
 68
 69/// Look up table. Feed value from firstByteLUT in, gives you
 70/// the mask for the data bits of that UTF-8 code unit.
 71static const U8  sgByteMask8LUT[]  = { 0x3f, 0x7f, 0x1f, 0x0f, 0x07, 0x03, 0x01 }; // last 0=6, 1=7, 2=5, 4, 3, 2, 1 bits
 72
 73/// Mask for the data bits of a UTF-16 surrogate.
 74static const U16 sgByteMaskLow10 = 0x03ff;
 75
 76//-----------------------------------------------------------------------------
 77
 78#ifdef TORQUE_ENABLE_UTF16_CACHE
 79
 80/// Cache data for UTF16 strings. This is wrapped in a class so that data is
 81/// automatically freed when the hash table is deleted.
 82struct UTF16Cache
 83{
 84   UTF16 *mString;
 85   U32 mLength;
 86
 87   UTF16Cache()
 88   {
 89      mString = NULL;
 90      mLength = 0;
 91   }
 92   
 93   UTF16Cache(UTF16 *str, U32 len)
 94   {
 95      mLength = len;
 96      mString = new UTF16[mLength];
 97      dMemcpy(mString, str, mLength * sizeof(UTF16));
 98   }
 99
100   UTF16Cache(const UTF16Cache &other)
101   {
102      mLength = other.mLength;
103      mString = new UTF16[mLength];
104      dMemcpy(mString, other.mString, mLength * sizeof(UTF16));
105   }
106
107   UTF16Cache & operator=(const UTF16Cache &other)
108   {
109      if (&other != this)
110      {
111         delete [] mString;
112
113         mLength = other.mLength;
114         mString = new UTF16[mLength];
115         dMemcpy(mString, other.mString, mLength * sizeof(UTF16));
116      }
117      return *this;
118   }
119
120   ~UTF16Cache()
121   {
122      delete [] mString;
123   }
124
125   void copyToBuffer(UTF16 *outBuffer, U32 lenToCopy, bool nullTerminate = true) const
126   {
127      U32 copy = getMin(mLength, lenToCopy);
128      if(mString && copy > 0)
129         dMemcpy(outBuffer, mString, copy * sizeof(UTF16));
130      
131      if(nullTerminate)
132         outBuffer[copy] = 0;
133   }
134};
135
136/// Cache for UTF16 strings
137typedef HashTable<U32, UTF16Cache> UTF16CacheTable;
138static UTF16CacheTable sgUTF16Cache;
139
140#endif // TORQUE_ENABLE_UTF16_CACHE
141
142//-----------------------------------------------------------------------------
143inline bool isSurrogateRange(U32 codepoint)
144{
145   return ( 0xd800 < codepoint && codepoint < 0xdfff );
146}
147
148inline bool isAboveBMP(U32 codepoint)
149{
150   return ( codepoint > 0xFFFF );
151}
152
153//-----------------------------------------------------------------------------
154U32 convertUTF8toUTF16N(const UTF8 *unistring, UTF16 *outbuffer, U32 len)
155{
156   AssertFatal(len >= 1, "Buffer for unicode conversion must be large enough to hold at least the null terminator.");
157   PROFILE_SCOPE(convertUTF8toUTF16);
158
159#ifdef TORQUE_ENABLE_UTF16_CACHE
160   // If we have cached this conversion already, don't do it again
161   U32 hashKey = Torque::hash((const U8 *)unistring, dStrlen(unistring), 0);
162   UTF16CacheTable::Iterator cacheItr = sgUTF16Cache.find(hashKey);
163   if(cacheItr != sgUTF16Cache.end())
164   {
165      const UTF16Cache &cache = (*cacheItr).value;
166      cache.copyToBuffer(outbuffer, len);
167      return getMin(cache.mLength,len - 1);
168   }
169#endif
170
171   U32 walked, nCodepoints;
172   UTF32 middleman;
173   
174   nCodepoints=0;
175   while(*unistring != '\0' && nCodepoints < len)
176   {
177      walked = 1;
178      middleman = oneUTF8toUTF32(unistring,&walked);
179      outbuffer[nCodepoints] = oneUTF32toUTF16(middleman);
180      unistring+=walked;
181      nCodepoints++;
182   }
183
184   nCodepoints = getMin(nCodepoints,len - 1);
185   outbuffer[nCodepoints] = '\0';
186
187#ifdef TORQUE_ENABLE_UTF16_CACHE
188   // Cache the results.
189   // FIXME As written, this will result in some unnecessary memory copying due to copy constructor calls.
190   UTF16Cache cache(outbuffer, nCodepoints);
191   sgUTF16Cache.insertUnique(hashKey, cache);
192#endif
193   
194   return nCodepoints; 
195}
196
197//-----------------------------------------------------------------------------
198U32 convertUTF16toUTF8N( const UTF16 *unistring, UTF8  *outbuffer, U32 len)
199{
200   AssertFatal(len >= 1, "Buffer for unicode conversion must be large enough to hold at least the null terminator.");
201   PROFILE_START(convertUTF16toUTF8);
202   U32 walked, nCodeunits, codeunitLen;
203   UTF32 middleman;
204   
205   nCodeunits=0;
206   while( *unistring != '\0' && nCodeunits + 3 < len )
207   {
208      walked = 1;
209      middleman  = oneUTF16toUTF32(unistring,&walked);
210      codeunitLen = oneUTF32toUTF8(middleman, &outbuffer[nCodeunits]);
211      unistring += walked;
212      nCodeunits += codeunitLen;
213   }
214
215   nCodeunits = getMin(nCodeunits,len - 1);
216   outbuffer[nCodeunits] = '\0';
217   
218   PROFILE_END();
219   return nCodeunits;
220}
221
222U32 convertUTF16toUTF8DoubleNULL( const UTF16 *unistring, UTF8  *outbuffer, U32 len)
223{
224   AssertFatal(len >= 1, "Buffer for unicode conversion must be large enough to hold at least the null terminator.");
225   PROFILE_START(convertUTF16toUTF8DoubleNULL);
226   U32 walked, nCodeunits, codeunitLen;
227   UTF32 middleman;
228
229   nCodeunits=0;
230   while( ! (*unistring == '\0' && *(unistring + 1) == '\0') && nCodeunits + 3 < len )
231   {
232      walked = 1;
233      middleman  = oneUTF16toUTF32(unistring,&walked);
234      codeunitLen = oneUTF32toUTF8(middleman, &outbuffer[nCodeunits]);
235      unistring += walked;
236      nCodeunits += codeunitLen;
237   }
238
239   nCodeunits = getMin(nCodeunits,len - 1);
240   outbuffer[nCodeunits] = NULL;
241   outbuffer[nCodeunits+1] = NULL;
242
243   PROFILE_END();
244   return nCodeunits;
245}
246
247//-----------------------------------------------------------------------------
248// Functions that convert buffers of unicode code points
249//-----------------------------------------------------------------------------
250UTF16* createUTF16string( const UTF8* unistring)
251{
252   PROFILE_SCOPE(createUTF16string);
253   
254   // allocate plenty of memory.
255   U32 nCodepoints, len = dStrlen(unistring) + 1;
256   FrameTemp<UTF16> buf(len);
257   
258   // perform conversion
259   nCodepoints = convertUTF8toUTF16N( unistring, buf, len);
260   
261   // add 1 for the NULL terminator the converter promises it included.
262   nCodepoints++;
263   
264   // allocate the return buffer, copy over, and return it.
265   UTF16 *ret = new UTF16[nCodepoints];
266   dMemcpy(ret, buf, nCodepoints * sizeof(UTF16));
267   
268   return ret;
269}
270
271//-----------------------------------------------------------------------------
272UTF8*  createUTF8string( const UTF16* unistring)
273{
274   PROFILE_SCOPE(createUTF8string);
275
276   // allocate plenty of memory.
277   U32 nCodeunits, len = dStrlen(unistring) * 3 + 1;
278   FrameTemp<UTF8> buf(len);
279      
280   // perform conversion
281   nCodeunits = convertUTF16toUTF8N( unistring, buf, len);
282   
283   // add 1 for the NULL terminator the converter promises it included.
284   nCodeunits++;
285   
286   // allocate the return buffer, copy over, and return it.
287   UTF8 *ret = new UTF8[nCodeunits];
288   dMemcpy(ret, buf, nCodeunits * sizeof(UTF8));
289
290   return ret;
291}
292
293//-----------------------------------------------------------------------------
294
295//-----------------------------------------------------------------------------
296// Functions that converts one unicode codepoint at a time
297//-----------------------------------------------------------------------------
298UTF32 oneUTF8toUTF32( const UTF8* codepoint, U32 *unitsWalked)
299{
300   PROFILE_SCOPE(oneUTF8toUTF32);
301   
302   // codepoints 6 codeunits long are read, but do not convert correctly,
303   // and are filtered out anyway.
304   
305   // early out for ascii
306   if(!(*codepoint & 0x0080))
307   {
308      if (unitsWalked != NULL)
309         *unitsWalked = 1;
310      return (UTF32)*codepoint;
311   }
312   
313   U32 expectedByteCount;
314   UTF32  ret = 0;
315   U8 codeunit;
316   
317   // check the first byte ( a.k.a. codeunit ) .
318   U8 c = codepoint[0];
319   c = c >> 1;
320   expectedByteCount = sgFirstByteLUT[c];
321   if(expectedByteCount > 0) // 0 or negative is illegal to start with
322   {
323      // process 1st codeunit
324      ret |= sgByteMask8LUT[expectedByteCount] & codepoint[0]; // bug?
325      
326      // process trailing codeunits
327      for(U32 i=1;i<expectedByteCount; i++)
328      {
329         codeunit = codepoint[i];
330         if( sgFirstByteLUT[codeunit>>1] == 0 )
331         {
332            ret <<= 6;                 // shift up 6
333            ret |= (codeunit & 0x3f);  // mask in the low 6 bits of this codeunit byte.
334         }
335         else
336         {
337            // found a bad codepoint - did not get a medial where we wanted one.
338            // Dump the replacement, and claim to have parsed only 1 char,
339            // so that we'll dump a slew of replacements, instead of eating the next char.            
340            ret = kReplacementChar;
341            expectedByteCount = 1;
342            break;
343         }
344      }
345   }
346   else 
347   {
348      // found a bad codepoint - got a medial or an illegal codeunit. 
349      // Dump the replacement, and claim to have parsed only 1 char,
350      // so that we'll dump a slew of replacements, instead of eating the next char.
351      ret = kReplacementChar;
352      expectedByteCount = 1;
353   }
354   
355   if(unitsWalked != NULL)
356      *unitsWalked = expectedByteCount;
357   
358   // codepoints in the surrogate range are illegal, and should be replaced.
359   if(isSurrogateRange(ret))
360      ret = kReplacementChar;
361   
362   // codepoints outside the Basic Multilingual Plane add complexity to our UTF16 string classes,
363   // we've read them correctly so they won't foul the byte stream,
364   // but we kill them here to make sure they wont foul anything else
365   if(isAboveBMP(ret))
366      ret = kReplacementChar;
367
368   return ret;
369}
370
371//-----------------------------------------------------------------------------
372UTF32  oneUTF16toUTF32(const UTF16* codepoint, U32 *unitsWalked)
373{
374   PROFILE_START(oneUTF16toUTF32);
375   U8    expectedType;
376   U32   unitCount;
377   UTF32 ret = 0;
378   UTF16 codeunit1,codeunit2;
379   
380   codeunit1 = codepoint[0];
381   expectedType = sgSurrogateLUT[codeunit1 >> 10];
382   switch(expectedType)
383   {
384      case 0: // simple
385         ret = codeunit1;
386         unitCount = 1;
387         break;
388      case 1: // 2 surrogates
389         codeunit2 = codepoint[1];
390         if( sgSurrogateLUT[codeunit2 >> 10] == 2)
391         {
392            ret = ((codeunit1 & sgByteMaskLow10 ) << 10) | (codeunit2 & sgByteMaskLow10);
393            unitCount = 2;
394            break;
395         }
396         // else, did not find a trailing surrogate where we expected one,
397         // so fall through to the error
398      case 2: // error
399         // found a trailing surrogate where we expected a codepoint or leading surrogate.
400         // Dump the replacement.
401         ret = kReplacementChar;
402         unitCount = 1;
403         break;
404      default:
405         // unexpected return
406         AssertFatal(false, "oneUTF16toUTF323: unexpected type");
407         ret = kReplacementChar;
408         unitCount = 1;
409         break;
410   }
411
412   if(unitsWalked != NULL)
413      *unitsWalked = unitCount;
414
415   // codepoints in the surrogate range are illegal, and should be replaced.
416   if(isSurrogateRange(ret))
417      ret = kReplacementChar;
418
419   // codepoints outside the Basic Multilingual Plane add complexity to our UTF16 string classes,
420   // we've read them correctly so they wont foul the byte stream,
421   // but we kill them here to make sure they wont foul anything else
422   // NOTE: these are perfectly legal codepoints, we just dont want to deal with them.
423   if(isAboveBMP(ret))
424      ret = kReplacementChar;
425
426   PROFILE_END();
427   return ret;
428}
429
430//-----------------------------------------------------------------------------
431UTF16 oneUTF32toUTF16(const UTF32 codepoint)
432{
433   // found a codepoint outside the encodable UTF-16 range!
434   // or, found an illegal codepoint!
435   if(codepoint >= 0x10FFFF || isSurrogateRange(codepoint))
436      return kReplacementChar;
437   
438   // these are legal, we just don't want to deal with them.
439   if(isAboveBMP(codepoint))
440      return kReplacementChar;
441
442   return (UTF16)codepoint;
443}
444
445//-----------------------------------------------------------------------------
446U32 oneUTF32toUTF8(const UTF32 codepoint, UTF8 *threeByteCodeunitBuf)
447{
448   PROFILE_START(oneUTF32toUTF8);
449   U32 bytecount = 0;
450   UTF8 *buf;
451   U32 working = codepoint;
452   buf = threeByteCodeunitBuf;
453
454   //-----------------
455   if(isSurrogateRange(working))  // found an illegal codepoint!
456      working = kReplacementChar;
457   
458   if(isAboveBMP(working))        // these are legal, we just dont want to deal with them.
459      working = kReplacementChar;
460
461   //-----------------
462   if( working < (1 << 7))        // codeable in 7 bits
463      bytecount = 1;
464   else if( working < (1 << 11))  // codeable in 11 bits
465      bytecount = 2;
466   else if( working < (1 << 16))  // codeable in 16 bits
467      bytecount = 3;
468
469   AssertISV( bytecount > 0, "Error converting to UTF-8 in oneUTF32toUTF8(). isAboveBMP() should have caught this!");
470
471   //-----------------
472   U8  mask = sgByteMask8LUT[0];            // 0011 1111
473   U8  marker = ( ~static_cast<U32>(mask) << 1u);            // 1000 0000
474   
475   // Process the low order bytes, shifting the codepoint down 6 each pass.
476   for( S32 i = bytecount-1; i > 0; i--)
477   {
478      threeByteCodeunitBuf[i] = marker | (working & mask); 
479      working >>= 6;
480   }
481
482   // Process the 1st byte. filter based on the # of expected bytes.
483   mask = sgByteMask8LUT[bytecount];
484   marker = ( ~mask << 1 );
485   threeByteCodeunitBuf[0] = marker | (working & mask);
486   
487   PROFILE_END();
488   return bytecount;
489}
490
491//-----------------------------------------------------------------------------
492U32 dStrlen(const UTF16 *unistring)
493{
494   if(!unistring)
495      return 0;
496
497   U32 i = 0;
498   while(unistring[i] != '\0')
499      i++;
500      
501//   AssertFatal( wcslen(unistring) == i, "Incorrect length" );
502
503   return i;
504}
505
506//-----------------------------------------------------------------------------
507U32 dStrlen(const UTF32 *unistring)
508{
509   U32 i = 0;
510   while(unistring[i] != '\0')
511      i++;
512      
513   return i;
514}
515
516//-----------------------------------------------------------------------------
517
518const UTF16* dStrrchr(const UTF16* unistring, U32 c)
519{
520   if(!unistring) return NULL;
521
522   const UTF16* tmp = unistring + dStrlen(unistring);
523   while( tmp >= unistring)
524   { 
525      if(*tmp == c)
526         return tmp;
527      tmp--;
528   }
529   return NULL;
530}
531
532UTF16* dStrrchr(UTF16* unistring, U32 c)
533{
534   const UTF16* str = unistring;
535   return const_cast<UTF16*>(dStrrchr(str, c));
536}
537
538const UTF16* dStrchr(const UTF16* unistring, U32 c)
539{
540   if(!unistring) return NULL;
541   const UTF16* tmp = unistring;
542   
543   while ( *tmp  && *tmp != c)
544      tmp++;
545
546   return  (*tmp == c) ? tmp : NULL;
547}
548
549UTF16* dStrchr(UTF16* unistring, U32 c)
550{
551   const UTF16* str = unistring;
552   return const_cast<UTF16*>(dStrchr(str, c));
553}
554
555//-----------------------------------------------------------------------------
556const UTF8* getNthCodepoint(const UTF8 *unistring, const U32 n)
557{
558   const UTF8* ret = unistring;
559   U32 charsseen = 0;
560   while( *ret && charsseen < n)
561   {
562      ret++;
563      if((*ret & 0xC0) != 0x80)
564         charsseen++;
565   }
566   
567   return ret;
568}
569
570/* alternate utf-8 decode impl for speed, no error checking, 
571   left here for your amusement:
572   
573   U32 codeunit = codepoint + expectedByteCount - 1;
574   U32 i = 0;
575   switch(expectedByteCount)
576   {
577      case 6: ret |= ( *(codeunit--) & 0x3f ); i++;            
578      case 5: ret |= ( *(codeunit--) & 0x3f ) << (6 * i++);    
579      case 4: ret |= ( *(codeunit--) & 0x3f ) << (6 * i++);    
580      case 3: ret |= ( *(codeunit--) & 0x3f ) << (6 * i++);    
581      case 2: ret |= ( *(codeunit--) & 0x3f ) << (6 * i++);    
582      case 1: ret |= *(codeunit) & byteMask8LUT[expectedByteCount] << (6 * i);
583   }
584*/
585
586//------------------------------------------------------------------------------
587// Byte Order Mark functions
588
589bool chompUTF8BOM( const char *inString, char **outStringPtr )
590{
591   *outStringPtr = const_cast<char *>( inString );
592
593   bool valid = false;
594   if (inString[0] && inString[1] && inString[2])
595   {
596      U8 bom[4];
597      dMemcpy(bom, inString, 4);
598      valid = isValidUTF8BOM(bom);
599   }
600
601   // This is hackey, but I am not sure the best way to do it at the present.
602   // The only valid BOM is a UTF8 BOM, which is 3 bytes, even though we read
603   // 4 bytes because it could possibly be a UTF32 BOM, and we want to provide
604   // an accurate error message. Perhaps this could be re-worked when more UTF
605   // formats are supported to have isValidBOM return the size of the BOM, in
606   // bytes.
607   if( valid )
608      (*outStringPtr) += 3; // SEE ABOVE!! -pw
609
610   return valid;
611}
612
613bool isValidUTF8BOM( U8 bom[4] )
614{
615   // Is it a BOM?
616   if( bom[0] == 0 )
617   {
618      // Could be UTF32BE
619      if( bom[1] == 0 && bom[2] == 0xFE && bom[3] == 0xFF )
620      {
621         Con::warnf( "Encountered a UTF32 BE BOM in this file; Torque does NOT support this file encoding. Use UTF8!" );
622         return false;
623      }
624
625      return false;
626   }
627   else if( bom[0] == 0xFF )
628   {
629      // It's little endian, either UTF16 or UTF32
630      if( bom[1] == 0xFE )
631      {
632         if( bom[2] == 0 && bom[3] == 0 )
633            Con::warnf( "Encountered a UTF32 LE BOM in this file; Torque does NOT support this file encoding. Use UTF8!" );
634         else
635            Con::warnf( "Encountered a UTF16 LE BOM in this file; Torque does NOT support this file encoding. Use UTF8!" );
636      }
637
638      return false;
639   }
640   else if( bom[0] == 0xFE && bom[1] == 0xFF )
641   {
642      Con::warnf( "Encountered a UTF16 BE BOM in this file; Torque does NOT support this file encoding. Use UTF8!" );
643      return false;
644   }
645   else if( bom[0] == 0xEF && bom[1] == 0xBB && bom[2] == 0xBF )
646   {
647      // Can enable this if you want -pw
648      //Con::printf("Encountered a UTF8 BOM. Torque supports this.");
649      return true;
650   }
651
652   // Don't print out an error message here, because it will try this with
653   // every script. -pw
654   return false;
655}
656