unicode.cpp
Engine/source/core/strings/unicode.cpp
Classes:
Cache data for UTF16 strings.
Public Defines
kReplacementChar() 0xFFFD
replacement character. Standard correct value is 0xFFFD.
Public Typedefs
HashTable< U32, UTF16Cache >
UTF16CacheTable
Cache for UTF16 strings.
Public Variables
Look up table.
Mask for the data bits of a UTF-16 surrogate.
sgFirstByteLUT [128]
Look up table.
sgSurrogateLUT [64]
Look up table.
Public Functions
bool
chompUTF8BOM(const char * inString, char ** outStringPtr)
Functions to read and validate UTF BOMs (Byte Order Marker) For reference:
convertUTF8toUTF16N(const UTF8 * unistring, UTF16 * outbuffer, U32 len)
Functions that convert buffers of unicode code points, into a provided buffer.
UTF16 *
createUTF16string(const UTF8 * unistring)
Unicode conversion utility functions.
UTF8 *
createUTF8string(const UTF16 * unistring)
getNthCodepoint(const UTF8 * unistring, const U32 n)
Functions that scan for characters in a utf8 string.
bool
isAboveBMP(U32 codepoint)
bool
isSurrogateRange(U32 codepoint)
bool
isValidUTF8BOM(U8 bom)
oneUTF16toUTF32(const UTF16 * codepoint, U32 * unitsWalked)
oneUTF32toUTF16(const UTF32 codepoint)
oneUTF32toUTF8(const UTF32 codepoint, UTF8 * threeByteCodeunitBuf)
oneUTF8toUTF32(const UTF8 * codepoint, U32 * unitsWalked)
Functions that converts one unicode codepoint at a time.
Detailed Description
Public Defines
kReplacementChar() 0xFFFD
replacement character. Standard correct value is 0xFFFD.
TORQUE_ENABLE_UTF16_CACHE()
Public Typedefs
typedef HashTable< U32, UTF16Cache > UTF16CacheTable
Cache for UTF16 strings.
Public Variables
const U8 sgByteMask8LUT []
Look up table.
Feed value from firstByteLUT in, gives you the mask for the data bits of that UTF-8 code unit.
const U16 sgByteMaskLow10
Mask for the data bits of a UTF-16 surrogate.
const U8 sgFirstByteLUT [128]
Look up table.
Shift a byte >> 1, then look up how many bytes to expect after it. Contains -1's for illegal values.
const U8 sgSurrogateLUT [64]
Look up table.
Shift a 16-bit word >> 10, then look up whether it is a surrogate, and which part. 0 means non-surrogate, 1 means 1st in pair, 2 means 2nd in pair.
UTF16CacheTable sgUTF16Cache
Public Functions
chompUTF8BOM(const char * inString, char ** outStringPtr)
Functions to read and validate UTF BOMs (Byte Order Marker) For reference:
convertUTF16toUTF8DoubleNULL(const UTF16 * unistring, UTF8 * outbuffer, U32 len)
convertUTF16toUTF8N(const UTF16 * unistring, UTF8 * outbuffer, U32 len)
convertUTF8toUTF16N(const UTF8 * unistring, UTF16 * outbuffer, U32 len)
Functions that convert buffers of unicode code points, into a provided buffer.
These functions are useful for working on existing buffers.
These cannot convert a buffer in place. If unistring is the same memory as outbuffer, the behavior is undefined.
The converter clamps output to the BMP (Basic Multilingual Plane) .
Conversion to UTF-8 requires a buffer of 3 bytes (U8's) per character, + 1.
Conversion to UTF-16 requires a buffer of 1 U16 (2 bytes) per character, + 1.
Conversion to UTF-32 requires a buffer of 1 U32 (4 bytes) per character, + 1.
UTF-8 only requires 3 bytes per character in the worst case.
Output is null terminated. Be sure to provide 1 extra byte, U16 or U32 for the null terminator, or you will see truncated output.
If the provided buffer is too small, the output will be truncated.
createUTF16string(const UTF8 * unistring)
Unicode conversion utility functions.
Some definitions first:
Code Point: a single character of Unicode text. Used to disabmiguate from C char type.
UTF-32: a Unicode encoding format where one code point is always 32 bits wide. This format can in theory contain any Unicode code point that will ever be needed, now or in the future. 4billion+ code points should be enough, right?
UTF-16: a variable length Unicode encoding format where one code point can be either one or two 16-bit code units long.
UTF-8: a variable length Unicode endocing format where one code point can be up to four 8-bit code units long. The first bit of a single byte UTF-8 code point is 0. The first few bits of a multi-byte code point determine the length of the code point.
see:http://en.wikipedia.org/wiki/UTF-8 Surrogate Pair: a pair of special UTF-16 code units, that encode a code point that is too large to fit into 16 bits. The surrogate values sit in a special reserved range of Unicode.
Code Unit: a single unit of a variable length Unicode encoded code point. UTF-8 has 8 bit wide code units. UTF-16 has 16 bit wide code units.
BMP: "Basic Multilingual Plane". Unicode values U+0000 - U+FFFF. This range of Unicode contains all the characters for all the languages of the world, that one would usually be interested in. All code points in the BMP are 16 bits wide or less. The current implementation of these conversion functions deals only with the BMP. Any code points above 0xFFFF, the top of the BMP, are replaced with the standard unicode replacement character: 0xFFFD. Any UTF16 surrogates are read correctly, but replaced. UTF-8 code points up to 6 code units wide will be read, but 5+ is illegal, and 4+ is above the BMP, and will be replaced. This means that UTF-8 output is clamped to 3 code units ( bytes ) per code point. Functions that convert buffers of unicode code points, allocating a buffer.
These functions allocate their own return buffers. You are responsible for calling delete[] on these buffers.
Because they allocate memory, do not use these functions in a tight loop.
These are useful when you need a new long term copy of a string.
createUTF8string(const UTF16 * unistring)
dStrchr(const UTF16 * unistring, U32 c)
dStrchr(UTF16 * unistring, U32 c)
dStrlen(const UTF16 * unistring)
Functions that calculate the length of unicode strings.
Since calculating the length of a UTF8 string is nearly as expensive as converting it to another format, a dStrlen for UTF8 is not provided here.
If *unistring does not point to a null terminated string of the correct type, the behavior is undefined.
dStrlen(const UTF32 * unistring)
dStrrchr(const UTF16 * unistring, U32 c)
dStrrchr(UTF16 * unistring, U32 c)
Scanning for characters in unicode strings.
getNthCodepoint(const UTF8 * unistring, const U32 n)
Functions that scan for characters in a utf8 string.
this is useful for getting a character-wise offset into a UTF8 string, as opposed to a byte-wise offset into a UTF8 string: foo[i]
isAboveBMP(U32 codepoint)
isSurrogateRange(U32 codepoint)
isValidUTF8BOM(U8 bom)
oneUTF16toUTF32(const UTF16 * codepoint, U32 * unitsWalked)
oneUTF32toUTF16(const UTF32 codepoint)
oneUTF32toUTF8(const UTF32 codepoint, UTF8 * threeByteCodeunitBuf)
oneUTF8toUTF32(const UTF8 * codepoint, U32 * unitsWalked)
Functions that converts one unicode codepoint at a time.
Since these functions are designed to be used in tight loops, they do not allocate buffers.
oneUTF8toUTF32() and oneUTF16toUTF32() return the converted Unicode code point in *codepoint, and set *unitsWalked to the # of code units *codepoint took up. The next Unicode code point should start at *(codepoint + *unitsWalked).
oneUTF32toUTF8() requires a 3 byte buffer, and returns the # of bytes used.
1 2//----------------------------------------------------------------------------- 3// Copyright (c) 2012 GarageGames, LLC 4// 5// Permission is hereby granted, free of charge, to any person obtaining a copy 6// of this software and associated documentation files (the "Software"), to 7// deal in the Software without restriction, including without limitation the 8// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or 9// sell copies of the Software, and to permit persons to whom the Software is 10// furnished to do so, subject to the following conditions: 11// 12// The above copyright notice and this permission notice shall be included in 13// all copies or substantial portions of the Software. 14// 15// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 20// FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS 21// IN THE SOFTWARE. 22//----------------------------------------------------------------------------- 23 24#include <stdio.h> 25 26#include "core/frameAllocator.h" 27#include "core/strings/unicode.h" 28#include "core/strings/stringFunctions.h" 29 30#include "platform/profiler.h" 31#include "console/console.h" 32 33#define TORQUE_ENABLE_UTF16_CACHE 34 35#ifdef TORQUE_ENABLE_UTF16_CACHE 36#include "core/util/tDictionary.h" 37#include "core/util/hashFunction.h" 38#endif 39 40//----------------------------------------------------------------------------- 41/// replacement character. Standard correct value is 0xFFFD. 42#define kReplacementChar 0xFFFD 43 44/// Look up table. Shift a byte >> 1, then look up how many bytes to expect after it. 45/// Contains -1's for illegal values. 46static const U8 sgFirstByteLUT[128] = 47{ 48 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x0F // single byte ascii 49 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x1F // single byte ascii 50 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x2F // single byte ascii 51 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x3F // single byte ascii 52 53 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x4F // trailing utf8 54 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x5F // trailing utf8 55 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, // 0x6F // first of 2 56 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 6, 0, // 0x7F // first of 3,4,5,illegal in utf-8 57}; 58 59/// Look up table. Shift a 16-bit word >> 10, then look up whether it is a surrogate, 60/// and which part. 0 means non-surrogate, 1 means 1st in pair, 2 means 2nd in pair. 61static const U8 sgSurrogateLUT[64] = 62{ 63 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x0F 64 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x1F 65 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x2F 66 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, // 0x3F 67}; 68 69/// Look up table. Feed value from firstByteLUT in, gives you 70/// the mask for the data bits of that UTF-8 code unit. 71static const U8 sgByteMask8LUT[] = { 0x3f, 0x7f, 0x1f, 0x0f, 0x07, 0x03, 0x01 }; // last 0=6, 1=7, 2=5, 4, 3, 2, 1 bits 72 73/// Mask for the data bits of a UTF-16 surrogate. 74static const U16 sgByteMaskLow10 = 0x03ff; 75 76//----------------------------------------------------------------------------- 77 78#ifdef TORQUE_ENABLE_UTF16_CACHE 79 80/// Cache data for UTF16 strings. This is wrapped in a class so that data is 81/// automatically freed when the hash table is deleted. 82struct UTF16Cache 83{ 84 UTF16 *mString; 85 U32 mLength; 86 87 UTF16Cache() 88 { 89 mString = NULL; 90 mLength = 0; 91 } 92 93 UTF16Cache(UTF16 *str, U32 len) 94 { 95 mLength = len; 96 mString = new UTF16[mLength]; 97 dMemcpy(mString, str, mLength * sizeof(UTF16)); 98 } 99 100 UTF16Cache(const UTF16Cache &other) 101 { 102 mLength = other.mLength; 103 mString = new UTF16[mLength]; 104 dMemcpy(mString, other.mString, mLength * sizeof(UTF16)); 105 } 106 107 UTF16Cache & operator=(const UTF16Cache &other) 108 { 109 if (&other != this) 110 { 111 delete [] mString; 112 113 mLength = other.mLength; 114 mString = new UTF16[mLength]; 115 dMemcpy(mString, other.mString, mLength * sizeof(UTF16)); 116 } 117 return *this; 118 } 119 120 ~UTF16Cache() 121 { 122 delete [] mString; 123 } 124 125 void copyToBuffer(UTF16 *outBuffer, U32 lenToCopy, bool nullTerminate = true) const 126 { 127 U32 copy = getMin(mLength, lenToCopy); 128 if(mString && copy > 0) 129 dMemcpy(outBuffer, mString, copy * sizeof(UTF16)); 130 131 if(nullTerminate) 132 outBuffer[copy] = 0; 133 } 134}; 135 136/// Cache for UTF16 strings 137typedef HashTable<U32, UTF16Cache> UTF16CacheTable; 138static UTF16CacheTable sgUTF16Cache; 139 140#endif // TORQUE_ENABLE_UTF16_CACHE 141 142//----------------------------------------------------------------------------- 143inline bool isSurrogateRange(U32 codepoint) 144{ 145 return ( 0xd800 < codepoint && codepoint < 0xdfff ); 146} 147 148inline bool isAboveBMP(U32 codepoint) 149{ 150 return ( codepoint > 0xFFFF ); 151} 152 153//----------------------------------------------------------------------------- 154U32 convertUTF8toUTF16N(const UTF8 *unistring, UTF16 *outbuffer, U32 len) 155{ 156 AssertFatal(len >= 1, "Buffer for unicode conversion must be large enough to hold at least the null terminator."); 157 PROFILE_SCOPE(convertUTF8toUTF16); 158 159#ifdef TORQUE_ENABLE_UTF16_CACHE 160 // If we have cached this conversion already, don't do it again 161 U32 hashKey = Torque::hash((const U8 *)unistring, dStrlen(unistring), 0); 162 UTF16CacheTable::Iterator cacheItr = sgUTF16Cache.find(hashKey); 163 if(cacheItr != sgUTF16Cache.end()) 164 { 165 const UTF16Cache &cache = (*cacheItr).value; 166 cache.copyToBuffer(outbuffer, len); 167 return getMin(cache.mLength,len - 1); 168 } 169#endif 170 171 U32 walked, nCodepoints; 172 UTF32 middleman; 173 174 nCodepoints=0; 175 while(*unistring != '\0' && nCodepoints < len) 176 { 177 walked = 1; 178 middleman = oneUTF8toUTF32(unistring,&walked); 179 outbuffer[nCodepoints] = oneUTF32toUTF16(middleman); 180 unistring+=walked; 181 nCodepoints++; 182 } 183 184 nCodepoints = getMin(nCodepoints,len - 1); 185 outbuffer[nCodepoints] = '\0'; 186 187#ifdef TORQUE_ENABLE_UTF16_CACHE 188 // Cache the results. 189 // FIXME As written, this will result in some unnecessary memory copying due to copy constructor calls. 190 UTF16Cache cache(outbuffer, nCodepoints); 191 sgUTF16Cache.insertUnique(hashKey, cache); 192#endif 193 194 return nCodepoints; 195} 196 197//----------------------------------------------------------------------------- 198U32 convertUTF16toUTF8N( const UTF16 *unistring, UTF8 *outbuffer, U32 len) 199{ 200 AssertFatal(len >= 1, "Buffer for unicode conversion must be large enough to hold at least the null terminator."); 201 PROFILE_START(convertUTF16toUTF8); 202 U32 walked, nCodeunits, codeunitLen; 203 UTF32 middleman; 204 205 nCodeunits=0; 206 while( *unistring != '\0' && nCodeunits + 3 < len ) 207 { 208 walked = 1; 209 middleman = oneUTF16toUTF32(unistring,&walked); 210 codeunitLen = oneUTF32toUTF8(middleman, &outbuffer[nCodeunits]); 211 unistring += walked; 212 nCodeunits += codeunitLen; 213 } 214 215 nCodeunits = getMin(nCodeunits,len - 1); 216 outbuffer[nCodeunits] = '\0'; 217 218 PROFILE_END(); 219 return nCodeunits; 220} 221 222U32 convertUTF16toUTF8DoubleNULL( const UTF16 *unistring, UTF8 *outbuffer, U32 len) 223{ 224 AssertFatal(len >= 1, "Buffer for unicode conversion must be large enough to hold at least the null terminator."); 225 PROFILE_START(convertUTF16toUTF8DoubleNULL); 226 U32 walked, nCodeunits, codeunitLen; 227 UTF32 middleman; 228 229 nCodeunits=0; 230 while( ! (*unistring == '\0' && *(unistring + 1) == '\0') && nCodeunits + 3 < len ) 231 { 232 walked = 1; 233 middleman = oneUTF16toUTF32(unistring,&walked); 234 codeunitLen = oneUTF32toUTF8(middleman, &outbuffer[nCodeunits]); 235 unistring += walked; 236 nCodeunits += codeunitLen; 237 } 238 239 nCodeunits = getMin(nCodeunits,len - 1); 240 outbuffer[nCodeunits] = NULL; 241 outbuffer[nCodeunits+1] = NULL; 242 243 PROFILE_END(); 244 return nCodeunits; 245} 246 247//----------------------------------------------------------------------------- 248// Functions that convert buffers of unicode code points 249//----------------------------------------------------------------------------- 250UTF16* createUTF16string( const UTF8* unistring) 251{ 252 PROFILE_SCOPE(createUTF16string); 253 254 // allocate plenty of memory. 255 U32 nCodepoints, len = dStrlen(unistring) + 1; 256 FrameTemp<UTF16> buf(len); 257 258 // perform conversion 259 nCodepoints = convertUTF8toUTF16N( unistring, buf, len); 260 261 // add 1 for the NULL terminator the converter promises it included. 262 nCodepoints++; 263 264 // allocate the return buffer, copy over, and return it. 265 UTF16 *ret = new UTF16[nCodepoints]; 266 dMemcpy(ret, buf, nCodepoints * sizeof(UTF16)); 267 268 return ret; 269} 270 271//----------------------------------------------------------------------------- 272UTF8* createUTF8string( const UTF16* unistring) 273{ 274 PROFILE_SCOPE(createUTF8string); 275 276 // allocate plenty of memory. 277 U32 nCodeunits, len = dStrlen(unistring) * 3 + 1; 278 FrameTemp<UTF8> buf(len); 279 280 // perform conversion 281 nCodeunits = convertUTF16toUTF8N( unistring, buf, len); 282 283 // add 1 for the NULL terminator the converter promises it included. 284 nCodeunits++; 285 286 // allocate the return buffer, copy over, and return it. 287 UTF8 *ret = new UTF8[nCodeunits]; 288 dMemcpy(ret, buf, nCodeunits * sizeof(UTF8)); 289 290 return ret; 291} 292 293//----------------------------------------------------------------------------- 294 295//----------------------------------------------------------------------------- 296// Functions that converts one unicode codepoint at a time 297//----------------------------------------------------------------------------- 298UTF32 oneUTF8toUTF32( const UTF8* codepoint, U32 *unitsWalked) 299{ 300 PROFILE_SCOPE(oneUTF8toUTF32); 301 302 // codepoints 6 codeunits long are read, but do not convert correctly, 303 // and are filtered out anyway. 304 305 // early out for ascii 306 if(!(*codepoint & 0x0080)) 307 { 308 if (unitsWalked != NULL) 309 *unitsWalked = 1; 310 return (UTF32)*codepoint; 311 } 312 313 U32 expectedByteCount; 314 UTF32 ret = 0; 315 U8 codeunit; 316 317 // check the first byte ( a.k.a. codeunit ) . 318 U8 c = codepoint[0]; 319 c = c >> 1; 320 expectedByteCount = sgFirstByteLUT[c]; 321 if(expectedByteCount > 0) // 0 or negative is illegal to start with 322 { 323 // process 1st codeunit 324 ret |= sgByteMask8LUT[expectedByteCount] & codepoint[0]; // bug? 325 326 // process trailing codeunits 327 for(U32 i=1;i<expectedByteCount; i++) 328 { 329 codeunit = codepoint[i]; 330 if( sgFirstByteLUT[codeunit>>1] == 0 ) 331 { 332 ret <<= 6; // shift up 6 333 ret |= (codeunit & 0x3f); // mask in the low 6 bits of this codeunit byte. 334 } 335 else 336 { 337 // found a bad codepoint - did not get a medial where we wanted one. 338 // Dump the replacement, and claim to have parsed only 1 char, 339 // so that we'll dump a slew of replacements, instead of eating the next char. 340 ret = kReplacementChar; 341 expectedByteCount = 1; 342 break; 343 } 344 } 345 } 346 else 347 { 348 // found a bad codepoint - got a medial or an illegal codeunit. 349 // Dump the replacement, and claim to have parsed only 1 char, 350 // so that we'll dump a slew of replacements, instead of eating the next char. 351 ret = kReplacementChar; 352 expectedByteCount = 1; 353 } 354 355 if(unitsWalked != NULL) 356 *unitsWalked = expectedByteCount; 357 358 // codepoints in the surrogate range are illegal, and should be replaced. 359 if(isSurrogateRange(ret)) 360 ret = kReplacementChar; 361 362 // codepoints outside the Basic Multilingual Plane add complexity to our UTF16 string classes, 363 // we've read them correctly so they won't foul the byte stream, 364 // but we kill them here to make sure they wont foul anything else 365 if(isAboveBMP(ret)) 366 ret = kReplacementChar; 367 368 return ret; 369} 370 371//----------------------------------------------------------------------------- 372UTF32 oneUTF16toUTF32(const UTF16* codepoint, U32 *unitsWalked) 373{ 374 PROFILE_START(oneUTF16toUTF32); 375 U8 expectedType; 376 U32 unitCount; 377 UTF32 ret = 0; 378 UTF16 codeunit1,codeunit2; 379 380 codeunit1 = codepoint[0]; 381 expectedType = sgSurrogateLUT[codeunit1 >> 10]; 382 switch(expectedType) 383 { 384 case 0: // simple 385 ret = codeunit1; 386 unitCount = 1; 387 break; 388 case 1: // 2 surrogates 389 codeunit2 = codepoint[1]; 390 if( sgSurrogateLUT[codeunit2 >> 10] == 2) 391 { 392 ret = ((codeunit1 & sgByteMaskLow10 ) << 10) | (codeunit2 & sgByteMaskLow10); 393 unitCount = 2; 394 break; 395 } 396 // else, did not find a trailing surrogate where we expected one, 397 // so fall through to the error 398 case 2: // error 399 // found a trailing surrogate where we expected a codepoint or leading surrogate. 400 // Dump the replacement. 401 ret = kReplacementChar; 402 unitCount = 1; 403 break; 404 default: 405 // unexpected return 406 AssertFatal(false, "oneUTF16toUTF323: unexpected type"); 407 ret = kReplacementChar; 408 unitCount = 1; 409 break; 410 } 411 412 if(unitsWalked != NULL) 413 *unitsWalked = unitCount; 414 415 // codepoints in the surrogate range are illegal, and should be replaced. 416 if(isSurrogateRange(ret)) 417 ret = kReplacementChar; 418 419 // codepoints outside the Basic Multilingual Plane add complexity to our UTF16 string classes, 420 // we've read them correctly so they wont foul the byte stream, 421 // but we kill them here to make sure they wont foul anything else 422 // NOTE: these are perfectly legal codepoints, we just dont want to deal with them. 423 if(isAboveBMP(ret)) 424 ret = kReplacementChar; 425 426 PROFILE_END(); 427 return ret; 428} 429 430//----------------------------------------------------------------------------- 431UTF16 oneUTF32toUTF16(const UTF32 codepoint) 432{ 433 // found a codepoint outside the encodable UTF-16 range! 434 // or, found an illegal codepoint! 435 if(codepoint >= 0x10FFFF || isSurrogateRange(codepoint)) 436 return kReplacementChar; 437 438 // these are legal, we just don't want to deal with them. 439 if(isAboveBMP(codepoint)) 440 return kReplacementChar; 441 442 return (UTF16)codepoint; 443} 444 445//----------------------------------------------------------------------------- 446U32 oneUTF32toUTF8(const UTF32 codepoint, UTF8 *threeByteCodeunitBuf) 447{ 448 PROFILE_START(oneUTF32toUTF8); 449 U32 bytecount = 0; 450 UTF8 *buf; 451 U32 working = codepoint; 452 buf = threeByteCodeunitBuf; 453 454 //----------------- 455 if(isSurrogateRange(working)) // found an illegal codepoint! 456 working = kReplacementChar; 457 458 if(isAboveBMP(working)) // these are legal, we just dont want to deal with them. 459 working = kReplacementChar; 460 461 //----------------- 462 if( working < (1 << 7)) // codeable in 7 bits 463 bytecount = 1; 464 else if( working < (1 << 11)) // codeable in 11 bits 465 bytecount = 2; 466 else if( working < (1 << 16)) // codeable in 16 bits 467 bytecount = 3; 468 469 AssertISV( bytecount > 0, "Error converting to UTF-8 in oneUTF32toUTF8(). isAboveBMP() should have caught this!"); 470 471 //----------------- 472 U8 mask = sgByteMask8LUT[0]; // 0011 1111 473 U8 marker = ( ~static_cast<U32>(mask) << 1u); // 1000 0000 474 475 // Process the low order bytes, shifting the codepoint down 6 each pass. 476 for( S32 i = bytecount-1; i > 0; i--) 477 { 478 threeByteCodeunitBuf[i] = marker | (working & mask); 479 working >>= 6; 480 } 481 482 // Process the 1st byte. filter based on the # of expected bytes. 483 mask = sgByteMask8LUT[bytecount]; 484 marker = ( ~mask << 1 ); 485 threeByteCodeunitBuf[0] = marker | (working & mask); 486 487 PROFILE_END(); 488 return bytecount; 489} 490 491//----------------------------------------------------------------------------- 492U32 dStrlen(const UTF16 *unistring) 493{ 494 if(!unistring) 495 return 0; 496 497 U32 i = 0; 498 while(unistring[i] != '\0') 499 i++; 500 501// AssertFatal( wcslen(unistring) == i, "Incorrect length" ); 502 503 return i; 504} 505 506//----------------------------------------------------------------------------- 507U32 dStrlen(const UTF32 *unistring) 508{ 509 U32 i = 0; 510 while(unistring[i] != '\0') 511 i++; 512 513 return i; 514} 515 516//----------------------------------------------------------------------------- 517 518const UTF16* dStrrchr(const UTF16* unistring, U32 c) 519{ 520 if(!unistring) return NULL; 521 522 const UTF16* tmp = unistring + dStrlen(unistring); 523 while( tmp >= unistring) 524 { 525 if(*tmp == c) 526 return tmp; 527 tmp--; 528 } 529 return NULL; 530} 531 532UTF16* dStrrchr(UTF16* unistring, U32 c) 533{ 534 const UTF16* str = unistring; 535 return const_cast<UTF16*>(dStrrchr(str, c)); 536} 537 538const UTF16* dStrchr(const UTF16* unistring, U32 c) 539{ 540 if(!unistring) return NULL; 541 const UTF16* tmp = unistring; 542 543 while ( *tmp && *tmp != c) 544 tmp++; 545 546 return (*tmp == c) ? tmp : NULL; 547} 548 549UTF16* dStrchr(UTF16* unistring, U32 c) 550{ 551 const UTF16* str = unistring; 552 return const_cast<UTF16*>(dStrchr(str, c)); 553} 554 555//----------------------------------------------------------------------------- 556const UTF8* getNthCodepoint(const UTF8 *unistring, const U32 n) 557{ 558 const UTF8* ret = unistring; 559 U32 charsseen = 0; 560 while( *ret && charsseen < n) 561 { 562 ret++; 563 if((*ret & 0xC0) != 0x80) 564 charsseen++; 565 } 566 567 return ret; 568} 569 570/* alternate utf-8 decode impl for speed, no error checking, 571 left here for your amusement: 572 573 U32 codeunit = codepoint + expectedByteCount - 1; 574 U32 i = 0; 575 switch(expectedByteCount) 576 { 577 case 6: ret |= ( *(codeunit--) & 0x3f ); i++; 578 case 5: ret |= ( *(codeunit--) & 0x3f ) << (6 * i++); 579 case 4: ret |= ( *(codeunit--) & 0x3f ) << (6 * i++); 580 case 3: ret |= ( *(codeunit--) & 0x3f ) << (6 * i++); 581 case 2: ret |= ( *(codeunit--) & 0x3f ) << (6 * i++); 582 case 1: ret |= *(codeunit) & byteMask8LUT[expectedByteCount] << (6 * i); 583 } 584*/ 585 586//------------------------------------------------------------------------------ 587// Byte Order Mark functions 588 589bool chompUTF8BOM( const char *inString, char **outStringPtr ) 590{ 591 *outStringPtr = const_cast<char *>( inString ); 592 593 bool valid = false; 594 if (inString[0] && inString[1] && inString[2]) 595 { 596 U8 bom[4]; 597 dMemcpy(bom, inString, 4); 598 valid = isValidUTF8BOM(bom); 599 } 600 601 // This is hackey, but I am not sure the best way to do it at the present. 602 // The only valid BOM is a UTF8 BOM, which is 3 bytes, even though we read 603 // 4 bytes because it could possibly be a UTF32 BOM, and we want to provide 604 // an accurate error message. Perhaps this could be re-worked when more UTF 605 // formats are supported to have isValidBOM return the size of the BOM, in 606 // bytes. 607 if( valid ) 608 (*outStringPtr) += 3; // SEE ABOVE!! -pw 609 610 return valid; 611} 612 613bool isValidUTF8BOM( U8 bom[4] ) 614{ 615 // Is it a BOM? 616 if( bom[0] == 0 ) 617 { 618 // Could be UTF32BE 619 if( bom[1] == 0 && bom[2] == 0xFE && bom[3] == 0xFF ) 620 { 621 Con::warnf( "Encountered a UTF32 BE BOM in this file; Torque does NOT support this file encoding. Use UTF8!" ); 622 return false; 623 } 624 625 return false; 626 } 627 else if( bom[0] == 0xFF ) 628 { 629 // It's little endian, either UTF16 or UTF32 630 if( bom[1] == 0xFE ) 631 { 632 if( bom[2] == 0 && bom[3] == 0 ) 633 Con::warnf( "Encountered a UTF32 LE BOM in this file; Torque does NOT support this file encoding. Use UTF8!" ); 634 else 635 Con::warnf( "Encountered a UTF16 LE BOM in this file; Torque does NOT support this file encoding. Use UTF8!" ); 636 } 637 638 return false; 639 } 640 else if( bom[0] == 0xFE && bom[1] == 0xFF ) 641 { 642 Con::warnf( "Encountered a UTF16 BE BOM in this file; Torque does NOT support this file encoding. Use UTF8!" ); 643 return false; 644 } 645 else if( bom[0] == 0xEF && bom[1] == 0xBB && bom[2] == 0xBF ) 646 { 647 // Can enable this if you want -pw 648 //Con::printf("Encountered a UTF8 BOM. Torque supports this."); 649 return true; 650 } 651 652 // Don't print out an error message here, because it will try this with 653 // every script. -pw 654 return false; 655} 656