Editeurs Européens de Logiciels Libres

par **JPL** » 16 Mars 2012, 10:49

Code for php :
http://pastebin.com/18iNY6dU

Code: Tout sélectionner: function is_utf8($str) { $c=0; $b=0; $bits=0; $len=strlen($str); for($i=0; $i<$len; $i++){ $c=ord($str[$i]); if($c > 128){ if(($c >= 254)) return false; elseif($c >= 252) $bits=6; elseif($c >= 248) $bits=5; elseif($c >= 240) $bits=4; elseif($c >= 224) $bits=3; elseif($c >= 192) $bits=2; else return false; if(($i+$bits) > $len) return false; while($bits > 1){ $i++; $b=ord($str[$i]); if($b < 128 || $b > 191) return false; $bits--; } } } return true; }

Code for C, c++ :

Code: Tout sélectionner: // **************************************************************************** //O is_utf8 () // **************************************************************************** int Mkdcppw::is_utf8(char string[]) { char ch1=0; char ch2=0; int noctets=0; long llen=strlen(string); long li; for(li=0; li<llen; li++) { ch1=string[li]; if(ch1 == 0xffffffc3) puts("OK"); printf("%x=%c, ",ch1,ch1); //T test point if(ch1 >= 0xffffffc2) { puts("char is > 128 - ISO or Unicode-utf8 ?"); //T test point if((ch1 >= 0xfffffffe)) { puts("char >= OxFe no match for utf8 !"); return 0; // 0xFE } else if(ch1 >= 0xfffffffc) noctets=6; // 0xFC ? else if(ch1 >= 0xfffffff8) noctets=5; // 0xF8 ? else if(ch1 >= 0xfffffff0) noctets=4; // 0xF0 utf8 de F0 à F4 puis 80 à BF // si F0 alors entre 90 et BF // si F4 alors entre 80 et 8F else if(ch1 >= 0xffffffE0) noctets=3; // OxE0 utf8 de E0 à EF puis 80 à BF else if(ch1 >= 0xffffffC2) noctets=2; // 0xC0 utf8 de C2 à DF puis 80 à BF else { puts("char is probably ISO"); return 0; // not utf8 if ch1 < OxC0 } if((li+noctets) > llen) return 0; printf ("\nchar utf8 with %d octets\n",noctets); //T test point while(noctets > 1) { li++; puts("ultime test : realy utf 8 ?"); //T test point ch2=string[li]; if(ch2 < 0xffffff80 || ch2 > 0xffffffbf) { // Global puts("char no match for utf8 !"); //T test point return 0; // entre Ox80 et 0xBF } if(ch1==(ch1 == 0xfffffff0) && ch2 < 0xffffff90) { // case ch1=0xF0 and ch2<0x90 puts("char no match for utf8 !"); //T test point return 0; // entre Ox90 et 0xBF } if(ch1==(ch1 == 0xfffffff4) && ch2 > 0xffffff8F) { // case ch1=0xF4 and ch2>0x8F puts("char no match for utf8 !"); //T test point return 0; // entre Ox80 et 0x8F } noctets--; } } } puts("char is utf8"); return 1; }

extract from man page (7) :

Code: Tout sélectionner: The UTF-8 encoding has the following nice properties: * UCS characters 0x00000000 to 0x0000007f (the classic US-ASCII charac‐ ters) are encoded simply as bytes 0x00 to 0x7f (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.

Consult unix utf-88 man page.

See also : http://fr.wikipedia.org/wiki/Unicode#D% ... techniques and : http://en.wikipedia.org/wiki/Unicode

par **JPL** » 16 Mars 2012, 11:04

source :

Code: Tout sélectionner: //T test with xml chars encoded UTF-8 Result : ISO-8859-1 in text file //T ready for printing on utf-8 format with accents /*D encoded UTF-8 : caractère & caractère < caractère > caractère " caractère ' */

mkdcppw, result in buffer :

Code: Tout sélectionner: result encoded UTF-8 in buffer: caractère & caractère < caractère > caractère " caractère '

mkdcppw, decodage with function is_utf8 and tests points :

Code: Tout sélectionner: result in function is_utf8 : 20= , 20= , 20= , a= , 65=e, 6e=n, 63=c, 6f=o, 64=d, 65=e, 64=d, 20= , 55=U, 54=T, 46=F, 2d=-, 38=8, 20= , 3a=:, a= , 63=c, 61=a, 72=r, 61=a, 63=c, 74=t, OK ffffffc3=�, char is > 128 - ISO or Unicode-utf8 ? char utf8 with 2 octets ultime test : realy utf 8 ? 72=r, 65=e, 20= , 26=&, 23=#, 78=x, 32=2, 36=6, 3b=;, a= , 63=c, 61=a, 72=r, 61=a, 63=c, 74=t, OK ffffffc3=�, char is > 128 - ISO or Unicode-utf8 ? char utf8 with 2 octets ultime test : realy utf 8 ? 72=r, 65=e, 20= , 26=&, 6c=l, 74=t, 3b=;, a= , 63=c, 61=a, 72=r, 61=a, 63=c, 74=t, OK ffffffc3=�, char is > 128 - ISO or Unicode-utf8 ? char utf8 with 2 octets ultime test : realy utf 8 ? 72=r, 65=e, 20= , 26=&, 67=g, 74=t, 3b=;, a= , 63=c, 61=a, 72=r, 61=a, 63=c, 74=t, OK ffffffc3=�, char is > 128 - ISO or Unicode-utf8 ? char utf8 with 2 octets ultime test : realy utf 8 ? 72=r, 65=e, 20= , 26=&, 71=q, 75=u, 6f=o, 74=t, 3b=;, a= , 63=c, 61=a, 72=r, 61=a, 63=c, 74=t, OK ffffffc3=�, char is > 128 - ISO or Unicode-utf8 ? char utf8 with 2 octets ultime test : realy utf 8 ? 72=r, 65=e, 20= , 26=&, 61=a, 70=p, 6f=o, 73=s, 3b=;, a= , char is utf8

funcion return 1 : is utf8

mkdcppw, preview for printing :

Editeurs Européens de Logiciels Libres

function is_utf8() source code

function is_utf8() source code

Re: function is_utf8() source code

Qui est en ligne ?