Some clarification about charset settings for the various DBMS (MySQL, PostgreSQL, Firebird/Interbase, Oracle, Sqlite) and a call for help for charset setting support in PEAR::MDB2.
Foreword: we're working on PEAR::MDB2 charset and collation support. That means that we're in the process of figuring out what's the best way to set the charset in each DBMS, and how to make it portable and consistent through all of them.
I'll try to explain how the various DBMS implement charset settings first, then describe the MDB2 status-quo regarding charset support. Finally, we'd like to get some feedback to implement a better charset support in MDB2.
In this wiki page there's a nice summary of the possible charset settings:
While the above schema is more or less clear, when it comes to implementing a generic solution for every DBMS, then some doubts start creeping in.
In an ideal word, all the DBMS would implement the SQL standard in a standard way ;-), or at least provide some conceptually equivalent set of parameters. In the real word, things are a bit different...
|mysql||SET character_set_client = charset||SET character_set_connection = charset||SET character_set_results = charset|
|pgsql||SET NAMES 'charset'||-||SET NAMES 'charset'|
|ibase||SET NAMES charset (isql)||-||SET NAMES charset (isql)|
|oracle||NLS_LANG environmental variable||-||-|
SET NAMES charset" clause: in MySQL, it is equivalent to these three commands:
SET character_set_client = charset; SET character_set_results = charset; SET character_set_connection = charset;
In MySQL, there's also the "
SET CHARACTER SET charset" clause, which is equivalent to these three commands:
SET character_set_client = charset; SET character_set_results = charset; SET collation_connection = @@collation_database;
In other words, it is similar to
SET NAMES, but sets the connection character set and collation to be those of the default database.
With PostgreSQL, we have three ways to set the client encoding:
1. pg_setclientencoding($cid, "BIG5") 2. pg_exec("SET NAMES 'BIG5'") 3. pg_exec("SET CLIENT_ENCODING TO 'BIG5'")
2 and 3 are identical, 1 is doing the same thing as 2 or 3 AND set the internal encoding of libpq, which is linked to PHP, to BIG5. This is necessary if the second byte of the BIG5 character is "\" or one of other control characters.
|mysql||CREATE DATABASE database DEFAULT CHARACTER SET charset COLLATE collation||CREATE TABLE table1 (...) CHARACTER SET charset COLLATE collation||CREATE TABLE table1 (column1 VARCHAR(5) CHARACTER SET charset COLLATE collation)|
|pgsql||CREATE DATABASE database WITH ENCODING 'charset'||-||-|
|ibase||CREATE DATABASE database DEFAULT CHARACTER SET 'charset'||-||CREATE TABLE Table1 (column1 VARCHAR(5) CHARACTER SET charset COLLATE collation)|
|sqlite||-||PRAGMA encoding="charset"||CREATE TABLE Table1 (column1 VARCHAR(5) COLLATE collation)|
|sql server||CREATE DATABASE database COLLATE collation||-||CREATE TABLE Table1 (column1 VARCHAR(5) COLLATE collation)|
I admit my experience with charsets is limited to MySQL, PostgreSQL and Firebird/Interbase with ISO-8859-1 and UTF-8,
and always in a uniform setup (i.e. using the same charset for the whole db, and for the data I/O). For instance,
I don't have a (successful) first-hand experience of mixed setups, i.e. where the server charset is -say-
latin1 and you have a table in
latin2 and another in
AFAICS, others have tried with mixed results.
Since our goal is to provide a good charset/collation setting implementation for PEAR::MDB2, we humbly ask for counselling to those of you who know better. Please contact me or Lukas with your corrections, suggestions and guidelines. We're especially interested in hearing experiences with dealing with multiple charsets at the same time. I will also update this page so it can serve as a future reference to others too.
For instance, ATM we only have a generic
setCharset() method, which in the ibase, mysql and pgsql drivers triggers a
"SET NAMES" query. Initially, in the MySQL driver it translated to a
"SET character_set_client" query, but I changed it to
"SET NAMES" for it was not effective for most users. Is it the right choice? Should we implement a method for each charset setting type? If so, how? Remember that we must provide a portable solution.
Thank you for your attention.