DBMS and charsets, a summary and a call for help

Foreword: we're working on PEAR::MDB2 charset and collation support. That means that we're in the process of figuring out what's the best way to set the charset in each DBMS, and how to make it portable and consistent through all of them.
I'll try to explain how the various DBMS implement charset settings first, then describe the MDB2 status-quo regarding charset support. Finally, we'd like to get some feedback to implement a better charset support in MDB2.

Types of charset settings

In this wiki page there's a nice summary of the possible charset settings:

  • client: charset of a query, i.e. the charset as it leaves client and is received by the server.
  • connection (server): this is the input charset used by the server. If the client charset is different, it is translated into the server charset (lossy conversion?)
  • database/table/field: this is the default charset for the db/table/field, i.e. the charset of the stored data.
  • results (message): this is the output charset used by the server. Everything coming from the server (messages and resultsets) is sent in this charset



While the above schema is more or less clear, when it comes to implementing a generic solution for every DBMS, then some doubts start creeping in.

Summary of the charset settings supported by each DBMS

In an ideal word, all the DBMS would implement the SQL standard in a standard way ;-), or at least provide some conceptually equivalent set of parameters. In the real word, things are a bit different...

DBMS charset comparison chart #1 (I/O)
DBMS client connection result
mysql SET character_set_client = charset SET character_set_connection = charset SET character_set_results = charset
pgsql SET NAMES 'charset' - SET NAMES 'charset'
ibase SET NAMES charset (isql) - SET NAMES charset (isql)
oracle NLS_LANG environmental variable - -
sqlite - - -


A note about the SQL standard "SET NAMES charset" clause: in MySQL, it is equivalent to these three commands:

SET character_set_client = charset;
SET character_set_results = charset;
SET character_set_connection = charset;

In MySQL, there's also the "SET CHARACTER SET charset" clause, which is equivalent to these three commands:

SET character_set_client = charset;
SET character_set_results = charset;
SET collation_connection = @@collation_database;

In other words, it is similar to SET NAMES, but sets the connection character set and collation to be those of the default database.

With PostgreSQL, we have three ways to set the client encoding:

1. pg_setclientencoding($cid, "BIG5")
2. pg_exec("SET NAMES 'BIG5'")
3. pg_exec("SET CLIENT_ENCODING TO 'BIG5'")

2 and 3 are identical, 1 is doing the same thing as 2 or 3 AND set the internal encoding of libpq, which is linked to PHP, to BIG5. This is necessary if the second byte of the BIG5 character is "\" or one of other control characters.

Setting the charset for the data storage

DBMS charset comparison chart #2 (data storage)
DBMS database table field
mysql CREATE DATABASE database DEFAULT CHARACTER SET charset COLLATE collation CREATE TABLE table1 (...) CHARACTER SET charset COLLATE collation CREATE TABLE table1 (column1 VARCHAR(5) CHARACTER SET charset COLLATE collation)
pgsql CREATE DATABASE database WITH ENCODING 'charset' - -
ibase CREATE DATABASE database DEFAULT CHARACTER SET 'charset' - CREATE TABLE Table1 (column1 VARCHAR(5) CHARACTER SET charset COLLATE collation)
oracle - - -
sqlite - PRAGMA encoding="charset" CREATE TABLE Table1 (column1 VARCHAR(5) COLLATE collation)
sql server CREATE DATABASE database COLLATE collation - CREATE TABLE Table1 (column1 VARCHAR(5) COLLATE collation)

Where not specified, the charset is the default charset of the system, of the default database or of the current table.

When in doubt, ask the experts!

I admit my experience with charsets is limited to MySQL, PostgreSQL and Firebird/Interbase with ISO-8859-1 and UTF-8, and always in a uniform setup (i.e. using the same charset for the whole db, and for the data I/O). For instance, I don't have a (successful) first-hand experience of mixed setups, i.e. where the server charset is -say- latin1 and you have a table in latin2 and another in cp1251_koi8.
AFAICS, others have tried with mixed results.

Since our goal is to provide a good charset/collation setting implementation for PEAR::MDB2, we humbly ask for counselling to those of you who know better. Please contact me or Lukas with your corrections, suggestions and guidelines. We're especially interested in hearing experiences with dealing with multiple charsets at the same time. I will also update this page so it can serve as a future reference to others too.

For instance, ATM we only have a generic setCharset() method, which in the ibase, mysql and pgsql drivers triggers a "SET NAMES" query. Initially, in the MySQL driver it translated to a "SET character_set_client" query, but I changed it to "SET NAMES" for it was not effective for most users. Is it the right choice? Should we implement a method for each charset setting type? If so, how? Remember that we must provide a portable solution.

Thank you for your attention.




1 response to "DBMS and charsets, a summary and a call for help"

invaluable information and advise the saved me the day... in a not that trivial topic indeed. many thanks Lorenzo!

Lorenzo Alberton

Lorenzo Alberton Lorenzo PHP5 ZCE - Zend Certified Engineer has been working with large enterprise UK companies for the past years and is now Chief Tech Architect at DataSift. He's an international conference speaker and a long-time contributor to many open source projects. Lorenzo Alberton's profile on LinkedIN View Lorenzo Alberton's Twitter stream

Lorenzo Alberton - Sun Certified MySQL 5 Developer

Tags

AJAX, Apache, Book Review, Charset, Cheat Sheet, Data structures, Database, Firebird SQL, Hadoop, Imagick, INFORMATION_SCHEMA, JavaScript, Kafka, Linux, Message Queues, mod_rewrite, Monitoring, MySQL, NoSQL, Oracle, PDO, PEAR, Performance, PHP, PostgreSQL, Profiling, Scalability, Security, SPL, SQL Server, SQLite, Testing, Tutorial, TYPO3, Windows, Zend Framework

Buy me a book - The Best Software Writing 1