Every now and again, we as DBAs need to get away from the daily grind of the hum drum tasks. We need to find a way to have a little fun (the kind that would be non-harmful and lacking in mal-intent).
Sometimes, this fun can be had through learning opportunities. Sometimes, we can have a little fun through diving deep into SQL Server in some way. At least I know that proves to be enlightening and enjoyable for myself. Sometimes, it is just fun for the sake of a good laugh. What if we could find an opportunity to have a little fun that incorporates learning and a laugh or two? BINGO!
Enter the opportunity to play with emojis, collations, and database (or object) names.
Fun with Characters
Let’s figure we have a requirement to create a database with sensitive data. Due to the sensitivity of the data, it is classified confidential (for your eyes only, don’t talk about it and plug your ears if somebody starts talking about it). This is so sensitive that an apt name for the database could be anything like 🙈 or 🙉 or 🙊. Being smart, you know there are two more databases coming down the line so you only want to pick one of those for the name and not all three (though all three could make sense for a single database name).
Being on top of your game, you prep for all three databases in advance. Here is the script to create the three new databases.
CREATE DATABASE [🙈];
CREATE DATABASE [🙉];
CREATE DATABASE [🙊];
In executing the script, you receive the following results:
Msg 1801, Level 16, State 3, Line 15
Database ‘🙈’ already exists. Choose a different database name.
Msg 1801, Level 16, State 3, Line 17
Database ‘🙉’ already exists. Choose a different database name.
Msg 1801, Level 16, State 3, Line 19
Database ‘🙊’ already exists. Choose a different database name.
You look up and down through your SSMS window and even query sys.databases to ensure those databases are factually not present. You are able to confirm that none or currently present on the server yet the script continues to fail. How could this be? Clearly, the images are each very different.
From the database review, you know there is only one other database with an emoji style name – 👺. What could it be? Well, as it turns out, it is the collation that was chosen for the SQL Server install. The chosen collation doesn’t support a large enough range of code points so most emojis end up getting mapped to the same thing when stored in SQL Server (topic for much larger article and covered very completely by Solomon Rutzky on his blog – – here). The short of the issue is the use of supplementary characters and splitting the code point into 2 surrogate codes. For most emojis, this surrogate system results in the “high” surrogate key being the same. Let’s take a look at it.
SELECT UNICODE(N'☯' COLLATE Latin1_General_100_CI_AS_SC) AS '☯',
UNICODE(N'🙉' COLLATE Latin1_General_100_CI_AS_SC) AS '🙉',
UNICODE(N'🙊' COLLATE Latin1_General_100_CI_AS_SC) AS '🙊',
UNICODE(N'🙈' COLLATE Latin1_General_100_CI_AS_SC) AS '🙈',
UNICODE(N'👺' COLLATE Latin1_General_100_CI_AS_SC) AS '👺',
'Latin1_General_100_CI_AS_SC' AS Collation
SELECT UNICODE(N'☯' COLLATE DATABASE_DEFAULT),
UNICODE(N'🙉' COLLATE DATABASE_DEFAULT),
UNICODE(N'🙊' COLLATE DATABASE_DEFAULT),
UNICODE(N'🙈' COLLATE DATABASE_DEFAULT),
UNICODE(N'👺' COLLATE DATABASE_DEFAULT),
Notice how each of the monkeys plus the devil mask all translate to the same code point when using the SQL_Latin1_General_CP1_CI_AS (default collation)? This is the root of the problem. If we change to a different server with a better collation (e.g. Latin1_General_100_CI_AS_SC) then we will see vastly different results.
Then let’s go ahead and create each of those three high security databases (seenoevil, speaknoevil, and hearnoevil). Et voila! We now have each of those new databases created and they are indeed unique as we desired (as you can see in the preceding image).
If you want to learn all of the hard core nitty gritty behind this, I recommend reading the blog by Solomon Rutzky – here. Solomon knows his 💩 about the collations and can answer or figure out how to answer just about anything you could throw at him about the topic. His blog is a WEALTH of in-depth knowledge with all sorts of tests and proofs. Check it out!
Put a bow on it
Working as a DBA and creating databases doesn’t have to be the same old hum drum day in and day out. This article showed how to have a little fun while creating databases and database objects. At the same time, I showed how the collation of your server can impact some of your naming schemes.
Interested in learning about some deep technical information instead? Check these out!
This is the ninth article in the 2019 “12 Days of Christmas” series. For the full list of articles, please visit this page.
*Post Mortem from Solomon who is the collation genius
And, that brings us to the next fun fact: being able to use more than 1 emoji for database names has nothing to do with whether or not SQL Server sees a surrogate pair as being a single Supplementary Character or merely two surrogate code points. Using SCA collations only affects the behavior of the built-in functions, even though the official documentation says that they help in sorting and comparison (the doc is wrong and I haven’t had time to submit a correction). Being able to name more than 1 database one or more emoji characters is entirely based on how the characters compare to each other, and that is a function of sort weights, which are managed separately. Sort weights do map to code points, but they can change in value depending on the locale/culture being used, as well as the sensitivities selected (i.e. case, accent, etc). The problem you run into with at least most of the version 80 collations (i.e. all Windows collations without a version number in their names, and all SQL Server collations) is simply that they never defined any sort weights for either supplementary code points, or even the surrogate code points used to create surrogate pairs. And with no sort weights defined, their sort value is always 0, and since they are all “0”, they all equate to not only each other, but to anything else with a sort weight of “0”, even an empty string.Starting with the version 90 collations (even without the “_SC” flag in the name), sort weights were added to the surrogate code points (not to supplementary characters) so that they could at least be distinguished from each other in sorting and comparison operations. And this is why:
you were not able to create the 3 monkey DBs using SQL_Latin1_General_CP1_CI_AS
you were able to create the 3 monkey DBs using Latin1_General_100_CI_AS_SC
you can create those 3 monkey DBs using SQL_Latin1_General_
CP850_BIN2, even with that being a SQL Server collation, because binary collations don’t use sort weights, but instead just go by each byte or code unit (depending on BIN or BIN2, respectively)