I had an itch, so I scratched it.
I had been pondering about xml:lang values for language en, yet without a locale, to use for UEBC, a proposed Universal English braille code. Eventually I joined the W3C I18N interest list, and asked. Silly thing to do! As is often the case I was out of date. rfc4646.txt was pointed to as the up to date version. Since I'd need to test it, even nicer I found a pointer to this, which proposes test cases. At this link, I found a proposed regex from Mark Davis. I looked at that regex for long enough before trying to break it down into it's component pieces. Initially in Python, I then moved over to netbeans (the new 5.5 version - very nice), and started testing. After most of the night I finally arrived at testing for a full language specification. If you take a look at the test cases, you'll see it can get rather messy! Certainly stretched my regex knowledge. Having been sold on TDD for a little while, I naturally started to build tests. A little tedious, but it does keep the confidence high each time you move on a step. I now have 82 part tests, which exercise the sub elements of the full beast. I've made a start on the fuller test list, though it's likely to be flaky. Any suggested improvements, please let me know. It's zipped up, ant build file and test code, here. Enjoy
Further debate puts common sense onto the formal words. The key part being that 'grandfathered' (lovely term!) stuff is being replaced by 'irregular' tags. New references are: iana, which provides the semantics. The update to rfc4646 is currently inter-locale.com, but note that the suffix of 01 will change over time.
Based on comments on the I18N <email@example.com> list, mainly from John Cowan. Some notes on language specification.
Grandfathered. I failed to understand this. John explained:
"Grandfathered" is a semantic concept (the meaning of the tag cannot be deduced from its parts); "irregular" a syntactic one (the tag cannot be parsed into parts using the regular parsing algorithms). All irregular tags are grandfathered, but not all grandfathered tags are irregular. Unfortunately this distinction was not clarified until after 4646 was published.
Okay, let me unpack that a bit.
Most 4646 language tags follow the general pattern of language-script- region-variant, with all but the first part optional. The ABNF makes it possible to (a) recognize a well-formed tag and (b) take it apart into the four components. Then you can look in the Language Subtag Registry at iana to find out what the various subtags mean.
There are some exceptions, however, based on tags that were registered before we adopted these rules. For example, "sgn" means "sign languages" and "US" means "in the United States", but "sgn-us" does not mean "any sign language used in the United States", it means the specific sign language called "American Sign Language". A tag like this has the regular form, but its meaning is grandfathered. You can recognize a tag like this using the ABNF, but if you try to understand its meaning piece by piece, you get the wrong answer. All such tags are listed in the Language Subtag Registry.
Furthermore, some of the grandfathered tags are also irregular: they don't match the language-script-region-variant pattern at all, and you cannot take them apart. "i-hak" is an example of this: it means "Hakka Chinese". These tags are also listed in the Language Subtag Registry.
You should ignore the "grandfathered" production in the ABNF altogether. It will be replaced in the next RFC (temporarily called "RFC 4646bis") with the following production:
irregular = "en-GB-oed" / "i-ami" / "i-bnn" / "i-default"
/ "i-enochian" / "i-hak" / "i-klingon" / "i-lux"
/ "i-mingo" / "i-navajo" / "i-pwn" / "i-tao"
/ "i-tay" / "i-tsu" / "sgn-BE-fr" / "sgn-BE-nl"
Your code should simply check if a tag is case-insensitively equal to any of those 17 strings, and if so, declare it well-formed without further investigation. This list is permanently fixed, so it is safe to freeze it into code.
Addison Phillips responded with:
Grandfathered tags are those registered under RFC 1766 or RFC 3066 whose subtags are not all in the subtag registry.There are two classes of grandfathered tags:
1. Tags which are "well-formed" but for which some subtags are not registered.
2. Tags which are not well-formed and which are only valid as grandfathered tags (which are called "irregular").
If your processor does validation, you need the complete list of grandfathered tags. If your processor only does well-formedness checking, you only need the list of irregular tags (as above)
This latter list, please note is closed, as in it will never change.
Please note that there is test data in some of the LTRU WG archives which you would do well to try.
For the curious, ietf manages a part of this standardisation work. Go see ietf which is the Working Group home page.