Finding interesting Japanese words using C#

My awesome wife studies Japanese. Recently she showed me a challenge a friend of ours came up with. But first some basic (and over-simplistic) concepts:

Background

(skip if you know anything about Japanese)

  • Japanese has two kinds of symbols:
    kana – which are more like letters (we don’t really care about these at this post). For example: ね, こ, だ.
    kanji – A group of over 12,000 symbols representing words. For example: 猫.
  • Kanjis are made of smaller parts are called radicals. For example, the kanji 災 (disaster) is made of two radicals: 巛 (river) and 火 (fire).
  • Words are made from several kanjis. For example, the word 火山 means volcano and is made from two kanjis: 火 (fire) and 山 (mountain). Kana also has part in words, but again, not the subject here.
  • When it comes to Unicode, each kanji has its own character. A word is a string made of several characters.

Anyway,

The challenge:

Can you think of Japanese words like 妊婦, 爆煙, or 価値 where the left radical is identical in both kanjis?

For example, with the radicals highlighted:

I hardly know any Japanese, but I wondered if I could write a program that can find such words…

Finding the data.

I took the data from a Firefox add-on called Rikaichan.

Rikaichan contains information on words, so we’ll start with that:

I’ve download the Japanese-English Dictionary, which is a separate add-on, and found that Rikaichan has all word in a sqlite database. Using a System.Data.SQLite we quickly connect to it:

var builder = new SQLiteConnectionStringBuilder();
builder.DataSource = _dictionaryPath;
using (var connection = new SQLiteConnection(builder.ToString()))
{
    connection.Open();
    /// ...
}

We can get the schema using select * from sqlite_master;, which reveals just a single table (and two indices):

CREATE TABLE dict (kanji TEXT, kana TEXT, entry TEXT)

We’ve got a list of 236,329 words, their kanji and kana representations, and their English meaning. This is nice, but there is still no data about the radicals…

Kanji structure and radicals

Rikaichan does have that data: when standing over a kanji it shows a lot more information:

We have three things that are interesting here:

  • One radical on top – Some sort of “main” radical. I’m not sure how Rikaichan crowns that radical.
  • List of all radicals.
  • SKIP pattern – we’ll get to that soon.

Poking a little deeper, we can see this data in the main add-on, not the English dictionary (which is unexcpected, because each kanji has its meaning in English).

There is a file called kanji.dat, with this structure:

災|B86 G5 S7 F976 N1448 V3400 H2206 DK1400 L167 IN1335 E680 P2-3-4 I4d3.3 Yzai1|サイ わざわ.い|||disaster, calamity, woe, curse, evil
灾|B86 S7 P2-3-4 Yzai1|サイ わざわ.い|||calamity, disaster, catastrophe
炁|B86 S8 P2-4-4 Yqi4|キ ケ いき|||breath, air, steam, gas, weather, used in Taoist charms
炅|B86 S8 P2-4-4 Yjiong3 Ygui4|ケイ キョウ エイ ヨウ あらわ.れる|||brilliance

So this is promising. For 災, B86 correlated with “radical 86” from the screenshot, and there’s the English meaning (but we don’t need that). There isn’t any information on the other radicals, and nothing about their positions…
One thing that did stuck out is P2-3-4. Looking at the image we can see this is called SKIP pattern – and it looks less opaque than the other kanji indices. A quick search demystifies it. The first digit can have only four options:

  • 1 – The kanji structure is left-and-right, like 炆, 炒, or 炬.
  • 2 – The kanji structure is top-and-bottom, like 災, 灾, or 粂.
  • 3 – The kanji structure is enclosing-and-contained, like 仄, 兦, or 冃.
  • 4 – The kanji structure is “other”, like 冉, 戍, or 火.

Great! Now it looks like we have enough data to find our words: We have a list of words, list of kanjis, “main” radical in each kanji, and the general structure of the kanji. It looks like the “main” radical is (mostly) the one on the left, and that’s good enough.

We can parse this file using a small regular expression:

^(?<Kanji>\w|[^|]{2})\|       # \w doesn't match 𠀋 - surrogate pair
B(?<RadicalIndex>\d+)\b
(?=[^|]*\bS(?<StrokeCount>\d+\b))?
(?=[^|]*\bP
    (?<SkipPattern>(?<SkipPatternType>[1-4])-\d+-\d+\b)
)?

Here’s a pie chart with the distribution of the different kanji structure. The majority of them is of the left-and-right kind, so we will probably find a lot of words:

pie chart

With 64% of kanjis in the 1 category, you have to wonder if SKIP pattern is a good way to organize kanjis.

What are the Radicals

We’ve extracted the “main” radical for each kanji, bu we are still missing something. Some kanjis are also radicals, and according to our data, they are their own radicals. For example:

Here is the source data from kanji.dat

巛|B47 S3 V1527 H9 P1-1-2 Ychuan1|セン かわ||まがりがわ|curving river radical (no.47)
川|B47 G1 S3 F181 N1447 V1526 H6 DK1 L127 IN33 E48 P1-1-2 I0a3.2 Ychuan1|セン かわ|か こ さわ|さんぼんがわ|stream, river, river or 3-stroke river radical (no. 47)
[...]
順|B47 G4 S12 F779 N1450 V6619 H18 DK9 L129 IN769 E506 P1-1-11 I9a3.2 Yshun4|ジュン|あや あり おき おさむ しげ したがう とし なお のぶ のり まさ むね もと ゆき よし より||obey, order, turn, right, docility, occasion

This is bad news, because it would introduce false positives – what if there was a word like 順巛? (not a real word, by the way).

Luckily, the Rikaichan developers are really good at naming files, and we can find that data on radicals.dat:

巛	川	まがりがわ	crooked river	侃釧訓慌荒災拶州洲酬巡順疏馴流琉硫剄勁卅廱徑惱旒梳毓獵瑙痙癰碯經緇脛腦臘莖蔬輕輜逕醯錙鑞頸駲鯔
工		たくみ	craft	恐空功巧控攻江紅腔貢項鴻佐嵯左差瑳試式拭尋惰楕築筑虹杢倥儔剄勁啌嗟噐噬墮壽嵳巫弑徑惘扛搓擣杠檮椌槓槎橢汞濤潯熕畭疇痙矼磋穩筮箜籌經縒缸肛脛隋膸莖蕁蛩覡訌誣跫蹉躊軾輕逕隨鑄隱靈鞏頸髓鵐
己	已巳	おのれ	snake	改鞄巻忌紀記起倦圏捲巷港撰選遷巽巴配妃包庖抱泡砲胞飽僊囘匏咆垉惓杞枹炮煕熈爬疱皰祀綣苞萢蚫蜷袍鉋雹靤韆饌髱鮑麭熙
巾		はば	cloth	柿希帰稀錦策刷刺姉市師獅常飾帥制製席匝掃帯滞凧帖帳吊帝締諦蹄逓肺幡帆婦布怖幅幣弊蔽瞥帽幌幕棉綿佩冪唏啻啼嫦帋帚帙帑帛帶帷幄幃幀幎幗幔幟幢幤幇掣敝斃旆晞暼柬棘棗楴楝欷歸沛滯珮箍箒篩緜羃菷蒂蓆蔕乕蟐衞閙霈鬧鯑鰤

Surrogate Pairs

As seen above in the regular expression, it is not trivial to split a string into Unicode characters. you can not assume one Char is one Unicode character. For example, "𠀋" is a surrogate pair, made of two Chars.

It is simple enough to split a string to Unicode character points:

public static IEnumerable<string> SplitBySurrogatePairs(this string str)
{
    var enumerator = StringInfo.GetTextElementEnumerator(str);
    while (enumerator.MoveNext())
    {
        yield return enumerator.Current as string;
    }
}

Final code

We can finally write the query to find our words:

var w = from word in _words.Words
        let chars = word.Characters
        where chars.Count >= 2
        where chars.All(_kanji.IsKanji)
        let kanjis = chars.Select(_kanji.GetKanji)
        where kanjis.All(k => k.SkipPatternType == SkipPatternType.LeftRight)
        where kanjis.All(k => !_radicals.IsRadical(k.Kanji))
        let firstKanjiRadicalIndex = kanjis.First().MainRadicalIndex
        where kanjis.All(k => k.MainRadicalIndex == firstKanjiRadicalIndex)
        orderby kanjis.Count() descending
        select word;

Result

We’ve found 1364 such words. Naturally, some of them are quite nice:

  • 喋喋喃喃 – holding an intimate, long-winded conversation in whispers
  • 津津浦浦 – all over the country; every nook and cranny of the land; throughout the land.
  • 流汗淋漓 – profuse perspiration, dripping with sweat.
  • 経緯線網 – graticule
  • 縷縷綿綿 – going on and on in tedious detail
  • 蚯蚓蜥蜴 – worm lizard, amphisbaenian

Source Code

You can see the whole code on GitHub: https://github.com/kobi/JapaneseWords

.Net Regular Expressions – Finding Decimal Numbers that are Divisible by Three

It’s very easy to check a decimal number is divisible by three using a simple DFA with 3 states.

A regex, therefor, is possible, but not too pretty (source):

(?:[0369]|
[147](?:[0369]*[147][0369]*[258])*(?:[0369]*[258]|[0369]*[147][0369]*[147])|
[258](?:[0369]*[258][0369]*[147])*(?:[0369]*[147]|[0369]*[258][0369]*[258])
)*

Example: http://www.rubular.com/r/ZcRDblHg8M

Here’s another approach, using .Net’s stacks as a simple counter:

\b
(?>             # No regrets - don't backtrack on if/else decisions.
    [0369]      # = 0 (mod 3)
    |
    [147]       # = 1 (mod 3)
    (?:         # if possible pop 2, else push 1
        (?<-Sum>){2}|(?<Sum>)
    )
    |
    [258]       # = 2 (mod 3)
    (?:         # if possible pop 1, else push 2
        (?<-Sum>)|(?<Sum>){2}
    )
)+
\b
(?(Sum)(?!)) # Assert nothing's left in the stack

Why? Well, I was bored while I shaved. Luckily, this regex is simple enough for Mono. Working example: http://ideone.com/Yp6Ti (ok, maybe not, mono is missing 111222)

.Net Regular Expressions – Using the Stack State to Understand Numeric Values

It is common knowledge that regular expressions should handle text and not values. A recent stack overflow question got me thinking though – it is possible to use .Net regular expressions to understand numbers or other values while matching the pattern?
Regular expressions can be used to perform numerical tasks, but that is usually when working in unary base.
It turns out this is possible – .Net keeps a stack for every capture of every matched group while matching the pattern, and that state is available for use while matching. The idea is simple: we can represent numbers as depth of the stack, so the number 0 is an empty stack, 6 is a stack with 6 captures, and so forth.

(?>
    (?=[0-9])   # optimization - don't multiply when we don't have a digit.
    # multiply the content of the stack by 10
    # for each item on Stack, push 10 items to a Temp stack.
    (?(Decimal)
        (?<-Decimal>
            (?<Temp>){10}
        )
    ){100000}
    (?(Decimal)(?!))
    # Push all items from Temp back to Stack
    (?(Temp)
        (?<-Temp>
            (?<Decimal>)
        )
    ){100000}
    (?(Temp)(?!))
    # match a digit, and push its value to the stack
    (?:
        0                 |
        1 (?<Decimal>)    |
        2 (?<Decimal>){2} |
        3 (?<Decimal>){3} |
        4 (?<Decimal>){4} |
        5 (?<Decimal>){5} |
        6 (?<Decimal>){6} |
        7 (?<Decimal>){7} |
        8 (?<Decimal>){8} |
        9 (?<Decimal>){9}
    )
)+

The idea is very simple: when we see a new digit, we multiply the depth of the stack by 10, and add the number represented by the new digit. The value of the number can be verified using:

match.Groups["Decimal"].Captures.Count

A curious bit here is the use of the loop to copy stacks:

(?(Temp)
    (?<-Temp>
        (?<Decimal>)
    )
){100000}

I’d expect this to be enough:

(?<-Temp> (?<Decimal>) )*
(?(Temp)(?!))

It turns out the above loop is only executed once, and the condition always fails. It is probably a documented optimization, I’ll look more into that later. As a proof of concept, the workaround should do.

It is even possible to perform basic arithmetic operations on these stacks such as adding, subtracting, multiplying and such from within the regex engine, but that may be a few extra steps too many.
It should go without saying, of course, that regex isn’t a good option here – this is for recreational use. The run time and complexity are far from ideal.

See also:

.Net Regular Expressions – Finding Acronyms, and Reversing the Stack

A recent Stack Overflow question asked if you could (not should) use regular expressions to find acronyms, specifically of the form “Original Poster (OP)” – words followed by the acronym in parentheses.

Well, my first try was this:

\b((?<Acronym>\w)\w*\W+)+
\((?<-Acronym>\k<Acronym>)+\)
(?(Acronym)(?!))

Seems simple – the first line captures the words and pushes each first letter to the stack. The second line pops and matches it, and the last line makes sure there aren’t any extra letters. Seems nice, but wrong. The first letter on the stack in this case comes from the last word, so it matches a reversed acronym – “Oops, Wrong (WO)”.

What I had to do is to reverse the stack. I came up with this regex:

\b((?<Acronym>\w)\w*\W+)+
(?<=(?<-Acronym>.(?=.*?(?<Reverse>\k<Acronym>)))+)(?(Acronym)(?!))
\((?<-Reverse>\k<Reverse>)+\)
(?(Reverse)(?!))

Now, I’m not sure that’s the best way, but it works nicely. The second line is the only thing interesting – I won’t explain it too much, because nobody is reading it. Basically, I match every letter on the stack, and push it to a second stack. I match a dot for each letter because the engine has trouble matching a zero-width expression multiple times (though it works with {5}, for example, but not + or {1,5} – it only tries one). I can match backwards because I know I had at least that many letters, and can look forward because I’m optimistic – I expect to match these letters later, so if they aren’t there, I might as well fail now.

Source Code

Source code and test cases can be found on GitHub: https://github.com/kobi/RecreationalRegex

SharePoint 2010 – Using Document Id to Link to a Specific Version

SharePoint 2010 introduces Document Id, which is an easy way to create document permalinks across a SharePoint Site (SPSite), without worrying about changing names and folders.
This is all nice and well, but what if you’ve enabled versioning, and want to link to a specific version?
It is a shame DocIdRedir.aspx does not accept a version as an optional parameter, but a small shame, as it is easy enough to implement your own handler, using the API method DocumentId.FindUrlById. Alternately, you could have done that yourself and used DocumentId.FindUrlsById, find the appropriate SPListItem and find the version url, if you don’t approve of the way FindUrlById does that for you.
My solution is to create another page, much like DocIdRedir.aspx, and have it accept a version and act accordingly (in this case I’ve created an ASHX handler, but it should be just the same):

public void ProcessRequest(HttpContext context)
{
	string docId = context.Request.QueryString["id"];
	string versionLabel = context.Request.QueryString["v"];
	SPSite currentSite = SPContext.Current.Site;
	string url = null;
	try
	{
		url = GetUrlFromID(currentSite, docId, versionLabel);
	}
	catch (Exception ex)
	{
		string message = String.Format("Error finding document by id {0} and version [{1}], {2}"
			, docId, versionLabel, ex.Message);
		SPUtility.TransferToErrorPage(message);
	}

	if (url == null)
	{
		string message = String.Format("Could not find document with id {0} and version [{1}]"
			, docId, versionLabel);
		SPUtility.TransferToErrorPage(message);
	}
	else
	{
		context.Response.Write(url);
	}
}

private string GetUrlFromID(SPSite site, string docId, string version)
{
	//FindUrlById throws an exception if version is empty or null,
	// so I check it here to make sure it works.
	if (String.IsNullOrEmpty(version))
	{
		string[] urls = DocumentId.FindUrlsById(site, docId);
		return urls.Single(); // think about what you do here
	}
	else
	{
		return DocumentId.FindUrlById(site, docId, version);
	}
}

After creating and deploying your package, which is effortless in Visual Studio 2010, instead of using the default permalinks:

/_layouts/DocIdRedir.aspx?ID=GROOVY-10-2

you can use your own handler (you may want a shorter solution name and handler):

/_layouts/mySolution/VersionedDocumentId.ashx?id=GROOVY-10-2&v=2.5

See also: