Wouldn’t it be fun to analyse some of the data that this project is generating? Today is something completely different to the preceding 33 days of new things, namely to implement some fun text analyses code in F#.
Some interesting post data
So far the total number of unique words used across the 33 posts preceding this post is 2492. The average number of unique words per post is 168.2. The average post length is 281 words. If I can settle on a definition for stop words that I like, then it turns out you usually get to read every word only twice per post on average. Which is pretty exciting.
Something more interesting about the post data than that
The really new part today was diving back into F# to use it to implement some text similarity analysis functionality. There’s nothing particularly interesting as yet, partly I don’t have enough data points as yet to extract much meaning. And, as you can judge by some of the simple stats in the first paragraph, I needed to get my F# skills sharpened again. As well as burning through some time trying to turn the WordPress RSS feed into something usable for analysis purposes.
Today’s main stat is comparing the posts to each other for similarity, as determined by the Jaccard distance between the posts. It’s fairly naive, but for the first time, I’ve implemented calculations in an F# 4.4 assembly that’s called from a C# app that is pulling the data down from WordPress.
The stand-out post
I was surprised to find that one of the posts pops out as being ‘odd’. Reading back through it, it’s not that surprising. It’s interestingly also one of the most-read posts on the blog so far. The blog post it is most similar to is an unlikely candidate, so it’s all probably just noise.
I’m not sharing any code just yet – I need to clean it up as I only spent slightly more than an hour on the whole thing, and it’s a mess. It’ll be on Github when I get around to it.