Day 34: Use F# to analyze the post data

Wouldn’t it be fun to analyse some of the data that this project is generating? Today is something completely different to the preceding 33 days of new things, namely to implement some fun text analyses code in F#.

post data

Some interesting post data

So far the total number of unique words used across the 33 posts preceding this post is 2492. The average number of unique words per post is 168.2. The average post length is 281 words. If I can settle on a definition for stop words that I like, then it turns out you usually get to read every word only twice per post on average. Which is pretty exciting.

Something more interesting about the post data than that

The really new part today was diving back into F# to use it to implement some text similarity analysis functionality. There’s nothing particularly interesting as yet, partly I don’t have enough data points as yet to extract much meaning. And, as you can judge by some of the simple stats in the first paragraph, I needed to get my F# skills sharpened again. As well as burning through some time trying to turn the WordPress RSS feed into something usable for analysis purposes.

Today’s main stat is comparing the posts to each other for similarity, as determined by the Jaccard distance between the posts. It’s fairly naive, but for the first time, I’ve implemented calculations in an F# 4.4 assembly that’s called from a C# app that is pulling the data down from WordPress.

The stand-out post

I was surprised to find that one of the posts pops out as being ‘odd’. Reading back through it, it’s not that surprising. It’s interestingly also one of the most-read posts on the blog so far. The blog post it is most similar to is an unlikely candidate, so it’s all probably just noise.

I’m not sharing any code just yet – I need to clean it up as I only spent slightly more than an hour on the whole thing, and it’s a mess. It’ll be on Github when I get around to it.

Day 6: A double header: health tonic and a parser that parses first time

Purdey's Edge

Today’s first new thing is a bit lazy. It’s winter and everyone is a bit under the weather after the festive days. What better cure than a health tonic. Or as the bottle says: Elixir vitae.
Yes, its Purdey’s Edge multivitamin health drink. I was fully expecting it to be disgusting. It turns out to be quite tasty. I might try it again.

Perhaps suitably reinvigorated by the tonic, I also remembered a second new thing for the day. As you may or may not know I work as a software engineer at Microsoft. This week being a fairly quiet week so decided to pick up one of those side projects that always seem to remain unfinished. In process of which I had to write a parser that needs to deal with a fairly funky input format. Usually debugging takes a while, but today the tests all passed first time. And I don’t think that has ever happened to me first time. And yes, there’s more than one test. So either I’ve been writing too many parsers, or, more likely: quiet, uninterrupted coding time leads to fewer errors!