Perfect Data Poppycock – Part 4 of the Powerful Productivity Series

I wasn’t planning on this being part 4 of this blog series.  No, the plan was to do an end-to-end scenario using real data and a real business use case.  Instead, you’re getting a rant that I’m dressing up as a blog post.  The rant of a man who spent his Saturday afternoon doing what’s the hardest part of any BI project – actually getting the data and making sure it is clean and correct.

There’s been a number of new announcements recently around new analytics solutions from companies like Amazon, SAP and Tableau.  The announcements all had a similar theme – we’ve got a snazzy new product that will help people to easily explore, visualize and analyze their data.  And to show you how ground-breaking it really is, they give you a chance to try it using one of their pre-loaded sample datasets.  With just a few quick clicks, you too can make a bar chart that shows you how much ticket sales were for Jurassic World.  Or better yet, how many third class passengers died on the Titanic!

Yes, the best friend of BI pre-sales personnel everywhere, these perfect public datasets do a fantastic job helping tell the story you want customers to hear – just take your perfect dataset (preferably a single table on a spreadsheet), load it up and watch the magic happen.  And everyone uses these datasets – yes, including here at Contoso Microsoft.  It’s just the reality of doing demos – privacy concerns make using real company data very tricky, you don’t want the demo to go wrong, etc.  So like the latest Macklemore video, these demos look great at first glance, but trying to find any actual meaning around the information being presented is probably wasted energy.

That was why I wanted to finish off my perfect productivity series with a grand finale – use real data and a real business scenario that showcased the value of Power BI and Office 365.  And I had an excellent one – my wife runs an online business I help her out with from time to time.  She’s had some moderate success with it, but she’s never really used the data in any meaningful way, other than the normal tax stuff one has to do.  She suggested I look at the data to see if there were some things she could be doing to make more money at it.  Jackpot!

So yesterday, I sat down to dig into the data – surely, this was going to be a piece of cake.  All the data was in a central location, I knew the business, etc.  That’s when I ran into a data access problem.  The site she sells the items on only lets you get the records for the current month and the past three calendar months.  Since she had sales going back to October 2013, that wasn’t going to help.  I then tried pulling the data from a site called Terapeak, which I love to play with because it has historic sales data, but they too limit the number of records you can download.  Gee, what a surprise, central IT was preventing me from getting the data I needed to do my job . . .

After I snapped out of my shadow IT flashback, I did what I used to do in these situations.  I figured out how to work around the data access limitations.  I discovered the data was also available from PayPal, and those you could run for a custom time period.  Woohoo!  So I ran the report, downloaded the file, and opened it up in Excel to review the data before I got started.  That’s when the “fun” began.

The first problem was PayPal had two records for each item someone bought, one that reflected the item when it was in the shopping cart, the other when it was it actually paid for.  It looked like this –


Then I noticed there was stuff in the download that had nothing to do with the sales themselves, but were potentially for office supplies or the like.  Plus it had a bunch of extra columns, date formatted in a dumb way, etc.

No big deal for me – Power BI desktop and Power Query in Excel eat these types of problems for breakfast.  And unless I missed something in my BI 101 class, no snazzy visual or finger gesture on my tablet is going to make up for the fact my dataset from the company’s “single source of truth” is giving me complete garbage.  Oh, but it gets better – this particular report included all the listing information, including the listing title, but didn’t include the listing category, which I needed.  So I’d have to recreate that myself for the purposes of this report, based off of the item title.  Also, the publication year for the item being listed, which was very helpful for certain listings, was buried in with the main title text.  I needed to pull that out and make it its own column.  It was becoming clear that my grand plans to play Witcher 3 all afternoon were evaporating fast.

We’re now teetering dangerously close to Ken Puls/Bill Jelen/Rob Collie territory in terms of ability required to make this happen, and last time I checked, there weren’t folks like this littered across every department of companies I’ve spoken to (see my previous post for some context on that).  Unless of course, I’m simply muscling through the spreadsheet and typing values into new columns manually as I eyeball each line.  Which is EXACTLY WHAT 99% OF PEOPLE WHO THESE NEW TOOLS ARE MARKETED FOR WILL DO.  It’s going to be faster for the vast majority of people in the short term.  Or maybe you’re using Google Docs, or haven’t heard of Power Query, whatever.  But not ME.  No, I’m going to make the folks at PASS proud by writing this great M script and –

If you’re thinking that ends with the phrase, weep like a baby when you notice yet another data issue, you’ve obviously been on this goat rodeo before at a customer site.  As I went to start my data import, I looked more closely at the title field – in the report, if someone bought multiple items in the same transaction, it only lists it as a single transaction line item, with a summarized sales amount.  BUT, it has all the items bought as separate values in the item title field, separated by commas.  So it looked like this –

Sadly, Bo didn’t know profits.

I knew what each item’s sales price was for, because the prices were always the same for the same types of item – for example, an old book would be listed for $9.95, and old Xbox game would be $14.95, etc.  But I still had to break these records into individual line items and then adjust the value on each line to reflect what the price was for that item only, since the overall item total would be copied to each line.


I’m sure it’s possible to get an M script that would automate this, and I’ve done stuff like that before in SQL Server, but I wasn’t about to fiddle with it for the 175 or so rows I needed to clean up.  I sucked it up and did that part of it manually in Excel.  The rest of the data clean-up I outlined I did do in Power BI desktop, but it took some custom M script, two Ken Puls articles (this one is especially useful) and about 200 (!) steps.  Oh and about nine hours of my time.  And Geralt’s time.

But the pain was all worth it – my properly cleaned dataset contained some remarkable insights that I was able to light up in Power BI with some of the new custom visuals that were introduced last week.  And it also reminded me of what it’s like in the real world, because this scenario happens every day, in every company in almost every department.  It certainly happened in my old group more often than my boss coming to me with a perfect data export of Titanic deaths he was looking to throw a pie chart on.

Jen Underwood touched on this theme in her post asking if you’d bought a self-service BI fantasy.  I’d take that one step further – you might need to ask yourself if you’ve bought a corporate BI fantasy as well. Because for all the flaws people feel the Microsoft solutions might have, it’s the only one I’ve found was designed and consistently used as part of every solution here in the real-world, where data is messy or unavailable or both.  And we’ve got pretty great visuals as well – ever seen a Datazen dashboard or that cool fishbowl in Power BI?

Have a great week at PASS!

One thought on “Perfect Data Poppycock – Part 4 of the Powerful Productivity Series

Comments are closed.