Digitally Speaking

Anis Makeriya
August 21, 2017

It’s always the same scenario: someone giving me some data files that I just want to dive straight into and start exploring ways to visually depict them, but I can’t.

I’d fire up a reporting tool only to step right back, realising that for data to get into visual shapes, they need to be in shape first!  One correlation consistently appearing over the years is that time spent on ETL/ELT (Extract, Transform and Load [in varying sequences]) and the speed of exit from reporting layer back to data prep share a negative correlation.

Data preparation for the win

‘80% of time goes into data prep’ and ‘Garbage in Garbage out (GIGO)’ have existed for some time now but don’t actually hit you until you face it in practical situations and it suddenly translates into ‘backward progress’. Data quality issues can vary from date formats, multiple spellings of the same value to values not existing at all in the form of nulls. So, how can they all be dealt with? Data prep layer is the answer.

Often with complex transformations or large datasets, analysts find themselves turning to IT to perform the ETL process. Thankfully, over the years, vendors have recognised the need to include commonly used transformations in the reporting tools themselves. To name a few, tools such as Tableau and Power BI have successfully passed this power on to the analysts making time to analysis a flash. Features such as pivot, editing aliases, joining and unioning tables and others are available within a few clicks.

There may also be times when multiple data sources need joining, such as matching company names. Whilst Excel and SQL fuzzy look-ups have existed for some time, the likes of dedicated ETL tools such as Paxata have imbedded further intelligence that enable it to go a step further and recognise that the solutions lies beyond just having similar spellings in between the names.

All the tasks mentioned above are for the ‘T’ (Transformation) of ETL and is only the second OR third step in the ETL/ELT process! If data can’t be extracted as part of the E in ETL in the first place, there is nothing to transform. When information lies in disparate silos, often it cannot be ‘merged’ unless the data is migrated or replicated across stores. Following the data explosion in the past decade, Cisco Data Virtualisation has gained traction for its core capability of creating a ‘merged virtual’ layer over multiple data sources enabling quick time to access as well as the added benefits of data quality monitoring and single version of the truth.

These recent capabilities are now even more useful with the rise in data services like Bloomberg/forex and APIs that can return weather info, if we want to further know how people feel about the weather, then the twitter API also works.

Is that it..?

Finally after the extraction and transformation of the data, the load process is all that remains… but even that comes with its own challenges. Load frequencies, load types (incremental vs. full loads) depending on data volumes, data capture (changing dimensions) to give an accurate picture of events and also storage and query speeds from the source to name a few.

Whilst for quick analysis a capable analyst with best practice knowledge will suffice, scalable complex solutions will need the right team from IT and non-IT side in addition to the tools and hardware to support it going forward smoothly. Contact us today to help you build a solid Data Virtualisation process customised to your particular needs.

Latest Tweets