VLDB DBSee on GitHub

VLDB-LogoIt’s far from easy to police a busy Teradata system – even those that think they’re on top of best practice can let standards slip over the years.


Teradata Best Practice

VLDB have been developing Teradata best practice beliefs since 1989. To support this, we have worked to develop a program that consists of a set of SQL scripts that run against the Teradata data dictionary; each script can be used to inform a user of the current state of their Teradata system.

DBSee is now available on GitHub for all to view and use at will – and it doesn’t cost a penny.

DBSee Benefits

The key DBSee deliverable is a set of results displayed within your native database client tool. This can also be output as a .csv file and imported into Tableau for easy-to-understand visualisations.

We have yet to find a Teradata system that doesn’t fail nearly all of the DBSee best practice checks.

The great news is that all best practice violations can be fixed relatively easily. The first step is to uncover them with DBSee.

Head to our GitHub page to retrieve all of our SQL scripts and for more information on how to run DBSee: https://github.com/VLDB-Solutions/DBSEE

Posted in SQL, Teradata, VLDB | Tagged , , , | Comments Off on VLDB DBSee on GitHub

PostgresConf 2018 Musings

VLDB-LogoYours truly attended the Postgres Conf in Jersey City last week with my esteemed colleague Richard Jackson. It was quite a blast, to say the least…apart from not getting home until late Saturday afternoon. Ho hum.

PostgresConf 2018, Jersey City, USA

Pivotal Postgres

PostgreSQL Re-Cap

PostgreSQL, or Postgres for short, is an open source database first developed as a follow-on from Ingres. The Postgres prototype was shown at the ACM SIGMOD conference in 1988. Yes, that’s 30 years ago.

During the early 2000’s, Postgres was frequently used as the basis to develop ‘scale-out’ clustered analytic platforms. Without Postgres there would be no Netezza, Paraccel, Redshift, Aster Data or Greenplum. So, even the new-fangled Postgres-based MPP platforms out there have a 15 year heritage.

Parallel Postgress

The senior team at VLDB (you know who you are!) were early adopters of all things ‘parallel Postgres’. We delivered our first Netezza project in 2003 at Caudwell Communications, the fixed-line telecoms division of Phones-4-U.

As early as 2003-04 we were also engaged with the early incarnations of Greenplum. We were there during the Metapa & Bizgres days. Scott hasn’t aged a single day. Go figure!!!

Greenplum was MPP, which we liked. It was a ‘parallel Postgres MPP’, which we really liked. Most of all, unlike the other MPP offerings we’d used, Greenplum was – and still is – a ‘software only’ play. Now that’s *really* interesting to a consulting outfit like VLDB that likes to ‘roll your own’.

Greenplum Re-cap

After being founded in 2003, Greenplum the company was acquired by EMC in 2010.

Pivotal Software was formed in 2013 as a spin-out from EMC and VMWare. Greenplum was duly added to Pivotal’s software portfolio, hence ‘Pivotal Greenplum’.

Pivotal decided to open source Greenplum in 2015. Greenplum thus became the ‘World’s First Open Source MPP Data Warehouse’, which we like, a lot. Did we mention that?

Back to PostgresConf 2018

Day 2 of the PostgresConf 2018 was the ‘Greenplum summit‘.

The day kicked off with ‘Greenplum: Building a Postgres Fabric for Large Scale Analytical Computation‘ presented by our good friend Jacque Istok, and our new friend Elisabeth Hendrickson.

Immediately following Jacque and Elisabeth was a presentation by Howard Goldberg from Morgan Stanley entitled ‘Greenplum: A Pivotal Moment on Wall Street‘. Jaw-dropping tales of 20PB (10x compressed) Greenplum databases aside, Howard is perhaps the funniest and most engaging speaker we’ve ever had the pleasure to encounter at a conference, and we’ve been to *lots* of conferences.

‘Fessing up to having Pivotal’s Greenplum product manager Ivan Novick on speed-dial ahead of Mrs Goldberg had the audience in stitches. Fine work Howard!!!

How do you follow Howard? Well folks, Shaun Litt from Conversant Media did a mighty fine job with a session entitled ‘Greenplum for Internet Scale Analytics and Mining‘. We’re not easily impressed when it comes to all things ‘big’ data, but when Shaun casually dropped ‘quintillions of rows’ into the session we all sat up and took notice…and then Googled just how big a quintillion actually is.

After a spot of lunch we were treated to ‘Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum‘ by Frank McQuillan and Bharath Sitaraman from Pivotal. A core component of the Greenplum offering is scalable, in-database machine learning, graph, analytics & statistics enabled by Apache MADlib. Pivotal has also put a lot of engineering effort into text search via the Apache Solr based GPText.

The final Greenplum summit session we attended was ‘Pivotal Greenplum in Action on AWS, Azure, and GCP‘ by Pivotal’s Jon Roberts. This was of particular interest given that team VLDB have been deploying Greenplum on public cloud platforms such as Profitbricks, AWS, Azure and Google for years. Welcome to the public cloud party chaps!

As an aside, Jon maintains a most excellent blog at http://www.PivotalGuru.com. Highly recommended.

Pivotal IPO

After a very entertaining evening meal and a ‘few drinks’ with the folks from Pivotal and Blue Talon, on the Friday we headed over to the Pivotal office in NYC. This was no ordinary Friday for our friends at Pivotal (‘Pivots’), no sir. Today was the day Pivotal went public via an IPO on the NASDAQ.

A big ‘well done’ to Greenplum founders Bill Cook and Scott Yara for their roles in bringing Pivotal to IPO. Oh, and ‘Happy Birthday’ to Bill for this week.

The wine, champagne and strange blue drinks were handed out a-plenty at the Pivotal NYC office. A unique way to end a tech conference, for sure.

If you look carefully, you’ll probably spot a couple of VLDB interlopers in the NYC team photo. Sorry for blocking the folks stuck behind us, it was quite a crush!

Conference Takeaways

Although this was a Postgres conference, the main focus for VLDB was the Greenplum summit. We’ve been fans and users of Greenplum database (GPDB) for a *long* time.

The IRS, Morgan Stanley and Conversant can’t all be wrong, surely?

Pivotal has gone ‘all in’ on Greenplum as the open source MPP platform to support analytics at any scale and complexity. It is perhaps informative to note that their is no mention of Hadoop on the Pivotal product page. No way did we call this out early last year…

And Finally…

In addition to the usual pillaging of the Macy’s mens department, we managed to bring some Pivotal ‘swag’ home to the UK. Our two labradors, Molly and Ruby, enjoyed the foam rockets perhaps a little too much. A big thanks from both of them!

Posted in cloud, consulting, data warehouse, database, Pivotal, SQL, VLDB | Tagged , , , , , , , , , , , , , | Comments Off on PostgresConf 2018 Musings

Google, SQL & noSQL Ramblings

Teradata AWS AccessA couple of articles on Google and SQL versus noSQL caught my eye recently and inspired me to go on a bit of a ramble, so here goes..

Google’s Head Fake

Firstly, there’s the splendidly titled “Did Google Send the Big Data Industry on a 10 Year Head Fake?” Apparently a ‘head fake’ is to trick your opponent as to your intentions. Probably akin to ‘dropping the shoulder’ in football.

So, the question is, did the Big G send us all the wrong way for a decade? Short answer: no. Longer answer: if you did march the wrong way for several years, have a word with yourself.

The article quite rightly points out that neither distributed file systems nor distributed computing can be attributed to Google. As Curt Monash succinctly stated on DBMS2 in 2010, ‘popularization != invention‘.

However, the thinking goes that Google’s famous research papers gave rise to Doug Cutting efforts at Yahoo, which led to Hadoop, whilst all the while the Big G was working on plain old SQL-based Spanner.

The Big G may have been working on other SQL-based stuff whilst the Hadoop crowd were beavering away, but does this amount to a head fake or a shoulder drop? No, not in the slightest.

Many considered the old-fangled SQL-based data warehouse to be under immediate threat due to the rise of the ‘new’ MapReduce-based SQL-free computing paradigm. Shiny & new always wins, right?

Sadly, that can be the case. I’ve witnessed ‘big data’ POCs at first hand where the main motivation was to be seen to be ‘doing big data’. I kid you not. Forget the impact on the users and existing toolsets, stop worrying about ROI, this stuff is free and cool, the vendor is saying all the right things, so it’s definitely the way to go. Woo-hoo!!!

Let’s never forget, there are armies of folks looking to monetise anything that moves in IT. The VCs, start-ups, analysts and conference organisers make their living out of getting mere mortals to believe they have the silver bullet that’s been missing until now.

What worked for Google, Yahoo, Facebook and LinkedIn doesn’t necessarily translate to the mainstream – this doesn’t make it bad tech. Hadoop is a great example. Out in the real world, SQL is the answer for most analytic applications for most folks most of the time, plain and simple.

If you believed the hype and ‘Hadoop’ didn’t turn out to be your silver bullet, don’t blame the Big G for sharing their thoughts or others for building/promoting ‘shiny & new’. No-one told you to put your finger in the fire, did they?

Why SQL Is Beating noSQL

The article entitled ‘Why SQL is beating NoSQL, and what this means for the future of data‘ really got my attention. The notion that SQL was ‘left for dead‘ and is ‘making a comeback‘ is somewhat at odds with reality, me thinks.

There may have been a belief in certain quarters that the ‘answer’ was no longer SQL, but this has never been reflected out in the real world where regular folks run queries, and lots of them.

SQL is very deeply embedded in every single organisation I’ve encountered since the late 1980’s. I don’t see this changing any time soon. There is simply too much sunk cost and too little benefit in trying to ditch SQL.

Software developers may have ‘cast aside SQL as a relic’ but that certainly isn’t what happened throughout corporate IT organisations or analyst communities.

Its clearly untrue that SQL ‘couldn’t scale with these growing data volumes‘. We’ve had scale-out Teradata MPP systems chomping on SQL since the 1980’s, and newer MPP & SQL players like Netezza & Greenplum for over 15 years. A low industry profile doesn’t mean something doesn’t exist.

As the article states, the developers of SQL recognised that ‘much of the success of the computer industry depends on developing a class of users other than trained computer specialists.‘ which is partly why it has become the de facto standard for interacting with databases.

Apparently ‘…SQL was actually fine until another engineer showed up and invented the World Wide Web, in 1989.‘ I’ll help out here. His name is Sir Tim Berners-Lee OM KBE FRS FREng FRSA FBCS. He’s quite well known.

There is no doubt that mainstream general purpose databases (SQL Server, Oracle, mySQL etc) struggled to cope with data volumes thrown out by all things digital in the new-fangled age of ‘the web’ and ‘the net’.

Some of us were already scaling out (admittedly expensive) MPP systems running SQL databases as early as the 1980’s. Big name retailers, banks and telcos have been running SQL on scale-out database systems for decades.

My first multi-billion row Teradata table is over 20 years old and still alive and well at a retail bank. A 100 billion row telco CDR table is over a decade old and runs on a parallel PostgreSQL system. How much scale do you want?

The rational behind attempts to ditch legacy SQL databases is neatly summarised: ‘NoSQL was new and shiny; it promised scale and power; it seemed like the fast path to engineering success. But then the problems started appearing.’

The likes of DeWitt & Stonebraker know a thing or two about this stuff and were early nay-sayers. Feel free to disagree, obviously, but dismiss their observations at your peril.

Most of the post-Teradata attempts to develop scale-out SQL-compliant databases have leveraged PostgreSQL. This approach dates back to Netezza over 15 years ago, and includes the mega-popular AWS Redshift.

The light-bulb moment and conversion to SQL is something Teradata went through over 30 years ago: ‘One day we realized that building our own query language made no sense. That the key was to embrace SQL. And that was one of the best design decisions we have made. Immediately a whole new world opened up.’

Robb Klopp covers Chuck McDevitt’s ‘SQL is the answer’ light-bulb moment whilst at Teradata in 1985 in Chuck’s obituary. RIP Chuck, without doubt owner of the biggest brain I ever met.

The King Is Dead. Long Live The King.

SQL didn’t die. It didn’t recede. SQL is far from perfect, but it isn’t likely to go away any time soon.

Kool-Aid drinkers took themselves on a detour into the land of file processing via developer-only languages all of their own accord, whilst all the time mainstream IT organisations and user communities carried on quite happily with boring old legacy SQL. The same SQL that runs the world.

The first time I ran an SQL query on Teradata in the late 1980’s I realised I was immediately free from the pain and suffering of developing Cobol for analytics. To this day I can clearly remember the relief (sorry Grace).

It seems a certain section of the IT community is going through a similar realisation: if you want to play with data, SQL was, is, and should be the default starting place.

The King is Dead. Long live the King.

Posted in Big Data, data warehouse, Google, noSQL, SQL | Tagged , , | Comments Off on Google, SQL & noSQL Ramblings

Azure IO Testing

Teradata AWS AccessAzure IO Testing

One of the first things we do here at VLDB when we have a yearning to set up a database on a cloud platform is to develop an understanding of disk input-output (IO) performance.

For databases in particular, we are specifically interested in random read performance. A high random read throughput figure is necessary for an efficient and performant MPP database.

Microsoft’s Azure is the test platform in question. The approach taken is to commission a single CentOS7 VM, attach various disks and measure IOPS & throughput.

Azure VM Setup

A single Centos 7 VM is configured. Disks are added in he required size, type & number in order to maximise IO throughput at the VM level:

Azure OS Console

The starting point on Azure is SSD disks, as the IO figures for SSD are generally superior to HDD.

Disk size determines both throughput and available IOPS, so this will also need to be considered in the overall build:

Azure Disk Console


Azure VM & Disk Choices

For the 1TB SSD selected above, IOPS are limited to 5,000/sec and throughput is limited to 200 MB/s.

As an aside, cloud providers usually fail to distinguish between read & write, sequential & random or specify IO size when they bandy about throughput figures. The devil is in the detail, dear reader(s).

In order to use premium storage it is necessary to specify the account as ‘Premium LRS’.

The range available that is hosted in the UK is limited so if you have a requirement for UK based system, make sure that what is on offer from Azure will satisfy the requirement:

Azure UK Console

Note that only certain VM types support SSD disks.

For the basic D14v2 with 16 CPU cores, 112Gb RAM and 800 Gb disk, it is not possible to specify SSD disks. However, the Ds14V2 VM type will allow these ‘Premium’ disks.

Also note that the 800GB is for the operating system and swap space only – the storage is non-persistent on this device specifically.

The D14 also allows up to 8 NICs and has an ‘extremely high’ network bandwidth. This is useful if the requirements is to build a MPP cluster to facilitate high speed node:node data transfer.

The Ds14V2 VM is limited to 32 disks:

Azure ds14v2

The Standard_DS14_v2 VM is limited to 64K IOPS or 512MB/s in cached mode (host cache = read or read/write) or 51K IOPS (768MB/s) in uncached mode (host cache = none).

The P30 1Tb SSD disks used are rated at 5K IOPS (200Mb/s) max per disk:

Azure P30

Azure IOPS Testing

Booting CentOS7 and upgrading the kernel takes a matter of minutes.

FIO is then downloaded using git and installed to perform the IOPS tests:

git clone –branch fio-2.1.8 http://git.kernel.dk/fio.git /fio

To run a FIO throughput test, simply enter following from the fio directory:

./fio –rw=randread –name=<directory>/fio –size=<size>g –bs=32k –direct=1 –iodepth=16 –ioengine=libaio –time_based -–runtime=<elapse> –group_reporting

  • <directory> – the directory (disk) where the tests are executed.
  • <size> – size of the test file created in GB.
  • <elapse> – elapse time for the test.
  • bs=32k – blocksize (default 4k). 32Kb is a good size to test for database suitability.
  • direct=1 – non-buffered IO.
  • iodepth=16 – concurrent IO threads.
  • ioengine=libaio – native Linux asynchronous IO.
  • time_based – test elapse time.
  • group_reporting – report on group total not individual jobs.

For example:

./fio –rw=randread –name=/tests/disk1/fio –size=1g –bs=32k –direct=1 –iodepth=16 –ioengine=libaio –time_based –runtime=30 –group_reporting

Azure Disk Read Results

Multiple random & sequential read tests were executed against a Standard_DS14_v2 VM with between 1-6 SSD disks of P30 1TB SSD per disk. Blocksize is 32Kb in all cases.

As stated earlier, we are primarily, but not exclusively, interested in read and not write performance when assessing a cloud platform. This is due to the fact that 80-90% of all data warehouse IO activity is read, and only 10-20% is write. If the read performance isn’t good enough we don’t really care about write performance as the platform is probably unsuitable.

Sequential and random read results using fio on Azure are as follows:

Azure IOPS Results

Azure Read Performance Observations

Random and sequential read give similar performance. This is both unusual and most welcome for those of us that rely on random read performance for data warehouse systems!

At the VM level, total disk throughput does not increase as more disks are added to the VM. Disk throughput is limited by the VM type and/or user account type. Adding more disks to the VM is not a way to deliver more IOPS.

Approximately 540-570MB/s random read throughput with a 32KB blocksize was achieved fairly consistently for between 1-6 P30 1TB SSDs attached to a Standard_DS14_v2 VM.

This is slightly higher than the stated 512 MB/s limit for a Standard_DS14_v2 VM, and represents a very good random read throughput figure for a single disk.

However, it is disappointing that this figure can’t be increased nearer to 1,000MB/s (which is attainable on AWS) through the use of more than 1 disk per VM.

Posted in Azure, cloud, IOPS | Tagged , , | Comments Off on Azure IO Testing

MPP & Redshift Musings

Teradata AWS AccessWhat On Earth is MPP?

In computing, massively parallel refers to the use of a large number of processors (or separate computers) to perform a set of coordinated computations in parallel (simultaneously). Source: Wikipedia.

Teradata & MPP Beginnings

Once upon a time, in a land far, far away…well, OK, California in the late 1970’s/early 1980’s to be precise…the MPP database world started to stir in earnest.

Following on from research at Caltech and discussions with Citibank’s technology group, Teradata was incorporated in a garage in Brentwood, CA in 1979. Teradata’s eponymous flagship product was, and still is, a massively parallel processing (MPP) database system.

Teradata had to write their own parallel DBMS & operating system. They also used weedy 32bit x86 chips to compete with IBM’s ‘Big Iron’ mainframes to perform database processing. Quite an achievement, to say the least.

The first Teradata beta system was shipped to Wells Fargo in 1983, with an IPO following in 1987.

It is interesting to note, for those of us in the UK, that there were Teradata systems up and running over here as early as 1986/87, partly down to the efforts of folks such as our very own ‘Jim the phone’.

Early Teradata adopters in the UK included BT (who later moved off Teradata when NCR/Teradata was acquired by AT&T), Royal Insurance, Great Universal Stores (GUS) and Littlewoods. GUS and Littlewoods combined to become ShopDirect, who still run Teradata 30 years later.

Yours truly ran his first Teradata query at Royal Insurance in Liverpool from ITEQ on an IBM 3090 back in ’89. It killed the split screen in MVS, but ho-hum, better than writing Cobol against IMS DB/DC to answer basic questions. We were taught SQL by none other than Brian Marshall who went on to write the first Teradata SQL & performance books. I still have the original dog-eared Teradata reference cards, or ‘cheat sheets’ as we called them:

Teradata ITEQ/BTEQ & Utilities Reference Cards.

Teradata ITEQ/BTEQ & Utilities Reference Cards, 1988 vintage.

From the 1980’s through to the early 2000’s Teradata had a pretty clear run at the high-end analytic DBMS market. There was no serious competition, no matter how hard the others tried. All those big name banks, telecoms companies and retailers couldn’t be wrong, surely?

MPP Upstarts – Netezza, Datallegro & Greenplum

Teradata’s first real competition in the commercial MPP database space came in the form of Netezza in the early 2000’s.

Like Teradata, Netezza consisted of dedicated hardware & software all designed to work in harmony as an engineered MPP database ‘appliance’. Unlike Teradata, Netezza was able to take advantage of open source DBMS software in the form of PostgreSQL, and open source OS software in the form of Linux.

We discovered Netezza by accident in 2002/03 after landing on a PDF on their web site following a Google search. “Netezza is Teradata V1” was our initial response. Apart from the FPGAs, we were pretty close.

A few phone calls, a trip to Boston and a training session later, and we’re up and running as Netezza partners.

Following a successful Netezza project at the fixed line telecoms division of John Caudwell’s Phones4U empire, yours truly was a guest speaker at the inaugural Netezza user conference in 2005.

Following an IPO in 2007, Netezza was bought by IBM in 2010 where it remains to this day, somewhere in IBM’s kitbag. Poor old Netezza.

Also in the early 2000’s, a bunch of mainly Brits at Datallegro were trying to build an MPP appliance out of Ingres. They were bought by Microsoft in 2008. Why Microsoft needed to buy a startup running Ingres on Linux is anyone’s guess.

The last of the new MPP players we had dealings with all those years ago are none other than Greenplum.

Through the Netezza partner ecosystem we actually knew the Greenplum team when they were still called ‘Metapa’, probably around 2002/03. Yours truly was at a dinner with Luke & Scott in Seattle as they wooed none other than Teradata luminary Charles ‘Chuck’ McDevitt (RIP) to head up the Greenplum architecture team. With Chuck on board they were virtually guaranteed to succeed.

Greenplum was always of particular interest to us because, unlike Teradata & Netezza, it was and still is a software only offering. You get to choose your own platform, which can be your favourite servers/storage or even that new fangled ‘cloud’ thing.

Not only have we got Greenplum to work happily on every single platform we ever tried, Greenplum is also the only open source MPP database.

Other Parallel PostgreSQL Players

In addition to early parallel PostgreSQL players such as Netezza and Greenplum (Datallegro used Ingres, not PostgreSQL), a whole raft of ‘me too’ players cropped up in 2005:

  • Vertica – acquired by HP in 2011
  • Aster – acquired by Teradata in 2011
  • Paraccel – acquired by Actian in 2013

It is interesting to note that, unlike Netezza, not one of the new parallel PostgreSQL players put even the smallest dent in Teradata’s core MPP data warehouse market. At least Netezza shook Teradata from their slumber which, in turn, gave the Teradata appliance to the world.

That said, perhaps Paraccel will have the biggest impact on the MPP database market which, for so long, has been dominated by Teradata, the main player in the space for 30 years. How so? Read on…


Amazon launched the Redshift ‘Data Warehouse Solution’ on AWS in early 2013.

Yours truly even wrote some initial thoughts at the time, as you do.

As has been well documented, Redshift is the AWS implementation of Paraccel, which was one of the crop of ‘me too’ parallel PostgreSQL players that appeared in 2005.

Redshift adoption on AWS has been rapid, and reported to be the fastest growing service in AWS.

So, despite Teradata having been around for 30 years, the awareness of MPP and adoption of MPP as a scalable database architecture has been ignited by Redshift, which started out as Paraccel, which is built out of PostgreSQL.

Redshift Observations

Our previous post on Redshift in 2013 made mention of the single column only distribution key, and the dependence on the leader node for final aggregation processing.

There is a workaround for the single column only distribution key restriction, for sure. MPP folks will no doubt be mystified as to why such a restriction exists. However, it neatly removes the possibility of a 14 column distribution key (primary index) that we encountered at a Teradata gig a few years ago (no names). Cloud, meet silver lining.

The dependence on a leader node for final aggregation processing is much more of an issue. Aggregation queries that return a high number of groups will simply choke on the final aggregation step. Just ask Exadata users. There’s also not much you can do to remediate this issue if you can’t over-provision CPU/RAM at the leader node, which is the case with Redshift.

More recent Redshift observations from our own experiences, and from trusted contacts (you know who you are) include the following:

  • lack of node OS access – security auditing, hardening & OS customisation not possible.
  • non-persistent storage – if you stop or lose the EC2 cluster you lose the database data & will need to re-load from S3 or a database snapshot.
  • poor concurrency – default is 5 concurrent operations. Although this can be increased, performance drops off quickly as concurrency increases.
  • poor workload management – no short query ‘fast path’. Redshift is essentially a ‘free for all’.
  • 1MB block size & columnar only storage – very large table space overhead of 1 MB per column for every table x number of segments.
  • limited tuning options – no partitions or indexes. If the sort keys don’t help it’s a full table scan every time.
  • automatic database software updates – this might appeal to the ‘zero admin is good crowd’, but enterprise customers will baulk at the notion of a zero-choice upgrade that could break existing applications.

Lack of OS access and non-persistent storage will no doubt be show-stoppers for some enterprise folks.

The SQL Server crowd will probably not care and just marvel at the performance offered by SQL queries that run against all CPU cores in the cluster. Unless, of course, they didn’t read the data distribution notes.

Meet The New Boss, Same as The Old Boss

No matter that Redshift could be improved, what we can applaud is that Amazon has opened many folks eyes to the benefits of a scalable ‘old-skool’ relational database management system (RDBMS), and one with an MPP architecture to boot. For this we can only be thankful.

The rate of Redshift uptake speaks to the usefulness of a scalable RDBMS. The architecture is being socialised by Amazon (high volume/low margin) in a way that never was by Teradata (low volume/high margin).

All of the MPP database vendors, Amazon included, owe a debt of gratitude to Teradata who proved the architecture 30 years ago. That Teradata got so much so right so long ago never ceases to amaze yours truly.

To Redshift the old adage remains true, there really is ‘nowt new under the sun’.

Posted in MPP, Redshift, Teradata | Tagged , , | Comments Off on MPP & Redshift Musings

Teradata AWS Access How To Guide

Teradata AWS Access

Teradata AWS Access

It was previously explained by ourselves here at VLDB how to setup a Teradata MPP (multi-node) system on Amazon Web Services (AWS). Well, now we’re back to offer a simple step-by-step guide on how to setup basic SSH connections, install Teradata’s Tools & Utilities (TTU) and setup a desktop SQL client for a Teradata instance running on AWS.


SSH & Key Pairs Identity

When you first setup your server on AWS, you are required to have created a key pairs identity. With this, you can start accessing your server straight away. There are multiple ways to connect, though one of the more common is via the SSH tool.

– Mac

If you are connecting via a terminal, then the command will be similar to the usual connection string:


However, you may need to first alter the permissions of the key pairs file. If the access is open to any user the connection will reject the file input. This can be done as follows:

chmod 400 KEY_PAIRS_FILE

Note that the ec2-user is the default user for most Linux systems. Once logged on as ec2-user, you can gain access to root with the command sudo su -.

If you find yourself logging on and off a lot, it may be conveniant to set your key pairs file path as an environment variable for ease of use.

– Windows

If you are connecting via PuTTY on Windows, then you are required to first modify the key pairs file using PuTTY-Gen. Typically this piece of software is downloaded alongside PuTTY, but you may need to download it separately.

Once you’ve loaded up the PuTTY-Gen software, click Load on the right and then select your key pairs file. In the bottom right you may need to change the file type that is visible (default is .ppk and the key pairs file is a .pem).

Once loaded, it should pop up with a message saying Successfully Imported… Once you have this, click Save Private Key in the bottom right and save the file. This will now be a usable key pairs file for PuTTY’s use.

Now, open up PuTTY.

On the left side in the list, expand Connection, and then SSH. Now click on Auth and you should see a view on the right titled Options controlling SSH authentication.

From here, click on Browse in the bottom right and then select the key pairs file you just created.

Now go back to the main page by clicking on Session at the top of the list on the left. Here, put in your server IP address as normal and hit Open.

Note that it may be of convenience to save the session details before opening the connection for ease of use at a later time.

Password Authenticated Login –

If you require password authentication to be enabled, then it will be a simple edit of a system config file and a restart of the SSH service on the AWS server.

First, edit the file /etc/ssh/sshd_config, and then find and change the following line:

PasswordAuthentication no


PasswordAuthentication yes

Note that you should also remove the hash (#) symbol from in front of the parameter if there is one. Once the file is edited as above, you will then need to restart the SSH service as follows to commit the change.

service sshd restart

Once the service is restarted, password authentication should be updated and enabled on your system.

Teradata’s Tools and Utilities (TTU)

 It is very useful to understand that all Teradata servers on AWS come with the equivalent TTU version packaged for Linux, Mac and Windows. With your key pairs file to hand, you can use it, similarly to SSH access, to download the file using SCP.

But first you will need to make the file(s) available to the required user for download (ec2-user in our example), as by default only root can access the files.

First, logon onto the server using your SSH access as ec2-user. Change to the root user and then do [something similar to] the following commands:

cp /var/opt/teradata/TTU_pkgs/Teradata*MACOSX* /home/ec2-user/

chown ec2-user /home/ec2-user/Teradata*MACOSX*

Note that you will need to replace the name MACOSX with Linux or windows if you require those instead.

 –  Mac

Once done, use the following command to retrieve the file for your Mac via terminal:

scp -i KEY_PAIRS_FILE ec2-user@SERVER_IP:/home/ec2-user/Teradata*MACOSX* .

Windows –

The simplest way to download the package on Windows is to use the WinSCP software. For this you will again require the key pairs (.ppk) file that you created using PuTTY-Gen.

First, open up WinSCP.

From here, click on Advanced on the right which will pop up a new window. In the list on the left, click on Authentication. Similarly to PuTTY, click on the […] icon under Private key file and select the key pairs .ppk file, and then click OK.

Now you can input your IP address and user name as per usual, and then Login. Once logged in, you can freely move around your local and server directories to download the files as required.

Note that it may be of convenience, similarly to PuTTY, to save the session details before opening the connection for ease of use at later time.

Installation –

Once you have downloaded the required packages, it is a simple matter of unpackaging the files and running the setup script contained within to install the TTU software.

For example, on a Mac terminal locate the TTU package and then run the following commands:

gunzip Teradata*MACOSX*

tar -xvf Teradata*MACOSX*

./setup.bat a

Note that using tar as above will result in all files being placed in the directory you are located in. Additionally, using the a parameter for the setup script will install all available TTU software.

Once installation is complete, you should be able to use BTEQ from your terminal to connect to your Teradata instance (however you may need to re-load your terminal first).

Teradata Studio Express

Whilst there are a wide variety of desktop SQL clients available that are compatible with Teradata, and you may very well have your preference out of them all, this will be a quick setup and overview of Teradata Studio Express.

Teradata Studio Express is openly available for download on Windows, Mac and Linux platforms from the following website address:


Note that for this software to install successfully, Java Developer Kit must be installed.

Once downloaded, simply unzip the package and run the installer. This will take you through a basic on-screen installation process that typically takes less than a minute. Once finished, the software is ready to use.

When first launching Teradata Studio Express you will be greeted with a Quick Tour guide for the studio. Quickly run through the guide to help you understand the basics (if required) and then continue to the main view.

From the main view, simply right-click Database Connections at the top of the far left panel and click New…. In this window, click Teradata, name the connection as you wish at the bottom under Name: and then hit Next.

In this next view you will see various parameters to fill in, and some pre-defined as well. Unless required, we would recommend leaving the pre-set parameters as default, only altering the following:

  • Database Server Name – The IP or server name to connect too.
  • User Name – The name of the user that you wish to connect to Teradata with.
  • Password – The password for the respective user chosen above.
  • Database – Defaults to user’s default if not specified. All database hierarchy existing under the specified (or default) database will be available for viewing once the connection is made.

Once the details required have been completed, you may click Next to get an overview of the details you’ve specified, or you may directly Finish the connection setup. At this point, you should have a successful connection in the far left panel for you to view. Any changes required to an existing connection can be made by simply right-clicking the connection name and clicking Properties.

With the connection now open, you may easily run queries against the database via the SQL Editor in the top panel on the right. Our basic example shows the following query being run:



from dbc.dbcinfo


The query may be executed by simply clicking on the green triangle in the top right (or by using the shortcut CTRL + ALT + X on a Windows environment, or CMD + ALT + X on a Mac).

Results may then be viewed in the middle panel titled Teradata Result Set Viewer, with history available for all previously executed queries in the bottom panel, suitably titled Teradata SQL History.

Whilst this should cover the basics required to use Teradata Studio Express, there are of course many useful features that can be utilised: One such simple feature is the Reexecute Statement command. By right-clicking on any single row in the Teradata SQL History, you may commit Reexecute Statement, which, as the term suggests, re-executes the selected query statement against the database, and will return results as per usual in the Teradata Result Set Viewer.


And there you have it! A simple step-by-step run through on the basic setup of connections, utilities and client tools that may be required by yourselves for any Teradata instance running on AWS.






Posted in AWS, SQL, Teradata | Tagged , , , , | Comments Off on Teradata AWS Access How To Guide

Is Hadoop Failing?

Is Hadoop Failing? ‘Hadoop Is Failing’ Article

A recent LinkedIn post linking to an Innovation Enterprise article entitled ‘Hadoop Is Failing’ certainly got our attention, as you might expect.

Apart from disagreeing with the assertion that ‘Hadoop…is very much the foundation on which data today is built’ the main thrust of the article is well founded.

Consider the following snippets:

  • ‘adoption appears to have more or less stagnated’
  • ‘trying to figure out how to use it for value’
  • ‘one man’s big data is another man’s small data’
  • ‘hundreds of enterprises were hugely disappointed by their useless 2 to 10TB Hadoop clusters’

And the grand finale:

Most companies do not currently have enough data to warrant a Hadoop rollout, but did so anyway because they felt they needed to keep up with the Joneses..they soon realize that their data works better in other technologies.

This can be summarised as: drank the Kool-Aid, convinced each other Hadoop was a silver bullet, rocked up a POC stack with an all-too-willing vendor, spent a lot of time/effort getting nowhere, quietly went back to existing data management technologies (almost certainly the RDBMS based data warehouse).

Mention is also made of the Hortonworks share price performance, which is down by about 66% since the December 2014 IPO. The NASDAQ is up 19% over the same period.

VLDB’s ‘Big Data’ Experiences

We’ve been exposed to the recent ‘Big Data’ thang from both a client and vendor perspective.

From the client side the main motivators appear to be twofold: reduce data warehouse costs and ‘fear of missing out‘ (FOMO) or keeping up with the Joneses. In both cases the amount of Kool-Aid supplied by VCs, vendors, analysts, tech journos and geek fanbois willingly gulped down by those that should know better is a tad disappointing.

The architect that glibly told us his company would ‘obviously’ have to re-train thousands of analysts away from SQL and into the realms of Pig an Hive springs to mind. This despite having 25 years of skills invested in the existing EDW.

From the vendor perspective there has been a lot of ‘over promising’ that has failed to materialise. ‘Nothing new there’, the cynics might suggest.

The challenge during one POC was to run an existing batch suite of SQL against a relatively simple schema lifted from the EDW as part of an EDW-offloading test.

The POC was to be attempted using a Hadoop stack running Hive, and was categorised by the vendor as ‘just a SQL race’. Given the stature of the vendor we reasonably assumed their confidence was well placed. How wrong we were – the SQL failed a few steps in. Very many tickets later, still no joy. EDW offloading? Err, no. EDW replacement? LMAO!

The point is, our real-world experiences bear out the main thrust of the article: the rush to get involved with Hadoop has *largely* been done without any real understanding of it and the failure to get it working, for whatever reason, has led to a lot of disillusionment and/or stagnation.

VLDB’s Guidance

First of all, be honest with yourself: you’re probably not Google, Facebook or Yahoo.

These companies conceived, designed & built the Hadoop ecosystem out of necessity. They have <understatement>slightly</understatement> different challenges to yours.

They also have a vastly different set of in-house skills. You’re probably a bank, a retailer, a manufacturer, a telecoms company or the like. You simply don’t possess the engineering capability that the ‘Hadoop crowd’ take for granted. Your salary scales also don’t allow the required talent to be hired. That’s before we take your location, stock options & ‘coolness’ into account!

When considering *any* new technology, it should always be apparent what extra capability is on offer, and what specific business benefit that capability delivers. Implicitly, this should be something we can’t already achieve with the tools at our disposal.

On a personal note, the adage that “there’s nowt new under the sun” is my default starting position. Greater minds than mine think there is a lack of novelty with MapReduce. We’ve had parallel data processing for decades. Bah humbug, etc.

Before any technology POC (Hadoop included) gets off the ground, the business case in the form of potential ROI should be established, as we all know. FOMO and ‘keeping up with the Joneses’ does not a business case make, no matter how much Kool-Aid is on offer.

During the ROI assessment, it’s all too easy to tot up the potential project benefits without being honest about the costs. Hadoop stacks aren’t free, no matter how much IT bang on about open source software, commodity servers/storage & vendor consulting support. Look at the amount of in-house resource required and the costs add up very quickly.

As several of our clients have rightly concluded, in most instances, Hadoop is a classic case of a solution looking for a problem – a problem that simply doesn’t exist for most organisations.

US tech companies assume the rest of the world operates at the same scale as the US. We simply don’t. Most organisations in the UK and Europe that we’ve encountered over the years operate at the 1-20TB scale. Over 100TB is rare. The case for a data processing stack running hundreds or even thousands of nodes simply doesn’t exist.

To underpin wider adoption, the recent focus has been to polish ‘SQL on HDFS’. This leads neatly back to where we started: the SQL compliant parallel data warehouse i.e. a database!

Love Your Data Warehouse

Prove us wrong by all means, but it is our belief that 99% of organisations can deliver 99% of their data analytics requirements without resort to a ‘Hadoop stack’. Go back and give the data warehouse you take for granted some love 🙂

We’ve yet to hear of a *single* case where an existing RDBMS data warehouse has been ditched for a Hadoop stack. Not one.

So, is Hadoop failing? No, not when used by the few organisations that have the right problem & skills.

Are Hadoop projects failing for lots of other organisations? Yes, mostly.

Posted in Big Data, data warehouse, Hadoop | Tagged , | Comments Off on Is Hadoop Failing?

Xmas Wish List

Teradata MPP Setup on AWSAll We Want For Christmas

It’s that time of year…yes Christmas (Xmas for short), most definitely not <yuck>’holidays'</yuck>.

There’s far too much chocolate in the VLDB office as you might expect. Geeks & chocolate are a winning combo, so it won’t last long.

Moving on…office Xmas decorations – check. Office Xmas party – check. Silly jumpers – check. Secret Santa – check. Loud Xmas music – check.

As we draw ever nearer to *finally* going home for Xmas, our thoughts turn to what VLDB’s Xmas list to Santa might look like…so, here goes…

Dear Santa, can you please make sure clients understand at least some of the following:

  1. Data warehouse systems aren’t a side-project you can pay for with left over funding from another project. Real funding, sponsorship, requirements & commitment are required. Subject Matter Experts (SMEs) or Business Analysts (BAs) will need to provide guidance to get an analytic solution delivered. Technology & designers/developers on their own won’t get very far.
  2. High praise from the likes of Gartner doesn’t mean a particular technology is a good fit for your organisation. Figure out your needs/wants/desires afore ye go looking to buy shiny new tech. Thou shalt not believe all thou hears at conferences. It’s the job of tech companies, VCs, analysts & conference organisers to whip up excitement (see kool-aid). They’re not on the hook for delivery.
  3. Accurate estimates for design/build/test are only possible if analysis is carried out. Either you do it, pay us to do it, or accept estimates with wide tolerances.
  4. Quality Assurance (QA) is not the same as unit testing. It’s a real thing that folks do. Lots of them. No really!
  5. CSV files with headers and trailers are a perfectly acceptable way to build data interfaces. Lots of very large organisations are guilty of clinging on to this ‘unsexy’ approach. It ‘just works’.
  6. You’ll probably need a scheduler to run stuff. cron is not a scheduler. Nor is the DBA.
  7. If you could have ‘built that ourselves in SQL Server in 5 days’ you would have already done so.
  8. Don’t focus on our rate card. Focus on the project ROI. Oh, wait, you haven’t even thought about ROI. Doh!
  9. Yes, we can get deltas out of your upstream applications without crashing the system. It’s what we do. We’ll even prove it.
  10. If you want us to work on site we’ll need desks to sit at, preferably next to each other. We’re picky like that 😉

Have a great Xmas & New Year Santa,

Love from all at VLDB

Have a great Xmas & New Year, and here’s to 2017 .

Posted in consulting, data warehouse | Comments Off on Xmas Wish List

Teradata Setup on AWS

Teradata MPP Setup on AWSTeradata AWS Setup

As previously mentioned, Teradata MPP (multi-node) setup on AWS is a tad more involved than the previous SMP (single node) setup. Well, as we’re kind-hearted folks here at VLDB, we decided to show you how it’s done. Well, how we did it anyway.

Let’s see, if armed with an Amazon account and a credit card, you can have Teradata up and running on AWS in under an hour, as claimed by Stephen Brobst at the Teradata Partners conference recently.

Tick-tock, here goes…

Teradata AWS Marketplace

First of all, login to the AWS Management Console and select AWS Marketplace (under ‘Additional Resources’ on the right) then search for ‘Teradata Database Enterprise Edition’ in the AWS marketplace to go to the Teradata Database Enterprise Edition (1-32 nodes) page. From here you will have to subscribe to Teradata if not already done so.



Select the required ‘Region’ – in this case ‘EU(Ireland)’ – then click the ‘Continue’ button on the ‘AWS Marketplace Teradata Database Enterprise Edition (1-32 nodes)’ page to go to the ‘Launch on EC2’ page.

Launch Teradata on EC2


Set the following as required:

  • Version – currently only ‘’ available.
  • Region – ‘EU (Ireland)’ is the closest region for the UK.
  • Deploymentlocal or EBS storage and VPC options.

Click ‘Launch with CloudFormation Console’ to go to the ‘Create Stack/Select Template’ page.

AWS Cloud Formation Stack Template


Ensure you are in the correct AWS region (top right corner next to ‘Support’ drop-down) then click on ‘Next’ to go to the ‘Create Stack/Specify Details’ page.

AWS Cloud Formation Stack Details

Set the following & leave other values as default:

  • Stack name – the name of the stack that will be created in CloudFormation.
  • System Name – the name of the Teradata instance.
  • DBC Password – the password for Teradata DBC user.
  • Number of Nodes – the number of EC2 nodes required (1-32).
  • Instance & Storage Type – select the EC2 instance type and storage type/size required.
  • Availability Zone – choose availability zone from list.
  • Remote Access From – specify CIDR block (IP range) from which SSH/DBMS access is allowed. Use if required (any IP allowed) to test the setup process.
  • AWS Key Pair – a pre-existing key pair must be specified.

See http://www.vldbsolutions.com/blog/teradata-on-aws-configurations-pricing/ for help on choosing EC2 instance & storage types. Apart from the number of nodes, this is the biggest driver of the cost of your Teradata on AWS stack.

Click on ‘Next’ to go to the ‘Create Stack/Options’ page.

AWS Cloud Formation Stack Options


Nothing needs to be set/changed in this page…unless you think otherwise.

Click ‘Create’ to proceed with the stack/instance creation which is monitored via CloudFormation.

AWS Cloud Formation Stack Review

teradata-aws-reviewIt generally takes between 20-30 minutes to provision a single m4.4xlarge EC2 instance with 5TB EBS storage. The process is the same irrespective of the number of nodes.

There are lots of steps to go through as part of the provisioning. Once, complete the status will change to ‘CREATE_COMPLETE’:


Teradata on AWS Up and Running

Once the stack is up and running the next stage is to connect via SSH, Teradata Tools & Utilities (TTU) and a desktop SQL client. This is quite a big topic in itself and will be covered in a separate blog post.

So, to get back to Mr Brobst, we think it is possible to be up and running with Teradata on AWS in under an hour, but only if the node count is low, and only if you follow a guide that somebody has prepared…such as this one.


Posted in AWS, cloud, Teradata | Tagged , , | Comments Off on Teradata Setup on AWS

Teradata MPP on AWS

Teradata MPP on AWSMulti-node Teradata on AWS

Teradata’s eponymous database became available on AWS in Q116 as a single node (SMP) only offering, as detailed in a previous VLDB blog post:

Teradata on AWS – Configurations & Pricing

Of far more interest to most folks was the availability of multi-node (MPP) Teradata systems on AWS. The ‘full fat’ version of Teradata was slated for Q316 and was duly made available during Teradata Partners 2016 via the AWS Marketplace.

The Teradata MPP version is now available on AWS to support Teradata clusters of between 1 and 32 SMP nodes.

Teradata Enterprise Edition on AWS Features

The previous SMP-only Base and Base+ editions have been consolidated into a single ‘Teradata Enterprise Edition’ with the following features:

  • Teradata DBMS v15.10 on SLES 11 including columnar, temporal & row level security
  • Teradata Active System Management (TASM)
  • Teradata Data Stream Utility
  • Teradata Parallel Transporter (TPT)
  • Teradata QueryGrid
  • Teradata Studio
  • Teradata Tools and Utilities (TTU)
  • Teradata REST Services
  • Teradata Server Management
  • Teradata Viewpoint (excluding Teradata Data Lab)

Additional cost items include:

  • Teradata Data Lab
  • Teradata Data Mover
  • Teradata Ecosystem Manager

Support is included in the Teradata software subscription and consists of both 24×7 Premier Cloud Support and access to Teradata At Your Service (TAYS).

A feature of running Teradata on AWS likely to be of interest to a lot of folks is the ability to backup Teradata data to S3.

Teradata MPP on AWS Pricing

At the time of launch, pricing for the m4 EC2 instances with persistent EBS storage in the Ireland region is as follows:

m4.4xlarge (16 vCPU, 64 GB RAM, 20 AMPs)

EC2 = $1.056/hr or $778.88/month
Teradata = $5.973/hr or $4,360.29/month
Total = $7.029/hr or $5,131.17/month

m4.10xlarge (40 vCPU, 160 GB RAM, 20 AMPs)

EC2 = $2.641/hr or $1,927.93/month
Teradata = $16.417/hr or $11,984.41/month
Total = $19.058/hr or $13,912.34/month

Note that these prices are per node.

Teradata software discounts of between 23% – 89% are offered for those willing to pay annually in advance.

Teradata on AWS Setup

The ‘Teradata on AWS Getting Started’ guide is available as HTML  or a 98 page PDF.

Compared to the previous SMP-only offering, the setup of Teradata MPP systems on AWS is *way* more complex.

Understanding topics such as Identity & Access Management (IAM), Virtual Private Cloud (VPC) and CloudFormation is now of prime importance.

That’s all for today…watch this space for VLDB’s experience of setting up a multi-node Teradata MPP system on AWS, along with some benchmark figures.

Posted in AWS, cloud, Teradata | Tagged , , | Comments Off on Teradata MPP on AWS