Google, SQL & noSQL Ramblings

Teradata AWS AccessA couple of articles on Google and SQL versus noSQL caught my eye recently and inspired me to go on a bit of a ramble, so here goes..

http://thermograve.co.uk/the-5-axis-is-here/ Google’s Head Fake

Firstly, there’s the splendidly titled “Did Google Send the Big Data Industry on a 10 Year Head Fake?” Apparently a ‘head fake’ is to trick your opponent as to your intentions. Probably akin to ‘dropping the shoulder’ in football.

So, the question is, did the Big G send us all the wrong way for a decade? Short answer: no. Longer answer: if you did march the wrong way for several years, have a word with yourself.

The article quite rightly points out that neither distributed file systems nor distributed computing can be attributed to Google. As Curt Monash succinctly stated on DBMS2 in 2010, ‘popularization != invention‘.

However, the thinking goes that Google’s famous research papers gave rise to Doug Cutting efforts at Yahoo, which led to Hadoop, whilst all the while the Big G was working on plain old SQL-based Spanner.

The Big G may have been working on other SQL-based stuff whilst the Hadoop crowd were beavering away, but does this amount to a head fake or a shoulder drop? No, not in the slightest.

Many considered the old-fangled SQL-based data warehouse to be under immediate threat due to the rise of the ‘new’ MapReduce-based SQL-free computing paradigm. Shiny & new always wins, right?

Sadly, that can be the case. I’ve witnessed ‘big data’ POCs at first hand where the main motivation was to be seen to be ‘doing big data’. I kid you not. Forget the impact on the users and existing toolsets, stop worrying about ROI, this stuff is free  and cool, the vendor is saying all the right things, so it’s definitely the way to go. Woo-hoo!!!

Let’s never forget, there are armies of folks looking to monetise anything that moves in IT. The VCs, start-ups, analysts and conference organisers make their living out of getting mere mortals to believe they have the silver bullet that’s been missing until now.

What worked for Google, Yahoo, Facebook and LinkedIn doesn’t necessarily translate to the mainstream – this doesn’t make it bad tech. Hadoop is a great example. Out in the real world, SQL is the answer for  most analytic applications for  most folks most of the time, plain and simple.

If you believed the hype and ‘Hadoop’ didn’t turn out to be your silver bullet, don’t blame the Big G for sharing their thoughts or others for building/promoting ‘shiny & new’. No-one told you to put your finger in the fire, did they?

Why SQL Is Beating noSQL

The article entitled ‘Why SQL is beating NoSQL, and what this means for the future of data‘ really got my attention. The notion that SQL was ‘left for dead‘ and is ‘making a comeback‘ is somewhat at odds with reality, me thinks.

There may have been a belief in certain quarters that the ‘answer’ was no longer SQL, but this has never been reflected out in the real world where regular folks run queries, and lots of them.

SQL is very deeply embedded in every single organisation I’ve encountered since the late 1980’s. I don’t see this changing any time soon. There is simply too much sunk cost and too little benefit in trying to ditch SQL.

Software developers may have ‘cast aside SQL as a relic’ but that certainly isn’t what happened throughout corporate IT organisations or analyst communities.

Its clearly untrue that SQL ‘couldn’t scale with these growing data volumes‘. We’ve had scale-out Teradata MPP systems chomping on SQL since the 1980’s, and newer MPP & SQL players like Netezza & Greenplum for over 15 years. A low industry profile doesn’t mean something doesn’t exist.

As the article states, the developers of SQL recognised that ‘much of the success of the computer industry depends on developing a class of users other than trained computer specialists.‘ which is partly why it has become the de facto standard for interacting with databases.

Apparently ‘…SQL was actually fine until another engineer showed up and invented the World Wide Web, in 1989.‘ I’ll help out here. His name is Sir Tim Berners-Lee OM KBE FRS FREng FRSA FBCS. He’s quite well known.

There is no doubt that mainstream general purpose databases (SQL Server, Oracle, mySQL etc) struggled to cope with data volumes thrown out by all things digital in the new-fangled age of ‘the web’ and ‘the net’.

Some of us were already scaling out (admittedly expensive) MPP systems running SQL databases as early as the 1980’s. Big name retailers, banks and telcos have been running SQL on scale-out database systems for decades.

My first multi-billion row Teradata table is over 20 years old and still alive and well at a retail bank. A 100 billion row telco CDR table is over a decade old and runs on a parallel PostgreSQL system. How much scale do you want?

The rational behind attempts to ditch legacy SQL databases is neatly summarised: ‘NoSQL was new and shiny; it promised scale and power; it seemed like the fast path to engineering success. But then the problems started appearing.’

The likes of DeWitt & Stonebraker know a thing or two about this stuff and were early nay-sayers. Feel free to disagree, obviously, but dismiss their observations at your peril.

Most of the post-Teradata attempts to develop scale-out SQL-compliant databases have leveraged PostgreSQL. This approach dates back to Netezza over 15 years ago, and includes the mega-popular AWS Redshift.

The light-bulb moment and conversion to SQL is something Teradata went through over 30 years ago: ‘One day we realized that building our own query language made no sense. That the key was to embrace SQL. And that was one of the best design decisions we have made. Immediately a whole new world opened up.’

Robb Klopp covers Chuck McDevitt’s ‘SQL is the answer’ light-bulb moment whilst at Teradata in 1985 in Chuck’s obituary. RIP Chuck, without doubt owner of the biggest brain I ever met.

The King Is Dead. Long Live The King.

SQL didn’t die. It didn’t recede. SQL is far from perfect, but it isn’t likely to go away any time soon.

Kool-Aid drinkers took themselves on a detour into the land of file processing via developer-only languages all of their own accord, whilst all the time mainstream IT organisations and user communities carried on quite happily with boring old legacy SQL. The same SQL that runs the world.

The first time I ran an SQL query on Teradata in the late 1980’s I realised I was immediately free from the pain and suffering of developing Cobol for analytics. To this day I can clearly remember the relief (sorry Grace).

It seems a certain section of the IT community is going through a similar realisation: if you want to play with data, SQL was, is, and should be the default starting place.

The King is Dead. Long live the King.

Posted in Big Data, data warehouse, Google, noSQL, SQL | Tagged , , | Comments Off on Google, SQL & noSQL Ramblings

Azure IO Testing

Teradata AWS AccessAzure IO Testing

One of the first things we do here at VLDB when we have a yearning to set up a database on a cloud platform is to develop an understanding of disk input-output (IO) performance.

For databases in particular, we are specifically interested in random read performance. A high random read throughput figure is necessary for an efficient and performant MPP database.

Microsoft’s Azure is the test platform in question. The approach taken is to commission a single CentOS7 VM, attach various disks and measure IOPS & throughput.

Azure VM Setup

A single Centos 7 VM is configured. Disks are added in he required size, type & number in order to maximise IO throughput at the VM level:

Azure OS Console

The starting point on Azure is SSD disks, as the IO figures for SSD are generally superior to HDD.

Disk size determines both throughput and available IOPS, so this will also need to be considered in the overall build:

Azure Disk Console

 

Azure VM & Disk Choices

For the 1TB SSD selected above, IOPS are limited to 5,000/sec and throughput is limited to 200 MB/s.

As an aside, cloud providers usually fail to distinguish between read & write, sequential & random or specify IO size when they bandy about throughput figures. The devil is in the detail, dear reader(s).

In order to use premium storage it is necessary to specify the account as ‘Premium LRS’.

The range available that is hosted in the UK is limited so if you have a requirement for UK based system, make sure that what is on offer from Azure will satisfy the requirement:

Azure UK Console

Note that only certain VM types support SSD disks.

For the basic D14v2 with 16 CPU cores, 112Gb RAM and 800 Gb disk, it is not possible to specify SSD disks. However, the Ds14V2 VM type will allow these ‘Premium’ disks.

Also note that the 800GB is for the operating system and swap space only – the storage is non-persistent on this device specifically.

The D14 also allows up to 8 NICs and has an ‘extremely high’ network bandwidth. This is useful if the requirements is to build a MPP cluster to facilitate high speed node:node data transfer.

The Ds14V2 VM is limited to 32 disks:

Azure ds14v2

The Standard_DS14_v2 VM is limited to 64K IOPS or 512MB/s in cached mode (host cache = read or read/write) or 51K IOPS (768MB/s) in uncached mode (host cache = none).

The P30 1Tb SSD disks used are rated at 5K IOPS (200Mb/s) max per disk:

Azure P30

Azure IOPS Testing

Booting CentOS7 and upgrading the kernel takes a matter of minutes.

FIO is then downloaded using git and installed to perform the IOPS tests:

git clone –branch fio-2.1.8 http://git.kernel.dk/fio.git /fio

To run a FIO throughput test, simply enter following from the fio directory:

./fio –rw=randread –name=<directory>/fio –size=<size>g –bs=32k –direct=1 –iodepth=16 –ioengine=libaio –time_based -–runtime=<elapse> –group_reporting

  • <directory> – the directory (disk) where the tests are executed.
  • <size> – size of the test file created in GB.
  • <elapse> – elapse time for the test.
  • bs=32k – blocksize (default 4k). 32Kb is a good size to test for database suitability.
  • direct=1 – non-buffered IO.
  • iodepth=16 – concurrent IO threads.
  • ioengine=libaio – native Linux asynchronous IO.
  • time_based – test elapse time.
  • group_reporting – report on group total not individual jobs.

For example:

./fio –rw=randread –name=/tests/disk1/fio –size=1g –bs=32k –direct=1 –iodepth=16 –ioengine=libaio –time_based –runtime=30 –group_reporting

Azure Disk Read Results

Multiple random & sequential read tests were executed against a Standard_DS14_v2 VM with between 1-6 SSD disks of P30 1TB SSD per disk. Blocksize is 32Kb in all cases.

As stated earlier, we are primarily, but not exclusively, interested in read and not write performance when assessing a cloud platform. This is due to the fact that 80-90% of all data warehouse IO activity is read, and only 10-20% is write. If the read performance isn’t good enough we don’t really care about write performance as the platform is probably unsuitable.

Sequential and random read results using fio on Azure are as follows:

Azure IOPS Results

Azure Read Performance Observations

Random and sequential read give similar performance. This is both unusual and most welcome for those of us that rely on random read performance for data warehouse systems!

At the VM level, total disk throughput does not increase as more disks are added to the VM. Disk throughput is limited by the VM type and/or user account type. Adding more disks to the VM is not a way to deliver more IOPS.

Approximately 540-570MB/s random read throughput with a 32KB blocksize was achieved fairly consistently for between 1-6 P30 1TB SSDs attached to a Standard_DS14_v2 VM.

This is slightly higher than the stated 512 MB/s limit for a Standard_DS14_v2 VM, and represents a very good random read throughput figure for a single disk.

However, it is disappointing that this figure can’t be increased nearer to 1,000MB/s (which is attainable on AWS) through the use of more than 1 disk per VM.

Posted in Azure, cloud, IOPS | Tagged , , | Comments Off on Azure IO Testing

MPP & Redshift Musings

Teradata AWS AccessWhat On Earth is MPP?

In computing, massively parallel refers to the use of a large number of processors (or separate computers) to perform a set of coordinated computations in parallel (simultaneously). Source: Wikipedia.

Teradata & MPP Beginnings

Once upon a time, in a land far, far away…well, OK, California in the late 1970’s/early 1980’s to be precise…the MPP database world started to stir in earnest.

Following on from research at Caltech and discussions with Citibank’s technology group, Teradata was incorporated in a garage in Brentwood, CA in 1979. Teradata’s eponymous flagship product was, and still is, a massively parallel processing (MPP) database system.

Teradata had to write their own parallel DBMS & operating system. They also used weedy 32bit x86 chips to compete with IBM’s ‘Big Iron’ mainframes to perform database processing. Quite an achievement, to say the least.

The first Teradata beta system was shipped to Wells Fargo in 1983, with an IPO following in 1987.

It is interesting to note, for those of us in the UK, that there were Teradata systems up and running over here as early as 1986/87, partly down to the efforts of folks such as our very own ‘Jim the phone’.

Early Teradata adopters in the UK included BT (who later moved off Teradata when NCR/Teradata was acquired by AT&T), Royal Insurance, Great Universal Stores (GUS) and Littlewoods. GUS and Littlewoods combined to become ShopDirect, who still run Teradata 30 years later.

Yours truly ran his first Teradata query at Royal Insurance in Liverpool from ITEQ on an IBM 3090 back in ’89. It killed the split screen in MVS, but ho-hum, better than writing Cobol against IMS DB/DC to answer basic questions. We were taught SQL by none other than Brian Marshall who went on to write the first Teradata SQL & performance books. I still have the original dog-eared Teradata reference cards, or ‘cheat sheets’ as we called them:

Teradata ITEQ/BTEQ & Utilities Reference Cards.

Teradata ITEQ/BTEQ & Utilities Reference Cards, 1988 vintage.

From the 1980’s through to the early 2000’s Teradata had a pretty clear run at the high-end analytic DBMS market. There was no serious competition, no matter how hard the others tried. All those big name banks, telecoms companies and retailers couldn’t be wrong, surely?

MPP Upstarts – Netezza, Datallegro & Greenplum

Teradata’s first real competition in the commercial MPP database space came in the form of Netezza in the early 2000’s.

Like Teradata, Netezza consisted of dedicated hardware & software all designed to work in harmony as an engineered MPP database ‘appliance’. Unlike Teradata, Netezza was able to take advantage of open source DBMS software in the form of PostgreSQL, and open source OS software in the form of Linux.

We discovered Netezza by accident in 2002/03 after landing on a PDF on their web site following a Google search. “Netezza is Teradata V1” was our initial response. Apart from the FPGAs, we were pretty close.

A few phone calls, a trip to Boston and a training session later, and we’re up and running as Netezza partners.

Following a successful Netezza project at the fixed line telecoms division of John Caudwell’s Phones4U empire, yours truly was a guest speaker at the inaugural Netezza user conference in 2005.

Following an IPO in 2007, Netezza was bought by IBM in 2010 where it remains to this day, somewhere in IBM’s kitbag. Poor old Netezza.

Also in the early 2000’s, a bunch of mainly Brits at Datallegro were trying to build an MPP appliance out of Ingres. They were bought by Microsoft in 2008. Why Microsoft needed to buy a startup running Ingres on Linux is anyone’s guess.

The last of the new MPP players we had dealings with all those years ago are none other than Greenplum.

Through the Netezza partner ecosystem we actually knew the Greenplum team when they were still called ‘Metapa’, probably around 2002/03. Yours truly was at a dinner with Luke & Scott in Seattle as they wooed none other than Teradata luminary Charles ‘Chuck’ McDevitt (RIP) to head up the Greenplum architecture team. With Chuck on board they were virtually guaranteed to succeed.

Greenplum was always of particular interest to us because, unlike Teradata & Netezza, it was and still is a software only offering. You get to choose your own platform, which can be your favourite servers/storage or even that new fangled ‘cloud’ thing.

Not only have we got Greenplum to work happily on every single platform we ever tried, Greenplum is also the only open source MPP database.

Other Parallel PostgreSQL Players

In addition to early parallel PostgreSQL players such as Netezza and Greenplum (Datallegro used Ingres, not PostgreSQL), a whole raft of ‘me too’ players cropped up in 2005:

  • Vertica – acquired by HP in 2011
  • Aster – acquired by Teradata in 2011
  • Paraccel – acquired by Actian in 2013

It is interesting to note that, unlike Netezza, not one of the new parallel PostgreSQL players put even the smallest dent in Teradata’s core MPP data warehouse market. At least Netezza shook Teradata from their slumber which, in turn, gave the Teradata appliance to the world.

That said, perhaps Paraccel will have the biggest impact on the MPP database market which, for so long, has been dominated by Teradata, the main player in the space for 30 years. How so? Read on…

Redshift

Amazon launched the Redshift ‘Data Warehouse Solution’ on AWS in early 2013.

Yours truly even wrote some initial thoughts at the time, as you do.

As has been well documented, Redshift is the AWS implementation of Paraccel, which was one of the crop of ‘me too’ parallel PostgreSQL players that appeared in 2005.

Redshift adoption on AWS has been rapid, and reported to be the fastest growing service in AWS.

So, despite Teradata having been around for 30 years, the awareness of MPP and adoption of MPP as a scalable database architecture has been ignited by Redshift, which started out as Paraccel, which is built out of PostgreSQL.

Redshift Observations

Our previous post on Redshift in 2013 made mention of the single column only distribution key, and the dependence on the leader node for final aggregation processing.

There is a workaround for the single column only distribution key restriction, for sure. MPP folks will no doubt be mystified as to why such a restriction exists. However, it neatly removes the possibility of a 14 column distribution key (primary index) that we encountered at a Teradata gig a few years ago (no names). Cloud, meet silver lining.

The dependence on a leader node for final aggregation processing is much more of an issue. Aggregation queries that return a high number of groups will simply choke on the final aggregation step. Just ask Exadata users. There’s also not much you can do to remediate this issue if you can’t over-provision CPU/RAM at the leader node, which is the case with Redshift.

More recent Redshift observations from our own experiences, and from trusted contacts (you know who you are) include the following:

  • lack of node OS access – security auditing, hardening & OS customisation not possible.
  • non-persistent storage – if you stop or lose the EC2 cluster you lose the database data & will need to re-load from S3 or a database snapshot.
  • poor concurrency – default is 5 concurrent operations. Although this can be increased, performance drops off quickly as concurrency increases.
  • poor workload management – no short query ‘fast path’. Redshift is essentially a ‘free for all’.
  • 1MB block size & columnar only storage – very large table space overhead of 1 MB per column for every table x number of segments.
  • limited tuning options – no partitions or indexes. If the sort keys don’t help it’s a full table scan every time.
  • automatic database software updates – this might appeal to the ‘zero admin is good crowd’, but enterprise customers will baulk at the notion of a zero-choice upgrade that could break existing applications.

Lack of OS access and non-persistent storage will no doubt be show-stoppers for some enterprise folks.

The SQL Server crowd will probably not care and just marvel at the performance offered by SQL queries that run against all CPU cores in the cluster. Unless, of course, they didn’t read the data distribution notes.

Meet The New Boss, Same as The Old Boss

No matter that Redshift could be improved, what we can applaud is that Amazon has opened many folks eyes to the benefits of a scalable ‘old-skool’ relational database management system (RDBMS), and one with an MPP architecture to boot. For this we can only be thankful.

The rate of Redshift uptake speaks to the usefulness of a scalable RDBMS. The architecture is being socialised by Amazon (high volume/low margin) in a way that never was by Teradata (low volume/high margin).

All of the MPP database vendors, Amazon included, owe a debt of gratitude to Teradata who proved the architecture 30 years ago. That Teradata got so much so right so long ago never ceases to amaze yours truly.

To Redshift the old adage remains true, there really is ‘nowt new under the sun’.

Posted in MPP, Redshift, Teradata | Tagged , , | Comments Off on MPP & Redshift Musings

Teradata AWS Access How To Guide

Teradata AWS Access

Teradata AWS Access

It was previously explained by ourselves here at VLDB how to setup a Teradata MPP (multi-node) system on Amazon Web Services (AWS). Well, now we’re back to offer a simple step-by-step guide on how to setup basic SSH connections, install Teradata’s Tools & Utilities (TTU) and setup a desktop SQL client for a Teradata instance running on AWS.

 

SSH & Key Pairs Identity

When you first setup your server on AWS, you are required to have created a key pairs identity. With this, you can start accessing your server straight away. There are multiple ways to connect, though one of the more common is via the SSH tool.

– Mac

If you are connecting via a terminal, then the command will be similar to the usual connection string:

ssh -i KEY_PAIRS_FILE ec2-user@AWS_SERVER_IP

However, you may need to first alter the permissions of the key pairs file. If the access is open to any user the connection will reject the file input. This can be done as follows:

chmod 400 KEY_PAIRS_FILE

Note that the ec2-user is the default user for most Linux systems. Once logged on as ec2-user, you can gain access to root with the command sudo su -.

If you find yourself logging on and off a lot, it may be conveniant to set your key pairs file path as an environment variable for ease of use.

– Windows

If you are connecting via PuTTY on Windows, then you are required to first modify the key pairs file using PuTTY-Gen. Typically this piece of software is downloaded alongside PuTTY, but you may need to download it separately.

Once you’ve loaded up the PuTTY-Gen software, click Load on the right and then select your key pairs file. In the bottom right you may need to change the file type that is visible (default is .ppk and the key pairs file is a .pem).

Once loaded, it should pop up with a message saying Successfully Imported… Once you have this, click Save Private Key in the bottom right and save the file. This will now be a usable key pairs file for PuTTY’s use.

Now, open up PuTTY.

On the left side in the list, expand Connection, and then SSH. Now click on Auth and you should see a view on the right titled Options controlling SSH authentication.

From here, click on Browse in the bottom right and then select the key pairs file you just created.

Now go back to the main page by clicking on Session at the top of the list on the left. Here, put in your server IP address as normal and hit Open.

Note that it may be of convenience to save the session details before opening the connection for ease of use at a later time.

Password Authenticated Login –

If you require password authentication to be enabled, then it will be a simple edit of a system config file and a restart of the SSH service on the AWS server.

First, edit the file /etc/ssh/sshd_config, and then find and change the following line:

PasswordAuthentication no

to

PasswordAuthentication yes

Note that you should also remove the hash (#) symbol from in front of the parameter if there is one. Once the file is edited as above, you will then need to restart the SSH service as follows to commit the change.

service sshd restart

Once the service is restarted, password authentication should be updated and enabled on your system.

Teradata’s Tools and Utilities (TTU)

 It is very useful to understand that all Teradata servers on AWS come with the equivalent TTU version packaged for Linux, Mac and Windows. With your key pairs file to hand, you can use it, similarly to SSH access, to download the file using SCP.

But first you will need to make the file(s) available to the required user for download (ec2-user in our example), as by default only root can access the files.

First, logon onto the server using your SSH access as ec2-user. Change to the root user and then do [something similar to] the following commands:

cp /var/opt/teradata/TTU_pkgs/Teradata*MACOSX* /home/ec2-user/

chown ec2-user /home/ec2-user/Teradata*MACOSX*

Note that you will need to replace the name MACOSX with Linux or windows if you require those instead.

 –  Mac

Once done, use the following command to retrieve the file for your Mac via terminal:

scp -i KEY_PAIRS_FILE ec2-user@SERVER_IP:/home/ec2-user/Teradata*MACOSX* .

Windows –

The simplest way to download the package on Windows is to use the WinSCP software. For this you will again require the key pairs (.ppk) file that you created using PuTTY-Gen.

First, open up WinSCP.

From here, click on Advanced on the right which will pop up a new window. In the list on the left, click on Authentication. Similarly to PuTTY, click on the […] icon under Private key file and select the key pairs .ppk file, and then click OK.

Now you can input your IP address and user name as per usual, and then Login. Once logged in, you can freely move around your local and server directories to download the files as required.

Note that it may be of convenience, similarly to PuTTY, to save the session details before opening the connection for ease of use at later time.

Installation –

Once you have downloaded the required packages, it is a simple matter of unpackaging the files and running the setup script contained within to install the TTU software.

For example, on a Mac terminal locate the TTU package and then run the following commands:

gunzip Teradata*MACOSX*

tar -xvf Teradata*MACOSX*

./setup.bat a

Note that using tar as above will result in all files being placed in the directory you are located in. Additionally, using the a parameter for the setup script will install all available TTU software.

Once installation is complete, you should be able to use BTEQ from your terminal to connect to your Teradata instance (however you may need to re-load your terminal first).

Teradata Studio Express

Whilst there are a wide variety of desktop SQL clients available that are compatible with Teradata, and you may very well have your preference out of them all, this will be a quick setup and overview of Teradata Studio Express.

Teradata Studio Express is openly available for download on Windows, Mac and Linux platforms from the following website address:

http://downloads.teradata.com/download/tools/teradata-studio-express

Note that for this software to install successfully, Java Developer Kit must be installed.

Once downloaded, simply unzip the package and run the installer. This will take you through a basic on-screen installation process that typically takes less than a minute. Once finished, the software is ready to use.

When first launching Teradata Studio Express you will be greeted with a Quick Tour guide for the studio. Quickly run through the guide to help you understand the basics (if required) and then continue to the main view.

From the main view, simply right-click Database Connections at the top of the far left panel and click New…. In this window, click Teradata, name the connection as you wish at the bottom under Name: and then hit Next.

In this next view you will see various parameters to fill in, and some pre-defined as well. Unless required, we would recommend leaving the pre-set parameters as default, only altering the following:

  • Database Server Name – The IP or server name to connect too.
  • User Name – The name of the user that you wish to connect to Teradata with.
  • Password – The password for the respective user chosen above.
  • Database – Defaults to user’s default if not specified. All database hierarchy existing under the specified (or default) database will be available for viewing once the connection is made.

Once the details required have been completed, you may click Next to get an overview of the details you’ve specified, or you may directly Finish the connection setup. At this point, you should have a successful connection in the far left panel for you to view. Any changes required to an existing connection can be made by simply right-clicking the connection name and clicking Properties.

With the connection now open, you may easily run queries against the database via the SQL Editor in the top panel on the right. Our basic example shows the following query being run:

select

*

from dbc.dbcinfo

;

The query may be executed by simply clicking on the green triangle in the top right (or by using the shortcut CTRL + ALT + X on a Windows environment, or CMD + ALT + X on a Mac).

Results may then be viewed in the middle panel titled Teradata Result Set Viewer, with history available for all previously executed queries in the bottom panel, suitably titled Teradata SQL History.

Whilst this should cover the basics required to use Teradata Studio Express, there are of course many useful features that can be utilised: One such simple feature is the Reexecute Statement command. By right-clicking on any single row in the Teradata SQL History, you may commit Reexecute Statement, which, as the term suggests, re-executes the selected query statement against the database, and will return results as per usual in the Teradata Result Set Viewer.

 

And there you have it! A simple step-by-step run through on the basic setup of connections, utilities and client tools that may be required by yourselves for any Teradata instance running on AWS.

 

 

 

 

 

Posted in AWS, SQL, Teradata | Tagged , , , , | Comments Off on Teradata AWS Access How To Guide

Is Hadoop Failing?

Is Hadoop Failing? ‘Hadoop Is Failing’ Article

A recent LinkedIn post linking to an Innovation Enterprise article entitled ‘Hadoop Is Failing’ certainly got our attention, as you might expect.

Apart from disagreeing with the assertion that ‘Hadoop…is very much the foundation on which data today is built’ the main thrust of the article is well founded.

Consider the following snippets:

  • ‘adoption appears to have more or less stagnated’
  • ‘trying to figure out how to use it for value’
  • ‘one man’s big data is another man’s small data’
  • ‘hundreds of enterprises were hugely disappointed by their useless 2 to 10TB Hadoop clusters’

And the grand finale:

Most companies do not currently have enough data to warrant a Hadoop rollout, but did so anyway because they felt they needed to keep up with the Joneses..they soon realize that their data works better in other technologies.

This can be summarised as: drank the Kool-Aid, convinced each other Hadoop was a silver bullet, rocked up a POC stack with an all-too-willing vendor, spent a lot of time/effort getting nowhere, quietly went back to existing data management technologies (almost certainly the RDBMS based data warehouse).

Mention is also made of the Hortonworks share price performance, which is down by about 66% since the December 2014 IPO. The NASDAQ is up 19% over the same period.

VLDB’s ‘Big Data’ Experiences

We’ve been exposed to the recent ‘Big Data’ thang from both a client and vendor perspective.

From the client side the main motivators appear to be twofold: reduce data warehouse costs and ‘fear of missing out‘ (FOMO) or keeping up with the Joneses. In both cases the amount of Kool-Aid supplied by VCs, vendors, analysts, tech journos and geek fanbois willingly gulped down by those that should know better is a tad disappointing.

The architect that glibly told us his company would ‘obviously’ have to re-train thousands of analysts away from SQL and into the realms of Pig an Hive springs to mind. This despite having 25 years of skills invested in the existing EDW.

From the vendor perspective there has been a lot of ‘over promising’ that has failed to materialise. ‘Nothing new there’, the cynics might suggest.

The challenge during one POC was to run an existing batch suite of SQL against a relatively simple schema lifted from the EDW as part of an EDW-offloading test.

The POC was to be attempted using a Hadoop stack running Hive, and was categorised by the vendor as ‘just a SQL race’. Given the stature of the vendor we reasonably assumed their confidence was well placed. How wrong we were – the SQL failed a few steps in. Very many tickets later, still no joy. EDW offloading? Err, no. EDW replacement? LMAO!

The point is, our real-world experiences bear out the main thrust of the article: the rush to get involved with Hadoop has *largely* been done without any real understanding of it and the failure to get it working, for whatever reason, has led to a lot of disillusionment and/or stagnation.

VLDB’s Guidance

First of all, be honest with yourself: you’re probably not Google, Facebook or Yahoo.

These companies conceived, designed & built the Hadoop ecosystem out of necessity. They have <understatement>slightly</understatement> different challenges to yours.

They also have a vastly different set of in-house skills. You’re probably a bank, a retailer, a manufacturer, a telecoms company or the like. You simply don’t possess the engineering capability that the ‘Hadoop crowd’ take for granted. Your salary scales also don’t allow the required talent to be hired. That’s before we take your location, stock options & ‘coolness’ into account!

When considering *any* new technology, it should always be apparent what extra capability is on offer, and what specific business benefit that capability delivers. Implicitly, this should be something we can’t already achieve with the tools at our disposal.

On a personal note, the adage that “there’s nowt new under the sun” is my default starting position. Greater minds than mine think there is a lack of novelty with MapReduce. We’ve had parallel data processing for decades. Bah humbug, etc.

Before any technology POC (Hadoop included) gets off the ground, the business case in the form of potential ROI should be established, as we all know. FOMO and ‘keeping up with the Joneses’ does not a business case make, no matter how much Kool-Aid is on offer.

During the ROI assessment, it’s all too easy to tot up the potential project benefits without being honest about the costs. Hadoop stacks aren’t free, no matter how much IT bang on about open source software, commodity servers/storage & vendor consulting support. Look at the amount of in-house resource required and the costs add up very quickly.

As several of our clients have rightly concluded, in most instances, Hadoop is a classic case of a solution looking for a problem – a problem that simply doesn’t exist for most organisations.

US tech companies assume the rest of the world operates at the same scale as the US. We simply don’t. Most organisations in the UK and Europe that we’ve encountered over the years operate at the 1-20TB scale. Over 100TB is rare. The case for a data processing stack running hundreds or even thousands of nodes simply doesn’t exist.

To underpin wider adoption, the recent focus has been to polish ‘SQL on HDFS’. This leads neatly back to where we started: the SQL compliant parallel data warehouse i.e. a database!

Love Your Data Warehouse

Prove us wrong by all means, but it is our belief that 99% of organisations can deliver 99% of their data analytics requirements without resort to a ‘Hadoop stack’. Go back and give the data warehouse you take for granted some love 🙂

We’ve yet to hear of a *single* case where an existing RDBMS data warehouse has been ditched for a Hadoop stack. Not one.

So, is Hadoop failing? No, not when used by the few organisations that have the right problem & skills.

Are Hadoop projects failing for lots of other organisations? Yes, mostly.

Posted in Big Data, data warehouse, Hadoop | Tagged , | Comments Off on Is Hadoop Failing?

Xmas Wish List

Teradata MPP Setup on AWSAll We Want For Christmas

It’s that time of year…yes Christmas (Xmas for short), most definitely not <yuck>’holidays'</yuck>.

There’s far too much chocolate in the VLDB office as you might expect. Geeks & chocolate are a winning combo, so it won’t last long.

Moving on…office Xmas decorations – check. Office Xmas party – check. Silly jumpers – check. Secret Santa – check. Loud Xmas music – check.

As we draw ever nearer to *finally* going home for Xmas, our thoughts turn to what VLDB’s Xmas list to Santa might look like…so, here goes…

Dear Santa, can you please make sure clients understand at least some of the following:

  1. Data warehouse systems aren’t a side-project you can pay for with left over funding from another project. Real funding, sponsorship, requirements & commitment are required. Subject Matter Experts (SMEs) or Business Analysts (BAs) will need to provide guidance to get an analytic solution delivered. Technology & designers/developers on their own won’t get very far.
  2. High praise from the likes of Gartner doesn’t mean a particular technology is a good fit for your organisation. Figure out your needs/wants/desires afore ye go looking to buy shiny new tech. Thou shalt not believe all thou hears at conferences. It’s the job of tech companies, VCs, analysts & conference organisers to whip up excitement (see kool-aid). They’re not on the hook for delivery.
  3. Accurate estimates for design/build/test are only possible if analysis is carried out. Either you do it, pay us to do it, or accept estimates with wide tolerances.
  4. Quality Assurance (QA) is not the same as unit testing. It’s a real thing that folks do. Lots of them. No really!
  5. CSV files with headers and trailers are a perfectly acceptable way to build data interfaces. Lots of very large organisations are guilty of clinging on to this ‘unsexy’ approach. It ‘just works’.
  6. You’ll probably need a scheduler to run stuff. cron is not a scheduler. Nor is the DBA.
  7. If you could have ‘built that ourselves in SQL Server in 5 days’ you would have already done so.
  8. Don’t focus on our rate card. Focus on the project ROI. Oh, wait, you haven’t even thought about ROI. Doh!
  9. Yes, we can get deltas out of your upstream applications without crashing the system. It’s what we do. We’ll even prove it.
  10. If you want us to work on site we’ll need desks to sit at, preferably next to each other. We’re picky like that 😉

Have a great Xmas & New Year Santa,

Love from all at VLDB
xxx

Have a great Xmas & New Year, and here’s to 2017 .

Posted in consulting, data warehouse | Comments Off on Xmas Wish List

Teradata Setup on AWS

Teradata MPP Setup on AWSTeradata AWS Setup

As previously mentioned, Teradata MPP (multi-node) setup on AWS is a tad more involved than the previous SMP (single node) setup. Well, as we’re kind-hearted folks here at VLDB, we decided to show you how it’s done. Well, how we did it anyway.

Let’s see, if armed with an Amazon account and a credit card, you can have Teradata up and running on AWS in under an hour, as claimed by Stephen Brobst at the Teradata Partners conference recently.

Tick-tock, here goes…

Teradata AWS Marketplace

First of all, login to the AWS Management Console and select AWS Marketplace (under ‘Additional Resources’ on the right) then search for ‘Teradata Database Enterprise Edition’ in the AWS marketplace to go to the Teradata Database Enterprise Edition (1-32 nodes) page. From here you will have to subscribe to Teradata if not already done so.

teradata-aws-marketplace

 

Select the required ‘Region’ – in this case ‘EU(Ireland)’ – then click the ‘Continue’ button on the ‘AWS Marketplace Teradata Database Enterprise Edition (1-32 nodes)’ page to go to the ‘Launch on EC2’ page.

Launch Teradata on EC2

teradata-launch-ec2

Set the following as required:

  • Version – currently only ‘15.10.02.09-1’ available.
  • Region – ‘EU (Ireland)’ is the closest region for the UK.
  • Deploymentlocal or EBS storage and VPC options.

Click ‘Launch with CloudFormation Console’ to go to the ‘Create Stack/Select Template’ page.

AWS Cloud Formation Stack Template

teradata-aws-template

Ensure you are in the correct AWS region (top right corner next to ‘Support’ drop-down) then click on ‘Next’ to go to the ‘Create Stack/Specify Details’ page.

AWS Cloud Formation Stack Details

Set the following & leave other values as default:

  • Stack name – the name of the stack that will be created in CloudFormation.
  • System Name – the name of the Teradata instance.
  • DBC Password – the password for Teradata DBC user.
  • Number of Nodes – the number of EC2 nodes required (1-32).
  • Instance & Storage Type – select the EC2 instance type and storage type/size required.
  • Availability Zone – choose availability zone from list.
  • Remote Access From – specify CIDR block (IP range) from which SSH/DBMS access is allowed. Use 0.0.0.0/0 if required (any IP allowed) to test the setup process.
  • AWS Key Pair – a pre-existing key pair must be specified.

See http://www.vldbsolutions.com/blog/teradata-on-aws-configurations-pricing/ for help on choosing EC2 instance & storage types. Apart from the number of nodes, this is the biggest driver of the cost of your Teradata on AWS stack.

Click on ‘Next’ to go to the ‘Create Stack/Options’ page.

AWS Cloud Formation Stack Options

teradata-aws-options

Nothing needs to be set/changed in this page…unless you think otherwise.

Click ‘Create’ to proceed with the stack/instance creation which is monitored via CloudFormation.

AWS Cloud Formation Stack Review

teradata-aws-reviewIt generally takes between 20-30 minutes to provision a single m4.4xlarge EC2 instance with 5TB EBS storage. The process is the same irrespective of the number of nodes.

There are lots of steps to go through as part of the provisioning. Once, complete the status will change to ‘CREATE_COMPLETE’:

teradata-aws-create-complete

Teradata on AWS Up and Running

Once the stack is up and running the next stage is to connect via SSH, Teradata Tools & Utilities (TTU) and a desktop SQL client. This is quite a big topic in itself and will be covered in a separate blog post.

So, to get back to Mr Brobst, we think it is possible to be up and running with Teradata on AWS in under an hour, but only if the node count is low, and only if you follow a guide that somebody has prepared…such as this one.

Enjoy!

Posted in AWS, cloud, Teradata | Tagged , , | Comments Off on Teradata Setup on AWS

Teradata MPP on AWS

Teradata MPP on AWSMulti-node Teradata on AWS

Teradata’s eponymous database became available on AWS in Q116 as a single node (SMP) only offering, as detailed in a previous VLDB blog post:

Teradata on AWS – Configurations & Pricing

Of far more interest to most folks was the availability of multi-node (MPP) Teradata systems on AWS. The ‘full fat’ version of Teradata was slated for Q316 and was duly made available during Teradata Partners 2016 via the AWS Marketplace.

The Teradata MPP version is now available on AWS to support Teradata clusters of between 1 and 32 SMP nodes.

Teradata Enterprise Edition on AWS Features

The previous SMP-only Base and Base+ editions have been consolidated into a single ‘Teradata Enterprise Edition’ with the following features:

  • Teradata DBMS v15.10 on SLES 11 including columnar, temporal & row level security
  • Teradata Active System Management (TASM)
  • Teradata Data Stream Utility
  • Teradata Parallel Transporter (TPT)
  • Teradata QueryGrid
  • Teradata Studio
  • Teradata Tools and Utilities (TTU)
  • Teradata REST Services
  • Teradata Server Management
  • Teradata Viewpoint (excluding Teradata Data Lab)

Additional cost items include:

  • Teradata Data Lab
  • Teradata Data Mover
  • Teradata Ecosystem Manager

Support is included in the Teradata software subscription and consists of both 24×7 Premier Cloud Support and access to Teradata At Your Service (TAYS).

A feature of running Teradata on AWS likely to be of interest to a lot of folks is the ability to backup Teradata data to S3.

Teradata MPP on AWS Pricing

At the time of launch, pricing for the m4 EC2 instances with persistent EBS storage in the Ireland region is as follows:

m4.4xlarge (16 vCPU, 64 GB RAM, 20 AMPs)

EC2 = $1.056/hr or $778.88/month
Teradata = $5.973/hr or $4,360.29/month
Total = $7.029/hr or $5,131.17/month

m4.10xlarge (40 vCPU, 160 GB RAM, 20 AMPs)

EC2 = $2.641/hr or $1,927.93/month
Teradata = $16.417/hr or $11,984.41/month
Total = $19.058/hr or $13,912.34/month

Note that these prices are per node.

Teradata software discounts of between 23% – 89% are offered for those willing to pay annually in advance.

Teradata on AWS Setup

The ‘Teradata on AWS Getting Started’ guide is available as HTML  or a 98 page PDF.

<warning>
Compared to the previous SMP-only offering, the setup of Teradata MPP systems on AWS is *way* more complex.

Understanding topics such as Identity & Access Management (IAM), Virtual Private Cloud (VPC) and CloudFormation is now of prime importance.
</warning>

That’s all for today…watch this space for VLDB’s experience of setting up a multi-node Teradata MPP system on AWS, along with some benchmark figures.

Posted in AWS, cloud, Teradata | Tagged , , | Comments Off on Teradata MPP on AWS

Using Joins Across Multiple Tables

Readings in Database SystemsJoins

When dealing with large data sets, it’s important to ensure that the data can be accessed correctly. Failure to address this issue early on in database development can lead to problems when later attempting to extract information from the data itself. This was highlighted recently during a count for London’s Mayoral election on 5 May, when staff at the Electoral Commission had to ‘manually query a bug-stricken database’ which delayed the result by ‘several hours’. These kinds of problems aren’t helpful in terms of furthering the role of computerisation of such tasks, and in particular for data companies. At present, electronic voting machines are not currently used by UK voters, however, counting software was used for the London Mayoral and Assembly elections. In the article, electoral law expert Prof. Bob Watt of Buckingham University expressed concerns about such tasks being undertaken digitally; he said: ‘The bigger issue is of course just how stable these machines are and that’s something that I have a great deal of worry about and have had for some time’. As you can see, it’s important that companies dealing with data get it right. The BBC article on this story doesn’t go into any specific details on why the data was not accessible, so it’s difficult to offer any kind of assessment – however, when dealing with such large sets of data involving tens of thousands of rows, it’s important that the right people have access to the information when they need it.

 

When storing data in a relational database management system (RDBMS), data is contained in multiple tables which are related to each other through one or more common factors. Using the example above, a table may contain information relating to the voter in a table named ‘voter’, such as id, name and email_address. This then may be linked to another table containing contact information for the voter such as their address, telephone and other related information. We may then have a table relating to the candidates of the election, and information relating to them which is necessary for The Electoral Commission to efficiently track the vote. In order to access the information across two tables, we would use a join.

 

A join is used to examine information, not just across two tables, but across multiple tables (depending on system capabilities) and is an integral part of querying data. Each table in a database should contain a primary key which ensures that each row is unique and prevents duplication of redundant data. In addition, a table can contain a foreign key which references another table. For example, by connecting a table containing voter information using voter_id and candidate_id, we could find out which individuals voted for a specific candidate. In this instance, the candidate_id would be the foreign key in our voters table, as it would reference a different table containing information on the candidate. To perform a join in SQL, we need to create a statement which references which information we require, followed by the tables we want to join and how we want to join them together to provide the results we need.

 

To fully understand how joins work, we need to create two tables and some data within them. But first, we will briefly go over the different types of joins. The default join is an inner join; this statement is used to return results where the data in each table matches. The next type of join we will look at is a left join. This statement is used to return all the data from the left table and only the matching data in the right table. The next table is a right join; this is the opposite to a left join. This statement is used to return all of the data from the right table and only the matching data in the left table. The final join is a full outer join; this statement is used to return all the data in both tables. Let’s have a look at each in join in more detail. Firstly, we will examine the data contained in our tables.

Table examples:

voter

Join Table VLDB Solutions

candidate

Join Table VLDB

 

Inner Join

The first join we spoke of was the inner join. The inner join is used to return matching data in each table. Using the statement below on the tables we have created, we can display all of the people that voted for a candidate.

 

SELECT voter.voter_id

,voter.forename

,candidate.candidate_id

,candidate.forename

FROM voter

INNER JOIN candidate

ON voter.candidate_id=candidate.candidate_id

ORDER BY voter.voter_id

;

This table shows the results from the above statement:

Join Table VLDB

The results from the inner join show us that only five rows from the voter table have matches in the candidate table.

Venn VLDB

The above venn diagram shows us how the tables are connected when we join two tables on an inner join. In this instance we only get the data which matches in both tables.

 

Left Join

The next join statement is the left join. This will provide us with all the data from the left table and only the connecting data in the right table.

 

SELECT voter.voter_id

,voter.forename

,candidate.candidate_id

,candidate.forename

FROM voter

LEFT JOIN candidate

ON voter.candidate_id=candidate.candidate_id

ORDER BY voter.voter_id

;

 

Results table from the above statement:

Join Table VLDB

The left join displays all of the data in the left table, which in this statement is voter, and displays data of the connected table. If there is no data available for a row, then null is added to that row; this is what has happened in row 6 as Liam didn’t choose a candidate.

Venn VLDB

This venn diagram shows us how a left join is used. All of the data contained in voter is returned as is the matching data in candidate.

 

Right Join

 

As you may have guessed, the next statement we will look at is the right join. This statement will result in the right table providing us with all the data it holds and then only displaying the data that is connected to it from the left table.

 

SELECT voter.voter_id

,voter.forename

,candidate.candidate_id

,candidate.forename

FROM voter

RIGHT JOIN candidate

ON voter.candidate_id=candidate.candidate_id

ORDER BY candidate.candidate_id

;

 

Results from the above statement:

Join Table VLDB

The SQL statement has been tweaked slightly. The right join statement has the order by changed to candidate.canidate_id; this is to make the results more readable. The right join statement has displayed all of the data available in the right table, which was candidate, and now only shows the connecting data in the left table (voter), and again if there isn’t any data in the left table that connected to the right table then null is added to that row.

Venn VLDB

 

This right join venn diagram shows us the opposite to a left join, and displays all of the data in candidate and the connecting data in the voter table.

 

 

Full Outer Join

The last statement that will be explained now is a full outer join. This statement brings in all the data from both tables and connects the results together where it can.

SELECT voter.voter_id

,voter.forename

,candidate.candidate_id

,candidate.forename

FROM voter

FULL OUTER JOIN candidate

ON voter.candidate_id=candidate.candidate_id

ORDER BY voter.voter_id

;

 

Results from the above statement:

Join Table VLDB

With a full outer join, the results are from both tables, and results will be matched together. Also, if there is no matching data, the field will again contain a null value.

Venn VLDB

And this is the last venn diagram. This shows us a full outer join and displays all the data in both tables; it attempts to match the data together where it can.

 

As you can see from the examples above, we have managed to join the data from two individual tables and link them together in a variety of different ways to provide the results we required. The ability to join tables is a fundamental aspect of SQL and a common feature of the role of a database developer.

 

Sources: http://www.bbc.co.uk/news/technology-36558446

 

 

Posted in data warehouse, database, Joins, SQL | Tagged , , , | Comments Off on Using Joins Across Multiple Tables

Database Benchmarking

Readings in Database SystemsTPC Database Benchmarks

As most readers of data warehouse related blogs will no doubt know, the Transaction Processing Performance Council (TPC) define database benchmarks to allow comparisons to be made between different technologies.

Amongst said benchmarks is the TPC-H ad hoc decision support benchmark, which has been around since 1999 with the latest version being v2.17.1.This is the most relevant TPC benchmark for those of us focussed on data warehouse platforms.

Benchmark Scale

TPC-H benchmarks are executed at a given scale point ranging from 1GB up to 100TB:

  • TPCH1 = 1GB
  • TPCH10 = 10GB
  • TPCH30 = 30GB
  • TPCH100 = 100GB
  • TPCH300 = 300GB
  • TPCH1000 = 1,000GB/1TB
  • TPCH3000 = 3,000GB/3TB
  • TPCH10000 = 10,000GB/10TB
  • TPCH30000 = 30,000GB/30TB
  • TPCH100000 = 100,000GB/100TB

Benchmark Data

Data used to populate the TPC-H schema is generated using the DBGEN utility which is provided as part of the TPC-H download.

Database Schema

The schema used in the TPC-H benchmark is retail based and contains the following tables:

  • customer
  • orders
  • lineitem
  • part
  • partsupp
  • supplier
  • nation
  • region

Each table contains either a fixed number of rows or a number related to the scale factor in use.

Unsurprisingly, the ‘lineitem’ table contains the most rows, which is 6,000,000 x scale factor e.g. 600,000,000 rows for scale factor 100 (TPCH100).

Column names and data types are provided for all tables.

For those using MPP databases such as Teradata, Greenplum or Netezza you’ll have to decide your own data distribution strategy.

Benchmark Queries

The queries that are executed against the populated schema are generated using the QGEN utility which is provided as part of the TPC-H download. Minor query modifications are allowed to take into account differences between DBMS products.

There are a total of 22 ‘SELECT’ queries (‘query stream’) that must be executed against the populated retail schema at a given scale factor. Each query corresponds to a business question. There are also 2 refresh functions (‘refresh stream’) that add new sales and remove old sales from the database.

Database Load Time

The elapsed time required to generate & load the data to the database must be recorded. The time to execute other supporting tasks such as table creation, index creation & statistsics collection must also be recorded.

Performance Tests

Once the data is loaded the real fun and games can begin. The performance test consists of both single-user power and multi-user throughput tests.

The power test consists of the first refresh function followed by the 22 query set and lastly the second refresh function.

The throughput test consists of a minimum number of concurrent runs of the 22 queries (‘streams’), as determined by the scale factor:

  • SF1 = 2
  • SF10 = 3
  • SF30 = 4
  • SF100 = 5
  • SF300 = 6
  • SF1000 = 7
  • SF3000 = 8
  • SF10000 = 9
  • SF30000 = 10
  • SF100000 = 11

The throughput test is run in parallel with a single refresh stream.

The set of 22 queries is run in a sequence specified in Appendix A of the TPC-H guide and is dependant on the number of streams.

The timing of all queries is measured in seconds.

Benchmark Metrics

The metrics captured by the power and throughput tests are as follows:

  • composite query-per-hour (QphH@Size)
  • price/performance ($/QphH/@Size)

Detailed explanations as to how these metrics are computed are available in section 5 of the TPC-H guide. There are some lovely equations in there for maths geeks to enjoy!

Benchmark Results

Vendors that submit results are required to provide an executive summary in addition to a full disclosure report. All of the shell scripts, SQL, log files etc are also provided as supporting files. There are no secrets here!

Benchmark performance results have been submitted at the following scales:

  • SF100 (100GB)
  • SF300 (300GB)
  • SF1000 (1TB)
  • SF3000 (3TB)
  • SF1000 (10TB)
  • SF3000 (30TB)
  • SF100000 (100TB)

Perhaps interestingly, the fastest system at all scale factors is currently the Exasol database running on Dell PowerEdge servers. We’ll let you peruse to TPC web site to see how they did it.

Benchmark Use

The TPC-H benchmark is primarily of interest to technology vendors to show us what they’ve got in their locker.

Here at VLDB we’re data consultants, not tech vendors, so why the TPC-H interest?

Well, it is possible to use a cut-down version of the TPC-H benchmark to assess the real-world capability of various database platforms.

Why believe the hype when you can test database performance with a consistent set of queries against a real-world(ish) schema and usable data?

Well, that’s exactly what we’ve been doing for several years now.

Specifically, we use a set of benchmark tests to assess cloud database performance. Also, as cloud platforms are said to suffer the ‘noisy neighbour‘ problem, we also run benchmark tests over extended periods to test the variance of database performance over time for cloud platforms.

Some of the results are interesting…very interesting 🙂

 

Posted in data warehouse, database, TPC-H | Tagged , , | Comments Off on Database Benchmarking