Azure IO Testing

Teradata AWS AccessAzure IO Testing

One of the first things we do here at VLDB when we have a yearning to set up a database on a cloud platform is to develop an understanding of disk input-output (IO) performance.

For databases in particular, we are specifically interested in random read performance. A high random read throughput figure is necessary for an efficient and performant MPP database.

Microsoft’s Azure is the test platform in question. The approach taken is to commission a single CentOS7 VM, attach various disks and measure IOPS & throughput.

Azure VM Setup

A single Centos 7 VM is configured. Disks are added in he required size, type & number in order to maximise IO throughput at the VM level:

Azure OS Console

The starting point on Azure is SSD disks, as the IO figures for SSD are generally superior to HDD.

Disk size determines both throughput and available IOPS, so this will also need to be considered in the overall build:

Azure Disk Console

 

Azure VM & Disk Choices

For the 1TB SSD selected above, IOPS are limited to 5,000/sec and throughput is limited to 200 MB/s.

As an aside, cloud providers usually fail to distinguish between read & write, sequential & random or specify IO size when they bandy about throughput figures. The devil is in the detail, dear reader(s).

In order to use premium storage it is necessary to specify the account as ‘Premium LRS’.

The range available that is hosted in the UK is limited so if you have a requirement for UK based system, make sure that what is on offer from Azure will satisfy the requirement:

Azure UK Console

Note that only certain VM types support SSD disks.

For the basic D14v2 with 16 CPU cores, 112Gb RAM and 800 Gb disk, it is not possible to specify SSD disks. However, the Ds14V2 VM type will allow these ‘Premium’ disks.

Also note that the 800GB is for the operating system and swap space only – the storage is non-persistent on this device specifically.

The D14 also allows up to 8 NICs and has an ‘extremely high’ network bandwidth. This is useful if the requirements is to build a MPP cluster to facilitate high speed node:node data transfer.

The Ds14V2 VM is limited to 32 disks:

Azure ds14v2

The Standard_DS14_v2 VM is limited to 64K IOPS or 512MB/s in cached mode (host cache = read or read/write) or 51K IOPS (768MB/s) in uncached mode (host cache = none).

The P30 1Tb SSD disks used are rated at 5K IOPS (200Mb/s) max per disk:

Azure P30

Azure IOPS Testing

Booting CentOS7 and upgrading the kernel takes a matter of minutes.

FIO is then downloaded using git and installed to perform the IOPS tests:

git clone –branch fio-2.1.8 http://git.kernel.dk/fio.git /fio

To run a FIO throughput test, simply enter following from the fio directory:

./fio –rw=randread –name=<directory>/fio –size=<size>g –bs=32k –direct=1 –iodepth=16 –ioengine=libaio –time_based -–runtime=<elapse> –group_reporting

  • <directory> – the directory (disk) where the tests are executed.
  • <size> – size of the test file created in GB.
  • <elapse> – elapse time for the test.
  • bs=32k – blocksize (default 4k). 32Kb is a good size to test for database suitability.
  • direct=1 – non-buffered IO.
  • iodepth=16 – concurrent IO threads.
  • ioengine=libaio – native Linux asynchronous IO.
  • time_based – test elapse time.
  • group_reporting – report on group total not individual jobs.

For example:

./fio –rw=randread –name=/tests/disk1/fio –size=1g –bs=32k –direct=1 –iodepth=16 –ioengine=libaio –time_based –runtime=30 –group_reporting

Azure Disk Read Results

Multiple random & sequential read tests were executed against a Standard_DS14_v2 VM with between 1-6 SSD disks of P30 1TB SSD per disk. Blocksize is 32Kb in all cases.

As stated earlier, we are primarily, but not exclusively, interested in read and not write performance when assessing a cloud platform. This is due to the fact that 80-90% of all data warehouse IO activity is read, and only 10-20% is write. If the read performance isn’t good enough we don’t really care about write performance as the platform is probably unsuitable.

Sequential and random read results using fio on Azure are as follows:

Azure IOPS Results

Azure Read Performance Observations

Random and sequential read give similar performance. This is both unusual and most welcome for those of us that rely on random read performance for data warehouse systems!

At the VM level, total disk throughput does not increase as more disks are added to the VM. Disk throughput is limited by the VM type and/or user account type. Adding more disks to the VM is not a way to deliver more IOPS.

Approximately 540-570MB/s random read throughput with a 32KB blocksize was achieved fairly consistently for between 1-6 P30 1TB SSDs attached to a Standard_DS14_v2 VM.

This is slightly higher than the stated 512 MB/s limit for a Standard_DS14_v2 VM, and represents a very good random read throughput figure for a single disk.

However, it is disappointing that this figure can’t be increased nearer to 1,000MB/s (which is attainable on AWS) through the use of more than 1 disk per VM.

Posted in Azure, cloud, IOPS | Tagged , , | Comments Off on Azure IO Testing

MPP & Redshift Musings

Teradata AWS AccessWhat On Earth is MPP?

In computing, massively parallel refers to the use of a large number of processors (or separate computers) to perform a set of coordinated computations in parallel (simultaneously). Source: Wikipedia.

Teradata & MPP Beginnings

Once upon a time, in a land far, far away…well, OK, California in the late 1970’s/early 1980’s to be precise…the MPP database world started to stir in earnest.

Following on from research at Caltech and discussions with Citibank’s technology group, Teradata was incorporated in a garage in Brentwood, CA in 1979. Teradata’s eponymous flagship product was, and still is, a massively parallel processing (MPP) database system.

Teradata had to write their own parallel DBMS & operating system. They also used weedy 32bit x86 chips to compete with IBM’s ‘Big Iron’ mainframes to perform database processing. Quite an achievement, to say the least.

The first Teradata beta system was shipped to Wells Fargo in 1983, with an IPO following in 1987.

It is interesting to note, for those of us in the UK, that there were Teradata systems up and running over here as early as 1986/87, partly down to the efforts of folks such as our very own ‘Jim the phone’.

Early Teradata adopters in the UK included BT (who later moved off Teradata when NCR/Teradata was acquired by AT&T), Royal Insurance, Great Universal Stores (GUS) and Littlewoods. GUS and Littlewoods combined to become ShopDirect, who still run Teradata 30 years later.

Yours truly ran his first Teradata query at Royal Insurance in Liverpool from ITEQ on an IBM 3090 back in ’89. It killed the split screen in MVS, but ho-hum, better than writing Cobol against IMS DB/DC to answer basic questions. We were taught SQL by none other than Brian Marshall who went on to write the first Teradata SQL & performance books. I still have the original dog-eared Teradata reference cards, or ‘cheat sheets’ as we called them:

Teradata ITEQ/BTEQ & Utilities Reference Cards.

Teradata ITEQ/BTEQ & Utilities Reference Cards, 1988 vintage.

From the 1980’s through to the early 2000’s Teradata had a pretty clear run at the high-end analytic DBMS market. There was no serious competition, no matter how hard the others tried. All those big name banks, telecoms companies and retailers couldn’t be wrong, surely?

MPP Upstarts – Netezza, Datallegro & Greenplum

Teradata’s first real competition in the commercial MPP database space came in the form of Netezza in the early 2000’s.

Like Teradata, Netezza consisted of dedicated hardware & software all designed to work in harmony as an engineered MPP database ‘appliance’. Unlike Teradata, Netezza was able to take advantage of open source DBMS software in the form of PostgreSQL, and open source OS software in the form of Linux.

We discovered Netezza by accident in 2002/03 after landing on a PDF on their web site following a Google search. “Netezza is Teradata V1” was our initial response. Apart from the FPGAs, we were pretty close.

A few phone calls, a trip to Boston and a training session later, and we’re up and running as Netezza partners.

Following a successful Netezza project at the fixed line telecoms division of John Caudwell’s Phones4U empire, yours truly was a guest speaker at the inaugural Netezza user conference in 2005.

Following an IPO in 2007, Netezza was bought by IBM in 2010 where it remains to this day, somewhere in IBM’s kitbag. Poor old Netezza.

Also in the early 2000’s, a bunch of mainly Brits at Datallegro were trying to build an MPP appliance out of Ingres. They were bought by Microsoft in 2008. Why Microsoft needed to buy a startup running Ingres on Linux is anyone’s guess.

The last of the new MPP players we had dealings with all those years ago are none other than Greenplum.

Through the Netezza partner ecosystem we actually knew the Greenplum team when they were still called ‘Metapa’, probably around 2002/03. Yours truly was at a dinner with Luke & Scott in Seattle as they wooed none other than Teradata luminary Charles ‘Chuck’ McDevitt (RIP) to head up the Greenplum architecture team. With Chuck on board they were virtually guaranteed to succeed.

Greenplum was always of particular interest to us because, unlike Teradata & Netezza, it was and still is a software only offering. You get to choose your own platform, which can be your favourite servers/storage or even that new fangled ‘cloud’ thing.

Not only have we got Greenplum to work happily on every single platform we ever tried, Greenplum is also the only open source MPP database.

Other Parallel PostgreSQL Players

In addition to early parallel PostgreSQL players such as Netezza and Greenplum (Datallegro used Ingres, not PostgreSQL), a whole raft of ‘me too’ players cropped up in 2005:

  • Vertica – acquired by HP in 2011
  • Aster – acquired by Teradata in 2011
  • Paraccel – acquired by Actian in 2013

It is interesting to note that, unlike Netezza, not one of the new parallel PostgreSQL players put even the smallest dent in Teradata’s core MPP data warehouse market. At least Netezza shook Teradata from their slumber which, in turn, gave the Teradata appliance to the world.

That said, perhaps Paraccel will have the biggest impact on the MPP database market which, for so long, has been dominated by Teradata, the main player in the space for 30 years. How so? Read on…

Redshift

Amazon launched the Redshift ‘Data Warehouse Solution’ on AWS in early 2013.

Yours truly even wrote some initial thoughts at the time, as you do.

As has been well documented, Redshift is the AWS implementation of Paraccel, which was one of the crop of ‘me too’ parallel PostgreSQL players that appeared in 2005.

Redshift adoption on AWS has been rapid, and reported to be the fastest growing service in AWS.

So, despite Teradata having been around for 30 years, the awareness of MPP and adoption of MPP as a scalable database architecture has been ignited by Redshift, which started out as Paraccel, which is built out of PostgreSQL.

Redshift Observations

Our previous post on Redshift in 2013 made mention of the single column only distribution key, and the dependence on the leader node for final aggregation processing.

There is a workaround for the single column only distribution key restriction, for sure. MPP folks will no doubt be mystified as to why such a restriction exists. However, it neatly removes the possibility of a 14 column distribution key (primary index) that we encountered at a Teradata gig a few years ago (no names). Cloud, meet silver lining.

The dependence on a leader node for final aggregation processing is much more of an issue. Aggregation queries that return a high number of groups will simply choke on the final aggregation step. Just ask Exadata users. There’s also not much you can do to remediate this issue if you can’t over-provision CPU/RAM at the leader node, which is the case with Redshift.

More recent Redshift observations from our own experiences, and from trusted contacts (you know who you are) include the following:

  • lack of node OS access – security auditing, hardening & OS customisation not possible.
  • non-persistent storage – if you stop or lose the EC2 cluster you lose the database data & will need to re-load from S3 or a database snapshot.
  • poor concurrency – default is 5 concurrent operations. Although this can be increased, performance drops off quickly as concurrency increases.
  • poor workload management – no short query ‘fast path’. Redshift is essentially a ‘free for all’.
  • 1MB block size & columnar only storage – very large table space overhead of 1 MB per column for every table x number of segments.
  • limited tuning options – no partitions or indexes. If the sort keys don’t help it’s a full table scan every time.
  • automatic database software updates – this might appeal to the ‘zero admin is good crowd’, but enterprise customers will baulk at the notion of a zero-choice upgrade that could break existing applications.

Lack of OS access and non-persistent storage will no doubt be show-stoppers for some enterprise folks.

The SQL Server crowd will probably not care and just marvel at the performance offered by SQL queries that run against all CPU cores in the cluster. Unless, of course, they didn’t read the data distribution notes.

Meet The New Boss, Same as The Old Boss

No matter that Redshift could be improved, what we can applaud is that Amazon has opened many folks eyes to the benefits of a scalable ‘old-skool’ relational database management system (RDBMS), and one with an MPP architecture to boot. For this we can only be thankful.

The rate of Redshift uptake speaks to the usefulness of a scalable RDBMS. The architecture is being socialised by Amazon (high volume/low margin) in a way that never was by Teradata (low volume/high margin).

All of the MPP database vendors, Amazon included, owe a debt of gratitude to Teradata who proved the architecture 30 years ago. That Teradata got so much so right so long ago never ceases to amaze yours truly.

To Redshift the old adage remains true, there really is ‘nowt new under the sun’.

Posted in MPP, Redshift, Teradata | Tagged , , | Comments Off on MPP & Redshift Musings

Teradata AWS Access How To Guide

Teradata AWS Access

Teradata AWS Access

It was previously explained by ourselves here at VLDB how to setup a Teradata MPP (multi-node) system on Amazon Web Services (AWS). Well, now we’re back to offer a simple step-by-step guide on how to setup basic SSH connections, install Teradata’s Tools & Utilities (TTU) and setup a desktop SQL client for a Teradata instance running on AWS.

 

SSH & Key Pairs Identity

When you first setup your server on AWS, you are required to have created a key pairs identity. With this, you can start accessing your server straight away. There are multiple ways to connect, though one of the more common is via the SSH tool.

– Mac

If you are connecting via a terminal, then the command will be similar to the usual connection string:

ssh -i KEY_PAIRS_FILE ec2-user@AWS_SERVER_IP

However, you may need to first alter the permissions of the key pairs file. If the access is open to any user the connection will reject the file input. This can be done as follows:

chmod 400 KEY_PAIRS_FILE

Note that the ec2-user is the default user for most Linux systems. Once logged on as ec2-user, you can gain access to root with the command sudo su -.

If you find yourself logging on and off a lot, it may be conveniant to set your key pairs file path as an environment variable for ease of use.

– Windows

If you are connecting via PuTTY on Windows, then you are required to first modify the key pairs file using PuTTY-Gen. Typically this piece of software is downloaded alongside PuTTY, but you may need to download it separately.

Once you’ve loaded up the PuTTY-Gen software, click Load on the right and then select your key pairs file. In the bottom right you may need to change the file type that is visible (default is .ppk and the key pairs file is a .pem).

Once loaded, it should pop up with a message saying Successfully Imported… Once you have this, click Save Private Key in the bottom right and save the file. This will now be a usable key pairs file for PuTTY’s use.

Now, open up PuTTY.

On the left side in the list, expand Connection, and then SSH. Now click on Auth and you should see a view on the right titled Options controlling SSH authentication.

From here, click on Browse in the bottom right and then select the key pairs file you just created.

Now go back to the main page by clicking on Session at the top of the list on the left. Here, put in your server IP address as normal and hit Open.

Note that it may be of convenience to save the session details before opening the connection for ease of use at a later time.

Password Authenticated Login –

If you require password authentication to be enabled, then it will be a simple edit of a system config file and a restart of the SSH service on the AWS server.

First, edit the file /etc/ssh/sshd_config, and then find and change the following line:

PasswordAuthentication no

to

PasswordAuthentication yes

Note that you should also remove the hash (#) symbol from in front of the parameter if there is one. Once the file is edited as above, you will then need to restart the SSH service as follows to commit the change.

service sshd restart

Once the service is restarted, password authentication should be updated and enabled on your system.

Teradata’s Tools and Utilities (TTU)

 It is very useful to understand that all Teradata servers on AWS come with the equivalent TTU version packaged for Linux, Mac and Windows. With your key pairs file to hand, you can use it, similarly to SSH access, to download the file using SCP.

But first you will need to make the file(s) available to the required user for download (ec2-user in our example), as by default only root can access the files.

First, logon onto the server using your SSH access as ec2-user. Change to the root user and then do [something similar to] the following commands:

cp /var/opt/teradata/TTU_pkgs/Teradata*MACOSX* /home/ec2-user/

chown ec2-user /home/ec2-user/Teradata*MACOSX*

Note that you will need to replace the name MACOSX with Linux or windows if you require those instead.

 –  Mac

Once done, use the following command to retrieve the file for your Mac via terminal:

scp -i KEY_PAIRS_FILE ec2-user@SERVER_IP:/home/ec2-user/Teradata*MACOSX* .

Windows –

The simplest way to download the package on Windows is to use the WinSCP software. For this you will again require the key pairs (.ppk) file that you created using PuTTY-Gen.

First, open up WinSCP.

From here, click on Advanced on the right which will pop up a new window. In the list on the left, click on Authentication. Similarly to PuTTY, click on the […] icon under Private key file and select the key pairs .ppk file, and then click OK.

Now you can input your IP address and user name as per usual, and then Login. Once logged in, you can freely move around your local and server directories to download the files as required.

Note that it may be of convenience, similarly to PuTTY, to save the session details before opening the connection for ease of use at later time.

Installation –

Once you have downloaded the required packages, it is a simple matter of unpackaging the files and running the setup script contained within to install the TTU software.

For example, on a Mac terminal locate the TTU package and then run the following commands:

gunzip Teradata*MACOSX*

tar -xvf Teradata*MACOSX*

./setup.bat a

Note that using tar as above will result in all files being placed in the directory you are located in. Additionally, using the a parameter for the setup script will install all available TTU software.

Once installation is complete, you should be able to use BTEQ from your terminal to connect to your Teradata instance (however you may need to re-load your terminal first).

Teradata Studio Express

Whilst there are a wide variety of desktop SQL clients available that are compatible with Teradata, and you may very well have your preference out of them all, this will be a quick setup and overview of Teradata Studio Express.

Teradata Studio Express is openly available for download on Windows, Mac and Linux platforms from the following website address:

http://downloads.teradata.com/download/tools/teradata-studio-express

Note that for this software to install successfully, Java Developer Kit must be installed.

Once downloaded, simply unzip the package and run the installer. This will take you through a basic on-screen installation process that typically takes less than a minute. Once finished, the software is ready to use.

When first launching Teradata Studio Express you will be greeted with a Quick Tour guide for the studio. Quickly run through the guide to help you understand the basics (if required) and then continue to the main view.

From the main view, simply right-click Database Connections at the top of the far left panel and click New…. In this window, click Teradata, name the connection as you wish at the bottom under Name: and then hit Next.

In this next view you will see various parameters to fill in, and some pre-defined as well. Unless required, we would recommend leaving the pre-set parameters as default, only altering the following:

  • Database Server Name – The IP or server name to connect too.
  • User Name – The name of the user that you wish to connect to Teradata with.
  • Password – The password for the respective user chosen above.
  • Database – Defaults to user’s default if not specified. All database hierarchy existing under the specified (or default) database will be available for viewing once the connection is made.

Once the details required have been completed, you may click Next to get an overview of the details you’ve specified, or you may directly Finish the connection setup. At this point, you should have a successful connection in the far left panel for you to view. Any changes required to an existing connection can be made by simply right-clicking the connection name and clicking Properties.

With the connection now open, you may easily run queries against the database via the SQL Editor in the top panel on the right. Our basic example shows the following query being run:

select

*

from dbc.dbcinfo

;

The query may be executed by simply clicking on the green triangle in the top right (or by using the shortcut CTRL + ALT + X on a Windows environment, or CMD + ALT + X on a Mac).

Results may then be viewed in the middle panel titled Teradata Result Set Viewer, with history available for all previously executed queries in the bottom panel, suitably titled Teradata SQL History.

Whilst this should cover the basics required to use Teradata Studio Express, there are of course many useful features that can be utilised: One such simple feature is the Reexecute Statement command. By right-clicking on any single row in the Teradata SQL History, you may commit Reexecute Statement, which, as the term suggests, re-executes the selected query statement against the database, and will return results as per usual in the Teradata Result Set Viewer.

 

And there you have it! A simple step-by-step run through on the basic setup of connections, utilities and client tools that may be required by yourselves for any Teradata instance running on AWS.

 

 

 

 

 

Posted in AWS, SQL, Teradata | Tagged , , , , | Comments Off on Teradata AWS Access How To Guide

Is Hadoop Failing?

Is Hadoop Failing? ‘Hadoop Is Failing’ Article

A recent LinkedIn post linking to an Innovation Enterprise article entitled ‘Hadoop Is Failing’ certainly got our attention, as you might expect.

Apart from disagreeing with the assertion that ‘Hadoop…is very much the foundation on which data today is built’ the main thrust of the article is well founded.

Consider the following snippets:

  • ‘adoption appears to have more or less stagnated’
  • ‘trying to figure out how to use it for value’
  • ‘one man’s big data is another man’s small data’
  • ‘hundreds of enterprises were hugely disappointed by their useless 2 to 10TB Hadoop clusters’

And the grand finale:

Most companies do not currently have enough data to warrant a Hadoop rollout, but did so anyway because they felt they needed to keep up with the Joneses..they soon realize that their data works better in other technologies.

This can be summarised as: drank the Kool-Aid, convinced each other Hadoop was a silver bullet, rocked up a POC stack with an all-too-willing vendor, spent a lot of time/effort getting nowhere, quietly went back to existing data management technologies (almost certainly the RDBMS based data warehouse).

Mention is also made of the Hortonworks share price performance, which is down by about 66% since the December 2014 IPO. The NASDAQ is up 19% over the same period.

VLDB’s ‘Big Data’ Experiences

We’ve been exposed to the recent ‘Big Data’ thang from both a client and vendor perspective.

From the client side the main motivators appear to be twofold: reduce data warehouse costs and ‘fear of missing out‘ (FOMO) or keeping up with the Joneses. In both cases the amount of Kool-Aid supplied by VCs, vendors, analysts, tech journos and geek fanbois willingly gulped down by those that should know better is a tad disappointing.

The architect that glibly told us his company would ‘obviously’ have to re-train thousands of analysts away from SQL and into the realms of Pig an Hive springs to mind. This despite having 25 years of skills invested in the existing EDW.

From the vendor perspective there has been a lot of ‘over promising’ that has failed to materialise. ‘Nothing new there’, the cynics might suggest.

The challenge during one POC was to run an existing batch suite of SQL against a relatively simple schema lifted from the EDW as part of an EDW-offloading test.

The POC was to be attempted using a Hadoop stack running Hive, and was categorised by the vendor as ‘just a SQL race’. Given the stature of the vendor we reasonably assumed their confidence was well placed. How wrong we were – the SQL failed a few steps in. Very many tickets later, still no joy. EDW offloading? Err, no. EDW replacement? LMAO!

The point is, our real-world experiences bear out the main thrust of the article: the rush to get involved with Hadoop has *largely* been done without any real understanding of it and the failure to get it working, for whatever reason, has led to a lot of disillusionment and/or stagnation.

VLDB’s Guidance

First of all, be honest with yourself: you’re probably not Google, Facebook or Yahoo.

These companies conceived, designed & built the Hadoop ecosystem out of necessity. They have <understatement>slightly</understatement> different challenges to yours.

They also have a vastly different set of in-house skills. You’re probably a bank, a retailer, a manufacturer, a telecoms company or the like. You simply don’t possess the engineering capability that the ‘Hadoop crowd’ take for granted. Your salary scales also don’t allow the required talent to be hired. That’s before we take your location, stock options & ‘coolness’ into account!

When considering *any* new technology, it should always be apparent what extra capability is on offer, and what specific business benefit that capability delivers. Implicitly, this should be something we can’t already achieve with the tools at our disposal.

On a personal note, the adage that “there’s nowt new under the sun” is my default starting position. Greater minds than mine think there is a lack of novelty with MapReduce. We’ve had parallel data processing for decades. Bah humbug, etc.

Before any technology POC (Hadoop included) gets off the ground, the business case in the form of potential ROI should be established, as we all know. FOMO and ‘keeping up with the Joneses’ does not a business case make, no matter how much Kool-Aid is on offer.

During the ROI assessment, it’s all too easy to tot up the potential project benefits without being honest about the costs. Hadoop stacks aren’t free, no matter how much IT bang on about open source software, commodity servers/storage & vendor consulting support. Look at the amount of in-house resource required and the costs add up very quickly.

As several of our clients have rightly concluded, in most instances, Hadoop is a classic case of a solution looking for a problem – a problem that simply doesn’t exist for most organisations.

US tech companies assume the rest of the world operates at the same scale as the US. We simply don’t. Most organisations in the UK and Europe that we’ve encountered over the years operate at the 1-20TB scale. Over 100TB is rare. The case for a data processing stack running hundreds or even thousands of nodes simply doesn’t exist.

To underpin wider adoption, the recent focus has been to polish ‘SQL on HDFS’. This leads neatly back to where we started: the SQL compliant parallel data warehouse i.e. a database!

Love Your Data Warehouse

Prove us wrong by all means, but it is our belief that 99% of organisations can deliver 99% of their data analytics requirements without resort to a ‘Hadoop stack’. Go back and give the data warehouse you take for granted some love 🙂

We’ve yet to hear of a *single* case where an existing RDBMS data warehouse has been ditched for a Hadoop stack. Not one.

So, is Hadoop failing? No, not when used by the few organisations that have the right problem & skills.

Are Hadoop projects failing for lots of other organisations? Yes, mostly.

Posted in Big Data, data warehouse, Hadoop | Tagged , | Comments Off on Is Hadoop Failing?

Xmas Wish List

Teradata MPP Setup on AWSAll We Want For Christmas

It’s that time of year…yes Christmas (Xmas for short), most definitely not <yuck>’holidays'</yuck>.

There’s far too much chocolate in the VLDB office as you might expect. Geeks & chocolate are a winning combo, so it won’t last long.

Moving on…office Xmas decorations – check. Office Xmas party – check. Silly jumpers – check. Secret Santa – check. Loud Xmas music – check.

As we draw ever nearer to *finally* going home for Xmas, our thoughts turn to what VLDB’s Xmas list to Santa might look like…so, here goes…

Dear Santa, can you please make sure clients understand at least some of the following:

  1. Data warehouse systems aren’t a side-project you can pay for with left over funding from another project. Real funding, sponsorship, requirements & commitment are required. Subject Matter Experts (SMEs) or Business Analysts (BAs) will need to provide guidance to get an analytic solution delivered. Technology & designers/developers on their own won’t get very far.
  2. High praise from the likes of Gartner doesn’t mean a particular technology is a good fit for your organisation. Figure out your needs/wants/desires afore ye go looking to buy shiny new tech. Thou shalt not believe all thou hears at conferences. It’s the job of tech companies, VCs, analysts & conference organisers to whip up excitement (see kool-aid). They’re not on the hook for delivery.
  3. Accurate estimates for design/build/test are only possible if analysis is carried out. Either you do it, pay us to do it, or accept estimates with wide tolerances.
  4. Quality Assurance (QA) is not the same as unit testing. It’s a real thing that folks do. Lots of them. No really!
  5. CSV files with headers and trailers are a perfectly acceptable way to build data interfaces. Lots of very large organisations are guilty of clinging on to this ‘unsexy’ approach. It ‘just works’.
  6. You’ll probably need a scheduler to run stuff. cron is not a scheduler. Nor is the DBA.
  7. If you could have ‘built that ourselves in SQL Server in 5 days’ you would have already done so.
  8. Don’t focus on our rate card. Focus on the project ROI. Oh, wait, you haven’t even thought about ROI. Doh!
  9. Yes, we can get deltas out of your upstream applications without crashing the system. It’s what we do. We’ll even prove it.
  10. If you want us to work on site we’ll need desks to sit at, preferably next to each other. We’re picky like that 😉

Have a great Xmas & New Year Santa,

Love from all at VLDB
xxx

Have a great Xmas & New Year, and here’s to 2017 .

Posted in consulting, data warehouse | Comments Off on Xmas Wish List

Teradata Setup on AWS

Teradata MPP Setup on AWSTeradata AWS Setup

As previously mentioned, Teradata MPP (multi-node) setup on AWS is a tad more involved than the previous SMP (single node) setup. Well, as we’re kind-hearted folks here at VLDB, we decided to show you how it’s done. Well, how we did it anyway.

Let’s see, if armed with an Amazon account and a credit card, you can have Teradata up and running on AWS in under an hour, as claimed by Stephen Brobst at the Teradata Partners conference recently.

Tick-tock, here goes…

Teradata AWS Marketplace

First of all, login to the AWS Management Console and select AWS Marketplace (under ‘Additional Resources’ on the right) then search for ‘Teradata Database Enterprise Edition’ in the AWS marketplace to go to the Teradata Database Enterprise Edition (1-32 nodes) page. From here you will have to subscribe to Teradata if not already done so.

teradata-aws-marketplace

 

Select the required ‘Region’ – in this case ‘EU(Ireland)’ – then click the ‘Continue’ button on the ‘AWS Marketplace Teradata Database Enterprise Edition (1-32 nodes)’ page to go to the ‘Launch on EC2’ page.

Launch Teradata on EC2

teradata-launch-ec2

Set the following as required:

  • Version – currently only ‘15.10.02.09-1’ available.
  • Region – ‘EU (Ireland)’ is the closest region for the UK.
  • Deploymentlocal or EBS storage and VPC options.

Click ‘Launch with CloudFormation Console’ to go to the ‘Create Stack/Select Template’ page.

AWS Cloud Formation Stack Template

teradata-aws-template

Ensure you are in the correct AWS region (top right corner next to ‘Support’ drop-down) then click on ‘Next’ to go to the ‘Create Stack/Specify Details’ page.

AWS Cloud Formation Stack Details

Set the following & leave other values as default:

  • Stack name – the name of the stack that will be created in CloudFormation.
  • System Name – the name of the Teradata instance.
  • DBC Password – the password for Teradata DBC user.
  • Number of Nodes – the number of EC2 nodes required (1-32).
  • Instance & Storage Type – select the EC2 instance type and storage type/size required.
  • Availability Zone – choose availability zone from list.
  • Remote Access From – specify CIDR block (IP range) from which SSH/DBMS access is allowed. Use 0.0.0.0/0 if required (any IP allowed) to test the setup process.
  • AWS Key Pair – a pre-existing key pair must be specified.

See http://www.vldbsolutions.com/blog/teradata-on-aws-configurations-pricing/ for help on choosing EC2 instance & storage types. Apart from the number of nodes, this is the biggest driver of the cost of your Teradata on AWS stack.

Click on ‘Next’ to go to the ‘Create Stack/Options’ page.

AWS Cloud Formation Stack Options

teradata-aws-options

Nothing needs to be set/changed in this page…unless you think otherwise.

Click ‘Create’ to proceed with the stack/instance creation which is monitored via CloudFormation.

AWS Cloud Formation Stack Review

teradata-aws-reviewIt generally takes between 20-30 minutes to provision a single m4.4xlarge EC2 instance with 5TB EBS storage. The process is the same irrespective of the number of nodes.

There are lots of steps to go through as part of the provisioning. Once, complete the status will change to ‘CREATE_COMPLETE’:

teradata-aws-create-complete

Teradata on AWS Up and Running

Once the stack is up and running the next stage is to connect via SSH, Teradata Tools & Utilities (TTU) and a desktop SQL client. This is quite a big topic in itself and will be covered in a separate blog post.

So, to get back to Mr Brobst, we think it is possible to be up and running with Teradata on AWS in under an hour, but only if the node count is low, and only if you follow a guide that somebody has prepared…such as this one.

Enjoy!

Posted in AWS, cloud, Teradata | Tagged , , | Comments Off on Teradata Setup on AWS

Teradata MPP on AWS

Teradata MPP on AWSMulti-node Teradata on AWS

Teradata’s eponymous database became available on AWS in Q116 as a single node (SMP) only offering, as detailed in a previous VLDB blog post:

Teradata on AWS – Configurations & Pricing

Of far more interest to most folks was the availability of multi-node (MPP) Teradata systems on AWS. The ‘full fat’ version of Teradata was slated for Q316 and was duly made available during Teradata Partners 2016 via the AWS Marketplace.

The Teradata MPP version is now available on AWS to support Teradata clusters of between 1 and 32 SMP nodes.

Teradata Enterprise Edition on AWS Features

The previous SMP-only Base and Base+ editions have been consolidated into a single ‘Teradata Enterprise Edition’ with the following features:

  • Teradata DBMS v15.10 on SLES 11 including columnar, temporal & row level security
  • Teradata Active System Management (TASM)
  • Teradata Data Stream Utility
  • Teradata Parallel Transporter (TPT)
  • Teradata QueryGrid
  • Teradata Studio
  • Teradata Tools and Utilities (TTU)
  • Teradata REST Services
  • Teradata Server Management
  • Teradata Viewpoint (excluding Teradata Data Lab)

Additional cost items include:

  • Teradata Data Lab
  • Teradata Data Mover
  • Teradata Ecosystem Manager

Support is included in the Teradata software subscription and consists of both 24×7 Premier Cloud Support and access to Teradata At Your Service (TAYS).

A feature of running Teradata on AWS likely to be of interest to a lot of folks is the ability to backup Teradata data to S3.

Teradata MPP on AWS Pricing

At the time of launch, pricing for the m4 EC2 instances with persistent EBS storage in the Ireland region is as follows:

m4.4xlarge (16 vCPU, 64 GB RAM, 20 AMPs)

EC2 = $1.056/hr or $778.88/month
Teradata = $5.973/hr or $4,360.29/month
Total = $7.029/hr or $5,131.17/month

m4.10xlarge (40 vCPU, 160 GB RAM, 20 AMPs)

EC2 = $2.641/hr or $1,927.93/month
Teradata = $16.417/hr or $11,984.41/month
Total = $19.058/hr or $13,912.34/month

Note that these prices are per node.

Teradata software discounts of between 23% – 89% are offered for those willing to pay annually in advance.

Teradata on AWS Setup

The ‘Teradata on AWS Getting Started’ guide is available as HTML  or a 98 page PDF.

<warning>
Compared to the previous SMP-only offering, the setup of Teradata MPP systems on AWS is *way* more complex.

Understanding topics such as Identity & Access Management (IAM), Virtual Private Cloud (VPC) and CloudFormation is now of prime importance.
</warning>

That’s all for today…watch this space for VLDB’s experience of setting up a multi-node Teradata MPP system on AWS, along with some benchmark figures.

Posted in AWS, cloud, Teradata | Tagged , , | Comments Off on Teradata MPP on AWS

Using Joins Across Multiple Tables

Readings in Database SystemsJoins

When dealing with large data sets, it’s important to ensure that the data can be accessed correctly. Failure to address this issue early on in database development can lead to problems when later attempting to extract information from the data itself. This was highlighted recently during a count for London’s Mayoral election on 5 May, when staff at the Electoral Commission had to ‘manually query a bug-stricken database’ which delayed the result by ‘several hours’. These kinds of problems aren’t helpful in terms of furthering the role of computerisation of such tasks, and in particular for data companies. At present, electronic voting machines are not currently used by UK voters, however, counting software was used for the London Mayoral and Assembly elections. In the article, electoral law expert Prof. Bob Watt of Buckingham University expressed concerns about such tasks being undertaken digitally; he said: ‘The bigger issue is of course just how stable these machines are and that’s something that I have a great deal of worry about and have had for some time’. As you can see, it’s important that companies dealing with data get it right. The BBC article on this story doesn’t go into any specific details on why the data was not accessible, so it’s difficult to offer any kind of assessment – however, when dealing with such large sets of data involving tens of thousands of rows, it’s important that the right people have access to the information when they need it.

 

When storing data in a relational database management system (RDBMS), data is contained in multiple tables which are related to each other through one or more common factors. Using the example above, a table may contain information relating to the voter in a table named ‘voter’, such as id, name and email_address. This then may be linked to another table containing contact information for the voter such as their address, telephone and other related information. We may then have a table relating to the candidates of the election, and information relating to them which is necessary for The Electoral Commission to efficiently track the vote. In order to access the information across two tables, we would use a join.

 

A join is used to examine information, not just across two tables, but across multiple tables (depending on system capabilities) and is an integral part of querying data. Each table in a database should contain a primary key which ensures that each row is unique and prevents duplication of redundant data. In addition, a table can contain a foreign key which references another table. For example, by connecting a table containing voter information using voter_id and candidate_id, we could find out which individuals voted for a specific candidate. In this instance, the candidate_id would be the foreign key in our voters table, as it would reference a different table containing information on the candidate. To perform a join in SQL, we need to create a statement which references which information we require, followed by the tables we want to join and how we want to join them together to provide the results we need.

 

To fully understand how joins work, we need to create two tables and some data within them. But first, we will briefly go over the different types of joins. The default join is an inner join; this statement is used to return results where the data in each table matches. The next type of join we will look at is a left join. This statement is used to return all the data from the left table and only the matching data in the right table. The next table is a right join; this is the opposite to a left join. This statement is used to return all of the data from the right table and only the matching data in the left table. The final join is a full outer join; this statement is used to return all the data in both tables. Let’s have a look at each in join in more detail. Firstly, we will examine the data contained in our tables.

Table examples:

voter

Join Table VLDB Solutions

candidate

Join Table VLDB

 

Inner Join

The first join we spoke of was the inner join. The inner join is used to return matching data in each table. Using the statement below on the tables we have created, we can display all of the people that voted for a candidate.

 

SELECT voter.voter_id

,voter.forename

,candidate.candidate_id

,candidate.forename

FROM voter

INNER JOIN candidate

ON voter.candidate_id=candidate.candidate_id

ORDER BY voter.voter_id

;

This table shows the results from the above statement:

Join Table VLDB

The results from the inner join show us that only five rows from the voter table have matches in the candidate table.

Venn VLDB

The above venn diagram shows us how the tables are connected when we join two tables on an inner join. In this instance we only get the data which matches in both tables.

 

Left Join

The next join statement is the left join. This will provide us with all the data from the left table and only the connecting data in the right table.

 

SELECT voter.voter_id

,voter.forename

,candidate.candidate_id

,candidate.forename

FROM voter

LEFT JOIN candidate

ON voter.candidate_id=candidate.candidate_id

ORDER BY voter.voter_id

;

 

Results table from the above statement:

Join Table VLDB

The left join displays all of the data in the left table, which in this statement is voter, and displays data of the connected table. If there is no data available for a row, then null is added to that row; this is what has happened in row 6 as Liam didn’t choose a candidate.

Venn VLDB

This venn diagram shows us how a left join is used. All of the data contained in voter is returned as is the matching data in candidate.

 

Right Join

 

As you may have guessed, the next statement we will look at is the right join. This statement will result in the right table providing us with all the data it holds and then only displaying the data that is connected to it from the left table.

 

SELECT voter.voter_id

,voter.forename

,candidate.candidate_id

,candidate.forename

FROM voter

RIGHT JOIN candidate

ON voter.candidate_id=candidate.candidate_id

ORDER BY candidate.candidate_id

;

 

Results from the above statement:

Join Table VLDB

The SQL statement has been tweaked slightly. The right join statement has the order by changed to candidate.canidate_id; this is to make the results more readable. The right join statement has displayed all of the data available in the right table, which was candidate, and now only shows the connecting data in the left table (voter), and again if there isn’t any data in the left table that connected to the right table then null is added to that row.

Venn VLDB

 

This right join venn diagram shows us the opposite to a left join, and displays all of the data in candidate and the connecting data in the voter table.

 

 

Full Outer Join

The last statement that will be explained now is a full outer join. This statement brings in all the data from both tables and connects the results together where it can.

SELECT voter.voter_id

,voter.forename

,candidate.candidate_id

,candidate.forename

FROM voter

FULL OUTER JOIN candidate

ON voter.candidate_id=candidate.candidate_id

ORDER BY voter.voter_id

;

 

Results from the above statement:

Join Table VLDB

With a full outer join, the results are from both tables, and results will be matched together. Also, if there is no matching data, the field will again contain a null value.

Venn VLDB

And this is the last venn diagram. This shows us a full outer join and displays all the data in both tables; it attempts to match the data together where it can.

 

As you can see from the examples above, we have managed to join the data from two individual tables and link them together in a variety of different ways to provide the results we required. The ability to join tables is a fundamental aspect of SQL and a common feature of the role of a database developer.

 

Sources: http://www.bbc.co.uk/news/technology-36558446

 

 

Posted in data warehouse, database, Joins, SQL | Tagged , , , | Comments Off on Using Joins Across Multiple Tables

Database Benchmarking

Readings in Database SystemsTPC Database Benchmarks

As most readers of data warehouse related blogs will no doubt know, the Transaction Processing Performance Council (TPC) define database benchmarks to allow comparisons to be made between different technologies.

Amongst said benchmarks is the TPC-H ad hoc decision support benchmark, which has been around since 1999 with the latest version being v2.17.1.This is the most relevant TPC benchmark for those of us focussed on data warehouse platforms.

Benchmark Scale

TPC-H benchmarks are executed at a given scale point ranging from 1GB up to 100TB:

  • TPCH1 = 1GB
  • TPCH10 = 10GB
  • TPCH30 = 30GB
  • TPCH100 = 100GB
  • TPCH300 = 300GB
  • TPCH1000 = 1,000GB/1TB
  • TPCH3000 = 3,000GB/3TB
  • TPCH10000 = 10,000GB/10TB
  • TPCH30000 = 30,000GB/30TB
  • TPCH100000 = 100,000GB/100TB

Benchmark Data

Data used to populate the TPC-H schema is generated using the DBGEN utility which is provided as part of the TPC-H download.

Database Schema

The schema used in the TPC-H benchmark is retail based and contains the following tables:

  • customer
  • orders
  • lineitem
  • part
  • partsupp
  • supplier
  • nation
  • region

Each table contains either a fixed number of rows or a number related to the scale factor in use.

Unsurprisingly, the ‘lineitem’ table contains the most rows, which is 6,000,000 x scale factor e.g. 600,000,000 rows for scale factor 100 (TPCH100).

Column names and data types are provided for all tables.

For those using MPP databases such as Teradata, Greenplum or Netezza you’ll have to decide your own data distribution strategy.

Benchmark Queries

The queries that are executed against the populated schema are generated using the QGEN utility which is provided as part of the TPC-H download. Minor query modifications are allowed to take into account differences between DBMS products.

There are a total of 22 ‘SELECT’ queries (‘query stream’) that must be executed against the populated retail schema at a given scale factor. Each query corresponds to a business question. There are also 2 refresh functions (‘refresh stream’) that add new sales and remove old sales from the database.

Database Load Time

The elapsed time required to generate & load the data to the database must be recorded. The time to execute other supporting tasks such as table creation, index creation & statistsics collection must also be recorded.

Performance Tests

Once the data is loaded the real fun and games can begin. The performance test consists of both single-user power and multi-user throughput tests.

The power test consists of the first refresh function followed by the 22 query set and lastly the second refresh function.

The throughput test consists of a minimum number of concurrent runs of the 22 queries (‘streams’), as determined by the scale factor:

  • SF1 = 2
  • SF10 = 3
  • SF30 = 4
  • SF100 = 5
  • SF300 = 6
  • SF1000 = 7
  • SF3000 = 8
  • SF10000 = 9
  • SF30000 = 10
  • SF100000 = 11

The throughput test is run in parallel with a single refresh stream.

The set of 22 queries is run in a sequence specified in Appendix A of the TPC-H guide and is dependant on the number of streams.

The timing of all queries is measured in seconds.

Benchmark Metrics

The metrics captured by the power and throughput tests are as follows:

  • composite query-per-hour (QphH@Size)
  • price/performance ($/QphH/@Size)

Detailed explanations as to how these metrics are computed are available in section 5 of the TPC-H guide. There are some lovely equations in there for maths geeks to enjoy!

Benchmark Results

Vendors that submit results are required to provide an executive summary in addition to a full disclosure report. All of the shell scripts, SQL, log files etc are also provided as supporting files. There are no secrets here!

Benchmark performance results have been submitted at the following scales:

  • SF100 (100GB)
  • SF300 (300GB)
  • SF1000 (1TB)
  • SF3000 (3TB)
  • SF1000 (10TB)
  • SF3000 (30TB)
  • SF100000 (100TB)

Perhaps interestingly, the fastest system at all scale factors is currently the Exasol database running on Dell PowerEdge servers. We’ll let you peruse to TPC web site to see how they did it.

Benchmark Use

The TPC-H benchmark is primarily of interest to technology vendors to show us what they’ve got in their locker.

Here at VLDB we’re data consultants, not tech vendors, so why the TPC-H interest?

Well, it is possible to use a cut-down version of the TPC-H benchmark to assess the real-world capability of various database platforms.

Why believe the hype when you can test database performance with a consistent set of queries against a real-world(ish) schema and usable data?

Well, that’s exactly what we’ve been doing for several years now.

Specifically, we use a set of benchmark tests to assess cloud database performance. Also, as cloud platforms are said to suffer the ‘noisy neighbour‘ problem, we also run benchmark tests over extended periods to test the variance of database performance over time for cloud platforms.

Some of the results are interesting…very interesting 🙂

 

Posted in data warehouse, database, TPC-H | Tagged , , | Comments Off on Database Benchmarking

An Introduction to Primary Indexes and Distribution Keys

Primary Indexes and Distribution Keys

It’s All About Data Distribution.

As experts in Massively Parallel Processing (MPP), here at VLDB Solutions we talk regularly about ‘Primary Indexes’ (PI) or ‘Distribution Keys’ (DK). They are integral to the architecture of both Teradata and Greenplum respectively, and the correct identification and employment of them is ‘key’ to the maximised performance of both Massively Parallel Processing (MPP) systems. But how do they work?

Data Distribution

Before we examine each in detail, it is first important to understand how data is stored and accessed on a MPP system, and how the distribution of data helps a system achieve true parallelism.

Within a MPP system, data is partitioned across multiple servers (referred to as AMPs in Teradata, and Segments in Greenplum). These servers ‘share nothing’ – they each process only their own share of the data required by a query, and do not process data located on other servers in the system. If data is not partitioned across all of the available servers, then those servers without data sit idle during the processing of a workload, and the full power of MPP has not been harnessed. The data is considered ‘skewed’, and therefore the query will be skewed too.

Primary Indexes and Distribution Keys are, as the name suggests, the key by which data is distributed across the servers. They are designated at a table level within the database, turning a column, or a selection of columns, into the key for each row of data.

If the Primary Index or Distribution Key for each row of data within the table is unique, then the rows are distributed across the servers in a ‘round robin’ manner. When the data is queried, each server has an equal share of the workload, and the system’s full power has been harnessed; the system has achieved true parallelism.

If the Primary Index or Distribution Key is not unique, rows with duplicate Primary Indexes or Distribution Keys are grouped on the same server. In a table where there are many matching key values, this can lead to skewed data and therefore skewed performance.

For example, if a MPP system had 10 available servers, but a table with a Primary Index / Distribution Key was created where that column only contained two different values (1/0, True / False, Yes / No, etc.) then that data would only be distributed to two servers, as the rows with matching value sets would be stored together. When that data is queried, only those servers with data can process the workload; the remaining eight servers remain idle.

Unique and Non-Unique Primary Indexes

As mentioned earlier, data distribution in a Teradata system is governed by a table’s Primary Index. Primary Indexes can be designated as either Unique or Non-Unique, depending on the column, or selection of columns, that has been chosen as the Primary Index. If a Primary Index is determined as a Unique Primary Index (UPI), then duplicate values are no longer allowed within the chosen column / selection of columns, and will not be loaded to the system. However, those rows that are loaded will be distributed evenly across the AMPs, and the system will achieve parallelism when that table data is queried. If a Primary Index is determined as Non-Unique Primary Index (NUPI), then duplicate values will be allowed within the column / selection of columns, but they will be grouped on the same AMP when the data is distributed.

Distribution Keys and Distributed Randomly

The Distribution Key is how data is distributed on a Greenplum system. Unlike Teradata, the key is not declared as unique or non-unique; it merely is, or is not. Again, as with Teradata, table data with a unique Distribution Key is distributed ‘round robin’ across the Segments; and duplicate Distribution Key values are grouped together on the same segment.

However, Greenplum tables can also be designated as ‘distributed randomly’. In this case, column data is not used for distribution, and each row is distributed to the Segments in the same ‘round robin’ manner as when using a unique Distribution Key.

 How to Choose Primary Indexes and Distribution Keys

As should now be clear, correctly identifying which column, or selection of columns, to use as a Primary Index or Distribution Key is integral to the performance of a MPP system. In a Relational Database Management System (RDBMS), a table’s Primary Key is often a natural candidate to become the Primary Index / Distribution Key – a column of values uniquely identifying each row could easily be used as a UPI for best distribution.

However, there will be times when a table does not have a single column of unique values, but where ‘round robin’ distribution is still desired – ‘reference’ or ‘dimension’ tables, for instance. On a Greenplum system, this could be achieved by distributing randomly; but on a Teradata system, it would be necessary to identify a selection of columns, where the combination of data within those columns would be unique on a row-by-row basis.

Posted in Big Data, data warehouse, database, SQL, VLDB | Tagged , | Comments Off on An Introduction to Primary Indexes and Distribution Keys