VLDB Blog : Data Warehousing & Big Data news, tips & tales

Readings in Database SystemsJoins

When dealing with large data sets, it’s important to ensure that the data can be accessed correctly. Failure to address this issue early on in database development can lead to problems when later attempting to extract information from the data itself. This was highlighted recently during a count for London’s Mayoral election on 5 May, when staff at the Electoral Commission had to ‘manually query a bug-stricken database’ which delayed the result by ‘several hours’. These kinds of problems aren’t helpful in terms of furthering the role of computerisation of such tasks, and in particular for data companies. At present, electronic voting machines are not currently used by UK voters, however, counting software was used for the London Mayoral and Assembly elections. In the article, electoral law expert Prof. Bob Watt of Buckingham University expressed concerns about such tasks being undertaken digitally; he said: ‘The bigger issue is of course just how stable these machines are and that’s something that I have a great deal of worry about and have had for some time’. As you can see, it’s important that companies dealing with data get it right. The BBC article on this story doesn’t go into any specific details on why the data was not accessible, so it’s difficult to offer any kind of assessment – however, when dealing with such large sets of data involving tens of thousands of rows, it’s important that the right people have access to the information when they need it.

 

When storing data in a relational database management system (RDBMS), data is contained in multiple tables which are related to each other through one or more common factors. Using the example above, a table may contain information relating to the voter in a table named ‘voter’, such as id, name and email_address. This then may be linked to another table containing contact information for the voter such as their address, telephone and other related information. We may then have a table relating to the candidates of the election, and information relating to them which is necessary for The Electoral Commission to efficiently track the vote. In order to access the information across two tables, we would use a join.

 

A join is used to examine information, not just across two tables, but across multiple tables (depending on system capabilities) and is an integral part of querying data. Each table in a database should contain a primary key which ensures that each row is unique and prevents duplication of redundant data. In addition, a table can contain a foreign key which references another table. For example, by connecting a table containing voter information using voter_id and candidate_id, we could find out which individuals voted for a specific candidate. In this instance, the candidate_id would be the foreign key in our voters table, as it would reference a different table containing information on the candidate. To perform a join in SQL, we need to create a statement which references which information we require, followed by the tables we want to join and how we want to join them together to provide the results we need.

 

To fully understand how joins work, we need to create two tables and some data within them. But first, we will briefly go over the different types of joins. The default join is an inner join; this statement is used to return results where the data in each table matches. The next type of join we will look at is a left join. This statement is used to return all the data from the left table and only the matching data in the right table. The next table is a right join; this is the opposite to a left join. This statement is used to return all of the data from the right table and only the matching data in the left table. The final join is a full outer join; this statement is used to return all the data in both tables. Let’s have a look at each in join in more detail. Firstly, we will examine the data contained in our tables.

Table examples:

voter

Join Table VLDB Solutions

candidate

Join Table VLDB

 

Inner Join

The first join we spoke of was the inner join. The inner join is used to return matching data in each table. Using the statement below on the tables we have created, we can display all of the people that voted for a candidate.

 

SELECT voter.voter_id

,voter.forename

,candidate.candidate_id

,candidate.forename

FROM voter

INNER JOIN candidate

ON voter.candidate_id=candidate.candidate_id

ORDER BY voter.voter_id

;

This table shows the results from the above statement:

Join Table VLDB

The results from the inner join show us that only five rows from the voter table have matches in the candidate table.

Venn VLDB

The above venn diagram shows us how the tables are connected when we join two tables on an inner join. In this instance we only get the data which matches in both tables.

 

Left Join

The next join statement is the left join. This will provide us with all the data from the left table and only the connecting data in the right table.

 

SELECT voter.voter_id

,voter.forename

,candidate.candidate_id

,candidate.forename

FROM voter

LEFT JOIN candidate

ON voter.candidate_id=candidate.candidate_id

ORDER BY voter.voter_id

;

 

Results table from the above statement:

Join Table VLDB

The left join displays all of the data in the left table, which in this statement is voter, and displays data of the connected table. If there is no data available for a row, then null is added to that row; this is what has happened in row 6 as Liam didn’t choose a candidate.

Venn VLDB

This venn diagram shows us how a left join is used. All of the data contained in voter is returned as is the matching data in candidate.

 

Right Join

 

As you may have guessed, the next statement we will look at is the right join. This statement will result in the right table providing us with all the data it holds and then only displaying the data that is connected to it from the left table.

 

SELECT voter.voter_id

,voter.forename

,candidate.candidate_id

,candidate.forename

FROM voter

RIGHT JOIN candidate

ON voter.candidate_id=candidate.candidate_id

ORDER BY candidate.candidate_id

;

 

Results from the above statement:

Join Table VLDB

The SQL statement has been tweaked slightly. The right join statement has the order by changed to candidate.canidate_id; this is to make the results more readable. The right join statement has displayed all of the data available in the right table, which was candidate, and now only shows the connecting data in the left table (voter), and again if there isn’t any data in the left table that connected to the right table then null is added to that row.

Venn VLDB

 

This right join venn diagram shows us the opposite to a left join, and displays all of the data in candidate and the connecting data in the voter table.

 

 

Full Outer Join

The last statement that will be explained now is a full outer join. This statement brings in all the data from both tables and connects the results together where it can.

SELECT voter.voter_id

,voter.forename

,candidate.candidate_id

,candidate.forename

FROM voter

FULL OUTER JOIN candidate

ON voter.candidate_id=candidate.candidate_id

ORDER BY voter.voter_id

;

 

Results from the above statement:

Join Table VLDB

With a full outer join, the results are from both tables, and results will be matched together. Also, if there is no matching data, the field will again contain a null value.

Venn VLDB

And this is the last venn diagram. This shows us a full outer join and displays all the data in both tables; it attempts to match the data together where it can.

 

As you can see from the examples above, we have managed to join the data from two individual tables and link them together in a variety of different ways to provide the results we required. The ability to join tables is a fundamental aspect of SQL and a common feature of the role of a database developer.

 

Sources: http://www.bbc.co.uk/news/technology-36558446

 

 

, , ,

Readings in Database SystemsTPC Database Benchmarks

As most readers of data warehouse related blogs will no doubt know, the Transaction Processing Performance Council (TPC) define database benchmarks to allow comparisons to be made between different technologies.

Amongst said benchmarks is the TPC-H ad hoc decision support benchmark, which has been around since 1999 with the latest version being v2.17.1.This is the most relevant TPC benchmark for those of us focussed on data warehouse platforms.

Benchmark Scale

TPC-H benchmarks are executed at a given scale point ranging from 1GB up to 100TB:

  • TPCH1 = 1GB
  • TPCH10 = 10GB
  • TPCH30 = 30GB
  • TPCH100 = 100GB
  • TPCH300 = 300GB
  • TPCH1000 = 1,000GB/1TB
  • TPCH3000 = 3,000GB/3TB
  • TPCH10000 = 10,000GB/10TB
  • TPCH30000 = 30,000GB/30TB
  • TPCH100000 = 100,000GB/100TB

Benchmark Data

Data used to populate the TPC-H schema is generated using the DBGEN utility which is provided as part of the TPC-H download.

Database Schema

The schema used in the TPC-H benchmark is retail based and contains the following tables:

  • customer
  • orders
  • lineitem
  • part
  • partsupp
  • supplier
  • nation
  • region

Each table contains either a fixed number of rows or a number related to the scale factor in use.

Unsurprisingly, the ‘lineitem’ table contains the most rows, which is 6,000,000 x scale factor e.g. 600,000,000 rows for scale factor 100 (TPCH100).

Column names and data types are provided for all tables.

For those using MPP databases such as Teradata, Greenplum or Netezza you’ll have to decide your own data distribution strategy.

Benchmark Queries

The queries that are executed against the populated schema are generated using the QGEN utility which is provided as part of the TPC-H download. Minor query modifications are allowed to take into account differences between DBMS products.

There are a total of 22 ‘SELECT’ queries (‘query stream’) that must be executed against the populated retail schema at a given scale factor. Each query corresponds to a business question. There are also 2 refresh functions (‘refresh stream’) that add new sales and remove old sales from the database.

Database Load Time

The elapsed time required to generate & load the data to the database must be recorded. The time to execute other supporting tasks such as table creation, index creation & statistsics collection must also be recorded.

Performance Tests

Once the data is loaded the real fun and games can begin. The performance test consists of both single-user power and multi-user throughput tests.

The power test consists of the first refresh function followed by the 22 query set and lastly the second refresh function.

The throughput test consists of a minimum number of concurrent runs of the 22 queries (‘streams’), as determined by the scale factor:

  • SF1 = 2
  • SF10 = 3
  • SF30 = 4
  • SF100 = 5
  • SF300 = 6
  • SF1000 = 7
  • SF3000 = 8
  • SF10000 = 9
  • SF30000 = 10
  • SF100000 = 11

The throughput test is run in parallel with a single refresh stream.

The set of 22 queries is run in a sequence specified in Appendix A of the TPC-H guide and is dependant on the number of streams.

The timing of all queries is measured in seconds.

Benchmark Metrics

The metrics captured by the power and throughput tests are as follows:

  • composite query-per-hour (QphH@Size)
  • price/performance ($/QphH/@Size)

Detailed explanations as to how these metrics are computed are available in section 5 of the TPC-H guide. There are some lovely equations in there for maths geeks to enjoy!

Benchmark Results

Vendors that submit results are required to provide an executive summary in addition to a full disclosure report. All of the shell scripts, SQL, log files etc are also provided as supporting files. There are no secrets here!

Benchmark performance results have been submitted at the following scales:

  • SF100 (100GB)
  • SF300 (300GB)
  • SF1000 (1TB)
  • SF3000 (3TB)
  • SF1000 (10TB)
  • SF3000 (30TB)
  • SF100000 (100TB)

Perhaps interestingly, the fastest system at all scale factors is currently the Exasol database running on Dell PowerEdge servers. We’ll let you peruse to TPC web site to see how they did it.

Benchmark Use

The TPC-H benchmark is primarily of interest to technology vendors to show us what they’ve got in their locker.

Here at VLDB we’re data consultants, not tech vendors, so why the TPC-H interest?

Well, it is possible to use a cut-down version of the TPC-H benchmark to assess the real-world capability of various database platforms.

Why believe the hype when you can test database performance with a consistent set of queries against a real-world(ish) schema and usable data?

Well, that’s exactly what we’ve been doing for several years now.

Specifically, we use a set of benchmark tests to assess cloud database performance. Also, as cloud platforms are said to suffer the ‘noisy neighbour‘ problem, we also run benchmark tests over extended periods to test the variance of database performance over time for cloud platforms.

Some of the results are interesting…very interesting 🙂

 

, ,

Primary Indexes and Distribution Keys

It’s All About Data Distribution.

As experts in Massively Parallel Processing (MPP), here at VLDB Solutions we talk regularly about ‘Primary Indexes’ (PI) or ‘Distribution Keys’ (DK). They are integral to the architecture of both Teradata and Greenplum respectively, and the correct identification and employment of them is ‘key’ to the maximised performance of both Massively Parallel Processing (MPP) systems. But how do they work?

Data Distribution

Before we examine each in detail, it is first important to understand how data is stored and accessed on a MPP system, and how the distribution of data helps a system achieve true parallelism.

Within a MPP system, data is partitioned across multiple servers (referred to as AMPs in Teradata, and Segments in Greenplum). These servers ‘share nothing’ – they each process only their own share of the data required by a query, and do not process data located on other servers in the system. If data is not partitioned across all of the available servers, then those servers without data sit idle during the processing of a workload, and the full power of MPP has not been harnessed. The data is considered ‘skewed’, and therefore the query will be skewed too.

Primary Indexes and Distribution Keys are, as the name suggests, the key by which data is distributed across the servers. They are designated at a table level within the database, turning a column, or a selection of columns, into the key for each row of data.

If the Primary Index or Distribution Key for each row of data within the table is unique, then the rows are distributed across the servers in a ‘round robin’ manner. When the data is queried, each server has an equal share of the workload, and the system’s full power has been harnessed; the system has achieved true parallelism.

If the Primary Index or Distribution Key is not unique, rows with duplicate Primary Indexes or Distribution Keys are grouped on the same server. In a table where there are many matching key values, this can lead to skewed data and therefore skewed performance.

For example, if a MPP system had 10 available servers, but a table with a Primary Index / Distribution Key was created where that column only contained two different values (1/0, True / False, Yes / No, etc.) then that data would only be distributed to two servers, as the rows with matching value sets would be stored together. When that data is queried, only those servers with data can process the workload; the remaining eight servers remain idle.

Unique and Non-Unique Primary Indexes

As mentioned earlier, data distribution in a Teradata system is governed by a table’s Primary Index. Primary Indexes can be designated as either Unique or Non-Unique, depending on the column, or selection of columns, that has been chosen as the Primary Index. If a Primary Index is determined as a Unique Primary Index (UPI), then duplicate values are no longer allowed within the chosen column / selection of columns, and will not be loaded to the system. However, those rows that are loaded will be distributed evenly across the AMPs, and the system will achieve parallelism when that table data is queried. If a Primary Index is determined as Non-Unique Primary Index (NUPI), then duplicate values will be allowed within the column / selection of columns, but they will be grouped on the same AMP when the data is distributed.

Distribution Keys and Distributed Randomly

The Distribution Key is how data is distributed on a Greenplum system. Unlike Teradata, the key is not declared as unique or non-unique; it merely is, or is not. Again, as with Teradata, table data with a unique Distribution Key is distributed ‘round robin’ across the Segments; and duplicate Distribution Key values are grouped together on the same segment.

However, Greenplum tables can also be designated as ‘distributed randomly’. In this case, column data is not used for distribution, and each row is distributed to the Segments in the same ‘round robin’ manner as when using a unique Distribution Key.

 How to Choose Primary Indexes and Distribution Keys

As should now be clear, correctly identifying which column, or selection of columns, to use as a Primary Index or Distribution Key is integral to the performance of a MPP system. In a Relational Database Management System (RDBMS), a table’s Primary Key is often a natural candidate to become the Primary Index / Distribution Key – a column of values uniquely identifying each row could easily be used as a UPI for best distribution.

However, there will be times when a table does not have a single column of unique values, but where ‘round robin’ distribution is still desired – ‘reference’ or ‘dimension’ tables, for instance. On a Greenplum system, this could be achieved by distributing randomly; but on a Teradata system, it would be necessary to identify a selection of columns, where the combination of data within those columns would be unique on a row-by-row basis.

,

Readings in Database SystemsReadings in Database Systems

Also known as the ‘Red Book‘, Readings in Database Systems has been published since 1988. The 5th edition was published in 2015 after a 10 year hiatus. Quite ironic really that yours truly started his graduate IT career in 1988 on, you guessed it, database systems!

The Red Book contributors are Peter Bailis (Stanford Future Data Systems group), Joe Hellerstein (Professor of Computer Science at the University of California, Berkeley) and Michael Stonebraker (Professor at MIT). Although these chaps certainly know a thing or two about databases, it’s always interesting to read the academic take on the database world.

Preface

What caught the eye in the preface was the statement that ‘Cloud computing and microarchitectural trends has made distribution and parallelism nearly ubiquitous concerns‘. This is against a backdrop of a massive decrease in compute/storage costs and the increase in data volumes under management with the rise of ‘Big Data‘.

For those of us that have been Teradata developers since the 1980’s, data distribution and parallelism have been with us for almost 30 years. Good to know the rest of the world is finally catching up!

Chapter 1 : Background (Stonebraker)

The key theme here is that the Map-Reduce market has become an HDFS market, which ‘seems poised to become a relational SQL market‘. In a similar vein, ‘New data models have ben invented, ony to morph into SQL on tables‘.

We’ve been calling this out for years, a lot of Map-Reduce/Big Data/Hadoop efforts are leading inevitably to ‘SQL-on-Hadoop’, which is essentially a parallel database capability…which we’ve had for 30 years.

Chapter 3 : Techniques Everyone Should Know (Bailis)

As discussed on a previous VLDB blog post, it’s all to easy to overwhelm database optimisers such that ‘modern query optimizers still struggle with large plans’. Wise words from Mr Bailis.

Most folks don’t get bitten hard enough, or often enough, by bad query plans to be too bothered by the query optimisation challenge faced by all database optimisers. The fallout from a bad query plan becomes all too apparent as data volumes rise. A bad query plan against a few gigabytes of data is not the same as a bad plan when petabytes of data are in play. Testers take note!

As Teradata’s Stephen Brobst memorably quipped at the Teradata User Group (TUG) in London a few years ago: ‘With Hadoop you’re the optimizer. Good luck with that‘. Well said.

The folks at Pivotal are noted to have been addressing the query optimisation challenge with Orca. Now if only someone would make those Greenplum explain plans easier to read. Ho-hum.

Chapter 4 : New DBMS Architectures (Stonebraker)

A favourite YouTube video that always raises a chuckle in the VLDB office pokes fun at the NoSQL crowd. However, we tend to agree that the SQL and NoSQL crowds will merge over time. Beats piping data to /dev/null.

Chapter 5 : Large-Scale Dataflow Engines (Bailis)

The ongoing convergence between the ‘old’ (RDBMS-based data warehouses) and the ‘new’ (Hadoop) supports our belief that “there’s nothing new under the sun”.

Bailis points out that systems built from the Hadoop ecosystem have ‘come to resemble traditional data warehouses‘. Furthermore, ‘post MapReduce systems have come to implement a growing proportion of the feature set of conventionl RDBMSs‘.

This is an ongoing trend that will ultimatley lead to the ‘Hadoop/Big Data’ open source ecosystem delivering, amongst other things, MPP database capability. Yes, 30 years later than when it first came into being, but that’s another story.

Chapter 6 : Weak Isolation & Distribution (Bailis)

The author points to survey results that state that only 3 out of 18 SQL & NewSQL databases provided serialisability by default, and that 8 did not offer serialisability at all. This is characterised as a ‘race to the bottom‘ amongst vendors.

With the rise of Hadoop and all things Big Data, it is noted that ‘weak isolation has become even more prevalent‘. Ouch.

Chapter 7 : Query Optimisation (Hellerstein)

Ah, query optimisation, one of our favourite topics. No seriously!

Query optimsiation is quite rightly portrayed as ‘one of the hardest parts of a DBMS to implement well‘ and that query optimisers remain a ‘clear differentiator for mature commercial DBMSs‘.

These observations resonate well with the team at VLDB.

We took the prowess of the Teradata optimiser somewhat for granted during the 1980s/1990s/early 2000s. It wasn’t until the new wave of PostgreSQL-derived MPP systems (you know who you are!) came along in the early/mid 2000s that we realised how wrong it can all go when the query optimiser struggles to come up with sensible execution plans. Sub-optimal execution plans don’t really matter until you get hundreds of millions, billions and now trillions of rows.

One of the points we’ve been calling out for what seems like ages is that ‘the “big data” community…has been far too slow to focus on query optimization‘.

The author goes further, stating ‘that even when the community started building SQL interfaces like Hive, query optimization remained a little-discussed and poorly-implemented topic. Maybe it’s because query optimizers are harder to build well than query executors‘. Shots fired!

<chuckle>However, it is noted that ‘there are efforts underway to start building at least a 1980s-era optimizer‘. Green screens ahoy!</chuckle>

Our friends over at Pivotal are noted for their work on the Greenplum Orca optimiser.

Query optimsiation *really* matters when large and/or complex datasets are in play. Unfortunately, the challenge of query optimisation is known to be ‘NP hard‘. Query optimisation is even harder in a distributed/clustered MPP architecture. Damn that linear scalabilty.

The low hanging fruit of noSQL/no ACID/parallelism etc. is precisely that, low hanging. Like the challenge of workload management, query optimsiation is very much further up the tree.

Ever wondered why not all databases are available as an MPP version? Well, query optimisation is right at the top of the list of the challenges faced when a general purpose DBMS tries to morph into a fully-formed MPP version. It’s not easy, not at all, no sir. Just ask Microsoft or Oracle.

As an aside, Teradata deserve serious plaudits for how good their SQL query optimiser is, and has been for so long. Teradata explain plans are also extremely user-friendly. Go Teradata!

Chapter 8 : Interactive Analytics (Hellerstein)

The author calls out the ‘buzzword bingo‘ that has seen the analysis of data range from being called ‘“Decisison Support Systems” (DSS), to “Online Analytic Processing” (OLAP) to “Business Intelligence” (BI) to “Dashboards” and more generally just “Analytics”‘. What no data science? Oh, wait, that’s just analytics carried out in Silicon Valley, remember? Silly moi.

Pre-computation and sampling are offered as two methods for making a query ‘run in less time than it takes to look at the data‘.

The invention of the term ‘OLAP’ is assigned to Ted Codd, whose (in)famous promotion of Essbase is described as ‘not one of Codd’s more scholarly endeavours‘. More shots fired.

The assertion that ‘pre-computing a full data cube does not scale well‘ is well documented. In essence, this is sparse matrix problem.

It is noted that ‘OLAP style pre-computation has underpinned what is now a multi-billion dollar BI industry‘.

On the other hand, approximate query processing (sampling) attempts by IBM, Informix and Microsoft all failed because ‘database customers won’t tolerate wrong answers‘. What a picky bunch those database users must be.

Chapter 9 : Languages (Hellerstein)

The academics had done a great job of keeping us pesky out-in-the-field types on board until we came across the claim that ‘it is rare for people to interact directly with a database via a language like SQL unless they are coding up a database-backed application‘.

We take this to mean that only IT-folks building an application will ordinarily use SQL to interact with a database. If that is the correct interpretation, which seems likely, then it is 100% incorrect/inaccurate/wrong, and has been since, well, the 1980’s.

There are very few client sites, if any, where we haven’t witnessed business users/analysts writing and submitting SQL against Teradata, Netezza, Oracle, SQL Server, DB2 or Greenplum.

We have trained hundreds of business users to write SQL. We even offer a dedicated SQL training course for business users.

More than one retail bank in the UK has hundreds of business users that submit SQL via Teradata’s BTEQ client in batch mode using JCL submitted from an IBM mainframe. Think about that. Ugh, poor souls.

It boils down to a simple truism – big companies often have sufficient data that only an MPP class database will suffice, and only SQL runs in parallel against said MPP databases.

The upshot? Lots of folks at lots of big companies write SQL, which they run directly against their company’s MPP database. Sometimes business users even write highly efficient Really Neat SQL (RNSQL).

Chapter 11 : A Biased Take on a Moving Target: Complex Analytics (Stonebraker)

The opening gambit here is that ‘the questions asked by today’s data scientists…represent a very different use case from the traditional SQL analytics run by business intelligence specialists‘.

Furthermore, ‘data science will completely replace business intelligence…since it represents a more sophisticated approach to mining data warehouses for new insights‘.

Data science is characterised as ‘predictive modeling, machine learning, regressions, … and not SQL analytics‘.

This is kinda interesting given that activities such as predictive modeling and regressions have been widely used for decades. Machine learning is a newer kid on the block, certainly, but the whole ‘data science thang’ smacks a bit of Emperor’s new clothes. Sorry to disappoint.

The author considers it a ‘sad truth‘ that ‘Most data science platforms…operate on file data‘ but that ‘…data scientists over time wish to use DBMS technology‘.

OK, so we want to bring the data science (R) to the database? How about being able to run R in parallel inside an MPP database? Sounds to good to be true? Oh ye of little faith…take a look at PivotalR – voila!

PivotalR turns the R code into SQL that runs in parallel inside the Greenplum MPP database. This should hopefully address the concern that ‘most analytics codes…fail to scale to large data set sizes‘.

Interestingly, Hellerstein disgarees with Stonebraker that advanced analytics (data science) will replace BI. As he points out ‘SAS users are not database users‘. We side with Joe on this one. Sorry Mike.

It is interesting to note that Hellerstien suggests that ‘the advanced analytics community is highly biased towards open source‘.

Open source Greenplum & R anyone?

Chapter 12 : A Biased Take on a Moving Target: Data Integration (Stonebraker)

The last chapter of the Red Book covers the large topic of ‘data integration’. Last, but by no means least – data integration is where team VLDB earns most of its corn.

It is largely the case that data integration ‘began with the major retailers in the 1990s consolidating their sales data into a data warehouse’. To back this up, the early Teradata adopters in the UK include retailers such as Littlewoods and GUS (now combined into ShopDirect), Grattan, JD Williams (N Brown), Asda and Tesco.

Following on from the retailers, ‘all enterprises…organized their customer facing data into a data warehouse‘. In our MPP-centric world, this covers the banks and telecoms companies, in particular.

One of the greatest truisms we’ve witnessed first hand is that ‘an upfront global schema is incredibly difficult to construct for a broad domain‘.

We’d go further and describe this challenge as ‘impossible’.

This issue can be so contentious at large orgainsations that we’ve even been called in to referee the bun-fight between the enterprise modellers and the architects. You know who you are!

Data integration is described as ‘fundamentally difficult‘ both because ‘data is dirty‘ and ‘deduplication is hard‘. Wise words indeed. Thankfully for team VLDB, the truism ‘where there’s muck there’s brass‘ is wholly in play when it comes to the data integration challenge.

Step 5 of the data integration/curation journey is described as consolidation/de-duplication in which ‘Mike Stonebraker and M.R. Stonebraker must be consolidate into a single record’.

This consolidation challenge is known by various names such as ‘name & address matching’, ‘customer de-duplication’ or ‘single customer view’ (SCV).

We have been heavily involved with SCV processing since the 1980s. In our opinion, SCV is by far the most complex logic we ever deploy as part of the myriad ETL processes that keep the data warehouse fed and watered with deltas.

SCV processing looks simple on the surface, doesn’t it? Try and build a half-decent name/address matching process in SQL and you’ll soon wish you hadn’t started. Either that or you’ll pretend what you’ve built in SQL is fit for purpose whilst knowing the results are terrible.

Data is dirty, remember?

Note to self – write an SCV blog post some time.

And that’s it folks – our take on the most excellent Red Book 5th edition. Well done for reading this far.

A big thanks to Peter Bailis, Joe Hellerstein & Michael Stonebraker for providing such excellent material, at such an excellent price 🙂

Enjoy!

,

Google Cloud Platform

Google Cloud Platform Review: Head In The Clouds

 

Over the last decade, cloud computing services have sky rocketed as people push them to their limits and beyond to find more, advantageous uses for them. Providers such as Amazon Web Services (AWS), Microsoft Azure and VMware are constantly refining their own services and providing users with better means to their end. But in recent years, one service in particular has provided cloud computing that boasts both flexibility and performance to a high degree: The Google Cloud Platform (GCP).

 

Flexibility & Performance

 

The Google Compute Engine (GCE) is the main component offered by the Google Cloud Platform that we are interested in. This service allows the general public to create virtual machines (VMs) with an excellent range of selections and prices for the general hardware required for a VM.

 

Some of the more common hardware is of course CPU and RAM. When creating a new VM, the Google Compute Engine offers a list of options to choose from for fixed values. But once chosen, you then have the freedom to slide both RAM and CPU as required, which is a useful benefit to everyone.

 

Potentially the most important hardware for a VM are the disks. Disks can be chosen from either HDDs or SSDs, both scaling in IOPS (Input/Output Operations Per Second) as you increase the size of the disk, maxing out at approximately 10,000 read IOPS and 15,000 write IOPS for a 350GB (or higher) SSD.

 

You can also choose whether or not the disk created is persistent. This means that all of your data on your disks will be protected, and as a result will persist through system restarts and shutdowns. Should you not require your VM online at all times, then you can shutdown your server for a period and you will not be charged the full price of the VM during this time. Persistent disks therefore not only offer you security of your data, but also additional benefits that might not be seen with some other cloud computing services.

 

One other option available for disks is the local SSD. This is an SSD that is physically attached to a VM, which in turn offers superior performance on the other available disk options. However, this increased performance does have certain trade-offs.

 

  • First, a local SSD can only be created when the VM itself is initially created. Thus it is not possible to add a local SSD to an existing VM.
  • Second, when creating a local SSD there is [currently] a fixed size of 375GB. This may be more than enough for certain situations, but for others this can be very limiting indeed.
  • Third, and potentially the biggest downside, a local SSD cannot be persistent. This means that any type of system fault could result in all data held on a local SSD being lost. It also means that the VM cannot be shutdown, which revokes the benefit of temporarily turning off the VM when it isn’t needed.

 

As a result of the options available, the flexibility regarding both performance and price is highly suitable for anyone. It means that you can truly specify a system to your requirements, depending on both the performance and price range that you are looking for.

 

Testing The Platform

 

With all of the options available, the Google Compute Engine sounds like a great environment to build your new VMs in. But how easy is it to setup a VM in the Google Compute Engine? How does it fare for the tasks that you require? To try and answer these basic, but very important, questions, we have installed a Greenplum multi node cluster within the Google Compute Engine using Pivotal’s Greenplum database version 4.3.7.1 and servers provisioned with a CentOS 7.2 image.

 

The very simple reason for using Greenplum as a test is that it checks all of the boxes that would generally be required for a day to day server. Basic system processes can test the general performance of disks, CPU and RAM. By running a set of TPC-H queries on loop over a few days, it is possible to also see how daily traffic may, or may not, affect the performance of the servers.

 

Furthermore, a Greenplum database requires perfect broadcast capabilities of networks between the multiple VMs, without any interferences. When initially looking into the networking capabilities of VMs in the Google Compute Engine, various posts made it appear that running a Multi Parallel Processing (MPP) Greenplum instance would be difficult (if possible). This therefore was essentially a make or break for the initial testing stages.

 

Google Compute Engine Cluster Setup

 

Setting up a VM using the Google Compute Engine is relatively straight forward. Once you are logged in, simply go into the Google Compute Engine and click Create Instance in the VM instances tab. From here you can select the name of the instance (it is important to note that this is the host name of the server as well), CPU and RAM values, boot disk options (including the OS image as well as the boot disk size and type), and optional extras (such as additional disks and network parameters).

 

That’s all there is to it! Once you’re happy with the VM details, click Create and the Google Compute Engine will start to provision it over a respectfully short period of time. Once provisioned, it is possible to click on the VM to get a more detailed overview if required. From this view you can also choose to edit the VM and alter most settings.

For our test cluster, we provisioned three VMs, each consisting of 2 CPUs, 4GB RAM and one standard persistent boot disk with 15GB space. The master VM has an additional standard persistent disk with 200GB space (which will be used to generate and store 100GB of TPC-H data), whilst the two segment servers each have an additional SSD persistent disk with 250GB space (for the Greenplum database segments).

 

Disk Performance

 

It isn’t always an initial thought that disks can often be a limiting factor in server performance, especially when it comes to databases. Shovelling on more CPUs and RAM may help in part, but there is always an upper limit and disks can often impact that limit.

For a provisioned 250GB SSD disk in the Google Compute Engine, one would expect to achieve approximately 7500 random-read IO per second (IOPS), which would be a very welcome sight for most database servers. But using the exceptional disk performance measuring tool FIO, it was in fact a disappointment to find that the approximate random-read IOPS performance seen on both of my SSDs was closer to 3400, regardless of using a range of different options available with the FIO tool to try and increase this.

 

Similar testing on a separate VM provisioned with a 375GB local SSD returned similar disappointing results.

 

Network Configuration

 

The most important task is to configure the network, as this is essential for Greenplum to run correctly. By default, no user has external or internal SSH access to any server. Whilst generating SSH keys for each individual user (using PuTTYgen) and then applying them to each VM via the Google Compute Engine is relatively straight forward, this only allows SSH access to a VM from an external host.

 

Setting up SSH access between the VMs themselves, which for Greenplum is the more important aspect, requires a relatively simple task. First you need to manually generate an rsa key on each server for each user using the command ssh-keygen –t rsa from the command line (this key will appear in ~/.ssh). Then, share each generated key for each user between all servers (via SCP to and from an external host) and finally paste all keys into the ~/.ssh/authorized_keys file on all servers.

 

With this task complete and successful, not only is the most tedious part of the server setup out of the way but it is also a relatively straight forward procedure from here on to get Greenplum up and running.

 

Greenplum Installation

 

With the network setup as required, all that remains is the system configuration options and Greenplum software installation, which holds little to no complications. Once these additional tasks were complete, it was a single and simple command to successfully initialise the Greenplum database across the cluster.
With the database up and running and a 100GB TPC-H data set generated, it was possible to load and query the data without any issues.

 

TPC-H Testing

 

With the data loaded, a continuous loop of the 22 TPC-H queries was run against the data over several days. One thing we specifically looked for was the standard deviation percentage in query times for individual queries. Impressively, this averaged to be 2% across all queries, with the maximum being 7%. From this we concluded that daily traffic did not noticeably interfere from one persons server to the next.

 

Less impressively however, the TPC-H tests once again showed that the Google Compute Engine wasn’t quite as performant as it boasts, as it returned an average time of 618 seconds per query, whilst an almost exact replica of the server (regarding Greenplum setup and hardware) on a different cloud provider returned an average time of 374 seconds per query.

 

Overall:

 

It is easy to say that the Google Cloud Platform is a flexible and reliable cloud computing service, with options available that are more than capable of performing most tasks. Upgrading and scaling out your server(s) is essentially as quick as you can click, meaning you are never truly limited with regards to performance, and ultimately an ever growing server can easily meet ever growing needs.

However, with all of the benefits that can be utilised, the expectations of server performance appears far more boasted than what seemed the likely reality. This may not be a problem for the smallest of requirements, but it could most definitely prove to be the downside for larger requirements. Scaling out your server(s) is of course an option, as mentioned above, but how far are you willing to go?

, , , ,

Google Cloud Conference Attended By VLDB Solutions

Google Cloud Platform (GCP) Conference 2016

A few weeks ago now VLDB attended Google Global Cloud Conference at Pier 48, overlooking the Giant’s ball park, in San Francisco. The user conference was a platform to introduce products and services to its current community and convince the rest of us that Google is the answer. Google isn’t seen to be a market leader in ‘Cloud’. 

In a recent research note, Deutsche Bank Investment analysts predicted that GCP is on a $400M run rate, which is roughly 20 times less than AWS’s. But it’s not about your revenue, it’s about the smarts behind the solution. Google came from behind to win ‘Search’ and after this conference I think Google might be about to rock the cloud computing world. 

So, after two days of talks from clients and senior Google engineers and developers – what did I learn about the new stuff coming out of Google HQ?

 

Machine Learning

One of the biggest concepts I’ve taken away from GCP Next 2016 is the focus on Machine Learning. As Google adds a Machine Learning solution to its product portfolio it aims to go head to head with other companies trying to claim the IaaS Machine Learning crown such as IBM, AWS and Azure. Google’s Machine Learning solution promises to take care of everything from data ingestion through to prediction. There is also the ability to use Google APIs such as speech, translate and vision. All three are excellent examples of the brilliance created by Alphabet this year for the masses to enjoy.

 

Cloud Machine Learning

 

Big Query
GCP Analytics Data Warehouse is one of the core products in the Google portfolio. A price drop of 50% was announced on data stored for more than 90 days, auto table partitions make data management simpler and capacitor storage engines increase SQL query performance.

 

Big Query

 

Open Source
Google joined the Open Compute Project this year and has open sourced many projects including Hadoop MapReduce, Spanner, container management platform Kubernetes and machine learning tools such as Dataflow and Tensorflow.

 

Container Engine

Tensor Flow

Cloud Dataflow

 

Customers

Obviously at any conference you expect clients to be paraded in front of attendees but this event hosted an impressive collection of companies. These included Disney’s consumer product and interactive media, Dominos, Spotify, Coke Cola, Heineken and Best Buy.

 

Take the tour

If you are going to look around a data centre what better one than the world’s most advanced – with added Google-coloured heating and cooling piping this an interesting look behind scenes:

 

You Tube Video

 

In short this was a great conference attended by grown ups who are eager to see the ‘NEXT’ big thing in cloud computing that will take their organisation/clients to new heights.

 

Google might not be the market leader at the moment but the potential is definitely there – watch this space.

, , , , ,

Teradata on AWSTeradata on AWS

The availability of Teradata’s eponymous database on both Amazon’s AWS and Microsoft’s Azure was announced in late 2015 for availability in Q1 2016. Lo and behold, the Teradata DBMS was duly available via the Amazon Marketplace a few weeks ago, just squeaking into the promise of being available in Q1 2016. Many thanks to Teradata’s own Mike Whelan for supplying the URL which yours truly was seemingly unable to find without help. Doh!

At the time of writing there there is no sign of Teradata on Azure, but hey-ho, at least Teradata on AWS is available.

EC2 Configurations

OK, so Teradata is available on AWS, but what configurations are available?

Well, at least initially, Teradata is only available on AWS as a single node SMP server. Multi-node MPP availability is slated for Q4 2016. Teradata MPP via the public cloud is no doubt a bigger challenge than Teradata SMP via the public cloud. There are no doubt technical, support, product marketing and pricing challenges to overcome.

Within the single node Teradata setup there are various AWS instance types and Teradata database options available.

The following EC2 instance types can be deployed:

  • i2.xlarge – 4 vCPU, 30.5 GB RAM, 4 AMPs, 0.8 TB SSD ephemeral storage
  • i2.2xlarge – 8 vCPU, 61 GB RAM, 8 AMPs, 1.6 TB SSD ephemeral storage
  • i2.4xlarge – 16 vCPU, 122 GB RAM, 16 AMPs, 3.2 TB SSD ephemeral storage
  • i2.8xlarge – 32 vCPU, 244 GB RAM, 32 AMPs, 6.4 TB SSD ephemeral storage
  • d2.xlarge – 4 vCPU, 30.5 GB RAM, 3 AMPs, 6 TB HDD ephemeral storage
  • d2.2xlarge – 8 vCPU, 61 GB RAM, 6 AMPs, 12 TB HDD ephemeral storage
  • d2.4xlarge – 16 vCPU, 122 GB RAM, 12 AMPs, 24 TB HDD ephemeral storage
  • d2.8xlarge – 32 vCPU, 244 GB RAM, 24 AMPs, 48 TB HDD ephemeral storage
  • m4.4xlarge – 16 vCPU, 64 GB RAM, 20 AMPs, 5 TB or 20TB EBS block storage
  • m4.10xlarge – 40 vCPU, 160 GB RAM, 20 AMPs, 5 TB or 20 TB EBS block storage

In simple terms, there are 10 different EC2 instance types available with various amounts of vCPU/RAM/storage, storage types and IO bandwidths. Teradata have pre-configured the number of Teradata database AMPs available on a given EC2 instance type, as you might imagine.

Teradata database parallelism (not performance) starts at a lowly 3 AMPs with the d2.xlarge EC2 instance and peaks at 32 AMPs with the i2.8xlarge instance.

Perhaps the key point for those not overly familiar with AWS is that ephemeral disk is faster than EBS due to the fact that it is non-persistent. No, that’s not a typo. AWS ephemeral storage is non-persistent. Delete the instance, power off the instance etc and your disks and data are gone. Simple as that. Ephemeral disks are persistent for the life of an EC2 instance, no more, no less. Interesting times ahead, no doubt, to see how Teradata types deal with the notion of non-persistent storage. Only time will tell.

All is not lost for those picky individuals that would like their Teradata database tables stored somewhere persistent. The m4 EC2 instance types – specifically m4.4xlarge and m4.10xlarge – do offer persistent EBS block storage. Phew!

Teradata DBMS Options

In addition to the above EC2 instance choices, there are several Teradata database versions available:

  • Teradata Database Developer – Teradata DBMS, Teradata Columnar, Teradata Parallel Transporter (TPT), Teradata Studio, Teradata Tools and Utilities (TTU) with a free 30 day trial
  • Teradata Base Edition – Teradata DBMS, Teradata Columnar, Teradata Parallel Transporter (TPT), Teradata Studio, Teradata Tools and Utilities (TTU)
  • Teradata Base+ Edition – Teradata DBMS, Teradata Columnar, Teradata Parallel Transporter (TPT), Teradata QueryGrid, Teradata Studio, Teradata Tools and Utilities (TTU), Teradata Active System Management (TASM)

In all cases the latest Teradata DBMS v15.1 is deployed.

Teradata DBMS support is available via Teradata’s Premier Cloud Support, except during the 30 day Teradata Database Developer free trial.

Teradata on AWS Pricing

Cut to the chase, what does it cost!!!

Well, pricing consists of two elements – the Amazon EC2 instance cost and the Teradata DBMS software cost. You can’t have one without the other. In this case the Amazon EC2 instances are taking the place of the Teradata-supplied server, networking & storage hardware plus the SUSE Linux operating system that the Teradata database usually runs on.

Prices also vary by region. The figures below are $USD and do not take into account storage or data transfer. These figures are for EC2 instances hosted in Ireland and Teradata database software only. A 2% discount is available on Teradata software (not EC2 instances) for those willing to buy annual contracts. Storage is $0.11 per GB/month of provisioned capacity.

At the time of writing, EC2 + Teradata software pricing is as follows:

EC2 DBMS EC2+DBMS
Type $/hr $/month Type $/hr $/month $/hr $/month
i2.xlarge 0.938 685 Base 0.66 482 1.60 1,167
i2.2xlarge 1.876 1,369 Base 1.32 964 3.20 2,333
i2.4xlarge 3.751 2,738 Base 2.63 1,920 6.38 4,658
i2.8xlarge 7.502 5,476 Base 6.58 4,803 14.08 10,280
d2.xlarge 0.735 537 Base 0.44 321 1.18 858
d2.2xlarge 1.470 1,073 Base 0.88 642 2.35 1,716
d2.4xlarge 2.940 2,146 Base 1.75 1,278 4.69 3,424
d2.8xlarge 5.880 4,292 Base 5.25 3,833 11.13 8,125
i2.xlarge 0.938 685 Base+ 1.01 737 1.95 1,422
i2.2xlarge 1.876 1,369 Base+ 2.01 1,467 3.89 2,837
i2.4xlarge 3.751 2,738 Base+ 4.03 2,942 7.78 5,680
i2.8xlarge 7.502 5,476 Base+ 10.07 7,351 17.57 12,828
d2.xlarge 0.735 537 Base+ 0.67 489 1.41 1,026
d2.2xlarge 1.470 1,073 Base+ 1.34 978 2.81 2,051
d2.4xlarge 2.940 2,146 Base+ 2.69 1,964 5.63 4,110
d2.8xlarge 5.880 4,292 Base+ 8.07 5,891 13.95 10,184
m4.4xlarge 1.056 771 Base 1.75 1,278 2.81 2,048
m4.4xlarge 1.056 771 Base 1.75 1,278 2.81 2,048
m4.10xlarge 2.641 1,928 Base 5.25 3,833 7.89 5,760
m4.10xlarge 2.641 1,928 Base 5.25 3,833 7.89 5,760
m4.4xlarge 1.056 771 Base+ 2.69 1,964 3.75 2,735
m4.4xlarge 1.056 771 Base+ 2.69 1,964 3.75 2,735
m4.10xlarge 2.641 1,928 Base+ 8.07 5,891 10.71 7,819
m4.10xlarge 2.641 1,928 Base+ 8.07 5,891 10.71 7,819

Don’t forget – these are base costs for EC2 instances + Teradata software in Ireland only. Storage and data transfer costs are not included. All prices are in USD, as are payments.

Teradata on AWS Summary

<sales pitch>
Teradata via the public cloud is ‘a good thing’ for those of us that make a living out of selling Teradata Professional Services. Yay!
</sales pitch>

It is no surprise that Teradata on AWS is initially only available as a single SMP node. Fingers crossed for the Teradata MPP version later in 2016.

Within the various EC2 instance types there are a wide variety of CPU/RAM/storage/AMP combinations. Most folks will sooner or later figure out which works for them. All Teradata apps should work no matter which EC2 instance type is deployed. ‘Teradata is Teradata’, remember?

From a data persistence perspective, only the EBS-backed m4 instance types seem to make sense, to us anyway.

It’s often a tough sell moving corporate data to the public cloud. Moving corporate data to non-persistent storage in the public cloud is a further leap away from what most folks are comfortable with.

The Teradata Base Edition is likely to be sufficient in most cases. The cost premium for Teradata Base+ Edition is essentially a 50% price hike for Teradata Active System Management (TASM). The uninitiated are unlikley to get enough benefit from TASM to justify the premium.

There are two distinct markets for Teradata on AWS – existing Teradata customers, and those that have never used Teradata.

Existing Teradata customers may find value in single node Teradata instances on AWS that can be used for development, QA, or for temporary ‘burst’ capacity. Maybe even for HA for a subset of important data. Who knows? It seems unlikely that the availability of single node Teradata instances on AWS will cannibalise much of Teradata’s existing market.

Of more interest to Teradata’s executives, investors & industry analysts, no doubt, will be whether Teradata on AWS, and other public clouds in the future, opens up new routes to market for Teradata. This is the ‘last roll of the dice‘ alluded to previously.

This is where the outcome seems less clear. For all Teradata’s strengths, and there are many, the main single node differentiator over other DBMS products is intra-node parallelism. Scaling out to exploit Teradata’s linear scalability via an MPP cluster is not yet available.

Will parallelism within a single node prove enough of a draw to tempt folks away from the likes of SQL Server, Oracle and MySQL? Databases are very ‘sticky’ products and the case for moving from one single node SMP database to another is rarely compelling. A stampede away from SQL Server, Oracle, MySQL etc. seems unlikely.

Is the future ability to scale out when MPP becomes available enough to tempt folks to make the switch to Teradata SMP now? Again, this seems unlikely.

The ‘full fat’ multi-node MPP version of Teradata via AWS/Azure/etc promises to be far more interesting. How Teradata MPP via the public cloud will compete with other cloud enabled MPP databases such as Amazon’s own Redshift, and the recently open sourced Greenplum, will be interesting, very interesting.

Teradata’s journey to public cloud has started, albeit with a ‘single node on AWS only’ baby step.

The rest of 2016 promises Teradata on Azure and ultimately Teradata MPP in the cloud. Hurrah!

, ,

Privacy, Big Data and IoT

Big Data And Privacy

Hmm, privacy, where to start?

Over the long Easter weekend I met up with an old friend and his partner. They’ve lived in Spain for a couple of years now, and it’s a rare treat when I get to catch up with them. After a couple of drinks, some surprisingly good pub grub and a couple of hours reminiscing over our nerdy school years, we moved onto the present and got chatting about our jobs.

From there, it wasn’t long before we stumbled into the conversation I so often find myself having with people outside of my field; data, and the looming demise of privacy that comes with it.

That’s right, I’m the guy who spends my free time talking about data. In the pub, no less. For shame.

It was quickly revealed that both my friend and his partner have strong feelings on the subject. Google, selling their search history? Facebook, shoving adverts in their face for stuff they’d searched for? What gives them the right? Who do they think they are??? The surface concerns that anyone who has ever given it any consideration has faced. Until, that is, you realise that nobody in their right-mind would care about the average person’s browser history beyond flogging them a product or two. You’d be amazed how useless all those cat videos you’ve watched are to any shady spy agency.

Now is probably the time to add that during our conversation, my friend updated his Facebook ‘checking in’ at the pub, and his partner tweeted three times (one of which was a photo of her food, a particular pet peeve of mine). Both own expensive looking smart phones, with presumably constant data connections, and she was wearing an activity tracker bracelet/watch, the app for which she updated religiously with every nibble or sip of drink she ingested. I couldn’t bring myself to point these things out at the time, as I withstood the storm of their data collection based ire, but I decided to really upset them by explaining the basics of a subject that has recently piqued my interest; the Internet of Things.

There is a good chance that the fact that you’re here and reading this means you’ve at least heard the term before, but I’m going to explain it a little just in case, so bear with me.

The concept is simple; the Internet of Things (IoT) is the collection of electronic, network-connected devices that gather and share data between themselves. Take for example the activity tracker. Heart rate, steps taken, blood pressure, even sleep patterns are recorded by sensors in the wristband and reported to the app. The app identifies patterns and makes suggestions to the user on anything from exercise plans to ideal bed times. It requires no active input from the user, beyond putting on the watch, and it collects data on everything they do throughout the day.

Think a little bigger and you have the GPS in a car, talking to various systems in and around the vehicle, giving directions, monitoring fuel levels, warning of traffic jams or accidents ahead. Even the roads themselves can provide data to the cloud, which can be fed to receiving devices. Smart Concrete is laced with sensors to report the integrity of the concrete itself, identifying weak spots or cracks, and even detect ice or heavy rainfall to advise drivers of hazards on their route.

Entire cities are becoming ‘Smart’. Amsterdam, Barcelona and Stockholm are just a few of the flagship cities pioneering Smart City initiatives, which allow the management of inner city traffic, street lighting and even monitoring and reporting on available parking spaces.

Do Technology and Privacy Necessarily Collide?

When machines greedily gather data on every element of modern life, from your whereabouts and shopping habits right down to how well you slept last night, how can anyone expect to keep anything private?

Unfortunately our desire for privacy as a society is far outweighed by our reliance on the very machines that deprive us of it. My friend expressed concern at being ‘tracked’ when his phone’s GPS was active, but readily filled in the tiny questionnaire that would allow him onto the public Wi-Fi in the pub. I’ve heard quite a few people complain that Facebook somehow knows what they’ve been searching for and floods their front page with it, but how many of us will gladly enter our email for a chance at a voucher or a freebie? In an age where purchases can be made by waving your watch over a chip and pin device, privacy seems to have become less of a right and more of an optional extra. It can still be attained for the most part, though as a society we seem more than willing to cash it in for the latest time saving technology. It is a price we pay for efficiency and accessibility, but there is another element to consider, one that has had its importance confirmed by the recent Panama Papers release.

All of this data, collected and shared by the myriad devices that form the Internet of Things, brings with it accountability. While we sat in the pub, government agents presumably rummaging through the data generated by my drinks order, I don’t think I swayed them too much on their distrust of the all seeing eye of modern technology.

Since then however, I haven’t seen a single picture of their food, which I’m taking as progress.

, , ,

Teradata & Cloud‘Happy New Year’ To You Too!

The Christmas & New Year break (not ‘holidays’, ugh!) is a time for a bit of rest & relaxation away from work, eating & drinking too much, pretending to like spending time with your relatives, swapping presents and generally winding down.

It’s also good thinking time…

In addition to ‘messing about with databases’, especially Teradata, for the best part of 30 years (yikes!), yours truly also has more than a passing interest in personal finance and investing…which got me thinking about Teradata’s share (stock) price performance, obviously.

Teradata’s Share Price

After splitting off from NCR, Teradata floated via an IPO on the NYSE and closed at just under $28 on the first day of trading on 1st October 2007, or 1071001 for the TD geeks. Over the first 5 years Teradata’s shares rose a whopping 200% (source Google Finance):

Teradata share price 2007-2012

Since the heady days of Q3 2012, when Teradata peaked at around $80, it has been a rather different story:

Teradata 2012-2016

According to Google Finance, as of 6th Jan 2016, Teradata’s share price has declined 68% since the peak in Q3 2012. There have also been declines of 44% in the last year, 31% in the last 6 months, 17% in the last 3 months, and almost 10% in the last week alone.

Update 14th Jan 2016 – since I started writing this article last week Teradata’s price has declined by a further 2% or so to under $23.

Those that bought via the IPO in 2007 are now looking at paper losses. This is hardly a stellar performance from the world’s biggest pure-play analytics company.

Why Has Teradata’s Share Price Fallen?

There’s only one reason that share prices fall: more sellers than buyers. It’s that simple.

What motivates sellers to sell, and buyers to buy, is an entirely different question and one that we can only speculate on.

Efficient market theory would have us believe that all known information about the future potential of a company is reflected in the share price. This leads to the conclusion that ‘the market’, whoever that might be, hasn’t held out much hope for Teradata’s fortunes for over 3 years, since the 2012 peak.

What Has Happened to Teradata Since 2012?

The short answer, in a single word… Hadoop.

Teradata’s historic value proposition was quite simple: massively parallel processing (MPP). The ability to deal with data volumes that others struggled with through a ‘scale out’ MPP architecture set Teradata apart from the likes of Oracle, IBM, Microsoft etc who relied on the traditional ‘scale up’ SMP architecture.

For the best part of 25 years from the late 1980’s to the early 2000’s Teradata was alone in the enterprise market with its eponymous MPP database offering. Then along came the likes of Netezza and Greenplum with competing MPP database offerings. Netezza at least caused Teradata to sit up and take notice, before being bought by IBM. Greenplum was aquired by EMC and then spunoff as part of Pivotal, but has hardly had any impact on Teradata’s fortunes.

Teradata’s response to Netezza was the launch of the more affordable range of Teradata appliances to complement the existing Teradata enterprise models.

However, the new MPP database players collectively turned out to be minor irritants compared to the impact of Hadoop on Teradata. Our first hand experience is that, rightly or wrongly, many enterprises see Hadoop as an alternative to Teradata’s MPP database. The fact that so much effort is being spent to run SQL on Hadoop, and in doing so turn Hadoop into what is basically an MPP database, is deeply ironic.

Why not go with an MPP database system in the first place, given that it will cover 99% of use cases for 99% of companies? But that’s a whole other story…

Teradata’s response thus far to Hadoop has been less convincing. The strategy seems to be to promote the Teradata Unified Data Architecture (UDA). The UDA consists of 3 elements: the traditional Teradata data warehouse (production), the Teradata Aster appliance (exploration) and a Hadoop appliance (‘Big Data’). Confused?!?!?!

Teradata on Amazon AWS

The availability of Teradata via the public cloud, and specifically AWS, was announced in late 2015 for availability in Q1 2016. Teradata’s more recent thoughts on Teradata in the cloud are neatly covered in a BeyeNETWORK article.

As Teradata’s Brian Wood points out, Teradata in the cloud means “Teradata Database is opened to a much broader range of potential customer”.

While this is cleary true, key questions remain: how much interest will there be for Teradata on AWS? Will multi-node Teradata MPP systems work satisfactorily on AWS? Will Teradata on AWS be profitable for Teradata?

One thing is for sure, Teradata needs to do something to arrest the precipitous share price decline. Maybe the lauch of Teradata on AWS, and other public clouds such as Azure, is that something. Only time will tell, and that time will be upon us soon.

As a Teradata professional services provider, the folks here at VLDB are crossing our fingers in the hope that Teradata on AWS is both affordable and can run at scale beyond a single node. More Teradata users is good news for us!

These are interesting times indeed for Teradata.

,

Microsoft Azure SQL Data WarehouseTeradata on Amazon AWS and Microsoft Azure

For those that missed the news, Teradata recently announced that the Teradata DBMS is to made available via the public cloud, initially on Amazon Web Services (AWS), and subsequently on Microsoft Azure.

The Register described the news as ‘moon-on-a-stick‘ and in competition with Amazon’s own Redshift offering.

Teradata In The Cloud, Surely Not?

What to make of this Teradata-in-the-cloud news‘ we hear you cry (or not). Well, to put things in context, let’s start with a little bit of Teradata history…

Back in the late 1980’s, yours truly ran his first Teradata SQL query via the Teradata ITEQ client running under IBM MVS on an IBM mainframe against a new fangled Teradata DBC/1012. In these earliest incarnations Teradata was very much a proprietary affair. The eponymous database ran on Teradata’s own ‘Teradata Operating System’ (TOS). The early Teradata DBC/1012 systems needed re-booting…a lot.

Roll forward to the early 90’s and Teradata became part of NCR, itself a part of AT&T. As a result the Teradata database was ported to NCR’s SVR4-derived MP-RAS version of Unix. The Teradata VPROCs (AMPs and PEs) became virtual, rather than each running on its own dedicated Intel x86 CPU. As well as virtualising the AMPs, the MPP cluster consisted of several SMP nodes. The Ynet also became the Bynet. The virtualised systems running on clustered NCR MP-RAS nodes became known as Teradata V2, with the previous non-virtual DBC/1012 systems known as Teradata V1.

The release of a single-node Teradata demo version, latterly known as Teradata Express, and a full port to Windows NT also occurred during the 1990’s.

In the early 2000’s, largely in response to Netezza, Teradata started to ship systems known as ‘Teradata data warehouse appliances‘, not to be confused with the ‘Teradata active enterprise data warehouse‘ offerings. The differences between the ‘appliance’ and ‘enterprise’ offerings were largely around the disk IO sub-system, workload management and, you guessed it, price. While Teradata appliances were designed to mitigate the Netezza competitive threat, the enterprise platforms remained the premium Teradata offering.

During the 2000’s Teradata was also ported from NCR’s 32bit Unix MP-RAS to 64bit SUSE Linux. Teradata systems with a measly 2GB-4GB RAM per node were thankfully no longer the norm.

More recently, during the 2010’s, Teradata has also broadened the data platforms offered with the addition of Teradata Aster, Teradata Hadoop and Teradata Cloud (private cloud, not public cloud).

The point of the Teradata history lesson is that Teradata has a long track record of change, covering almost 30 years, some/all of which is not necessarily obvious. The latest news about upcoming Teradata availability on the public cloud AWS and Azure platforms is another part of the onward Teradata journey.

For those that are interested, Teradata’s version of their own history is here.

Teradata’s Cloud Journey

Teradata’s existing ‘cloud’ offerings consist of either the not-for-production 2 AMP Teradata Express on EC2 or Teradata’s private cloud, which is essentially access to a remote, shared Teradata platform i.e. not very ‘cloudy’ at all. These Teradata offerings scored more marketing points than anything else.

To really claim to have a cloud offering, any DBMS vendor must surely make their product(s) available via the shared public cloud?

Well, it looks like Teradata finally agrees. The Teradata DBMS being offered on AWS and Azure is a *big* departure from the current choice of on-prem appliance, on-prem enterprise or remote shared appliance.

Teradata’s Public Cloud Motivation

Teradata’s value-add is based on capabilities like resilience, performance, functionality, scalability and workload management. The stuff Teradata folks take for granted is not always apparent, until it’s not there. Rather than any one particular ‘killer’ feature, Teradata’s holistic ability to stand up to the real-world rigours of high end data warehousing set Teradata apart from the competition. Teradata’s impressive customer list can’t all be wrong!

Teradata has always been a company guided by customer demand and feedback. The Teradata PAC bears witness to this. So, part of the driver for Teradata in the cloud could be from the existing Teradata customer base.

Another driver is likely to be the general ‘dash to the cloud‘. There is simply no point in trying to swim against that tide.

There can also be little doubt that the Teradata execs are feeling the need to put a smile back on the face of Teradata’s investors. Following strong gains of over 200% from the IPO in 2007 to a peak in 2012, the Teradata share (stock) price is down by 25% in the last 5 years against a 70% gain for the S&P500 over the same period, with a peak-to-trough decline of over 60% in the last 3 years (graph from Google Finance):

Teradata Stock Price

Against the competitive backdrop of ever-increasing interest in Hadoop, Teradata has, by their own standards, been on an acquisition spree over the last few years. This has yet to arrest a downward trend in the share price. Maybe making Teradata available to the mass market via AWS and Azure will have a positive impact on Teradata’s share price? Only time will tell.

Teradata’s Public Cloud Challenges

At first blush Teradata on a public cloud platform such as Amazon’s AWS seems like a very strange fit. Teradata has historically been a low volume, high margin, premium priced offering. Amazon operate at the opposite end of the scale – pile it high, razor thin margins, sell it cheap.

It may be the case that certain Teradata features are unavailable or less capable in order to deliver a price point that works within the AWS platform. Only time will tell.

Product positioning and price issues aside, who will provide support for the potentially large upswing in Teradata users? Teradata? Maybe. Amazon? Maybe.

Teradata’s own support team currently enjoy the certainty of a tightly integrated stack within which Teradata provided all of the hardware, the SUSE Linux OS (and the occasional Windows system!) and the DBMS software. There is currently also a relatively small installed Teradata user base to support. All bets are off when it comes to being able to predict demand for Teradata support once Teradata on AWS is available.

One of the key enablers for Teradata’s MPP architecture, as I’m sure we all know, is the inter-node Bynet interconnect. Like all MPP systems, and Hadoop clusters for that matter, there will *always* be a need to ship data between the nodes via a high-speed, fault tolerant interconnect. Teradata has lots of value-add (and patents) in this part of the MPP stack.

How Teradata deploys the interconnect, or some form of it, on a public cloud infrastructure will be a key determinant of whether Teradata is able to support the kind of capability taken for granted with non-cloud deployments. It is notable that Amazon’s own Redshift is not noted for it’s ability to ship data between the nodes in a cluster.

Perhaps more notable is that Teradata confirmed via Twitter to ourselves that the initial AWS offering will be single node only, with MPP clusters available later in 2016. There is clearly work to do in this area before Terdata MPP via the public cloud can be fully unleashed.

In summary, this is a very interesting development, challenges lie ahead, but the more Teradata systems out there needing expert assistance from bona fide Teradata experts the better 😉

,