For the past few years I have been involved in many projects where the requirement was to migrate the existing on-premises data warehouse to modern platform. In this post I will share my experience and observation about the migration projects and why it is fun to be part of Migration projects.
Wave of Migration Projects and Why ?
First thing we have to understand is that why suddenly every enterprise wishes to migrate to new platform. I can think of below main points encouraging to migrate to new platforms:
- Price Model : Most of the cloud services follow pay-as-you-go model. So you pay for whatever services you use and for the duration of the usage only for most of the services. Also no upfront costs is required. So no harm in trying with small budget.
- Higher availability : With cloud you have option to securely connect to your infrastructure from anywhere in the world. And higher availability means better efficiency.
- Upgrades in Computation : In modern platform, upgrading compute is possible with a single click. You need more processing power, upgrade your infrastructure in single click.
- Upgrades in Storage : Multiple options for storage is available. Want to move to faster drive like SSD , pay slightly more and you have it. Switching to slower HDD can save you some money.
- Better Scalability : Infrastructure procurement was nightmare for any team earlier. Now you need more servers, you have it up and running in few minutes.
- Minimal Management : Many services are server-less and many are fully managed. So you can just worry about your tasks and let management and administration be taken care by cloud partner.
- Support for Data Science and advance Analytics : With the boom of Artificial Intelligence and Machine Learning , data analytics boundaries have grown too. So modern platform supports extended analytics when required.
Data-warehouse Migration Projects – How ?
Now we know the "why" part of data warehouse migration projects let's talk about "How" part of it too. So the very first step is to do a small POC before going on full fledged migration project. And the POC could be small but it may take years to complete. And the main reason for that is you may want to try most of the popular options available in the market and then pick the one which is best for you.
Data Migration – How to draft a POC ?
When you are thinking of data migration project, you have to decide the number 1 reason for you because of which you want to do the data warehouse migration POC.
- Is it because you are running out of compute power and need extra servers to take off that load ?
- This is a very common practice I have seen in the market. Enterprises are running out of compute power in existing data warehouse solution. With every PROD release they are adding more and more ETL jobs and most of it is daily job. If the ETL jobs are not optimised, very soon they face challenge to complete all the Daily jobs with-in 24 hours window. If not handled properly, they will end up have back-logs of ETL jobs to complete.
- If this is the case I will suggest to identify the ETL jobs which takes most of the CPU/IO and identify a good healthy ETL pipeline to be part of that POC.
- Is it because you are running out of storage space and you don't have much options to scale ?
- This is another challenge I have seen enterprises face. Many of the RDBMS solution comes with Storage & Compute coupled in the same machine. So if you want more storage, go for bigger machine even when you don't need more compute.
- If this is the case then identify the ETL pipeline which involves bulk of data in the data warehouse and mark it for POC.
- Is it because you are looking for separate environment for your business users or to manage adhoc requests more efficiently ?
- Third most common reason I have seen is enterprises want their business users to move to some other platform and free the ETL server of adhoc requests.
- If this is the case then identify the ETL pipeline which loads Reporting tables or Data Marts on which business team creates dashboard or runs adhoc queries.
- But what if the modern platform does not support Reporting or Visualisation tool then in such cases migrate the ETL pipeline to modern platform and do the processing and push the processed data back to existing environment. A separate POC may be required to identify the replacement of Visualisation tool on modern platform.
- Is it because you are looking to move out of ETL tool ?
- If you are using ETL tool and now want to move out of it then also you can try a POC for that. Main reason I see is the licence cost of these ETL tools and before renewing it , many enterprises wish to try alternatives.
- If this is the case, then you have to see the support for these ETL tools like Informatica/Datastage on modern platform. If the support is not available, then how can you automate it. If the tool is XML based, can you automate the process by changing XML to destination template.
In all the above mentioned cases, please keep performance tuning at the core. Because I have seen many enterprises ignore it at the start of POC and at a very later stage when everything looks fine , performance becomes the bottleneck and ultimately becoming the reason for POC failure. So be very clear about the performance at the very advent of POC.
Sharing my Data-warehouse Migration Journey
Now we know "how" and "why" of data warehouse migration project I would like to share my experience so far after working on many POCs and few end to end projects. I hope my experience can help you in taking some informed decisions during your data warehouse migration projects.
Data warehouse migration projects are not something new in the industry. I have been part of DB2 and Sybase to Teradata migrations before Hadoop & Cloud became more popular option. These typically were RDBMS to RDBMS migration projects. I started working on datawarehouse migration projects to modern platform ~8 years back. Back then everyone was moving on-premises Teradata/Netezza or other MPP solutions to on-premises Hadoop cluster. Cloudera/MapR/Hortonworks were major players in the market having hadoop distribution widely used by Enterprises.
Even I worked on multiple projects migrating Teradata to Hive. The biggest challenge was performance. Teradata was lightning fast to run DML queries where as Hive was not comparable. There is a trade-off between speed and price. Hive is economical so you can push the ETL pipelines which are not very critical and have relaxed SLA. Next, wave was to migrate to Spark which is still ongoing. Spark is a very popular right now for running ETL pipelines primarily because it is much faster than Hive. However it does require some modifications to existing on-premises Data Model to be more efficient. For example, custom SCD-2 is better than SCD-1 in Spark as over-writing the data is expensive operation. So I have worked on many POCs migrating RDBMS + ETL like Teradata & Informatica to Spark ETL and it do meets client expectation provided you have sufficient cluster size. That being said, do not compare it with MPP solutions like Teradata/Netezza in terms of performance for DML queries.
Another set of projects which are on-rise nowadays are migrating from on-premises RDBMS to cloud datawarehouse solutions like Amazon Redshift, Snowflake etc. These types of migration are much better in terms of performance however costing is higher than using Spark based platform. Overall effort of transformation is also significantly less in this case. I have also worked on such projects and it is really fun to be part of migration projects. The other good point I see in migration to cloud datawarehouse is the existing team can very easily learn the new technology and may start contributing quickly. Where as in Spark/Hive , some learning curve is required.
I think this is sufficient for beginners level guide for data warehouse migration projects. Let me know your experience of these migration projects. If I can or my experience can help you in anyway then feel free to reach out to me by dropping a comment below.