On this third put up of a multi-part collection, we discover a number of the edge circumstances in migrating a big knowledge warehouse from Greenplum to Amazon Redshift utilizing AWS Schema Conversion Instrument (AWS SCT) and tips on how to deal with these challenges. Challenges embrace how finest to make use of digital partitioning, edge circumstances for numeric and character fields, and arrays.
You possibly can try the first put up of this collection for steering on planning, working, and validating the migration. You may also try the second put up for finest practices for selecting the optimum Amazon Redshift cluster, knowledge structure, changing saved procedures, suitable capabilities and queries extensively used for SQL conversions, and proposals for optimizing the size of information sorts for desk columns.
Unbounded character knowledge kind
Greenplum helps creating columns as textual content
and varchar
with out specifying the size of the sphere. This works with out a problem in Greenplum however doesn’t work properly in migrating to Amazon Redshift. Amazon Redshift shops knowledge in columnar format and will get higher compression when utilizing shorter column lengths. Due to this fact, the Amazon Redshift finest apply is to make use of the smallest character size attainable.
AWS SCT will convert these unbounded fields as giant objects (LOBs) as a substitute of treating the columns as character fields with a specified size. LOBs are applied in a different way in every database product available on the market, however generally, a LOB will not be saved with the remainder of the desk knowledge. As an alternative, there’s a pointer to the placement of the info. When the LOB is queried, the database reconstitutes the info routinely for you, however this sometimes requires extra sources.
Amazon Redshift doesn’t help LOBs, so AWS SCT resolves this by loading the info into Amazon Easy Storage Service (Amazon S3) and within the column, it shops the S3 URL. When you should retrieve this knowledge, you must question the desk, get the S3 URL, after which fetch the info from Amazon S3. This isn’t ideally suited as a result of more often than not, the precise most size of the sphere doesn’t require it to be handled as a LOB, and storing the info remotely means it is going to take for much longer to fetch the info for queries.
The present decision is to calculate the utmost size of those columns and replace the Greenplum tables earlier than changing to Amazon Redshift with AWS SCT.
Notice that in a future launch of AWS SCT, the gathering of statistics will embrace calculating the utmost size for every column, and the conversion of unbounded varchar and textual content will set the size in Amazon Redshift routinely.
The next code is an instance of an unbounded character knowledge kind:
This desk makes use of a major key column on an unbounded textual content column. This must be transformed to varchar(n)
, the place n
is the utmost size discovered on this column.
- Drop distinctive constraints on affected columns:
- Drop indexes on affected columns:
- Calculate most size of affected columns:
Notice that on this instance, the description1
and description2
columns solely comprise NULL values, or the desk doesn’t have any knowledge in it, or the calculated size of the columns is 10.
- Alter the size of the affected columns:
Now you can proceed with utilizing AWS SCT to transform the Greenplum schema to Amazon Redshift and avoiding utilizing LOBs to retailer the column values.
GitHub assist
In case you have many tables to replace and need an automatic resolution, you should use the add_varchar_lengths.sh
script discovered within the GitHub repo to repair the entire unbounded varchar and textual content columns in a given schema in Greenplum. The script calculates the suitable most size after which alters the Greenplum tables so the varchar knowledge kind is bounded by a size.
Please be aware that the script additionally will drop any constraints or indexes on the affected columns.
Empty character knowledge
Greenplum and Amazon Redshift help an empty string worth in a discipline that’s completely different from NULL. The habits is similar between the 2 databases. Nevertheless, AWS SCT defaults to transform empty strings to NULL. This merely must be disabled to keep away from issues.
- In AWS SCT, open your mission, select Settings, Venture settings, and Information migration.
- Scroll to the underside and discover Use empty as null worth.
- Deselect this in order that AWS SCT doesn’t convert empty strings to NULL.
NaN and Infinity numeric knowledge kind
Greenplum helps NaN and Infinity in a numeric discipline to symbolize an undefined calculation consequence and infinity. NaN could be very unusual as a result of when utilizing combination capabilities on a column with a NaN row, the consequence may even be NaN. Infinity can also be unusual and never helpful when aggregating knowledge. Nevertheless, it’s possible you’ll encounter these values in a Greenplum database.
Amazon Redshift doesn’t help NaN and Infinity, and AWS SCT doesn’t test for this in your knowledge. In case you do encounter this when utilizing AWS SCT, the duty will fail with a numeric conversion error.
To resolve this, it’s prompt to make use of NULL as a substitute of NaN and Infinity. This lets you combination knowledge and get outcomes aside from NaN and, importantly, will let you convert the Greenplum knowledge to Amazon Redshift.
The next code is an instance NaN numeric worth:
- Drop the NOT NULL constraint:
- Replace the desk:
Now you can proceed with utilizing AWS SCT emigrate the Greenplum knowledge to Amazon Redshift.
Notice that in a future launch of AWS SCT, there might be an choice to convert NaN and Infinity to NULL so that you simply received’t must replace your Greenplum knowledge emigrate to Amazon Redshift.
Digital partitioning on GP_SEGMENT_ID
For giant tables, it’s really useful to make use of digital partitioning to extract knowledge from Greenplum. With out digital partitioning, AWS SCT will run a single question to unload knowledge from Greenplum. For instance:
If this desk could be very giant, it is going to take a very long time to extract the info as a result of it is a single course of querying the info. With digital partitioning, a number of queries are run in parallel in order that the extraction of information is accomplished sooner. It additionally makes it simpler to get better if there is a matter with the duty.
Digital partitioning could be very versatile, however a easy means to do that in Amazon Redshift is to make the most of the Greenplum hidden column gp_segment_id
. This column identifies which phase in Greenplum has the info, and every phase ought to have an equal variety of rows. Due to this fact, creating partitions for every gp_segment_id
is a simple technique to implement digital partitioning.
In case you’re not accustomed to the time period phase, it’s just like an Amazon Redshift slice.
For instance:
- First, decide the variety of segments in Greenplum:
Now you may configure AWS SCT.
- In AWS SCT, go to Information Migration view (different) and select (right-click) a big desk.
- Scroll all the way down to Add digital partitioning.
- For the partition kind, select Auto Break up and alter the column identify to
GP_SEGMENT_ID
. - Use
0
for Begin worth, the variety of segments present in Step 1 as Finish worth, and Interval of1
.
If you create an area job to load this desk, the duty may have a sub-task for every gp_segment_id
worth.
Notice that in a future launch of AWS SCT, there might be an choice to routinely nearly partition tables based mostly on GP_SEGMENT_ID
. This selection may even retrieve the variety of segments routinely.
Arrays
Greenplum helps arrays reminiscent of bigint[]
which can be unbounded. Usually, arrays are stored comparatively small in Greenplum as a result of arrays devour extra reminiscence in Greenplum than utilizing another technique. Nevertheless, it’s attainable to have a really giant array in Greenplum that isn’t supported by Amazon Redshift.
AWS SCT converts a Greenplum array to varchar(65535)
, but when the transformed array is longer than 65,535 characters, then the load will fail.
The next code is an instance of a giant array:
On this instance, the gross sales objects are saved in an array for every sales_id
. In case you encounter an error whereas loading that the size is simply too lengthy to load this knowledge into Amazon Redshift with AWS SCT, then that is the answer. It’s additionally a extra environment friendly sample to retailer knowledge in each Greenplum and Amazon Redshift!
- Create a brand new gross sales desk that has all columns from the prevailing gross sales desk, however exclude the array column:
- Populate the brand new gross sales desk with the prevailing knowledge aside from the array column:
We create a brand new desk that may be a cross-reference of gross sales IDs with the gross sales objects. As an alternative of getting a single row for this affiliation, now there might be a row for every relationship.
- Create a brand new gross sales merchandise desk:
- To unnest the array, create a row for every array aspect:
- Rename the gross sales tables:
In AWS SCT, refresh the tables and migrate the revised gross sales and the brand new sales_items
desk.
The next are some instance queries earlier than and after.
Earlier than:
After:
Earlier than:
After:
VACUUM ANALYZE
Greenplum, like Amazon Redshift, helps the VACUUM command, which reclaims cupboard space after UPDATE and DELETE instructions are run on a desk. Greenplum additionally means that you can add the ANALYZE choice to run each statements with a single command.
The next code is the Greenplum command:
This isn’t quite common, however you’ll see this on occasion. In case you’re simply inserting knowledge right into a desk, there isn’t any have to run VACUUM, however for ease of use, typically builders will use VACUUM ANALYZE.
The next are the Amazon Redshift instructions:
Amazon Redshift doesn’t help including ANALYZE to the VACUUM command, so as a substitute, this must be two completely different statements. Additionally be aware that Amazon Redshift performs VACUUM and ANALYZE routinely for you so typically, you may take away these instructions out of your scripts totally.
DISTINCT ON question
Greenplum helps an uncommon shortcut for eliminating duplicates in a desk. This function retains the primary row for every set of rows based mostly on the order of the info being fetched. It’s best to know by taking a look at an instance:
We get the next outcomes:
The answer for working this in Amazon Redshift is to make use of the ANSI customary row_number()
analytical perform, as proven within the following code:
Clear up
The examples on this put up create tables in Greenplum. To take away these instance tables, run the next instructions:
Conclusion
On this put up, we coated a number of the edge circumstances when migrating Greenplum to Amazon Redshift and tips on how to deal with these challenges, together with straightforward digital partitioning, edge circumstances for numeric and character fields, and arrays. This isn’t an exhaustive checklist of migrating Greenplum to Amazon Redshift, however this collection ought to show you how to navigate modernizing your knowledge platform by transferring to Amazon Redshift.
For extra particulars, see the Amazon Redshift Getting Began Information and the AWS SCT Consumer Information.
Concerning the Authors
Jon Roberts is a Sr. Analytics Specialist based mostly out of Nashville, specializing in Amazon Redshift. He has over 27 years of expertise working in relational databases. In his spare time, he runs.
Nelly Susanto is a Senior Database Migration Specialist of AWS Database Migration Accelerator. She has over 10 years of technical expertise specializing in migrating and replicating databases together with knowledge warehouse workloads. She is captivated with serving to prospects of their cloud journey.
Suresh Patnam is a Principal BDM – GTM AI/ML Chief at AWS. He works with prospects to construct IT technique, making digital transformation by the cloud extra accessible by leveraging Information & AI/ML. In his spare time, Suresh enjoys enjoying tennis and spending time along with his household.