Migrate a big knowledge warehouse from Greenplum to Amazon Redshift utilizing AWS SCT – Half 3

0
2


On this third put up of a multi-part collection, we discover a number of the edge circumstances in migrating a big knowledge warehouse from Greenplum to Amazon Redshift utilizing AWS Schema Conversion Instrument (AWS SCT) and tips on how to deal with these challenges. Challenges embrace how finest to make use of digital partitioning, edge circumstances for numeric and character fields, and arrays.

You possibly can try the first put up of this collection for steering on planning, working, and validating the migration. You may also try the second put up for finest practices for selecting the optimum Amazon Redshift cluster, knowledge structure, changing saved procedures, suitable capabilities and queries extensively used for SQL conversions, and proposals for optimizing the size of information sorts for desk columns.

Unbounded character knowledge kind

Greenplum helps creating columns as textual content and varchar with out specifying the size of the sphere. This works with out a problem in Greenplum however doesn’t work properly in migrating to Amazon Redshift. Amazon Redshift shops knowledge in columnar format and will get higher compression when utilizing shorter column lengths. Due to this fact, the Amazon Redshift finest apply is to make use of the smallest character size attainable.

AWS SCT will convert these unbounded fields as giant objects (LOBs) as a substitute of treating the columns as character fields with a specified size. LOBs are applied in a different way in every database product available on the market, however generally, a LOB will not be saved with the remainder of the desk knowledge. As an alternative, there’s a pointer to the placement of the info. When the LOB is queried, the database reconstitutes the info routinely for you, however this sometimes requires extra sources.

Amazon Redshift doesn’t help LOBs, so AWS SCT resolves this by loading the info into Amazon Easy Storage Service (Amazon S3) and within the column, it shops the S3 URL. When you should retrieve this knowledge, you must question the desk, get the S3 URL, after which fetch the info from Amazon S3. This isn’t ideally suited as a result of more often than not, the precise most size of the sphere doesn’t require it to be handled as a LOB, and storing the info remotely means it is going to take for much longer to fetch the info for queries.

The present decision is to calculate the utmost size of those columns and replace the Greenplum tables earlier than changing to Amazon Redshift with AWS SCT.

Notice that in a future launch of AWS SCT, the gathering of statistics will embrace calculating the utmost size for every column, and the conversion of unbounded varchar and textual content will set the size in Amazon Redshift routinely.

The next code is an instance of an unbounded character knowledge kind:

CREATE TABLE public.mytable 
(description1 textual content NOT NULL PRIMARY KEY, description2 varchar);
CREATE INDEX ON public.mytable (description2);

This desk makes use of a major key column on an unbounded textual content column. This must be transformed to varchar(n), the place n is the utmost size discovered on this column.

  1. Drop distinctive constraints on affected columns:
    ALTER TABLE public.mytable DROP CONSTRAINT mytable_pkey;

  2. Drop indexes on affected columns:
    DROP INDEX public.mytable_description2_idx;

  3. Calculate most size of affected columns:
    choose coalesce(max(size(description1)), 10),
    coalesce(max(size(description2)), 10) 
    from public.mytable;

Notice that on this instance, the description1 and description2 columns solely comprise NULL values, or the desk doesn’t have any knowledge in it, or the calculated size of the columns is 10.

  1. Alter the size of the affected columns:
    ALTER TABLE public.mytable ALTER COLUMN description1 TYPE varchar(10);
    ALTER TABLE public.mytable ALTER COLUMN description2 TYPE varchar(10);

Now you can proceed with utilizing AWS SCT to transform the Greenplum schema to Amazon Redshift and avoiding utilizing LOBs to retailer the column values.

GitHub assist

In case you have many tables to replace and need an automatic resolution, you should use the add_varchar_lengths.sh script discovered within the GitHub repo to repair the entire unbounded varchar and textual content columns in a given schema in Greenplum. The script calculates the suitable most size after which alters the Greenplum tables so the varchar knowledge kind is bounded by a size.

Please be aware that the script additionally will drop any constraints or indexes on the affected columns.

Empty character knowledge

Greenplum and Amazon Redshift help an empty string worth in a discipline that’s completely different from NULL. The habits is similar between the 2 databases. Nevertheless, AWS SCT defaults to transform empty strings to NULL. This merely must be disabled to keep away from issues.

  1. In AWS SCT, open your mission, select Settings, Venture settings, and Information migration.
  2. Scroll to the underside and discover Use empty as null worth.
  3. Deselect this in order that AWS SCT doesn’t convert empty strings to NULL.

NaN and Infinity numeric knowledge kind

Greenplum helps NaN and Infinity in a numeric discipline to symbolize an undefined calculation consequence and infinity. NaN could be very unusual as a result of when utilizing combination capabilities on a column with a NaN row, the consequence may even be NaN. Infinity can also be unusual and never helpful when aggregating knowledge. Nevertheless, it’s possible you’ll encounter these values in a Greenplum database.

Amazon Redshift doesn’t help NaN and Infinity, and AWS SCT doesn’t test for this in your knowledge. In case you do encounter this when utilizing AWS SCT, the duty will fail with a numeric conversion error.

To resolve this, it’s prompt to make use of NULL as a substitute of NaN and Infinity. This lets you combination knowledge and get outcomes aside from NaN and, importantly, will let you convert the Greenplum knowledge to Amazon Redshift.

The next code is an instance NaN numeric worth:

CREATE TABLE public.mytable2 (id int NOT NULL, quantity numeric NOT NULL);
INSERT INTO public.mytable2 VALUES (1, 10), (2, 'NaN'), (3, 20);

  1. Drop the NOT NULL constraint:
    ALTER TABLE public.mytable2 ALTER COLUMN quantity DROP NOT NULL;

  2. Replace the desk:
    UPDATE public.mytable2 SET quantity = NULL the place quantity="NaN";

Now you can proceed with utilizing AWS SCT emigrate the Greenplum knowledge to Amazon Redshift.

Notice that in a future launch of AWS SCT, there might be an choice to convert NaN and Infinity to NULL so that you simply received’t must replace your Greenplum knowledge emigrate to Amazon Redshift.

Digital partitioning on GP_SEGMENT_ID

For giant tables, it’s really useful to make use of digital partitioning to extract knowledge from Greenplum. With out digital partitioning, AWS SCT will run a single question to unload knowledge from Greenplum. For instance:

SELECT * FROM store_sales;

If this desk could be very giant, it is going to take a very long time to extract the info as a result of it is a single course of querying the info. With digital partitioning, a number of queries are run in parallel in order that the extraction of information is accomplished sooner. It additionally makes it simpler to get better if there is a matter with the duty.

Digital partitioning could be very versatile, however a easy means to do that in Amazon Redshift is to make the most of the Greenplum hidden column gp_segment_id. This column identifies which phase in Greenplum has the info, and every phase ought to have an equal variety of rows. Due to this fact, creating partitions for every gp_segment_id is a simple technique to implement digital partitioning.

In case you’re not accustomed to the time period phase, it’s just like an Amazon Redshift slice.

For instance:

SELECT * FROM store_sales WHERE gp_segment_id = 0;
SELECT * FROM store_sales WHERE gp_segment_id = 1;
SELECT * FROM store_sales WHERE gp_segment_id = 2;
...

  1. First, decide the variety of segments in Greenplum:
    SELECT rely(*) 
    FROM gp_segment_configuration 
    WHERE content material >= 0 AND preferred_role="p";

Now you may configure AWS SCT.

  1. In AWS SCT, go to Information Migration view (different) and select (right-click) a big desk.
  2. Scroll all the way down to Add digital partitioning.
  3. For the partition kind, select Auto Break up and alter the column identify to GP_SEGMENT_ID.
  4. Use 0 for Begin worth, the variety of segments present in Step 1 as Finish worth, and Interval of 1.

If you create an area job to load this desk, the duty may have a sub-task for every gp_segment_id worth.

Notice that in a future launch of AWS SCT, there might be an choice to routinely nearly partition tables based mostly on GP_SEGMENT_ID. This selection may even retrieve the variety of segments routinely.

Arrays

Greenplum helps arrays reminiscent of bigint[] which can be unbounded. Usually, arrays are stored comparatively small in Greenplum as a result of arrays devour extra reminiscence in Greenplum than utilizing another technique. Nevertheless, it’s attainable to have a really giant array in Greenplum that isn’t supported by Amazon Redshift.

AWS SCT converts a Greenplum array to varchar(65535), but when the transformed array is longer than 65,535 characters, then the load will fail.

The next code is an instance of a giant array:

CREATE TABLE public.gross sales 
(sales_id int NOT NULL,
customer_id int NOT NULL,
sales_item_ids bigint[]) 
DISTRIBUTED BY (sales_id);

INSERT INTO public.gross sales values (1, 100, '{1,2,3}'), (2, 100, '{1,2,3}');

On this instance, the gross sales objects are saved in an array for every sales_id. In case you encounter an error whereas loading that the size is simply too lengthy to load this knowledge into Amazon Redshift with AWS SCT, then that is the answer. It’s additionally a extra environment friendly sample to retailer knowledge in each Greenplum and Amazon Redshift!

  1. Create a brand new gross sales desk that has all columns from the prevailing gross sales desk, however exclude the array column:
    CREATE TABLE public.sales_new 
    (sales_id int NOT NULL,
    customer_id int NOT NULL) 
    DISTRIBUTED BY (sales_id);

  2. Populate the brand new gross sales desk with the prevailing knowledge aside from the array column:
    INSERT INTO public.sales_new (sales_id, customer_id) 
    SELECT sales_id, customer_id FROM public.gross sales;

We create a brand new desk that may be a cross-reference of gross sales IDs with the gross sales objects. As an alternative of getting a single row for this affiliation, now there might be a row for every relationship.

  1. Create a brand new gross sales merchandise desk:
    CREATE TABLE public.sales_items 
    (sales_id int NOT NULL, 
    sales_item_id bigint NOT NULL) 
    DISTRIBUTED BY (sales_id);

  2. To unnest the array, create a row for every array aspect:
    INSERT INTO public.sales_items 
    (sales_id, sales_item_id) 
    SELECT sales_id, unnest(sales_item_ids) 
    FROM public.gross sales;

  3. Rename the gross sales tables:
    ALTER TABLE public.gross sales RENAME TO sales_old;
    ALTER TABLE public.sales_new RENAME TO gross sales;

In AWS SCT, refresh the tables and migrate the revised gross sales and the brand new sales_items desk.

The next are some instance queries earlier than and after.

Earlier than:

SELECT s.sales_id, unnest(s.sales_item_ids) 
FROM public.gross sales s 
WHERE s.sales_id = 1;

After:

SELECT s.sales_id, i.sales_item_id 
FROM public.gross sales s 
JOIN public.sales_items i ON s.sales_id = i.sales_id 
WHERE s.sales_id = 1;

Earlier than:

SELECT s.sales_id 
FROM public.gross sales s 
WHERE s.customer_id = 100
AND 10 = ANY(s.sales_item_ids);

After:

SELECT s.sales_id
FROM public.gross sales s 
JOIN public.sales_items i ON s.sales_id = i.sales_id 
WHERE s.customer_id = 100
AND i.sales_item_id = 10;

VACUUM ANALYZE

Greenplum, like Amazon Redshift, helps the VACUUM command, which reclaims cupboard space after UPDATE and DELETE instructions are run on a desk. Greenplum additionally means that you can add the ANALYZE choice to run each statements with a single command.

The next code is the Greenplum command:

This isn’t quite common, however you’ll see this on occasion. In case you’re simply inserting knowledge right into a desk, there isn’t any have to run VACUUM, however for ease of use, typically builders will use VACUUM ANALYZE.

The next are the Amazon Redshift instructions:

VACUUM desk;
ANALYZE desk;

Amazon Redshift doesn’t help including ANALYZE to the VACUUM command, so as a substitute, this must be two completely different statements. Additionally be aware that Amazon Redshift performs VACUUM and ANALYZE routinely for you so typically, you may take away these instructions out of your scripts totally.

DISTINCT ON question

Greenplum helps an uncommon shortcut for eliminating duplicates in a desk. This function retains the primary row for every set of rows based mostly on the order of the info being fetched. It’s best to know by taking a look at an instance:

CREATE TABLE buyer 
(customer_name varchar(100) not null, 
customer_address varchar(1000) not null, 
lastupdate timestamp not null);

INSERT INTO buyer VALUES
('ACME', '123 Fundamental St', '2022-01-01'), 
('ACME', '456 Market St', '2022-05-01'), 
('ACME', '789 Broadway', '2022-08-01');

SELECT DISTINCT ON (customer_name) customer_name, customer_address 
FROM buyer 
ORDER BY customer_name, lastupdate DESC;

We get the next outcomes:

customer_name | customer_address 
---------------+------------------
 ACME          | 789 Broadway

The answer for working this in Amazon Redshift is to make use of the ANSI customary row_number() analytical perform, as proven within the following code:

SELECT sub.customer_name, sub.customer_address 
FROM (SELECT customer_name, customer_address, row_number() over (partition by customer_name ORDER BY lastupdate DESC) AS row_number FROM buyer) AS sub 
WHERE sub.row_number = 1;

Clear up

The examples on this put up create tables in Greenplum. To take away these instance tables, run the next instructions:

DROP TABLE IF EXISTS public.mytable;
DROP TABLE IF EXISTS public.mytable2;
DROP TABLE IF EXISTS public.gross sales;
DROP TABLE IF EXISTS public.sales_new;
DROP TABLE IF EXISTS public.sales_items;
DROP TABLE IF EXISTS public.buyer;

Conclusion

On this put up, we coated a number of the edge circumstances when migrating Greenplum to Amazon Redshift and tips on how to deal with these challenges, together with straightforward digital partitioning, edge circumstances for numeric and character fields, and arrays. This isn’t an exhaustive checklist of migrating Greenplum to Amazon Redshift, however this collection ought to show you how to navigate modernizing your knowledge platform by transferring to Amazon Redshift.

For extra particulars, see the Amazon Redshift Getting Began Information and the AWS SCT Consumer Information.


Concerning the Authors

Jon Roberts is a Sr. Analytics Specialist based mostly out of Nashville, specializing in Amazon Redshift. He has over 27 years of expertise working in relational databases. In his spare time, he runs.

Nelly Susanto is a Senior Database Migration Specialist of AWS Database Migration Accelerator. She has over 10 years of technical expertise specializing in migrating and replicating databases together with knowledge warehouse workloads. She is captivated with serving to prospects of their cloud journey.

Suresh Patnam is a Principal BDM – GTM AI/ML Chief at AWS. He works with prospects to construct IT technique, making digital transformation by the cloud extra accessible by leveraging Information & AI/ML. In his spare time, Suresh enjoys enjoying tennis and spending time along with his household.

LEAVE A REPLY

Please enter your comment!
Please enter your name here