How to build a decision tree model in IBM Db2

After creating a machine studying mannequin, you want a spot to run your mannequin and serve predictions. If your organization is within the early stage of its AI journey or has price range constraints, it’s possible you’ll battle to discover a deployment system in your mannequin. Constructing ML infrastructure and integrating ML fashions with the bigger enterprise are main bottlenecks to AI adoption ^[1,2,3]. IBM Db2 might help remedy these issues with its built-in ML infrastructure. Somebody with the information of SQL and entry to a Db2 occasion, the place the in-database ML characteristic is enabled, can simply be taught to construct and use a machine studying mannequin within the database.

On this put up, I’ll present how one can develop, deploy, and use a choice tree mannequin in a Db2 database.

These are my main steps on this tutorial:

Arrange Db2 tables
Discover ML dataset
Preprocess the dataset
Prepare a choice tree mannequin
Generate predictions utilizing the mannequin
Consider the mannequin

I carried out these steps in a Db2 Warehouse on-prem database. Db2 Warehouse on cloud additionally helps these ML options.

The machine studying use case

I’ll use a dataset of historic flights within the US. For every flight, the dataset has data such because the flight’s origin airport, departure time, flying time, and arrival time. Additionally, a column within the dataset signifies if every flight had arrived on time or late. Utilizing examples from the dataset, we’ll construct a classification mannequin with resolution tree algorithm. As soon as skilled, the mannequin can obtain as enter unseen flight knowledge and predict if the flight will arrive on time or late at its vacation spot.

1. Arrange Db2 tables

The dataset I take advantage of on this tutorial is obtainable right here as a csv file.

Making a Db2 desk

I take advantage of the next SQL for making a desk for storing the dataset.

db2start
hook up with <database_name>

db2 "CREATE TABLE FLIGHTS.FLIGHTS_DATA_V3  (
ID INTEGER NOT NULL GENERATED BY DEFAULT AS IDENTITY,
YEAR INTEGER ,
QUARTER INTEGER ,
MONTH INTEGER ,
DAYOFMONTH INTEGER ,                       
DAYOFWEEK INTEGER ,                       
UNIQUECARRIER VARCHAR(50 OCTETS) ,               
ORIGIN VARCHAR(50 OCTETS) ,                      
DEST VARCHAR(50 OCTETS) ,                       
CRSDEPTIME INTEGER ,                       
DEPTIME INTEGER ,                       
DEPDELAY REAL ,                       
DEPDEL15 REAL ,                       
TAXIOUT INTEGER ,                       
WHEELSOFF INTEGER ,                       
CRSARRTIME INTEGER ,                       
CRSELAPSEDTIME INTEGER ,                       
AIRTIME INTEGER ,                       
DISTANCEGROUP INTEGER ,                       
FLIGHTSTATUS VARCHAR(1) )
ORGANIZE BY ROW";

After creating the desk, I take advantage of the next SQL to load the information, from the csv file, into the desk:

db2 "IMPORT FROM 'FLIGHTS_DATA_V3.csv' OF DEL COMMITCOUNT 50000 INSERT INTO FLIGHTS.FLIGHTS_DATA_V3"

I now have the ML dataset loaded into the FLIGHTS.FLIGHTS_DATA_V3 desk in Db2. I’ll copy a subset of the information from this desk to a separate desk for the ML mannequin growth and analysis, leaving the unique copy of the information intact.

SELECT depend(*) FROM FLIGHTS.FLIGHTS_DATA_V3

— — —
1000000

Making a separate desk with pattern information

Create a desk with 10% pattern rows from the above desk. Use the RAND perform of Db2 for random sampling.

CREATE TABLE FLIGHT.FLIGHTS_DATA AS (SELECT * FROM FLIGHTS.FLIGHTS_DATA_V3 WHERE RAND() < 0.1) WITH DATA

Depend the variety of rows within the pattern desk.

SELECT depend(*) FROM FLIGHT.FLIGHTS_DATA

— — —
99879

Look into the scheme definition of the desk.

SELECT NAME, COLTYPE, LENGTH
FROM SYSIBM.SYSCOLUMNS
WHERE TBCREATOR = 'FLIGHT' AND TBNAME = 'FLIGHTS_DATA'
ORDER BY COLNO

FLIGHTSTATUS is the response or the goal column. Others are characteristic columns.

Discover the DISTINCT values within the goal column.

From these values, I can see that it’s a binary classification job the place every flight arrived both on time or late.

Discover the frequencies of distinct values within the FLIGHTSTATUS column.

SELECT FLIGHTSTATUS, depend(*) AS FREQUENCY, depend(*) / (SELECT depend(*) FROM FLIGHT.FLIGHTS_DATA) AS FRACTION
FROM FLIGHT.FLIGHTS_DATA fdf
GROUP BY FLIGHTSTATUS

From the above, I see the courses are imbalanced. Now I’ll not achieve any additional insights from the whole dataset, as this will leak data to the modeling section.

Creating prepare/take a look at partitions of the dataset

Earlier than gathering deeper insights into the information, I’ll divide this dataset into prepare and take a look at partitions utilizing Db2’s RANDOM_SAMPLING SP. I apply stratified sampling to protect the ratio between two courses within the generated coaching knowledge set.

Create a TRAIN partition.

name IDAX.RANDOM_SAMPLE('intable=FLIGHT.FLIGHTS_DATA, fraction=0.8, outtable=FLIGHT.FLIGHTS_TRAIN, by=FLIGHTSTATUS')

Copy the remaining information to a take a look at PARTITION.

CREATE TABLE FLIGHT.FLIGHTS_TEST AS (SELECT * FROM FLIGHT.FLIGHTS_DATA FDF WHERE FDF.ID NOT IN(SELECT FT.ID FROM FLIGHT.FLIGHTS_TRAIN FT)) WITH DATA

2. Discover knowledge

On this step, I’ll have a look at each pattern information and the abstract statistics of the coaching dataset to achieve insights into the dataset.

Look into some pattern information.

SELECT * FROM FLIGHT.FLIGHTS_TRAIN FETCH FIRST 10 ROWS ONLY

Some columns have encoded the time as numbers:

— CRSDEPTIME: Pc Reservation System (scheduled) Departure Time (hhmm)

— DepTime: Departure Time (hhmm)

— CRSArrTime: Pc Reservation System (scheduled) Arrival Time

Now, I accumulate abstract statistics from the FLIGHTS_TRAIN utilizing SUMMARY1000 SP to get a world view of the traits of the dataset.

CALL IDAX.SUMMARY1000('intable=FLIGHT.FLIGHTS_TRAIN, outtable=FLIGHT.FLIGHTS_TRAIN_SUM1000')

Right here the intable has the identify of the enter desk from which I need SUMMARY1000 SP to gather statistics. outtable is the identify of the desk the place SUMMARY1000 will retailer gathered statistics for the whole dataset. Moreover the outtable, SUMMARY1000 SP creates a number of further output tables — one desk with statistics for every column kind. Our dataset has two sorts of columns — numeric and nominal. So, SUMMARY1000 will generate two further tables. These further tables observe this naming conference: the identify of the outtable + column kind. In our case, the column varieties are NUM, representing numeric, and CHAR, representing nominal. So, the names of those two further tables might be as follows:

FLIGHTS_TRAIN_SUM1000_NUM

FLIGHTS_TRAIN_SUM1000_CHAR

Having the statistics out there in separate tables for particular datatypes makes it simpler to view the statistics that apply to particular datatype and cut back the variety of columns whose statistics are seen collectively. This simplifies the evaluation course of.

Verify the abstract statistics of the numeric column.

SELECT * FROM FLIGHT.FLIGHTS_TRAIN_SUM1000_NUM

For the numeric columns, SUMMARY1000 collect the next statistics:

Lacking worth depend
Non-missing worth depend
Common
Variance
Customary deviation
Skewness
Extra kurtosis
Minimal
Most

Every of those statistics might help uncover insights into the dataset. As an illustration, I can see that DEPDEL15 and DEPDELAY columns have 49 lacking values. There are massive values in these columns: AIRTIME, CRSARRTIME, CRSDEPTIME, CRSELAPSEDTIME, DEPDELAY, DEPTIME, TAXIOUT, WHEELSOFF, and YEAR. Since I’ll create a choice tree mannequin, I don’t have to cope with the big worth and the lacking values. Db2 will cope with each points natively.

Subsequent, I examine the abstract statistics of the nominal columns.

choose * from FLIGHT.FLIGHTS_TRAIN_SUM1000_CHAR

For nominal columns, SUMMARY1000 gathered the next statistics:

Variety of lacking values
Variety of non-missing values
Variety of distinct values
Frequency of essentially the most frequent worth

3. Preprocess knowledge

From the above knowledge exploration, I can see that the dataset has no lacking values. These 4 TIME columns have massive values: AIRTIME, CRSARRTIME, DEPTIME, WHEELSOFF. I’ll go away the nominal values in all columns as-is, as the choice tree implementation in Db2 can cope with them natively.

Extract the hour half from the TIME columns — CRSARRTIME, DEPTIME, WHEELSOFF.

From wanting up the outline of the dataset, I see the values within the CRSARRTIME, DEPTIME, and WHEELSOFF columns are encoding of hhmm of the time values. I extract the hour a part of these values to create, hopefully, higher options for the training algorithm.

Scale CRSARRTIME COLUMN: divide the worth with 100 offers the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TRAIN SET CRSARRTIME = CRSARRTIME / 100

Scale DEPTIME COLUMN: divide the worth by 100 offers the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TRAIN SET DEPTIME = DEPTIME / 100

Scale WHEELSOFF COLUMN: divide the worth by 100 will give the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TRAIN SET WHEELSOFF = WHEELSOFF / 100

4. Prepare a choice tree mannequin

Now the coaching dataset is prepared for the choice tree algorithm.

I prepare a choice tree mannequin utilizing GROW_DECTREE SP.

CALL IDAX.GROW_DECTREE('mannequin=FLIGHT.flight_dectree, intable=FLIGHT.FLIGHTS_TRAIN, id=ID, goal=FLIGHTSTATUS')

I referred to as this SP utilizing the next parameters:

mannequin: the identify I need to give to the choice tree mannequin — FLIGHT_DECTREE
intable: the identify of the desk the place the coaching dataset is saved
id: the identify of the ID column
goal: the identify of the goal column

After finishing the mannequin coaching, the GROW_DECTREE SP generated a number of tables with metadata from the mannequin and the coaching dataset. Listed below are among the key tables:

FLIGHT_DECTREE_MODEL: this desk accommodates metadata concerning the mannequin. Examples of metadata embrace depth of the tree, technique for dealing with lacking values, and the variety of leaf nodes within the tree.
FLIGHT_DECTREE_NODES: this desk gives details about every node within the resolution tree.
FLIGHT_DECTREE_COLUMNS: this desk gives data on every enter column and their position within the skilled mannequin. The knowledge consists of the significance of a column in producing a prediction from the mannequin.

This hyperlink has the entire record of mannequin tables and their particulars.

5. Generate predictions from the mannequin

Because the FLIGHT_DECTREE mannequin is skilled and deployed within the database, I can use it for producing predictions on the take a look at information from the FLIGHTS_TEST desk.

First, I preprocess the take a look at dataset utilizing the identical preprocessing logic that I utilized to the TRAINING dataset.

Scale CRSARRTIME COLUMN: divide the worth by 100 will give the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TEST SET CRSARRTIME = CRSARRTIME / 100

Scale DEPTIME COLUMN: divide the worth by 100 will give the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TEST SET DEPTIME = DEPTIME / 100

Scale WHEELSOFF COLUMN: divide the worth by 100 will give the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TEST SET WHEELSOFF = WHEELSOFF / 100

Producing predictions

I take advantage of PREDICT_DECTREE SP to generate predictions from the FLIGHT_DECTREE mannequin:

CALL IDAX.PREDICT_DECTREE('mannequin=FLIGHT.flight_dectree, intable=FLIGHT.FLIGHTS_TEST, outtable=FLIGHT.FLIGHTS_TEST_PRED, prob=true, outtableprob=FLIGHT.FLIGHTS_TEST_PRED_DIST')

Right here is the record of parameters I handed when calling this SP:

mannequin: the identify of the choice tree mannequin, FLIGHT_DECTREE
intable: identify of the enter desk to generate predictions from
outtable: the identify of the desk that the SP will create and retailer predictions to
prob: a boolean flag indicating if we need to embrace within the output the likelihood of every prediction
outputtableprob: the identify of the output desk the place the likelihood of every prediction might be saved

6. Consider the mannequin

Utilizing generated predictions for the take a look at dataset, I compute a number of metrics to judge the standard of the mannequin’s predictions.

Making a confusion matrix

I take advantage of CONFUSION_MATRIX SP to create a confusion matrix based mostly on the mannequin’s prediction on the TEST dataset.

CALL IDAX.CONFUSION_MATRIX('intable=FLIGHT.FLIGHTS_TEST, resulttable=FLIGHT.FLIGHTS_TEST_PRED, id=ID, goal=FLIGHTSTATUS, matrixTable=FLIGHT.FLIGHTS_TEST_CMATRIX')

In calling this SP, listed here are among the key parameters that I handed:

intable: the identify of the desk that accommodates the dataset and the precise worth of the goal column
resulttable: the identify of the desk that accommodates the column with predicted values from the mannequin
goal: the identify of the goal column
matrixTable: The output desk the place the SP will retailer the confusion matrix

After the SP completes its run, we have now the next output desk with statistics for the confusion matrix.

FLIGHTS_TEST_CMATRIX:

This desk has three columns. The REAL column has the precise flight standing. PREDICTION column has the anticipated flight standing. Since flight standing takes two values – 0 (on time) or 1 (delayed), we have now 4 attainable combos between values within the REAL and the PREDICTION columns:

TRUE NEGATIVE: REAL: 0, PREDICTION: 0 — The mannequin has precisely predicted the standing of these flights that arrived on schedule. From that CNT column, we see that 11795 rows from the TEST desk belong to this mix.
FALSE POSITIVE: REAL: 0, PREDICTION: 1 — these are the flights that truly arrived on time however the mannequin predicted them to be delayed. 671 is the depend of such flights.
FALSE NEGATIVE: REAL: 1, PREDICTION: 0 — these flights have arrived late, however the mannequin predicted them to be on time. From the CNT desk, we discover their depend to be 2528.
TRUE POSITIVE: REAL: 1, PREDICTION: 1 — the mannequin has precisely recognized these flights that had been late. The depend is 4981.

I take advantage of these counts to compute a number of analysis metrics for the mannequin. For doing so, I take advantage of CMATRIX_STATS SP as follows:

CALL IDAX.CMATRIX_STATS('matrixTable=FLIGHT.FLIGHTS_TEST_CMATRIX')

The one parameter this SP wants is the identify of the desk that accommodates the statistics generated by the CONFUSION_MATRIX SP within the earlier step. CMATRIX_STATS SP generates two units of output. The primary one reveals total high quality metrics of the mannequin. The second consists of the mannequin’s predictive efficiency for every class.

First output — total mannequin metrics embrace correction predictions, incorrect prediction, total accuracy, weighted accuracy. From this output, I see that the mannequin has an total accuracy of 83.98% and a weighted accuracy of 80.46%.

With classification duties, it’s normally helpful to view the mannequin’s high quality components for every particular person class. The second output from the CMATRIX_STATS SP consists of these class stage high quality metrics.

For every class, this output consists of the True Optimistic Charge (TPR), False Optimistic Charge (FPR), Optimistic Predictive Worth (PPV) or Precision, and F-measure (F1 rating).

Conclusions and key takeaways

If you wish to construct and deploy an ML mannequin in a Db2 database utilizing Db2’s built-in saved procedures, I hope you’ll discover this tutorial helpful. Listed below are the principle takeaways of this tutorial:

Demonstrated an entire workflow of making and utilizing a choice tree mannequin in a Db2 database utilizing in-database ML Saved procedures.
For every step within the workflow, I supplied concrete and purposeful SQL statements and saved procedures. For every code instance, when relevant, I defined intuitively what it does, and its inputs and outputs.
Included references to IBM Db2’s documentation for the ML saved procedures I used on this tutorial.

O’Reilly’s 2022 AI Adoption survey^[3] underscored challenges in constructing technical infrastructure and abilities hole as two prime bottlenecks to AI adoption within the enterprise. Db2 solves the primary one by supplying an end-to-end ML infrastructure within the database. It additionally lessens the latter, the abilities hole, by offering easy SQL API for creating and utilizing ML fashions within the database. Within the enterprise, SQL is a extra frequent ability than ML.

Try the next sources to be taught extra concerning the ML options in IBM Db2 and see further examples of ML use circumstances carried out with these options.

Discover Db2 ML Product Documentation

Discover Db2 ML samples in GitHub

References

Paleyes, A., Urma, R.G. and Lawrence, N.D., 2022. Challenges in deploying machine studying: a survey of case research. ACM Computing Surveys, 55(6), pp.1–29.
Amershi, S., Begel, A., Fowl, C., DeLine, R., Gall, H., Kamar, E., Nagappan, N., Nushi, B. and Zimmermann, T., 2019, Might. Software program engineering for machine studying: A case research. In 2019 IEEE/ACM forty first Worldwide Convention on Software program Engineering: Software program Engineering in Observe (ICSE-SEIP) (pp. 291–300). IEEE.
Loukides, Mike, AI Adoption within the Enterprise 2022. https://www.oreilly.com/radar/ai-adoption-in-the-enterprise-2022/

Machine Studying Architect, IBM Db2

Shaikh Quader is the machine studying (ML) architect of IBM Db2. He develops database options to enhance database operations with ML and assist customers deploy M…

Source link