GQL

Association Rule Mining using GQL

What is the most frequently bought item with Infant Formula and Infant Diapers? Association Rule Mining is a key to Market Basket Analysis. While Association Rule Mining is a complex topic, we can use SQL to figure out what items most frequently bought together.

An association rule is a implication expression of the form X → Y, where X and Y are itemset. X is the Antecedent and Y is the Consequent

Example of Association Rules

{Infant Diaper} → {Coke},
{Milk, Bread} → {Eggs, Coke},
{Coke, Bread} → {Infant Milk}
Antecedent → Consequent

Now let’s assume we take {Infant Diaper, Infant Milk} as the Antecedent, and we want to figure out the Consequent i.e. what ITEM is most often bought with Infant Diaper and Infant Milk.

Association Rule are, by definition, extracted from the data i.e. they are core properties of the Data. We are not looking for correlation in the Item sets in a Rule. This makes Graph Databases a good platform for Association Rule Mining. While there are lot of Python and R packages that are designed for Association Rule Mining, we want to start exploring the use of GQL to perform Exploratory Data Mining exercise.

We will start with the following sample Orders and the Items contained in those Orders

ORDER_NUMBERITEMS
3infant formula, infant diaper, coffee, coke
7infant formula, infant diaper, mountain dew, coke, coffee
1bread, infant formula
4bread, infant formula, infant diaper, coffee
6infant formula, infant diaper, mountain dew
5infant formula, infant diaper, coke
2bread, infant diaper, coffee, eggs

We are interested in Items that are bought together with Infant Diaper and Infant Milk.

Let’s start by creating the dataset

create graph order_items {
  node orders ({order_number string})
  , node item ({description string})
  , edge includes (orders)-[]->(items)
};

use graph order_items;
insert (n_o1:orders {_id: "order_1", order_number:'1'})
  , (n_o2:orders {_id: "order_2", order_number:'2'})
  , (n_o3:orders {_id: "order_3", order_number:'3'})
  , (n_o4:orders {_id: "order_4", order_number:'4'})
  , (n_o5:orders {_id: "order_5", order_number:'5'})
  , (n_o6:orders {_id: "order_6", order_number:'6'})
  , (n_o7:orders {_id: "order_7", order_number:'7'})
  , (n_infant_formula:item {_id: 'infant_formula', description:'infant formula'})
  , (n_infant_diaper:item {_id: 'infant_diaper', description:'infant diaper'})
  , (n_coffee:item {_id: 'coffee', description:'coffee'})
  , (n_coke:item {_id: 'coke', description:'coke'})
  , (n_mountain_dew:item {_id: 'mountain_dew', description:'mountain dew'})
  , (n_bread:item {_id: 'bread', description:'bread'})
  , (n_eggs:item {_id: 'eggs', description:'eggs'})
  , (n_o1)-[:includes]->(n_infant_formula)
  , (n_o1)-[:includes]->(n_bread)
  , (n_o2)-[:includes]->(n_bread)
  , (n_o2)-[:includes]->(n_infant_diaper)
  , (n_o2)-[:includes]->(n_coffee)
  , (n_o2)-[:includes]->(n_eggs)
  , (n_o3)-[:includes]->(n_infant_formula)
  , (n_o3)-[:includes]->(n_infant_diaper)
  , (n_o3)-[:includes]->(n_coffee)
  , (n_o3)-[:includes]->(n_coke)
  , (n_o4)-[:includes]->(n_bread)
  , (n_o4)-[:includes]->(n_infant_formula)
  , (n_o4)-[:includes]->(n_infant_diaper)
  , (n_o4)-[:includes]->(n_coffee)
  , (n_o5)-[:includes]->(n_infant_formula)
  , (n_o5)-[:includes]->(n_infant_diaper)
  , (n_o5)-[:includes]->(n_coke)
  , (n_o6)-[:includes]->(n_infant_formula)
  , (n_o6)-[:includes]->(n_infant_diaper)
  , (n_o6)-[:includes]->(n_mountain_dew)
  , (n_o7)-[:includes]->(n_infant_formula)
  , (n_o7)-[:includes]->(n_infant_diaper)
  , (n_o7)-[:includes]->(n_mountain_dew)
  , (n_o7)-[:includes]->(n_coke)
  , (n_o7)-[:includes]->(n_coffee)
;

Let’s use GQL to query the Items that are bought with Infant Diaper and Infant Formula the most.

match (o:orders)-[:includes]->(third_item:item)
CALL (third_item) {
  match (o:orders)-[:includes]->(i1:item {_id:'infant_formula'})
    , (o)-[:includes]->(i2:item {_id:'infant_diaper'})
    , (o)-[:includes]->(third_item)
  return third_item, count(o) as counter  
}
filter counter > 0
return distinct table(third_item.description, counter)
order by counter desc;

Notice the use of CALL statement. The CALL statement is used to invoke an inline procedure or subquery. An inline procedure is a user-defined procedure embedded within a query, commonly used to execute subqueries or perform data modifications. It enables complex logic such as looping. In short, CALL statement enables modular query design and the invocation of complex logic in a graph query.

In the above GQL query we first get a list of all items and then loop through each of them using the CALL statement to count the number of Orders where that item is included along with Infant Formula and Infant Diaper.

This query will output:

Based on the above query, we can extract the following Association Rule:

{Infant Milk, Infant Diaper} → {Coke}
{Infant Milk, Infant Diaper} → {Coffee}

This doesn’t mean that there is correlation between purchase of Infant Milk, Infant Diaper and Coffee. This is just a property if of the Data.

Alternate way is to the GROUP BY clause in GQL:

match ALL (o:orders)-[:includes]->(i1:item {_id:'infant_formula'})
  , (o)-[:includes]->(i2:item {_id:'infant_diaper'})
  , (o)-[:includes]->(i3:item)
return table(i1.description, i2.description, i3.description, count(distinct o))
group by i3.description;

Both the CALL function and the GROUP BY will produce the same results in this case. However CALL function is more versatile allows for complex MATCH statement as part of the sub-query.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *