Data Mining Apriori Algorithm

free research paper sample

The basic idea of CF-based algorithms is to provide item recommendations or predictions based on the opinions of other like-minded users. The opinions of users can be obtained explicitly from the users or by using some implicit measures. Collaborative filtering techniques collect and establish profiles, and determine the relationships among the data according to similarity models. The possible categories of the data in the profiles include user preferences, user behavior patterns, or item properties Everyday Examples of Collaborative Filtering… • • • Bestseller lists Top 40 music lists The “recent returns” shelf at the library Many weblogs Challenges of collaborative filtering. • The lack of the information would affect the recommendation results. For the relationship mining, new items not-yet-rated or not-yet-labeled can be abandoned in the recommendation processes. • Collaborative filtering does not cover the extreme case. If the scales of the user profiles are small or the users have unique tastes, similarity decisions are unable to be established. If any new information of users has to be included in the recommendation processes in real time, data latency will increase the waiting time for the query result. The complexity of the computation for the recommendation affects the waiting time of the user directly. • Synchronization is another issue of the profile updates in the system. When hundreds of users query the system within a very short time period. 3 Explicit vs. Implicit Data Collection In order to make any recommendations, the system has to collect data.

The ultimate goal of collection the data is to get an idea of user preferences, which can later be used to make predictions on future user preferences. There are two ways to gather the data. The first method is to ask for explicit ratings from a user, typically on a concrete rating scale (such as rating a movie from one to five stars). The second is to gather data implicitly as the user is in the domain of the system – that is, to log the actions of a user on the site. Explicit data gathering is easy to work with.

Assumedly, the ratings that a user provides can be directly interpreted as the user’s preferences, making it easier to make extrapolations from data to predict future ratings. However, the drawback with explicit data is that it puts the responsibility of data collection on the user, who may not want to take time to enter ratings. On the other hand, implicit data is easy to collect in large quantities without any extra effort on the part of the user. Unfortunately, it is much more difficult to work with since the goal is to convert user behavior into user references. Of course, these two methods of gathering data are not mutually exclusive. A combination of the two have the possibility for the best overall results – one could gain the advantages of explicit voting when the user chooses to rate items, and could still make recommendations when the user does not rate items by implicitly collecting data. Passive vs. Active Filtering The distinction between these two methods is subtle, but basically comes down to whether or not the recommendations are user-specific or not.

No time to write a research paper?
Order it from experienced writers now!
For Only $13.90/page $19.90

Order Now

In passive filtering, every user will be given the same predictions for a particular item. In active filtering, the system takes into account your specific history in order to make a recommendation. Ultimately, active filtering is what most people mean when they talk about collaborative filtering. Though passive filtering has very useful and practical applications, a personal recommendation system can only be implemented using active filtering. User-centric vs. Item-centric Filtering All recommender systems must decide whether or not it will attempt to see patterns between users or between items.

A user-centric system will find similarities between users then use the similar users’ preferences to predict ratings. An alternative to this is an item-centric system which will attempt to find the relationships between items, and make predictions then only based on a user’s preferences and these relationships. It is not necessary that a recommender system focus only on users or items, but most typically only find similarities between users or similarities between items and not both. 4 Memory-based Collaborative Filtering Algorithms. Memory-based algorithms utilize the entire user-item database to generate a prediction.

These systems employ statistical techniques to find a set of users, known as neighbors, that have a history of agreeing with the target user (i. e. , they either rate different items similarly or they tend to buy similar sets of items). Once a neighborhood of users is formed, these systems use different algorithms to combine the preferences of neighbors to produce a prediction or top-N recommendation for the active user. The techniques, also known as nearest-neighbor or user-based collaborative filtering are more popular and widely used in practice.

Model – based Collaborative Filtering Algorithms Model-based collaborative filtering algorithms provide item recommendation by first developing a model of user ratings. Algorithms in this category take a probabilistic approach and envision the collaborative filtering process as computing the expected value of a user prediction, given his/her ratings on other items. The model building process is performed by different machine learning algorithms such as Bayesian network, clustering, and rule-based approaches.

The Bayesian network model formulates a probabilistic model for collaborative filtering problem. The clustering model treats collaborative filtering as a classification problem and works by clustering similar users in same class and estimating the probability that a particular user is in a particular class C, and from there computes the conditional probability of ratings. The rule-based approach applies association rule discovery algorithms to find association between co-purchased items and then generates item recommendation based on the strength of the association between items 2. Problem Statement

Input Transaction logs containing preprocessed data, such as Customer Id and Product Codes of Item bought in a Single Transaction and Transaction number of the Customer and Transaction Number in total. We will attempt to make item-centric active filtering collaborative filters using Apriori model based algorithm, and then using k-nearest neighbors algorithm 5 3. Algorithm 1 – Apriori (Association Rule) Apriori is a classic algorithm for learning association rules. Apriori is designed to operate on databases containing transactions (for example,collections of items bought by customers, or details of a website frequentation).

The algorithm attempts to find subsets which are common to at least a minimum number C (the cutoff, or confidence threshold) of the itemsets. Apriori uses a “bottom up” approach, where frequent subsets are extended one item at a time (a step known as candidate generation, and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found. Apriori uses breadth-first search and a hash tree to count candidate item sets efficiently Apriori Pseudo Code Ck: Candidate itemset of size k Lk : requent itemset of size k L1 = {frequent items}; for (k = 1; Lk ! =O; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk; Advantages of Apriori Algorithm 1. Uses large itemset property 2. Easily parallelized 3. Easy to implement 6 4. The Apriori Algorithm: Example Disadvantages of Apriori Algorithm 1. Assumes transaction database is memory resident. 2. Requires many database scans.

TID T101 T102 T103 T104 T105 T106 T107 T108 T109 List of Items I1, I2, I5 I1, I2, I3 I2, I4 I2, I3 I1, I2, I4 I1, I3 I2, I3 I1, I3 I1, I2 ,I3, I5 • Consider a database, D , consisting of 9 transactions • Suppose min. support count required is 2 (i. e. min_sup = 2/9 = 22 % ) • Let minimum confidence required is 70% • We have to first find out the frequent itemset using Apriori algorithm. • Then, Association rules will be generated using min. support & min. confidence. 7 STEP 1 : Generating 1-itemset frequent set pattern Itemset Sup.

Compare candidate support count with minimum support count Count Itemset Sup. Count Scan D for count of each candidate {I1} {I2} {I3} {I4} {I5} C1 6 7 6 2 2 {I1} {I2} {I3} {I4} {I5} L1 6 7 6 2 2 • The set of frequent 1-itemsets, L1 , consists of the candidate 1- itemsets satisfying minimum support. • In the first iteration of the algorithm, each item is a member of the set of candidate. 8 STEP 2 : Generating 2-itemset frequent set pattern Itemset Sup. Count Generate C2 candidates from L1 Itemset Sup {I1, I2} {I1, I3} {I1, I5} {I2, I3} {I2, I4} {I2, I5} L2 Count 4 4 2 4 2 2

Itemset {I1, I2} {I1, I3} {I1, I4} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I3, I4} {I3, I5} {I4, I5} C2 C2 Scan D for count of each candidate {I1, I2} {I1, I3} {I1, I4} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I3, I4} {I3, I5} {I4, I5} 4 4 1 2 4 2 2 0 1 0 Compare candidate support count with minimum support count 9 ? To discover the set of frequent 2-itemsets, L2 , the algorithm uses L1 Join L1 to generate a candidate set of 2-itemsets, C2. ? Next, the transactions in D are scanned and the support count for each candidate itemset in C2 is accumulated (as shown in the middle table). The set of frequent 2-itemsets, L2 , is then determined, consisting of those candidate 2-itemsets in C2 having minimum support. STEP 3: Generating 3-itemset frequent set pattern Itemset Scan D for count of each candidate Itemset Sup. Count Scan D for count of each candidate Compare candidate support count with min support Itemset Sup Count {I1, I2, I3} {I1, I2, I5} 2 2 {I1, I2, I3} {I1, I2, I5} {I1, I2, I3} {I1, I2, I5} 2 2 C3 C3 count L3 • The generation of the set of candidate 3-itemsets, C3 , involves use of the Apriori Property. • In order to find C3, we compute L2 Join L2. C3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}. 10 • Now, Join step is complete and Prune step will be used to reduce the size of C3. Prune step helps to avoid heavy computation due to large Ck. • Based on the Apriori property that all subsets of a frequent itemset must also be frequent, We can determine that four latter candidates cannot possibly be frequent • For example , lets take {I1, I2, I3}. The 2-item subsets of it are {I1, I2}, {I1, l3} & {l2, l3}. Since all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3} in C3. Lets take another example of {I2, I3, I5} which shows how the pruning is performed. The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}. • BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating • Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of Join Operation for pruning • Now the transactions in D are scanned in order to determine L3 consisting of those candidates 3itemsets in C3 having minimum support Step 4 : Generating a 4-itemset frequent pattern The algorithm uses L3 join L# to generate candidate sets of 4 itemsets.

Although the join results in {{I1, I12, I3, I5}} the itemset is pruned since its subset {{I2, I3, I5}} is not frequent. Thus algorithm terminates having found all of the frequent items These frequent itemsets are used to generate strong association rules in the following manner with minimum confidence of 70% We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3}, {I1,I2,I5}}. – Lets take l = {I1,I2,I5}. 11 – Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}. R1: I1 ^ I2 I5 • Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50% • R1 is Rejected. – R2: I1 ^ I5 I2 • Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100% • R2 is Selected. – R3: I2 ^ I5 I1 • Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100% • R3 is Selected. – R4: I1 I2 ^ I5 • Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33% • R4 is Rejected. –R5: I2 I1 ^ I5 • Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29% • R5 is Rejected. – R6: I5 I1 ^ I2 12 • Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100% • R6 is Selected. Thus we get 3 strong association rules 13 Classification: s the task of generalizing known structure to apply to new data 1. Supervised Classification • The set of possible classes is known in advance. • The input data, also called the training set, consists of multiple records each having multiple attributes or features. • Each record is tagged with a class label. • The objective of classification is to analyze the input data and to develop an accurate description or model for each class using the features present in the data. • This model is used to classify test data for which the class descriptions are not known. . Unsupervised Classification • Set of possible classes is not known. • After classification we can try to assign a name to that class. • Unsupervised classification is called clustering. Classification—A Two-Step Process 1. Model construction: describing a set of predetermined classes • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute • The set of tuples used for model construction: training set • The model is represented as classification rules, decision trees, or mathematical formula 2.

Model usage: for classifying future or unknown objects; Estimate accuracy of the model • The known label of test sample is compared with the classified result from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model • Test set is independent of training set, otherwise over-fitting will occur 14 15 Classification Techniques • • • • • Rules Regression Distance Decision Trees Neural Networks 16 1. Claasification using rules 2. Classification Using Regression Division: Use regression function to divide area into regions.

Prediction: Use regression function to predict a class membership function. Input includes desired class. 3. Classification using distance 17 • Place items in class to which they are “closest”. • Must determine distance between an item and a class. • Classes represented by – Centroid: Central value. – Medoid: Representative point. – Individual points 4. Classification Using Decision Trees Partitioning based: Divide search space into rectangular regions. region within which it falls. Algorithms: ID3, C4. 5, CART Tuple placed into class based on the 18 -NN Algorithm ? Requires 3 things: ? The set of stored records ? Distance metric to compute distance between records ? The value of k, the number of nearest neighbors to retrieve To classify an unknown record: ? Compute distance to other training records ? Identify k nearest neighbors ? Use class labels of nearest neighbors to determine the class label of unknown record (e. g. , by taking majority vote) K-nearest neighbor classification 1. Based on closest training examples in the feature space. Sensitive to the local structure of the data 2. Type of learning. azy learning: system generalizing the training data is delayed until a query is made to the system eager learning : system generalizes the training data before receiving queries 3. Instance-based learning Constructs hypotheses directly from the training instances themselves. This means that the hypothesis complexity can grow with the data. 19 Algorithm ? ? ? ? ? Step 1: determine k Step 2: calculate the distances between the new input and all the training data Step 3: sort the distance and determine k nearest neighbors based on the k-th minimum distance.

Step 4: gather the categories of those neighbors. Step 5: determine the category based on majority vote. Eg. Let query = (3,4) X co-ordinates Y co-ordinates Emotion 20 3 -1 2 0 Step 1: k = 3. 3 -4 3 -5 Happy Sad Happy Sad Step 2: calculate the distances between the new input and all the training data(using euclidean distance) X co-ordinates 3 -1 2 0 Step 3: ranking X 3 2 -1 0 Y 3 3 -4 -5 Y co-ordinates 3 -4 3 -5 Distance 1 8. 9 1. 4 9. 4 Distance 1 1. 4 8. 9 9. 4 Rank 1 2 3 4 Neighbours Yes Yes Yes No

Step 4: categorization X 3 2 -1 Y 3 3 -4 Distance 1 1. 4 8. 9 Rank 1 2 3 Neighbours Yes Yes Yes Category Happy Happy Sad Step 5: determine the category 21 22 Vornoi Cells kNN rule leads to partition of the space into cells (Vornoi cells) enclosing the training points labeled as belonging to the same class The decision boundary in a Vornoi tessellation of the feature space resembles the surface of a crystal 23 KNN: Distance Measures Distance (or similarity) between instances is easy if the data is numeric.

Typically use Euclidian distance: Dist ( X , Y ) ? ? ( Xi ? Yi ) i ? 1 D 2 KNN: Non Numeric Distance ? For nominal attributes, we can only compare whether the value is the same or not. Equally, this can be done by dividing enumerations into many boolean/integer attributes. ? Might be able to convert to attributes between which distance can be determined by some function. Eg colour, temperature. ? Text can be treated as one attribute per word, with the frequency as the value, normalised to 0.. , and preferably with very high frequency words ignored (eg the, a, as, is… )??? How to determine the good value for k? ? ? ? ? ? ? Determined experimentally Start with k=1 and use a test set to validate the error rate of the classifier Repeat with k=k+2 Choose the value of k for which the error rate is minimum If k is too small, sensitive to noise points If k is too large, neighborhood may include points from other classes Note: k should be odd number to avoid ties Advantages: ? ? ? ?

Training is very fast Learn complex target functions Do not loose information Asymptotically, the error rate of 1-nearest-neighbor classification is less than twice the Bayes rate [error rate of classifier knowing model that generated data 24 ? Building model is cheap ? Extremely flexible classification scheme ? Error rate at most twice that of Bayes error rate Disadvantages: ? Slow at query time ? Classifying unknown records are relatively expensive o Requires distance computation of k-nearest neighbors o Computationally intensive, especially when the size of the training set grows ?

Accuracy can be severely degraded by the presence of noisy or irrelevant features 25 References ? Agrawal R, Imielinski T, Swami AN. “Mining Association Rules between Sets of Items in Large Databases. “SIGM OD. June 1993 ? Agrawal R, Srikant R. “Fast Algorithms for Mining Association Rules” 1994, Chile, ISBN 1-55860-153-8. ? Implementation of Web Usage Mining Using APRIORI and FP Growth Algorithms, B. Santhosh Kumar Department of Computer Science, C. S. I. College of Engineering, K. V. Rukmani Department of Computer Science, C. S. I.

College of Engineering. ? Mannila H, Toivonen H, Verkamo AI. “Efficient algorithms for discovering association rules. “AAAI Workshop on Knowledge Discovery in Databases (SIGKDD). July 1994, Seattle. ? Fabrizio Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, ? Tom Mitchell, Machine Learning. McGraw-Hill, 1997. ? Yiming Yang & Xin Liu, A re-examination of text categorization methods. Proceedings of SIGIR, 1999. ? Evaluating and Optimizing Autonomous Text Classification Systems (1995) David Lewis.

Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ? Han, Jiawei and Kamber, Micheline. Data Mining: Concepts and Techniques. ? Lifshits, Yury. Algorithms for Nearest Neighbor. Steklov Insitute of Mathematics at St. Petersburg. April 2007 ? Cherni, Sofiya. Nearest Neighbor Method. South Dakota School of Mines and Technology. 26 Acknowledgements I would like to thank Mrs. Shubhangi Gawali for being an excellent mentor and a patient guide throughout this whole learning process 27