Datascience Project II

Text Analysis and Entity Resolution

Posted by Nivin Anton Alexis Lawrence on January 2, 2017

Recently by the same author:


Introduction to Functional Programming [Part 2]

Think Functional


Nivin Anton Alexis Lawrence

Human

You may find interesting:


Cybersecurity Project

Online Profile Grader


DataScience Project I

Jypyer notebook Important Shortcut Keys

Text Analysis and Entity Resolution

Overview

Entity resolution is a common, yet difficult problem in data cleaning and integration. In this assignment, we will use powerful and scalable text analysis techniques to perform entity resolution across two data sets of commercial products.

Entity Resolution

Entity resolution, also known as record deduplication, is the process of identifying rows in one or more data sets that refer to the same real world entity. Take an example. You’re on ebay looking for a hip data science accessory, but you’re on a budget, so you decide to scrape the ebay listings for a few days to get a feel for the market. Unfortunately, the listings are confusing and you don’t know how to aggregate them. Entity resolution to the rescue! You find an authoritative database and map all the ebay listings to it. Now you can comparison shop, get that sweet Keuffel and Esser for way cheap, and impress all the data hipsters.

But finding matching records is a hard problem in general. A major reason is that the criteria for identifying duplicates are often vague and impossible to encode in rules. In addition, from a purely computational perspective, the problem is quadratic in the size of its inputs: naively, all pairs of records need to be compared to find all the duplicates. In this assignment, we will begin to address both these challenges.

Application

Your assignment is to perform entity resolution over two web-scraped data sets of commercial product listings, one from Amazon, and one from Google. The goal is to build a unified database of all products listed on the Internet: a one-stop-shop for all your shopping needs. (Elevator pitch: it’s like Kayak.com for e-commerce!)

The web has made unprecedented amounts of data available publicly, but scraped data frequently needs to be de-duplicated. These data sets are typical examples of what you can collect with some simple scripting. The data is not especially large (just a few thousand records), but even so, you will find that entity resolution is a major challenge (top results with this data are ~50% success rate). Don’t get discouraged; the goal is to get acquainted with techniques to tackle the problem, and apply them to a representative example.

Files

Data files for this assignment can be found at:

https://www.dropbox.com/sh/e8r0a35i25ywnhf/AABQuGJ1tZ0nese2hx4oC7E9a?dl=0

The zip file includes the following files:

  • Google.csv, the Google Products data set
  • Amazon.csv, the Amazon data set
  • Google_small.csv, 200 records sampled from the Google data
  • Amazon_small.csv, 200 records sampled from the Amazon data
  • Amazon_Google_perfectMapping.csv, the “gold standard” mapping of duplicates
  • stopwords.txt, a list of common English words

Besides the complete data files, there are “sample” data files for each data set. Use these for Part 1. In addition, there is a “gold standard” file that contains all of the true mappings between duplicate entities in the two data sets. Every row in the gold standard file has a pair of record IDs (one Google, one Amazon) that belong to two record that describe the same thing in the real world. We will use the gold standard to evaluate our algorithms.

Deliverables

Complete the all the exercises below and turn in a write up in the form of an IPython notebook, that is, an .ipynb file. The write up should include your code, answers to exercise questions, and plots of results. Submit your notebook on Canvas under A1.

You must use this notebook and fill in answers inline. In this notebook, we provide code templates for many of the exercises. They are intended to help with code re-use, since the exercises build on each other, and are highly recommended. Don’t forget to include answers to questions that ask for natural language responses, i.e., in English, not code!

We will test your code automatically, so in your submitted notebook function names must conform the names requested by the questions and that can be executed with “Cell > Run all”.

Guidelines

Code

This assignment can be done with basic python and matplotlib. Feel free to use PANDAs, too, which you may find well suited to several exercises.

Feel free to use your favorite IDE but you have to submit this assignment as an ipython notebook (.ipynb). So, you need to make sure eveything works properly in the ipython notebook before submitting your assignment on Canvas.

Collaboration

This assignment is to be done individually. Everyone should be getting a hands on experience in this course. You are free to discuss course material with fellow students, and we encourage you to use Internet resources to aid your understanding, but the work you turn in, including all code and answers, must be your own work.

Part 0: Preliminaries

Exercise 0

Download the data and uncompress/unarchive it. Read each file in from the file system, and store them as lists of lines.

For each of the data files (“Google.csv”, “Amazon.csv”, and the samples), we want to parse the IDs out of each record. The IDs are the first column of the file (they are URLs for Google, and alphanumeric strings for Amazon). Omitting the headers, load these data files into *dictionaries mapping ID to a single string containing the rest of the record except for the price field. Insert spaces between the other fields to make this string.

# Please preserve the format of this line so we can use it for automated testing.
DATA_PATH = "/Users/nivinantonalexislawrence/Documents/Books/Data Science/Data/" # Make this the /path/to/the/data`

import csv

# TODO Load data files here...
def loadprodfile(fname):
    f = open(fname,'r',encoding="latin-1") #Open with proper encoding 
    reader = csv.reader(f) #CSV reader
    next(reader)
    d = {}
    for row in reader:
       ke = row[0]
       v = str(row[1] + ' ' + row[2] + ' '+ row[3]) 
       d[ke] = v        
    return d

amazon_small = loadprodfile(DATA_PATH + "Amazon_small.csv")
print(amazon_small['b000fbk6gc'])
google_small = loadprodfile(DATA_PATH + "Google_small.csv")


# other ones ...

##Part 1: ER as Text Similarity

A simple approach to entity resolution is to treat all records as strings and compute their similarity with a string distance function. In this section, we will build some components for bag-of-words text-analysis, and use them to compute record similarity.

1.1 Bags of Words

Bag-of-words is a conceptually simple yet powerful approach to text analysis. The idea is to treat strings, a.k.a. documents, as unordered collections of words, or tokens, i.e., as bags of words.

Note on terminology: “token” is more general than what we ordinarily mean by “word” and includes things like numbers, acronyms, and other exotica like word-roots and fixed-length character strings. Bag of words techniques all apply to any sort of token, so when we say “bag-of-words” we really mean “bag-of-tokens,” strictly speaking.

Tokens become the atomic unit of text comparison. If we want to compare two documents, we count how many tokens they share in common. If we want to search for documents with keyword queries (this is what Google does), then we turn the keywords into tokens and find documents that contain them.

The power of this approach is that it makes string comparisons insensitive to small differences that probably do not affect meaning much, for example, punctuation and word order.

Exercise 1

a. Implement the function simple_tokenize(string) that takes a string and returns a list of tokens in the string. simple_tokenize should split strings using the provided regular expression (and its OK to use the re split function). Since we want to make token-matching case insensitive, make sure all tokens are lower-case. Give an interpretation, in natural language, of what the regular expression, split_regex, matches.

import re

quickbrownfox = "A quick brown fox jumps over the lazy dog."
split_regex = r'\W+'

# TODO Implement this
def simple_tokenize(string):
    
    str_l = re.split(split_regex,string)
    res = []
    for each in str_l:
        if len(each) >0:
            res.append(each.lower())
    return res
    

print (simple_tokenize(quickbrownfox)) # Should give ['a', 'quick', 'brown', ... ]
# Simple testcases
assert(simple_tokenize(" ") == [])
assert(simple_tokenize("!!!!123A/456_B/789C.123A") == ["123a","456_b","789c","123a"])
assert(simple_tokenize(quickbrownfox) == 
        ['a','quick','brown','fox','jumps','over','the','lazy','dog'])

def simple_tokenize(string): str_l = re.split(split_regex,string) res = [] for each in str_l: if len(each) >0: res.append(each.lower()) return res

b. Stopwords are common words that do not contribute much to the content or meaning of a document (e.g., “the”, “a”, “is”, “to”, etc.). Stopwords add noise to bag-of-words comparisons, so the are usually excluded. Using the included file “stopwords.txt”, implement tokenize, an improved tokenizer that does not emit stopwords.

# TODO Implement this
def tokenize(data):
    bag_of_words = simple_tokenize(data)
    f_stop_words = open(DATA_PATH + 'stopwords.txt','r')
    stop_words = f_stop_words.read().split()
    bag_words_without_stopwords = []
    for each_word in bag_of_words:
        if each_word not in stop_words:            
            bag_words_without_stopwords.append(each_word)
    return bag_words_without_stopwords
    

print (tokenize(quickbrownfox)) # Should give ['quick', 'brown', ... ]

assert(tokenize("Why a the?") == [])
assert(tokenize(quickbrownfox) == ['quick','brown','fox','jumps','lazy','dog'])

c. Now let’s tokenize the two small data sets. For each one build a dictionary of tokens, i.e., a dictionary where the record IDs are the keys, and the output of tokenize is the values. Include tokens for the name, description, and manufacturer fields, but not the price field. How many tokens, total, are there in the two data sets? Which Amazon record has the biggest number of tokens?

def makedict(items):
    new_dict = {}
    for key in items:
        value = items[key]
        new_dict[key]= tokenize(value)        
    return new_dict
    
# TODO Compute these (dict() or DataFrame OK)
#print (amazon_small['b000fbk6gc'])
#print (amazon_small['b000bnb72g'])

amazon_rec2tok = makedict(amazon_small)
#print(amazon_rec2tok)
google_rec2tok = makedict(google_small)
def count_data_token(table):
    sum_token = 0
    for key in table:
        sum_token += len(table[key])        
    return sum_token
def max_token_table(table):
    token_count = 0
    idx = 0
    for key in table:
        value = len(table[key])
        if value > token_count:
            token_count = value
            idx = key
    return idx
            
total_tokens = count_data_token(amazon_rec2tok) + count_data_token(google_rec2tok) 
print ('There are %s tokens in the combined data sets' % total_tokens)

biggest_record = max_token_table(amazon_rec2tok)
biggest_len = len(amazon_rec2tok[biggest_record])
print ('The Amazon record with ID "%s" has the most tokens "%d"' % (biggest_record, biggest_len))

1.2 Weighted Bag-of-Words: TF-IDF

Bag-of-words comparisons are not very good when all tokens are treated the same: some tokens are more important than others. Weights give us a way to specify which tokens to favor. With weights, when we compare documents, instead of counting common tokens, we sum up the weights of common tokens.

A good heuristic for assigning weights is called “Term-Frequency/Inverse-Document-Frequency,” or TF-IDF for short.

TF

TF rewards tokens that appear more in the same document. It is computed as the frequency of a token in a document, that is, if token t appears 5 times in document d then the TF of t in d is 5. The intuition for TF is that if a word occurs often in a document, then it is more important to the meaning of the document.

IDF

IDF rewards tokens that are rare across the documents in a data set. The intuition is that it is more significant if two documents share a rare word than a common one. IDF weight for a token, t, in a set of documents, U, is computed as follows:

  • Let N be the total number of documents in U
  • Find n(t), the number of documents in U that contain t
  • Then IDF(t) = log(N/(1+n(t))).

Note we are actually using the log of the inverse document frequency, which typically works better than the raw inverse frequency. But the terminology “IDF” still applies.

TF-IDF

Finally, to bring it all together, the total TF-IDF weight for a token in a document is the product of its TF and IDF weights.

Exercise 2

a. Implement tf(tokens) that takes a list of tokens belonging to a single document and returns a dictionary mapping tokens to TF weights.

# TODO Implement this
def tf(tokens):
    tf_map = {}
    for each in tokens:
        if each not in tf_map:
            tf_map[each] = 1
        else:
            tf_map[each] = tf_map[each] + 1

    return tf_map
    
print( tf(tokenize(quickbrownfox + quickbrownfox))) # Should give {'brown': 2, 'lazy': 2, 'jumps': 2...}

b. Implement idfs that assigns an IDF weight to every unique token in a collection of data called corpus. You may structure corpus however you want, but idfs should return a dictionary mapping tokens to weights. Use idfs to compute IDF weights for all tokens in the combined small data sets. How many unique tokens are there?

import math
# TODO Implement this
def unique_token(corpus):

    unique_token = set()
    for sku in corpus:
        doc = corpus[sku]
        for word in doc:
            unique_token.add(word)
    return unique_token

def idfs(corpus):
    set_token = unique_token(corpus)
    N = len(corpus)
    res = {}
    for each in set_token:
        count = 0
        for sku in corpus:
            if each in corpus[sku]:
                count += 1
        ans = math.log( float(N) / (float(1 + count)))
        res[each] = ans
    return res

corpus = amazon_rec2tok.copy()
corpus.update(google_rec2tok)

idfs_small =  idfs(corpus)# Use find_idfs here

unique_tokens = len(idfs_small)

print ("There are %s unique tokens in the small data sets." % unique_tokens)

idfs_small['product'] # should be 2.436116509460426

c. What are the 10 tokens with the smallest IDF in the combined small data set? Do you think they are useful for search? Why or why not?

def find_ten_smallest_token(mydict):
    res = sorted(mydict.items(), key=lambda x: x[1])
    ans = []
    for i in range(10):
        ans.append(res[i][0])        
    return ans

smallest_idf_tokens =  find_ten_smallest_token(idfs_small)# TODO Compute me

print (smallest_idf_tokens)

TODO Answer question [‘software’, ‘features’, ‘new’, ‘use’, ‘complete’, ‘easy’, ‘system’, ‘create’, ‘cd’, ‘windows’] here (click, esc-Enter, edit, shift-Enter)

d. Plot a histogram of IDF values. Use 20 buckets for the histogram. What conclusions can you draw from the distribution of weights?

import pylab
%matplotlib inline
def get_values(corpus):
    res = []
    for each in corpus:
        res.append(corpus[each])
    return res

values = get_values(idfs_small)
pylab.hist(values,bins=20)
#print(values)

# TODO Make a plot. HINT: You can use pylab.hist

TODO Answer question **plot has a long slope to the right, positive slope. **

e. Use tf to implement tfidf(tokens, idfs) that takes a list of tokens from a document and a dictionary of idf weights and returns a dictionary mapping tokens to total TF-IDF weight. Use tfidf to compute the weights of Amazon product record ‘b000hkgj8k’.

# TODO Implement this
def tfidf(tokens, idfs):
    tfidfs = {}
    tf_map = tf(tokens)
    for each in tokens:
        result = idfs[each] * tf_map[each]
        tfidfs[each] = result
    return tfidfs


rec_b000hkgj8k_weights = tfidf(amazon_rec2tok['b000hkgj8k'],idfs_small) # Fix me

print ("Amazon record 'b000hkgj8k' has tokens and weights:\n%s" % rec_b000hkgj8k_weights)

1.3 Cosine Similarity

Now we are ready to do text comparisons in a formal way. The metric of string distance we will use is called cosine similarity. We will treat each document as a vector in some high dimensional space. Then, to compare two documents we compute the cosine of the angle between their two document vectors. This is easier than it sounds.

The first question to answer is how do we represent documents as vectors? The answer is familiar: bag-of-words! We treat each unique token as a dimension, and treat token weights as magnitudes in their respective token dimensions. For example, suppose we use simple counts as weights, and we want to interpret the string “Hello, world! Goodbye, world!” as a vector. Then in the “hello” and “goodbye” dimensions the vector has value 1, in the “world” dimension it has value 2, and it is zero in all other dimensions.

Next question is: given two vectors how do we find the cosine of the angle between them? Recall the formula for the dot product of two vectors:

\[a \cdot b = \| a \| \| b \| \cos \theta\]

Here $a \cdot b = \sum_{i=1}^n a_i b_i$ is the ordinary dot product of two vectors, and $|a| = \sqrt{ \sum_{i=1}^n a_i^2 }$ is the norm of $a$.

We can rearrange terms and solve for the cosine to find it is simply the normalized dot product of the vectors. With our vector model, the dot product and norm computations are simple functions of the bag-of-words document representations, so we now have a formal way to compute similarity:

\[similarity = \cos \theta = \frac{a \cdot b}{\|a\| \|b\|} = \frac{\sum_{i=1}^n a_i b_i}{\sqrt{\sum_{i=1}^n a_i^2} \sqrt{\sum_{i=1}^n b_i^2}}\]

Setting aside the algebra, the geometric interpretation is more intuitive. The angle between two document vectors is small if they share many tokens in common, because they are pointing in roughly the same direction. Then, the cosine of the angle will be large. Otherwise, if the angle is large (and they have few words in common), the cosine is small. So the cosine scales proportionally with our intuitive sense of similarity.

Exercise 3

a. Implement cosine_similarity(tokens1, tokens2, idfs) that takes two lists of tokens and computes their cosine similarity in the context of some global IDF weights. Use tfidf, and the IDF weights from exercise 2b.

import math

Optional utility

def dotprod(a, b):

Optional utility

def norm(a):

Optional freebie

def cossim(a, b): return dotprod(a, b) / norm(a) / norm(b)

test_vec1 = {‘foo’: 2, ‘bar’: 3, ‘baz’: 5 } test_vec2 = {‘foo’: 1, ‘bax’: 5, ‘baz’: 4 } print dotprod(test_vec1, test_vec2), norm(test_vec1) # Should be 22 6.16441400297

TODO Implement this

def cosine_similarity(tokens1, tokens2, idfs):

print cosine_similarity([“adobe”, “photoshop”], [“adobe”, “illustrator”], idfs_small) # Should be 0.321248173318

b. Now we can finally do some entity resolution! For every product record in the small Google data set, use cosine_similarity to compute its similarity to every record in the small Amazon set. Build a dictionary mapping (Amazon Id, Google Id) tuples to similarity scores between 0 and 1. What is the similarity between Amazon record ‘b000o24l3q’ and Google record http://www.google.com/base/feeds/snippets/17242822440574356561.

# TODO Compute similarities
similarities = {}
def norm(e, idfs):
    sum1 = 0.0
    for each in e:
        sum1 += ( idfs[each] ** 2)
    return math.sqrt(sum1)

def cosine_similarity(tokens1, tokens2, idfs):
    tokens = set(tokens1) | set(tokens2)
    weight1 = tfidf(tokens1, idfs)
    weight2 = tfidf(tokens2, idfs)
    norm_a = norm(tokens1, weight1)
    norm_b = norm(tokens2, weight2)
    dot_product = 0.0
    for each_token in tokens:
            if each_token in tokens1 and each_token in tokens2:
                dot_product += (weight1[each_token] * weight2[each_token])  
                
    return dot_product / (norm_a * norm_b)
    
#print (cosine_similarity(["adobe", "photoshop"],["adobe", "illustrator"], idfs_small))


similarity_map = {}
#cosine_similarity(amazon_rec2tok[each_amazon_record],google_rec2tok[each_google_record],idfs_small)
for each_amazon_record in amazon_rec2tok:
    for each_google_record in google_rec2tok:
        similarity_map[(each_amazon_record, each_google_record )] = cosine_similarity(amazon_rec2tok[each_amazon_record],google_rec2tok[each_google_record], idfs_small)

        
print ('Requested similarity is %s.' % similarity_map[('b000o24l3q','http://www.google.com/base/feeds/snippets/17242822440574356561')])

c. Use the “gold standard” data (loaded from the supplied file) to answer the following questions. How many true duplicate pairs are there in the small data sets? What is the average similarity score for true duplicates? What about for non-duplicates? Based on this, is cosine similarity doing a good job, qualitatively speaking, of identifying duplicates? Why or why not?

gold_standard = [] # Load this if not already loaded

f1 = open(DATA_PATH + 'Amazon_Google_perfectMapping.csv','r',encoding="latin-1") #Open with proper encoding 
reader = csv.reader(f1) #CSV reader
next(reader)
gold_standard = list(reader)
#print(len(gold_standard))
true_dups = 0 # Fix me
avg_sim_dups = 0.0 # Fix me
avg_sim_non = 0.0 # Fix me
n = len(gold_standard)
no_dup =0
for i in range(n):
    if (gold_standard[i][0],gold_standard[i][1]) in similarity_map:
        true_dups +=1
        #print(similarity_map[(gold_standard[i][0],gold_standard[i][1])])
        avg_sim_dups += similarity_map[(gold_standard[i][0],gold_standard[i][1])]
        #break

for i in similarity_map:
    if (i not in gold_standard):
        avg_sim_non += similarity_map[i]
        no_dup +=1
        

avg_sim_dups /= true_dups
avg_sim_non /= (no_dup)

print ("There are %s true duplicates." % true_dups)
print ("The average similarity of true duplicates is %s." % avg_sim_dups)
print ("And for non duplicates, it is %s." % avg_sim_non)

TODO Answer question **There are 146 true duplicates. The average similarity of true duplicates is 0.24642902683234924. And for non duplicates, it is 0.006469247633969705. cosine similarity works well for string similarity and above results clearly shows that. True duplicates or similar string has good similarity score, whereas the non duplicates measures low scores. ** here