Implement two solutions for the problem given below using map-reduce: one on Apache Hadoop and one on Apache Spark. Test the solutions with at least a 1000 documents.

You are given a large collection of (English) text documents (as files).

For each document, compute the top 20 keywords by relevance scores. For a keyword w and a

document d the relevance score is given by T(w,d)/D(w,d) :

• where T (w,d) = count(w,d)^0.5 where ^ denotes exponentiation and

• count(w,d) is the number of occurrences of w in d and

• D(w,d) is the fraction of documents in the collection in which w occurs (i.e. x/N if w occurs in

x documents out of N, the total number of documents in the collection).

Also compute the intersection of all the top-20 keywords.

Aptitudini: Apache Hadoop, Apache Spark

Despre client:
( 0 recenzii ) Ghaziabad, India

ID Proiect: #33941230