This Week on Kungle.de: Nobel Prizes, Riots in Pakistan and the “Balloon Boy”
Three issues with hundreds of similar news publications blocked the front page of Kungle.de. Each publication is interesting and informative by itself but together they are hiding other noteworthy information.
I concluded that it was about time to build a new subsystem to reduce the amount of identical information. You can still find all articles via the new “related link”.
The new Subsystem “IssueMerger“ now merges news with similar content. The older news entries are the more likely they are consolidated to one issue.
For this, I defined a function to calculate the proximity of two entries. (The Result is 1 if two news entries are identical and 0 if they completely different.)
It is necessary to build a complete “News Topology” (A Matrix with up to 1.5 million elements) which defines the proximities of all entry combinations.
The calculation for all topics requires up to 40 hours. The Algorithm itself was coded in 80 lines of scala.
You can find a calculated result here:
http://www.kungle.de/Trend/entry/220033
Update 1: In comparison this merge was hand made:
http://www.kungle.de/Trend/entry/225189
Tags: Kungle.de, News Aggregation, Programming, Scala