This Week on Kungle.de: Nobel Prizes, Riots in Pakistan and the “Balloon Boy”

Three issues with hundreds of similar news publications blocked the front page of Kungle.de. Each publication is interesting and informative by itself  but together they are hiding other noteworthy information.

I concluded that it was about time to build a new subsystem to reduce the amount of identical information. You can still find all articles via the new “related link”.

The new Subsystem “IssueMerger“ now merges  news with similar content. The older news entries are the more likely  they are consolidated to one issue.

For this, I defined a function to calculate the proximity of two entries. (The Result is 1 if two news entries  are identical and 0 if they completely different.)

It is necessary to  build a complete “News Topology” (A Matrix with up to 1.5 million elements) which defines the proximities of all entry combinations.

The calculation for all topics requires up to 40 hours. The Algorithm itself was coded in 80 lines of scala.

You can find a calculated result here:
http://www.kungle.de/Trend/entry/220033

Update 1: In comparison this merge was hand made:

http://www.kungle.de/Trend/entry/225189

Tags: , , ,

Leave a Reply