Survey on Article Extraction and Comment Monitoring Techniques in Online News Media
Survey on Article Extraction and Comment Monitoring Techniques in Online News MediaSubmitted byAnkur Kumar AgrawalM.Tech(CS)-II Year13535009
Under the guidance of Dr. Dhaval Patel
OutlineIntroductionWhy Article Extraction and Comments Monitoring ?Challenges in Article Extraction and Comments MonitoringArticle Extraction TechniquesLearning Based TechniquesHeuristic TechniquesVisual Based ApproachComments MonitoringNews Article Popularity PredictionExtracting Discussion StructureConclusion
IntroductionWhat is Article on news web page?Online news sources publish their news in the form of articles.Article describes about a particular event happened.The main content on the news web page is Article Content.Other content on web pages like hyperlinks, images, and side banners etc. is considered as noise content.
What are Comments?Comments are the reactions by the citizens on the article published by the news media.
2Comments2Article TextWhy Article Extraction and Comments Monitoring?Article Extraction can be used inInformation Retrieval Systems.Search Engines (Indexing on Article content for giving best search result) like Google , Yahoo.News Aggregator Systems like Google News.
Comments monitoring can be used forNews Article Popularity Prediction.Advertisement AgenciesNews AgenciesDebate Identification Sentimental Analysis and Opinion Mining
Challenges in Article Extraction1. Noise Content on web page
11Article Text2NoiseContentMenusAdvertisements Side BannersHyperlinks
2222 Heterogeneous Templates
ArticleArticleChallenges in Comments MonitoringPublic Comments are not always available for every news source. Some websites provides their comments data
It is difficult to apply standard NLP techniques in comments since comments may not be syntactically correct.Article Extraction TechniquesWhat is Heuristic Technique?
News Web PageApplying Heuristics on parsed document Article Text Content
outputDOM Tree and Some Common HeuristicsWeb page is processed using DOM Tree.DOM Tree represents each tag as Node Object in a tree.
Two important factors in heuristic techniques are Text Count and Link Count.
Text Count: Text count is the number of words in the text of a node.
Link Count: Number of links a node has in the sub tree rooted at any node.DOM Tree Representation With Text Count and Link Count of Every NodeNode Name (Text Count, Link Count)Node StructureA Simple Heuristic Technique DOM Tree After applying Basic Score function1
110101110.830.8Selected as article text node as higher in levelReal Article Node
Modified Weighted Score FunctionHere one extra factor is added in basic scoring function.Extra factor describes the fraction of Total text of page in a node.Now optimal weights are assigned to both the factors. This extra factor removes the drawback of using only basic scoring function.After Applying Weighted ScoreReal Article Text Node Containing maximum score ResultExperiment was performed on 1620 news Articles from 27 different news sources.
Using a Basic Score: Precision is around 0.85 Recall is 0.02 (Very Poor)
Using Modified Weight Score Function:Precision is around 0.9562 (Improved) Recall is 0.9088 (Great Improvement)
Source: Jyotiak Prasad et. al.,Coreex: content extraction from online news articles
Article Extraction TechniquesWhat Is a Learning Based Approach?This approach works in two steps. STEP 1 First Learning is performed from a set of news web pages and a model is build which identifies the location of article content and noise content.
STEP 2A new web page is given as input to the model and Article text is obtained.
outputTarget web pageTraining datasetArticle Text
Learning Based TechniqueModel Learns some common features of web pages to distinguish between Noise and main Article Text Content Learning Based Approach for Article Extraction Using Style Tree The technique focus on removing noise content from news web page.
Learning is from web pages of a single news source.
The model builds a Style Tree after learning common layout from all the web pages.
Model(Style Tree) is applied on the target web page of the same news source to classify noise nodes and content nodes.Step 1 Learning (Style Tree Construction)a11222d2d1Step 2 Node Importance IdentificationNoise node and content is identified based on the information gain(Entropy) of each node.
So it is assumed that if more presentation style a node have then it may be the Noise Node.
If actual content is more diverted then it may be the probable Content Node. Entropy is used to calculate Node Importance Noise and Content Node rootTableTableTablePAPPPTextTrTextIMGAAAAA100100100100body100253515IMGAdvantages and DisadvantagesAdvantageAlgorithm is fast once the learning is over.
DisadvantagesStyle Tree can take large amount of memory.It requires some web pages of a single domain to learn.Article Extraction TechniquesVisual Based ApproachThe techniques learns visual features of web page and identifies the boundary of Article Text content.
A simple visual based technique uses following two steps:
Step 1: Identifying different text segments using beak node identification of CSS.
Step 2: Global optimization method MSS(Maximum Scoring Subsequent) is used to identify article text body .Step 1 Line Break Nodes Identification
and tags are always break nodes.
For other element nodes CSS display property is checked.
If CSS display property is block then it indicates that element have a line break before or after it.
Now Text segments are formed using nearest line break nodes of every text nodes.
Text Segments Generation Using Nearest Line Break Node t3BodyIABrUemUt5t4t6t7t8BBIBrPDIVt2t1Element nodeBreak nodeText nodegroup consecutive Text segments based on the Nearest line break nodeStep 2 Text Area Identification Using MSS(Maximum Scoring Subsequence) +1 ,Psize>c1,Pcolour>c2,Plink