Total views : 240
Content Extraction Studies using Neural Network and Attribute Generation
Objectives: The amount of information available on web today is more than at any point in history, and greater challenges arouse due to this huge wealth of information available. Also to deal with this information overload, challenging tools are required. Method of Analysis: Internet in the present day especially in India is spreading both in rural and urban areas. Bilingual and Multilingual websites are increasing to a larger extent. Even websites are becoming multitasking. Our main problem is to deal with multilingual web documents and ancient documents. Because, content extraction becomes difficult when such documents are considered. The present paper proposes a neural network approach and attribute generation to justify the content extraction studies for multilingual web documents. Findings: Results obtained are well defined and a thorough analysis is done. Novelty/Improvement: The method is versatile in using pixel-maps, analytically stable in that the matrix input is used and is demonstrated for adoption to different models.
Attribute, Content Extraction, Mining, Multi-Lingual, Neural Network, Pattern, Pixel.
- Gottron T. Content code blurring: A new approach to content extraction. DEXA ’08: 19th International Workshop on Database and Expert Systems Applications. IEEE Computer Society. 2008; p. 29–33.
- Gupta S, Kaiser G, Neistadt D, Grimm G. DOM based content extraction of HTML documents. New York, NY, USA: ACM Press: WWW ’03: Proceedings of the 12th International Conference on World Wide Web. p. 207–14.
- Moreno J, Deschacht K, Moens M. Language independent content extraction from web pages. Proceeding of the 9th Dutch-Belgian Information Retrieval Workshop. 2009; p. 50–55.
- Kolla Bhanu Prakash. Mining Issues in Traditional Indian Web Documents. Indian Journal of Science and Technology. 2015 November; 8(32), ISSN: 0974-5645. Doi: 10.17485/ijst/ 2015/v8i1/77056.
- Mantratzis C, Orgun M, Cassidy S. Separating XHTML content from navigation clutter using DOM-structure block analysis. New York, NY, USA: ACM Press: HYPERTEXT ‘05: Proceedings of the sixteenth ACM conference on Hypertext and hypermedia. 2005; p. 145–47.
- Debnath S, Mitra P, Lee Giles C. Identifying content blocks from web documents. Foundations of Intelligent Systems, ser. Lecture Notes in Computer Science. 2005; p. 285–93.
- Kolla Bhanu Prakash, Dorai Rangaswamy MA, Ananthan TV, Rajavarman VN. Information Extraction in Unstructured Multilingual Web Documents. Indian Journal of Science and Technology. 2015 July; 8(16). Doi: 10.17485/ijst/2015/v8i16/54252.
- Hawkey Kirstie, Inkpen Kori. Web browsing today: the impact of changing contexts on user activity. New York, NY, USA, ACM Press: CHI ’05: CHI ’05 extended abstracts on Human factors in computing systems, ages. 2005; p. 1443-46.
- William Jones, Harry Bruce, Susan Dumais. Once found, what then? A study of keeping behaviours in the personal use of web information, Proc. of ASIST.
- Abigail J Sellen, Murphy Rachel, Kate L Shaw. How knowledge workers use the web. New York, NY, USA, ACM: CHI ’02: Proceedings of the SIGCHI conference on Human factors in computing systems. 2002; p. 227-34.
- Rahman AFR, Alam H, Hartono R. Content extraction from html documents. In WDA. 2001; p. 7–10.
- Kolla Bhanu Prakash, Dorai Rangaswamy MA, Raja Raman Arun. ANN for Multi-lingual Regional Web Communication, ICONIP 2012, Part V, LNCS 7667. 2012; p. 473-78.
- Kolla Bhanu Prakash, Dorai Rangaswamy MA, Raja Raman Arun. Statistical Interpretation for Mining Hybrid Regional Web Documents, ICIP 2012, CCIS 292. 2012; p. 503–12.
- Kolla Bhanu Prakash, Dorai Rangaswamy MA, Raja Raman Arun. Performance of Content Based Mining Approach for Multi-lingual Textual Data. International Journal of Modern Engineering Research. 2011; 1(1) p. 146-50.
- Deng Cai, Shipeng Yu, Ji-Rong Wen,Wei-Ying Ma. VIPS: a Vision-based Page Segmentation Algorithm, Technical Report, MSR-TR-2003-79, Microsoft Research, Microsoft Corporation, One Microsoft Way, Redmond, WA 98052. 2013.
- There are currently no refbacks.
This work is licensed under a Creative Commons Attribution 3.0 License.