Issue On Parsing Html With Jsoup
I am trying to parse this HTML using jsoup. My code is: doc = Jsoup.connect(htmlUrl).timeout(1000 * 1000).get();              Elements items = doc.select('item');             Log.d
Solution 1:
There are 2 problems in rss content you fetched.
- The 
linktext is not within the<link/>tag but outside of it. - There is some 
escaped htmlcontent within thedescriptiontag. 
PFB the modified code.
Also I found some clean html content when viewed the URL in Browser, which when parsed will make you easy to extract the desired fields. You can achieve that setting the userAgent as Browser in the Jsoup. But its up to you to decide how to fetch the content.
    doc = Jsoup.connect("http://feeds.reuters.com/reuters/audio/newsmakerus/rss/mp3/").timeout(0).get();
    System.out.println(doc.html());
    System.out.println("================================");
    Elements items = doc.select("item");
    for (Element item : items) {
        Element titleElement = item.select("title").first();
        String mTitle = titleElement.text();
        System.out.println("title is : " + mTitle);
        /*
         * The link in the rss is as follows
         *  <link />http://feeds.reuters.com/~r/reuters/audio/newsmakerus/rss/mp3/~3/NX3AY96GfGk/59621707.mp3 
         *  which doesn't fall in the <link> element but falls under <item> TextNode
         */
        String  mLink = item.ownText(); //  
        System.out.println("link is : " + mLink);
        Element descElement = item.select("description").first();
        /*Unescape the html content, Parse it to a doc, and then fetch only the text leaving behind all the html tags in content
         * "/" is a dummy baseURI passed, as we don't care about resolving the links within parsed content.
         */
        String  mDesc = Parser.parse(Parser.unescapeEntities(descElement.text(), false),"/" ).text(); 
        System.out.println("description is : " + mDesc);
    }
Post a Comment for "Issue On Parsing Html With Jsoup"