Issue On Parsing Html With Jsoup

October 26, 2023 Post a Comment

I am trying to parse this HTML using jsoup. My code is: doc = Jsoup.connect(htmlUrl).timeout(1000 * 1000).get(); Elements items = doc.select('item'); Log.d

Solution 1:

There are 2 problems in rss content you fetched.

The link text is not within the <link/> tag but outside of it.
There is some escaped html content within the description tag.

PFB the modified code.

Also I found some clean html content when viewed the URL in Browser, which when parsed will make you easy to extract the desired fields. You can achieve that setting the userAgent as Browser in the Jsoup. But its up to you to decide how to fetch the content.

    doc = Jsoup.connect("http://feeds.reuters.com/reuters/audio/newsmakerus/rss/mp3/").timeout(0).get();
    System.out.println(doc.html());
    System.out.println("================================");
    Elements items = doc.select("item");
    for (Element item : items) {

        Element titleElement = item.select("title").first();
        String mTitle = titleElement.text();
        System.out.println("title is : " + mTitle);

        /*
         * The link in the rss is as follows
         *  <link />http://feeds.reuters.com/~r/reuters/audio/newsmakerus/rss/mp3/~3/NX3AY96GfGk/59621707.mp3 
         *  which doesn't fall in the <link> element but falls under <item> TextNode
         */
        String  mLink = item.ownText(); //  
        System.out.println("link is : " + mLink);

        Element descElement = item.select("description").first();
        /*Unescape the html content, Parse it to a doc, and then fetch only the text leaving behind all the html tags in content
         * "/" is a dummy baseURI passed, as we don't care about resolving the links within parsed content.
         */
        String  mDesc = Parser.parse(Parser.unescapeEntities(descElement.text(), false),"/" ).text(); 
        System.out.println("description is : " + mDesc);

    }

Html5 stackoverflow Examples

Issue On Parsing Html With Jsoup

Solution 1:

Post a Comment for "Issue On Parsing Html With Jsoup"