Issue On Parsing Html With Jsoup
I am trying to parse this HTML using jsoup. My code is: doc = Jsoup.connect(htmlUrl).timeout(1000 * 1000).get(); Elements items = doc.select('item'); Log.d
Solution 1:
There are 2 problems in rss
content you fetched.
- The
link
text is not within the<link/>
tag but outside of it. - There is some
escaped html
content within thedescription
tag.
PFB the modified code.
Also I found some clean html content when viewed the URL
in Browser
, which when parsed will make you easy to extract the desired fields. You can achieve that setting the userAgent
as Browser
in the Jsoup
. But its up to you to decide how to fetch the content.
doc = Jsoup.connect("http://feeds.reuters.com/reuters/audio/newsmakerus/rss/mp3/").timeout(0).get();
System.out.println(doc.html());
System.out.println("================================");
Elements items = doc.select("item");
for (Element item : items) {
Element titleElement = item.select("title").first();
String mTitle = titleElement.text();
System.out.println("title is : " + mTitle);
/*
* The link in the rss is as follows
* <link />http://feeds.reuters.com/~r/reuters/audio/newsmakerus/rss/mp3/~3/NX3AY96GfGk/59621707.mp3
* which doesn't fall in the <link> element but falls under <item> TextNode
*/
String mLink = item.ownText(); //
System.out.println("link is : " + mLink);
Element descElement = item.select("description").first();
/*Unescape the html content, Parse it to a doc, and then fetch only the text leaving behind all the html tags in content
* "/" is a dummy baseURI passed, as we don't care about resolving the links within parsed content.
*/
String mDesc = Parser.parse(Parser.unescapeEntities(descElement.text(), false),"/" ).text();
System.out.println("description is : " + mDesc);
}
Post a Comment for "Issue On Parsing Html With Jsoup"