Not Able To Parse Complete Html Of A Url Using Jsoup

July 27, 2023 Post a Comment

Jsoup library is not parsing complete html of a given url. some divisions are missing from the orignial html of url. Interesting thing: http://facebook.com/search.php?init=s:email

Solution 1:

As far as i know Jsoup restricts the size of the retrieved content to 1M usually. Try this to get the full html source:

Document document = Jsoup.connect(url)
  .userAgent("Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.69 Safari/537.36")
  .maxBodySize(0)
  .get();

The maxBodySize(0) removes the 1M limit. There are other useful parameters you can set in the connect, like a timeout or cookies.

Solution 2:

You should also set a large timeout, ex.:

Document document = Jsoup.connect(url)
.header("Accept-Encoding", "gzip, deflate")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
.maxBodySize(0)
.timeout(600000)
.get();

Html5 stackoverflow Examples

Not Able To Parse Complete Html Of A Url Using Jsoup

Solution 1:

Solution 2:

Post a Comment for "Not Able To Parse Complete Html Of A Url Using Jsoup"