Skip to content Skip to sidebar Skip to footer

Parsing An Html Document With Python

I am totally new on python and i am trying to parse an HTML document to remove the tags and I just want to keep the title and the body from a newspaper website I have previously do

Solution 1:

Based on the code you provided it looks like you are trying to open a html file that you have.

Instead of parsing the html file line by line like you are doing. Just feed the parser the entire HTML file.

from html.parser import HTMLParser

classMyHTMLParser(HTMLParser):
    defhandle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)

    defhandle_endtag(self, tag):
        print("Encountered an end tag :", tag)

    defhandle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser()

withopen(r'C:\Users\...site_1.html', "r") as f:
    page = f.read()
    tree = html.fromstring(page)
parser.feed(tree)

Pythons HTML parser requires the feed to be a string. What you could do is copy paste the entire HTML that you have into the Feed. Might not be best practice but it should read and parse the html

parser.feed("THE ENTIRE HTML AS STRING HERE")

I hope this helps

Edit———- Have you tried getting the html into a string like you have and then calling str.strip() on the string to remove all blank spaces from leading and trailing of the string.

FYI you can also use sentence.replace(“ “, “”) to remove all blank spaces from string

Hope this helps

Post a Comment for "Parsing An Html Document With Python"