Skip to content Skip to sidebar Skip to footer

Extracting P Within H1 With Python/scrapy

I am using Scrapy to extract some data about musical concerts from websites. At least one website I'm working with uses (incorrectly, according to W3C - Is it valid to have paragra

Solution 1:

That was quite baffling. To be frank, I still do not get why this is happening. Found out that the <p> tag that should be contained within the <h1> tag, is not so. Curl for the site shows of the form <h1><p> </p></h1>, whereas the response obtained from the site shows it as :

<h1class="performance-title">\n</h1><p>Bernard Haitink conducts Brahms and\xa0Dvo\u0159\xe1k featuring\npianist Emanuel Ax
</p>

As I mentioned, I do have my doubts but nothing concrete. Anyways, the xpath for getting the text inside <p> tag hence is :

response.xpath('//h1[@class="performance-title"]/following-sibling::p/text()').extract()

This is by using the <h1 class="performance-title"> as a landmark and finding its sibling <p> tag

Solution 2:

//*[@id="content"]/section/article/section[2]/h1/p/text()

Post a Comment for "Extracting P Within H1 With Python/scrapy"