美丽的汤解析许多条目的列表并保存在数据框中
问题内容
目前我将从世界各地的教区收集数据。
我的方法适用于 bs4 和 pandas。我目前正在研究抓取逻辑。
import requestsfrom bs4 import BeautifulSoupimport pandas as pdurl = "http://www.catholic-hierarchy.org/"# Send a GET request to the websiteresponse = requests.get(url)#my approach to parse the HTML content of the pagesoup = BeautifulSoup(response.text, 'html.parser')# Find the relevant elements containing diocese informationdiocese_elements = soup.find_all("div", class_="diocesan")# Initialize empty lists to store datadioceses = []addresses = []# Extract now data from each diocese elementfor diocese_element in diocese_elements: # Example: Extracting diocese name diocese_name = diocese_element.find("a").text.strip() dioceses.append(diocese_name) # Example: Extracting address address = diocese_element.find("div", class_="address").text.strip() addresses.append(address)# to save the whole data we create a DataFrame using pandasdata = {'Diocese': dioceses, 'Address': addresses}df = pd.DataFrame(data)# Display the DataFrameprint(df)
目前我的 pycharm 上发现了一些奇怪的东西。我尝试找到一种使用pandas 方法收集全部数据的方法。
正确答案
这个示例可以帮助您入门 - 它将解析所有教区页面以获取教区名称 + url,并将其存储到 panda 的 dataframe 中。
然后您可以迭代这些 url 并获取所需的更多信息。
import pandas as pdimport requestsfrom bs4 import beautifulsoupchars = "abcdefghijklmnopqrstuvwxyz"url = "http://www.catholic-hierarchy.org/diocese/la{char}.html"all_data = []for char in chars: u = url.format(char=char) while true: print(f"parsing {u}") soup = beautifulsoup(requests.get(u).content, "html.parser") for a in soup.select("li a[href^=d]"): all_data.append( { "name": a.text, "url": "http://www.catholic-hierarchy.org/diocese/" + a["href"], } ) next_page = soup.select_one('a:has(img[alt="[next page]"])') if not next_page: break u = "http://www.catholic-hierarchy.org/diocese/" + next_page["href"]df = pd.dataframe(all_data).drop_duplicates()print(df.head(10))
打印:
...Parsing http://www.catholic-hierarchy.org/diocese/lax.htmlParsing http://www.catholic-hierarchy.org/diocese/lay.htmlParsing http://www.catholic-hierarchy.org/diocese/laz.html Name URL0 Holy See http://www.catholic-hierarchy.org/diocese/droma.html1 Diocese of Rome http://www.catholic-hierarchy.org/diocese/droma.html2 Aachen http://www.catholic-hierarchy.org/diocese/da549.html3 Aachen http://www.catholic-hierarchy.org/diocese/daach.html4 Aarhus (Århus) http://www.catholic-hierarchy.org/diocese/da566.html5 Aba http://www.catholic-hierarchy.org/diocese/dabaa.html6 Abaetetuba http://www.catholic-hierarchy.org/diocese/dabae.html8 Abakaliki http://www.catholic-hierarchy.org/diocese/dabak.html9 Abancay http://www.catholic-hierarchy.org/diocese/daban.html10 Abaradira http://www.catholic-hierarchy.org/diocese/d2a01.html