PHP前端开发

美丽的汤解析许多条目的列表并保存在数据框中

百变鹏仔 1天前 #Python
文章标签 目的
问题内容

目前我将从世界各地的教区收集数据。

我的方法适用于 bs4 和 pandas。我目前正在研究抓取逻辑。

import requestsfrom bs4 import BeautifulSoupimport pandas as pdurl = "http://www.catholic-hierarchy.org/"# Send a GET request to the websiteresponse = requests.get(url)#my approach  to parse the HTML content of the pagesoup = BeautifulSoup(response.text, 'html.parser')# Find the relevant elements containing diocese informationdiocese_elements = soup.find_all("div", class_="diocesan")# Initialize empty lists to store datadioceses = []addresses = []# Extract now data from each diocese elementfor diocese_element in diocese_elements:    # Example: Extracting diocese name    diocese_name = diocese_element.find("a").text.strip()    dioceses.append(diocese_name)    # Example: Extracting address    address = diocese_element.find("div", class_="address").text.strip()    addresses.append(address)#  to save the whole data we create a DataFrame using pandasdata = {'Diocese': dioceses, 'Address': addresses}df = pd.DataFrame(data)# Display the DataFrameprint(df)

目前我的 pycharm 上发现了一些奇怪的东西。我尝试找到一种使用pandas 方法收集全部数据的方法。


正确答案


这个示例可以帮助您入门 - 它将解析所有教区页面以获取教区名称 + url,并将其存储到 panda 的 dataframe 中。

然后您可以迭代这些 url 并获取所需的更多信息。

import pandas as pdimport requestsfrom bs4 import beautifulsoupchars = "abcdefghijklmnopqrstuvwxyz"url = "http://www.catholic-hierarchy.org/diocese/la{char}.html"all_data = []for char in chars:    u = url.format(char=char)    while true:        print(f"parsing {u}")        soup = beautifulsoup(requests.get(u).content, "html.parser")        for a in soup.select("li a[href^=d]"):            all_data.append(                {                    "name": a.text,                    "url": "http://www.catholic-hierarchy.org/diocese/" + a["href"],                }            )        next_page = soup.select_one('a:has(img[alt="[next page]"])')        if not next_page:            break        u = "http://www.catholic-hierarchy.org/diocese/" + next_page["href"]df = pd.dataframe(all_data).drop_duplicates()print(df.head(10))

打印:

...Parsing http://www.catholic-hierarchy.org/diocese/lax.htmlParsing http://www.catholic-hierarchy.org/diocese/lay.htmlParsing http://www.catholic-hierarchy.org/diocese/laz.html               Name                                                   URL0          Holy See  http://www.catholic-hierarchy.org/diocese/droma.html1   Diocese of Rome  http://www.catholic-hierarchy.org/diocese/droma.html2            Aachen  http://www.catholic-hierarchy.org/diocese/da549.html3            Aachen  http://www.catholic-hierarchy.org/diocese/daach.html4    Aarhus (Århus)  http://www.catholic-hierarchy.org/diocese/da566.html5               Aba  http://www.catholic-hierarchy.org/diocese/dabaa.html6        Abaetetuba  http://www.catholic-hierarchy.org/diocese/dabae.html8         Abakaliki  http://www.catholic-hierarchy.org/diocese/dabak.html9           Abancay  http://www.catholic-hierarchy.org/diocese/daban.html10        Abaradira  http://www.catholic-hierarchy.org/diocese/d2a01.html