这是博客

试图存在, 但薛三无法自证

0%

Crawler_Elsevier Abstract爬取

机缘巧合,需要帮朋友爬取些论文摘要。遂将任务探索过程及解决方案记录如下。

初次接触Requests,代码较为稚嫩,多多包涵。

1. 需求描述

Elsevier数据库中,给定查询条件(如,搜索“题目、摘要、关键词中包含Target的论文”),检索论文,并存取论文标题及摘要

2. 解决方案

2.1 前期准备

  • python环境

  • 创建个人API-Key link

    image-20220908003127528
  • 下载Elsapy 源码 Github

    image-20220908003348189

2.2 摘要相关API一览

注:Elsapy封装了对相关api的request(虽然要魔改下)。

  • Scopus Search API

    通过向此API发起请求,我们能得到Scopus数据库中符合检索条件的摘要的URL

    image-20220908003940929

    后续Elsapy将此query参数进行了封装。

    image-20220908004250354

    API 返回结果的文档在此: Scopus Search Views

    可惜不够尊贵,个人版的返回json中不包括abstract

    image-20220908005338776

    实际返回项如下图:

    image-20220908005405331

    故我们需要使用prism:url进行abstract的提取。

    image-20220908005443232

  • Abstract Retrieval API

    用上述的prism:url做参数,调用Abstract Retrieval API就可以得到目标文章的摘要信息了。

    image-20220908005754680

    返回的Abstract Retrieval Views.链接在此,其中的dc:description即为我们所需的摘要,dc:title为标题。

2.3 Elsapy 开发

Elsapy使用Requests,封装了对Elsevier各个api访问的类及方法。不过其在Exception handling等方面很不完善。好在其本身代码简单,结构清晰,个人便对其做简单的定制。

  • 基本使用

    1. 提供API-Key构建ElsClient对象,该对象负责向Elsevier发起请求;

    2. 再根据需求构建想访问的API的对象,如ElsSearch

    3. 调用API对象的方法,进行访问。

    示例:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    from elsapy.elsclient import ElsClient
    from elsapy.elssearch import ElsSearch
    import time

    query_str = "your_query"
    user_key = 'your_key'

    if __name__ == '__main__':
    client = ElsClient(user_key)
    doc_src = ElsSearch(query_str, 'scopus')
    print(">>> Start crawling.\t Time: " + time.asctime())
    doc_src.custom_execute(client, get_num=20000, save_json=query_str+".json")
  • 改动

    原本的ElsSearch.execute缺少异常处理,致使爬取中断。此外,我希望能指定爬取文章的数量,以及存储json的文件名。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    def custom_execute(self, els_client = None, get_num = None, get_all = False, save_json=None):
    """Executes the search. If get_all = False (default), this retrieves
    the default number of results specified for the API. If
    get_all = True, multiple API calls will be made to iteratively get
    all results for the search, up to a maximum of 5,000."""
    api_response = els_client.exec_request(self._uri)
    self._tot_num_res = int(api_response['search-results']['opensearch:totalResults'])
    self._results = api_response['search-results']['entry']
    if get_all is True or get_num is not None:
    if get_num is not None:
    quota = get_num if get_num < self._tot_num_res else self._tot_num_res
    else:
    quota = self._tot_num_res

    failed_flag = False
    while self.num_res < quota and not failed_flag:
    print("> Executing {cur} | {total}".format(cur = self.num_res, total = quota))
    for e in api_response['search-results']['link']:
    if e['@ref'] == 'next':
    next_url = e['@href']
    for i in range(5): # if failed, retry up to 5 times
    try:
    api_response = els_client.exec_request(next_url)
    break
    except RequestException as e:
    print(e)
    if i < 4:
    print(">>> retry: {t} times".format(t= i + 1))
    else:
    print(">>> TASK FAILED. Save current results.")
    failed_flag = True

    self._results += api_response['search-results']['entry']
    name = save_json if save_json is not None else 'dump.json'
    with open(name, 'w') as f:
    f.write(json.dumps(self._results))
    self.results_df = recast_df(pd.DataFrame(self._results))

    此外,实际爬取过程中遇到 ConnectionResetError: [WinError 10054] 远程主机强迫关闭了一个现有的连接问题。尽管原ElsClient已经按照Elsevier限定的每秒1次访问设定了interval,我仍在此基础上增加0.5s间隔,从而解决上述问题。

    1
    time.sleep( self.__min_req_interval - interval + 0.5 ) # sleep 0.5 more second	

    2.4 完整代码

  • Scopus search

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    from elsapy.elsclient import ElsClient
    from elsapy.elssearch import ElsSearch
    import time

    query_str = "YOUR_QUERY"
    user_key = 'YOUR_KEY'

    if __name__ == '__main__':
    client = ElsClient(user_key)
    doc_src = ElsSearch(query_str, 'scopus')
    print(">>> Start crawling.\t Time: " + time.asctime())
    doc_src.custom_execute(client, get_num=20000, save_json=query_str+".json")
  • Abstract retrieval

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    import json
    import pandas as pd
    import time
    from elsapy.elsclient import ElsClient
    from elsapy.elsdoc import AbsDoc

    abs_field = 'dc:description'
    title_field = 'dc:title'
    publication_field = 'prism:publicationName'

    user_key = 'YOUR_KEY'

    def load_json2pd(file):
    with open(file, 'r') as f:
    data = json.load(f)
    return pd.json_normalize(data)

    def crawl_abstracts(df, client):
    failed_list = []
    abstracts = []
    titles = []
    publications = []

    print(">>> Start crawling.\t Time: " + time.asctime())

    for i, url in enumerate(df['prism:url']):
    if i % 20 == 0:
    print("> Parsing {cur} / {total}".format(cur = i + 1, total = len(df['prism:url'])))
    abs_doc = AbsDoc(url)
    rst = abs_doc.read(client)
    if rst:
    try:
    abstracts.append(abs_doc.data['coredata'][abs_field])
    titles.append(abs_doc.data['coredata'][title_field])
    publications.append(abs_doc.data['coredata'][publication_field])
    except Exception as e:
    print(e)
    else:
    print("> {cur} - th failed.".format(cur = i + 1))
    failed_list.append(url)
    if len(failed_list) > 0:
    print('> Retry failed url...')
    for url in failed_list:
    abs_doc = AbsDoc(url)
    if abs_doc.read(client):
    try:
    abstracts.append(abs_doc.data['coredata'][abs_field])
    titles.append(abs_doc.data['coredata'][title_field])
    publications.append(abs_doc.data['coredata'][publication_field])
    except Exception as e:
    pass
    print("Completed. Total crawled abstracts: ", len(abstracts))

    return titles, abstracts, publications


    if __name__ == '__main__':
    client = ElsClient(user_key)
    json_file = 'TITLE-ABS-KEY(inorganic compounds).json'
    csv_path = json_file[:-5] + '_2' + '.csv'

    df = load_json2pd(json_file)

    # 一点点来爬
    cut_l, cut_r = 1000, 5000

    titles, abstracts, publications = crawl_abstracts(df[cut_l:cut_r], client)
    abs_df = pd.concat([pd.DataFrame(titles),
    pd.DataFrame(abstracts),
    pd.DataFrame(publications)],
    axis=1)
    abs_df.to_csv(csv_path, header=False, index=False)

3. 相关资料

除上文中包含的链接外,

4. 可优化点

  • 多线程爬取

    现在受制于Elsevier限制,每秒只能发送1次请求,导致爬虫龟速。或可申请多个API-Key,进行多线程爬取。