机缘巧合,需要帮朋友爬取些论文摘要。遂将任务探索过程及解决方案记录如下。
初次接触Requests,代码较为稚嫩,多多包涵。
1. 需求描述
在Elsevier数据库中,给定查询条件(如,搜索“题目、摘要、关键词中包含Target
的论文”),检索论文,并存取论文标题及摘要。
2. 解决方案
2.1 前期准备
2.2 摘要相关API一览
注:Elsapy封装了对相关api的request(虽然要魔改下)。
-
通过向此API发起请求,我们能得到Scopus数据库中符合检索条件的摘要的URL。
后续Elsapy将此query参数进行了封装。
API 返回结果的文档在此: Scopus Search Views
可惜不够尊贵,个人版的返回json中不包括abstract
实际返回项如下图:
故我们需要使用prism:url进行abstract的提取。
-
用上述的prism:url做参数,调用Abstract Retrieval API就可以得到目标文章的摘要信息了。
返回的Abstract Retrieval Views.链接在此,其中的dc:description即为我们所需的摘要,dc:title为标题。
2.3 Elsapy 开发
Elsapy使用Requests,封装了对Elsevier各个api访问的类及方法。不过其在Exception handling等方面很不完善。好在其本身代码简单,结构清晰,个人便对其做简单的定制。
基本使用
提供API-Key构建
ElsClient
对象,该对象负责向Elsevier发起请求;再根据需求构建想访问的API的对象,如
ElsSearch
;调用API对象的方法,进行访问。
示例:
1
2
3
4
5
6
7
8
9
10
11
12from elsapy.elsclient import ElsClient
from elsapy.elssearch import ElsSearch
import time
query_str = "your_query"
user_key = 'your_key'
if __name__ == '__main__':
client = ElsClient(user_key)
doc_src = ElsSearch(query_str, 'scopus')
print(">>> Start crawling.\t Time: " + time.asctime())
doc_src.custom_execute(client, get_num=20000, save_json=query_str+".json")改动
原本的
ElsSearch.execute
缺少异常处理,致使爬取中断。此外,我希望能指定爬取文章的数量,以及存储json的文件名。1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37def custom_execute(self, els_client = None, get_num = None, get_all = False, save_json=None):
"""Executes the search. If get_all = False (default), this retrieves
the default number of results specified for the API. If
get_all = True, multiple API calls will be made to iteratively get
all results for the search, up to a maximum of 5,000."""
api_response = els_client.exec_request(self._uri)
self._tot_num_res = int(api_response['search-results']['opensearch:totalResults'])
self._results = api_response['search-results']['entry']
if get_all is True or get_num is not None:
if get_num is not None:
quota = get_num if get_num < self._tot_num_res else self._tot_num_res
else:
quota = self._tot_num_res
failed_flag = False
while self.num_res < quota and not failed_flag:
print("> Executing {cur} | {total}".format(cur = self.num_res, total = quota))
for e in api_response['search-results']['link']:
if e['@ref'] == 'next':
next_url = e['@href']
for i in range(5): # if failed, retry up to 5 times
try:
api_response = els_client.exec_request(next_url)
break
except RequestException as e:
print(e)
if i < 4:
print(">>> retry: {t} times".format(t= i + 1))
else:
print(">>> TASK FAILED. Save current results.")
failed_flag = True
self._results += api_response['search-results']['entry']
name = save_json if save_json is not None else 'dump.json'
with open(name, 'w') as f:
f.write(json.dumps(self._results))
self.results_df = recast_df(pd.DataFrame(self._results))此外,实际爬取过程中遇到
ConnectionResetError: [WinError 10054] 远程主机强迫关闭了一个现有的连接
问题。尽管原ElsClient
已经按照Elsevier限定的每秒1次访问设定了interval,我仍在此基础上增加0.5s间隔,从而解决上述问题。1
time.sleep( self.__min_req_interval - interval + 0.5 ) # sleep 0.5 more second
2.4 完整代码
Scopus search
1
2
3
4
5
6
7
8
9
10
11
12from elsapy.elsclient import ElsClient
from elsapy.elssearch import ElsSearch
import time
query_str = "YOUR_QUERY"
user_key = 'YOUR_KEY'
if __name__ == '__main__':
client = ElsClient(user_key)
doc_src = ElsSearch(query_str, 'scopus')
print(">>> Start crawling.\t Time: " + time.asctime())
doc_src.custom_execute(client, get_num=20000, save_json=query_str+".json")Abstract retrieval
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73import json
import pandas as pd
import time
from elsapy.elsclient import ElsClient
from elsapy.elsdoc import AbsDoc
abs_field = 'dc:description'
title_field = 'dc:title'
publication_field = 'prism:publicationName'
user_key = 'YOUR_KEY'
def load_json2pd(file):
with open(file, 'r') as f:
data = json.load(f)
return pd.json_normalize(data)
def crawl_abstracts(df, client):
failed_list = []
abstracts = []
titles = []
publications = []
print(">>> Start crawling.\t Time: " + time.asctime())
for i, url in enumerate(df['prism:url']):
if i % 20 == 0:
print("> Parsing {cur} / {total}".format(cur = i + 1, total = len(df['prism:url'])))
abs_doc = AbsDoc(url)
rst = abs_doc.read(client)
if rst:
try:
abstracts.append(abs_doc.data['coredata'][abs_field])
titles.append(abs_doc.data['coredata'][title_field])
publications.append(abs_doc.data['coredata'][publication_field])
except Exception as e:
print(e)
else:
print("> {cur} - th failed.".format(cur = i + 1))
failed_list.append(url)
if len(failed_list) > 0:
print('> Retry failed url...')
for url in failed_list:
abs_doc = AbsDoc(url)
if abs_doc.read(client):
try:
abstracts.append(abs_doc.data['coredata'][abs_field])
titles.append(abs_doc.data['coredata'][title_field])
publications.append(abs_doc.data['coredata'][publication_field])
except Exception as e:
pass
print("Completed. Total crawled abstracts: ", len(abstracts))
return titles, abstracts, publications
if __name__ == '__main__':
client = ElsClient(user_key)
json_file = 'TITLE-ABS-KEY(inorganic compounds).json'
csv_path = json_file[:-5] + '_2' + '.csv'
df = load_json2pd(json_file)
# 一点点来爬
cut_l, cut_r = 1000, 5000
titles, abstracts, publications = crawl_abstracts(df[cut_l:cut_r], client)
abs_df = pd.concat([pd.DataFrame(titles),
pd.DataFrame(abstracts),
pd.DataFrame(publications)],
axis=1)
abs_df.to_csv(csv_path, header=False, index=False)
3. 相关资料
除上文中包含的链接外,
-
介绍所支持Query的关键词,表达式(基本上用户进行advanced search时的条件都可以实现);
4. 可优化点
多线程爬取
现在受制于Elsevier限制,每秒只能发送1次请求,导致爬虫龟速。或可申请多个API-Key,进行多线程爬取。