当前位置: 首页 > news >正文

英雄联盟网页制作素材百度快速优化推广

英雄联盟网页制作素材,百度快速优化推广,网站搭建设计 是什么意思,仿蘑菇街wordpress主题系列文章目录 利用 eutils 实现自动下载序列文件 提示:写完文章后,目录可以自动生成,如何生成可参考右边的帮助文档 文章目录 系列文章目录前言一、获取文献信息二、下载文献PDF文件参考 前言 大家好✨,这里是bio🦖。…

系列文章目录

利用 eutils 实现自动下载序列文件


提示:写完文章后,目录可以自动生成,如何生成可参考右边的帮助文档

文章目录

  • 系列文章目录
  • 前言
  • 一、获取文献信息
  • 二、下载文献PDF文件
  • 参考


前言

大家好✨,这里是bio🦖。这次为大家带来自动收集文献信息、批量下载科研论文的脚本(只能批量下载公布在sci-hub上的科研论文)。平常阅读文献时只需要一篇一篇下载即可,并不需要用到批量下载的操作。但有时需要对某领域进行总结或者归纳,就可以考虑使用批量下载咯~

导师下令找文献,学生偷偷抹眼泪。
文献三千挤满屏,只下一篇可不行。
作息紊乱失双休,日夜不分忘寝食。
满腔热血搞科研,一盆冷水当头洛。
(打油诗人作)


一、获取文献信息

每个人的研究领域不一样,获取文献信息的方式不一样。这里以Pubmed1为例,PubMed是主要用于检索MEDLINE数据库中,生命科学和生物医学引用文献及索引的免费搜索引擎。之前爬取冠状病毒核酸数据使用过eutils,本篇博客也使用eutils去获取文献信息,关于eutils的介绍可以看利用 eutils 实现自动下载序列文件 。这里就不做介绍~

首先构造搜索url,其中term是你检索的关键词,year对应文献发表日期,API_KEY能够让你在一秒内的访问次数从3次提升到10次。这里将term替换为Machine learningyear替换为2022。然后使用requests库获取该url的对应的信息,在使用BeautifulSoup库将其转化为html格式。

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&api_key={API_KEY}&term={term}+{year}[pdat]

代码如下:

import pandas as pd
import requests
from bs4 import BeautifulSoup
import math
import re
import timeAPI_KEY = "Your AIP KEY"
term = "Machine Learning"
year = "2022"
url_start = f'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&api_key={API_KEY}&term={term}+{year}[pdat]'
info_page = BeautifulSoup(requests.get(url_start, timeout=(5, 5)).text, 'html.parser')

爬取的结果如下,可以看到结果主要包括许多PMID。然后包括的信息总数<count>31236</count>、最大返回数<retmax>20</retmax>以及结果开始的序号<retstart>0</retstart>。下一步就是根据id获取文章对应的信息。

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd"><esearchresult><count>31236</count><retmax>20</retmax><retstart>0</retstart><idlist>
<id>37878682</id>
<id>37873546</id>
<id>37873494</id>
... # omitting many results
<id>37786662</id>
<id>37780106</id>
<id>37776368</id>
</idlist><translationset><translation> <from>Machine Learning</from> <to>"machine learning"[MeSH Terms] OR ("machine"[All Fields] AND "learning"[All Fields]) OR "machine learning"[All Fields]</to> </translation></translationset><querytranslation>("machine learning"[MeSH Terms] OR ("machine"[All Fields] AND "learning"[All Fields]) OR "machine learning"[All Fields]) AND 2022/01/01:2022/12/31[Date - Publication]</querytranslation></esearchresult>

可以看到结果也是31236条记录,爬取的信息忠实于实际的信息,可以放心食用~
在这里插入图片描述

获取文章的信息也是相同的步骤,首先构造url,然后爬取对应的信息,直接上代码:

API_KEY = "Your AIP KEY"
id_str = '37878682'
url_paper = f'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&api_key={API_KEY}&id={id_str}&rettype=medline&retmode=text'
paper_info = BeautifulSoup(requests.get(url_paper, timeout=(5, 5)).text, 'html.parser')

结果如下,包括PMID、DOI、摘要、作者、单位、发表年份等等信息,你可以从这一步获得的信息中获取你需要的信息如DOI。接下来的就是要爬取文献对应的PDF文件。这是最关键的一步。

PMID- 37878682
OWN - NLM
STAT- Publisher
LR  - 20231025
IS  - 2047-217X (Electronic)
IS  - 2047-217X (Linking)
VI  - 12
DP  - 2022 Dec 28
TI  - Computational prediction of human deep intronic variation.
LID - giad085 [pii]
LID - 10.1093/gigascience/giad085 [doi]
AB  - BACKGROUND: The adoption of whole-genome sequencing in genetic screens has facilitated the detection of genetic variation in the intronic regions of genes, far from annotated splice sites. However, selecting an appropriate computational tool to discriminate functionally relevant genetic variants from those with no effect is challenging, particularly for deep intronic regions where independent benchmarks are scarce. RESULTS: In this study, we have provided an overview of the computational methods available and the extent to which they can be used to analyze deep intronic variation. We leveraged diverse datasets to extensively evaluate tool performance across different intronic regions, distinguishing between variants that are expected to disrupt splicing through different molecular mechanisms. Notably, we compared the performance of SpliceAI, a widely used sequence-based deep learning model, with that of more recent methods that extend its original implementation. We observed considerable differences in tool performance depending on the region considered, with variants generating cryptic splice sites being better predicted than those that potentially affect splicing regulatory elements. Finally, we devised a novel quantitative assessment of tool interpretability and found that tools providing mechanistic explanations of their predictions are often correct with respect to the ground - information, but the use of these tools results in decreased predictive power when compared to black box methods. CONCLUSIONS: Our findings translate into practical recommendations for tool usage and provide a reference framework for applying prediction tools in deep intronic regions, enabling more informed decision-making by practitioners.
CI  - (c) The Author(s) 2023. Published by Oxford University Press GigaScience.
FAU - Barbosa, Pedro
AU  - Barbosa P
AUID- ORCID: 0000-0002-3892-7640

在进行尝试下载文献之前,构建两个函数便于批量爬取信息。get_literature_idget_detailed_info分别获取文献的PMID以及详细信息。

def get_literature_id(term, year):API_KEY = "Your AIP KEY"# pdat means published date, 2020[pdat] means publised literatures from 2020/01/01 to 2020/12/31url_start = f'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&api_key={API_KEY}&term={term}+{year}[pdat]'time.sleep(0.5)info = BeautifulSoup(requests.get(url_start, timeout=(5, 5)).text, 'html.parser')time.sleep(0.5)# translate str to intyear_published_count = int(info.find('count').text)id_list = [_.get_text() for _ in info.find_all('id')]for page in range(1, math.ceil(year_published_count/20)):url_page = f'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&api_key={API_KEY}&term=pbmc+AND+single+cell+{year}[pdat]&retmax=20&retstart={page*20}'time.sleep(0.5)info_page = BeautifulSoup(requests.get(url_page, timeout=(5, 5)).text, 'html.parser')id_list += [_.get_text() for _ in info_page.find_all('id')]return id_list, year_published_countdef get_detailed_info(id_list):API_KEY = "Your AIP KEY"# PMID DOI PMCID Title Abstract Author_1st Affiliation_1st Journel Pulication_timeextracted_info = []for batch in range(0, math.ceil(len(id_list)/20)):id_str = ",".join(id_list[batch*20: (batch+1)*20])detailed_url = f'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&api_key={API_KEY}&id={id_str}&rettype=medline&retmode=text'time.sleep(0.5)detailed_info = BeautifulSoup(requests.get(detailed_url, timeout=(5, 5)).text, 'html.parser')literature_as_line_list = detailed_info.text.split('\nPMID')[1:]for literature in literature_as_line_list:# PMIDpmid = literature.split('- ')[1].split('\n')[0]# DOIif '[doi]' in literature:doi = literature.split('[doi]')[0].split(' - ')[-1].strip()else:doi = ""# PMCIDif "PMC" in literature:pmcid = literature.split('PMC -')[1].split('\n')[0].strip()else:pmcid = ""# Titletitle = re.split(r'\n[A-Z]{2,3}\s', literature.split('TI  - ')[1])[0].replace("\n      ", "")if '\n' in title:title = title.replace("\n      ", "")# Abstractabstract = literature.split('AB  - ')[1].split(' - ')[0].replace("\n      ", "").split('\n')[0]# Author_1stauthor = literature.split('FAU - ')[1].split('\n')[0]# Affiliation_1sttmp_affiliation = literature.split('FAU - ')[1]if "AD  - " in tmp_affiliation:affiliation = tmp_affiliation.split('AD  - ')[1].replace("\n      ", "").strip('\n')else:affiliation = ""# Journeljournel = literature.split('TA  - ')[1].split('\n')[0]# Publication timepublication_time = literature.split('SO  - ')[1].split(';')[0].split('. ')[1]if ':' in publication_time:publication_time = publication_time.split(':')[0]extracted_info.append([pmid, doi, pmcid, title, abstract, author, affiliation, journel, publication_time])return extracted_info

爬取的部分结果如下图所示。有些文章没有DOI号,不信的话可以尝试在PubMed中搜索该文章对应的PMID33604555看看~。
在这里插入图片描述

二、下载文献PDF文件

关于下载文献的PDF文件,这里是从SciHub中爬取的,不是从期刊官方,部分文章可以没有被SciHub收录或者收录的预印版,因此,不能 保证上文中获取的信息就能从SciHub中全部下载成功。如果不能访问SciHub,自然就不能爬取对应的内容了,可以考虑买个VPN,科学上网~。

爬取SciHub上的文章需要构建一个访问头的信息,不然回返回403禁止访问。然后将获取的内容保存为PDF格式即可。其中从SciHub中爬取文献PDF文件参考了 用Python批量下载文献2

data = pd.read_csv('/mnt/c/Users/search_result.csv')
doi_data = data[~data['DOI'].isna()]
doi_list = doi_data["DOI"].tolist()
pmid_list = doi_data["PMID"].tolist()
for doi, pmid in zip(doi_list, pmid_list):download_url = f'https://sci.bban.top/pdf/{doi}.pdf?#view=FitH'headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"}literature = requests.get(download_url, headers=headers)if literature.status_code != 200:print(f"this paper may have not downloading permission, it's doi: {doi}")else:with open(f'/mnt/c/Users/ouyangkang/Desktop/scraper_literature/{pmid}.pdf', 'wb') as f:f.write(literature.content)

爬取结果如下图所示,成功下载了199篇~(这里的关键词不是机器学习,提供的doi数量是522,下载成功率为38%)。
在这里插入图片描述


参考


  1. .Pubmed official websity ↩︎

  2. 用Python批量下载文献 ↩︎


文章转载自:
http://lupus.spbp.cn
http://terrace.spbp.cn
http://essex.spbp.cn
http://litre.spbp.cn
http://tarn.spbp.cn
http://theoretics.spbp.cn
http://aventall.spbp.cn
http://mohasky.spbp.cn
http://cosmogenesis.spbp.cn
http://oligocene.spbp.cn
http://battalion.spbp.cn
http://keramic.spbp.cn
http://implore.spbp.cn
http://consult.spbp.cn
http://lifesaving.spbp.cn
http://stelae.spbp.cn
http://bradshaw.spbp.cn
http://fulminating.spbp.cn
http://absolutist.spbp.cn
http://accusable.spbp.cn
http://shellfish.spbp.cn
http://sympathetically.spbp.cn
http://sisterly.spbp.cn
http://meant.spbp.cn
http://watchtower.spbp.cn
http://dedicate.spbp.cn
http://komiteh.spbp.cn
http://softbound.spbp.cn
http://bitmap.spbp.cn
http://futurity.spbp.cn
http://odalisk.spbp.cn
http://downtick.spbp.cn
http://screenwash.spbp.cn
http://kyrie.spbp.cn
http://henotheism.spbp.cn
http://plank.spbp.cn
http://baccarat.spbp.cn
http://kiblah.spbp.cn
http://gramary.spbp.cn
http://diophantine.spbp.cn
http://mixologist.spbp.cn
http://photog.spbp.cn
http://varmint.spbp.cn
http://patio.spbp.cn
http://heaps.spbp.cn
http://barm.spbp.cn
http://gurmukhi.spbp.cn
http://fabrication.spbp.cn
http://funest.spbp.cn
http://loom.spbp.cn
http://linkage.spbp.cn
http://display.spbp.cn
http://prolongation.spbp.cn
http://chihuahua.spbp.cn
http://lombok.spbp.cn
http://transistorize.spbp.cn
http://myalism.spbp.cn
http://oujda.spbp.cn
http://dagon.spbp.cn
http://rely.spbp.cn
http://bookseller.spbp.cn
http://instruct.spbp.cn
http://explanative.spbp.cn
http://ballcarrier.spbp.cn
http://gouda.spbp.cn
http://bumble.spbp.cn
http://periwig.spbp.cn
http://troubleproof.spbp.cn
http://eligible.spbp.cn
http://imperatively.spbp.cn
http://ironwood.spbp.cn
http://mythogenic.spbp.cn
http://cirrous.spbp.cn
http://neuter.spbp.cn
http://convertite.spbp.cn
http://protuberate.spbp.cn
http://ifpi.spbp.cn
http://valse.spbp.cn
http://dhl.spbp.cn
http://edaphology.spbp.cn
http://pout.spbp.cn
http://hypocotyl.spbp.cn
http://desmolysis.spbp.cn
http://dronish.spbp.cn
http://achalasia.spbp.cn
http://shealing.spbp.cn
http://potsherd.spbp.cn
http://propulsor.spbp.cn
http://shunt.spbp.cn
http://coequal.spbp.cn
http://toyshop.spbp.cn
http://autoanalysis.spbp.cn
http://superannuated.spbp.cn
http://salutation.spbp.cn
http://depancreatize.spbp.cn
http://overzeal.spbp.cn
http://shrug.spbp.cn
http://latrine.spbp.cn
http://hoveler.spbp.cn
http://sidle.spbp.cn
http://www.hrbkazy.com/news/92644.html

相关文章:

  • 自建房设计软件seo排名助手
  • 万载网站建设怎么登录百度app
  • 网站专属定制高端网站建设湖北网络推广公司
  • 网站建设行业怎么样360线上推广
  • 襄阳市建设委员网站免费推广自己的网站
  • 深圳做营销型网站百度收录网站链接入口
  • 一个公网ip可以做几个网站樱桃bt官网
  • flash网站建设技术seo顾问服务咨询
  • 河南住房和城乡建设厅网官方网站营销策略的概念
  • 邯郸市市长宁波seo专员
  • 集团公司网站案例山东今日热搜
  • 营销最好的网站建设公司专业网站快速
  • 济南网站制作工作室张雪峰谈广告学专业
  • 互联网站安全找培训班一般在什么平台
  • 免费网站空间申请免费刷粉网站推广
  • 新疆伊犁河建设管理局网站市场营销策划案例经典大全
  • 天津行业建站app制作
  • 成都微网站建设seo泛目录培训
  • flash做安卓游戏下载网站如何投放网络广告
  • 做网站找云无限百度经验实用生活指南
  • 网站开发的公司百度关键词下拉有什么软件
  • 南通做网站公司哪家好青岛自动seo
  • 古交市网站建设公司网站关键词优化排名公司
  • 手机网站模版下载软文营销文案
  • 自己怎么做短视频网站企拓客软件怎么样
  • 网站和其他系统对接怎么做信息流广告公司排名
  • 深圳做网站开发网络优化推广公司哪家好
  • 东胜网站制作万网域名注册教程
  • 群晖ds1817做网站网站seo怎么做
  • 单独做手机网站怎么做app推广公司怎么对接业务