当前位置: 首页 > news >正文

青岛外贸网站建设查排名的网站

青岛外贸网站建设,查排名的网站,江北网站建设,怎么样做游戏网站前言: 蒙特卡罗的学习基本流程: Policy Evaluation : 生成动作-状态轨迹,完成价值函数的估计。 Policy Improvement: 通过价值函数估计来优化policy。 同策略(one-policy):产生 采样轨迹的策略 和要改…

前言:

    蒙特卡罗的学习基本流程:

     Policy Evaluation :          生成动作-状态轨迹,完成价值函数的估计。

     Policy Improvement:       通过价值函数估计来优化policy。

       同策略(one-policy):产生 采样轨迹的策略 \pi^{'} 和要改善的策略 \pi 相同。

       Policy Evaluation :    通过\epsilon-贪心策略(\pi^{'}),产生(状态-动作-奖赏)轨迹。

       Policy Improvement:  原始策略也是 \epsilon-贪心策略(\pi^{'}), 通过价值函数优化, \epsilon-贪心策略(\pi^{'})

      异策略(off-policy):产生采样轨迹的  策略 \pi^{'} 和要改善的策略 \pi 不同。

      Policy Evaluation :   通过\epsilon-贪心策略(\pi^{'}),产生采样轨迹(状态-动作-奖赏)。

      Policy Improvement:  改进原始策略\pi

    两个优势:

    1: 原始策略不容易采样

    2: 降低方差

易策略常用的方案为 IR(importance sample) 重要性采样

Importance sampling is a Monte Carlo method for evaluating properties of a particular distribution, while only having samples generated from a different distribution than the distribution of interest. Its introduction in statistics is generally attributed to a paper by Teun Kloek and Herman K. van Dijk in 1978,[1] but its precursors can be found in statistical physics as early as 1949.[2][3] Importance sampling is also related to umbrella sampling in computational physics. Depending on the application, the term may refer to the process of sampling from this alternative distribution, the process of inference, or both.


一  importance-samling

    1.1 原理

     原始问题:

      u_f=\int_x p(z)f(z)dx

     如果采样N次,得到z_1,z_2,...z_N

       u_f \approx \frac{1}{N}\sum_{z_i \sim p(z)}f(z_i)

    问题: p(z) 很难采样(采样空间很大,很多时候只能采样到一部分)

   引入 q(x) 重要性分布(这也是一个分布,容易被采样)

  w(x)=\frac{p(x)}{q(x)}: 称为importance weight

            u_f =\int q(x)\frac{p(x)}{q(x)}f(x)dx

             \approx \frac{1}{N}\sum_i w(x_i)f(x_i)(大数定理)

 下面例子,我们需要对w(x_i),做归一化处理,更清楚的看出来占比

   下面代码进行了归一化处理,方案如下:

     w(x_i)=log p(x_i)-log q(x_i)

     w^1(x_i)=\frac{e^{w(x_i)}}{\sum_j e^{w(x_i)}}

     w^2(x_i)=w(x_i)-log\sum_j(e^{w(x_j)})

      

# -*- coding: utf-8 -*-
"""
Created on Wed Nov  8 16:38:34 2023@author: chengxf2
"""import numpy as np
import matplotlib.pyplot as plt
from scipy.special import logsumexpclass pdf:def __call__(self,x):passdef sample(self,n):pass#正太分布的概率密度
class Norm(pdf):#返回一组符合高斯分布的概率密度随机数。def __init__(self, mu=0, sigma=1):self.mu = muself.sigma = sigmadef __call__(self, x):#log p 功能,去掉前面常数项logp = (x-self.mu)**2/(2*self.sigma**2)return -logpdef sample(self, N):#产生N 个点,这些点符合正太分布x = np.random.normal(self.mu, self.sigma,N)return xclass Uniform(pdf):#均匀分布的概率密度def __init__(self, low, high):self.low = lowself.high = highdef __call__(self, x):#logq 功能N = len(x)a = np.repeat(-np.log(self.high-self.low), N)return -adef sample(self, N):#产生N 点,这些点符合均匀分布x = np.random.uniform(self.low, self.high,N)return xclass ImportanceSampler:def __init__(self, p_dist, q_dist):self.p_dist = p_distself.q_dist = q_distdef sample(self, N):#采样samples = self.q_dist.sample(N)weights = self.calc_weights(samples)normal_weights = weights - logsumexp(weights)return samples, normal_weightsdef calc_weights(self, samples):#log (p/q) =log(p)-log(q)return self.p_dist(samples)-self.q_dist(samples)if __name__ == "__main__":N = 10000p = Norm()q = Uniform(-10, 10)  sampler = ImportanceSampler(p, q)#samples 从q(x)采样出来的点,weight_samplesamples,weight_sample= sampler.sample(N)#以weight_sample的概率,从samples中抽样 N 个点samples = np.random.choice(samples,N, p = np.exp(weight_sample))plt.hist(samples, bins=100)


二 易策略 off-policy 原理

     target policy \pi: 原始策略 

        x:     这里面代表基于原始策略,得到的轨迹

                  \begin{bmatrix} s_0,a_0,r_1,....s_{T-1},a_{T-1},r_T,s_T \end{bmatrix}

       p(x):   该轨迹的概率

       f(x):    该轨迹的累积奖赏

      期望的累积奖赏:

                    u_f=\int_{x} f(x)p(x)dx \approx \frac{1}{N}\sum f(x_i)

    behavior policy \pi^{'}: 行为策略

     q(x): 代表各种轨迹的采样概率

    则累积奖赏函数f在概率p 也可以等价的写为:

     u_f=\int_{x}q(x)\frac{p(x)}{q(x)}f(x)dx

     E[f] \approx \frac{1}{m}\sum_{i=1}^{m}\frac{p(x_i)}{q(x_i)}f(x_i)

   

     P_i^{\pi} 和 P^{\pi^{'}} 分别表示两个策略产生i 条轨迹的概率,对于给定的一条轨迹

    \begin{bmatrix} s_0,a_0,r_1,....s_{T-1},a_{T-1},r_T,s_T \end{bmatrix}

    原始策略\pi 产生该轨迹的概率:

     P^{\pi}=\prod_{i=0}^{T-1} \pi(s_i,a_i)P_{s_i\rightarrow s_{i+1}}^{a_i}

    P^{\pi^{'}}=\prod_{i=0}^{T-1} \pi^{'}(s_i,a_i)P_{s_i\rightarrow s_{i+1}}^{a_i}

   则

    w(s)=\frac{P^{\pi}}{p^{\pi^{'}}}=\prod_{i=0}^{T-1}\frac{\pi(s_i,a_i)}{\pi^{'}(s_i,a_i)}

  若\pi 为确定性策略,但是\pi^{'} 是\pi\epsilon -贪心策略:

原始策略   p_i=\left\{\begin{matrix} \pi(s_i,a_i)=1, if: a_i==\pi(x_i) \\ \pi(s_i,a_i)=0, if: a_i \neq \pi(x_i) \end{matrix}\right.

行为策略: q_i=\left\{\begin{matrix} \pi^{'}(s_i,a_i)=1-\epsilon+\frac{\epsilon }{|A|} , if: a_i==\pi(x_i) \\ \pi^{'}(s_i,a_i)=\frac{\epsilon }{|A|}, if: a_i \neq \pi(x_i) \end{matrix}\right.

  现在通过行为策略产生的轨迹度量权重w

 理论上应该是连乘的,但是p_i=0, if a_i \neq \pi(x_i),

 考虑到只是概率的比值,上面可以做个替换

 w(s)=\frac{p^{\pi}}{p^{\pi^{'}}}=\prod\frac{e^{p_i}}{e^{q_i}}=\prod e^{p_i-q_i}

其中: w_i=\frac{e^{p_i}}{e^{q_i}}=e^{p_i-q_i}更灵活的利用importance sample)

其核心是要计算两个概率比值,上面的例子是去log,再归一化


三  方差影响


四  代码

代码里面R的计算方式跟上面是不同的,

R=\frac{1}{T-t}(\sum_{i=t}^{T-1}r_i)(\prod_{j=t}^{T-1}w_j)

w_j=e^{p_j-q_j}

# -*- coding: utf-8 -*-
"""
Created on Wed Nov  8 11:56:26 2023@author: chengxf2
"""import numpy as ap
# -*- coding: utf-8 -*-
"""
Created on Fri Nov  3 09:37:32 2023@author: chengxf2
"""# -*- coding: utf-8 -*-
"""
Created on Thu Nov  2 19:38:39 2023@author: cxf
"""
import numpy as np
import random
from enum import Enumclass State(Enum):#状态空间#shortWater =1 #缺水health = 2   #健康overflow = 3 #溢水apoptosis = 4 #凋亡class Action(Enum):#动作空间A#water = 1 #浇水noWater = 2 #不浇水class Env():def reward(self, state):#针对转移到新的环境奖赏    r = -100if state is State.shortWater:r =-1elif state is State.health:r = 1elif state is State.overflow:r= -1else: # State.apoptosisr = -100return rdef action(self, state, action):if state is State.shortWater:if action is Action.water :newState =[State.shortWater, State.health]p =[0.4, 0.6]else:newState =[State.shortWater, State.apoptosis]p =[0.4, 0.6]elif state is State.health:#健康if action is Action.water :newState =[State.health, State.overflow]p =[0.6, 0.4]else:newState =[State.shortWater, State.health]p =[0.6, 0.4]elif state is State.overflow:#溢水if action is Action.water :newState =[State.overflow, State.apoptosis]p =[0.6, 0.4]else:newState =[State.health, State.overflow]p =[0.6, 0.4]else:  #凋亡newState=[State.apoptosis]p =[1.0]#print("\n S",S, "\t prob ",proba)nextState = random.choices(newState, p)[0]r = self.reward(nextState)return nextState,rdef __init__(self):self.name = "环境空间"class Agent():def initPolicy(self):#初始化累积奖赏self.Q ={} #(state,action) 的累积奖赏self.count ={} #(state,action) 执行的次数for state in self.S:for action in self.A:self. Q[state, action] = 0.0self.count[state,action]= 0action = self.randomAction()self.policy[state]= Action.noWater #初始化都不浇水def randomAction(self):#随机策略action = random.choices(self.A, [0.5,0.5])[0]return actiondef behaviorPolicy(self):#使用e-贪心策略state = State.shortWater #从缺水开始env = Env()trajectory ={}#[s0,a0,r0]--[s1,a1,r1]--[sT-1,aT-1,rT-1]for t in range(self.T):#选择策略rnd = np.random.rand() #生成随机数if rnd <self.epsilon:action =self.randomAction()else:#通过原始策略选择actionaction = self.policy[state] newState,reward = env.action(state, action) trajectory[t]=[state,action,reward]state = newStatereturn trajectorydef calcW(self,trajectory):#计算权重q1 = 1.0-self.epsilon+self.epsilon/2.0 # a== 原始策略q2 = self.epsilon/2.0   # a!=原始策略w ={}for t, value in trajectory.items():#[state, action,reward]action =value[1]state = value[0]if action == self.policy[state]:p = 1q = q1else:p = 0q = q2w[t] = round(np.exp(p-q),3)#print("\n w ",w)return wdef getReward(self,t,wDict,trajectory):p = 1.0r=  0#=[state,action,reward]for i in range(t,self.T):r+=trajectory[t][-1]w =wDict[t]p =p*wR = p*rm = self.T-treturn R/mdef  improve(self):a = Action.noWaterfor state in self.S:maxR = self.Q[state, a]for action in self.A:R = self.Q[state,action]if R>=maxR:maxR = Rself.policy[state]= actiondef learn(self):self.initPolicy()for s in range(1,self.maxIter): #采样第S 条轨迹#通过行为策略(e-贪心策略)产生轨迹trajectory =self.behaviorPolicy()w = self.calcW(trajectory)print("\n 迭代次数 %d"%s ,"\t 缺水:",self.policy[State.shortWater].name,"\t 健康:",self.policy[State.health].name,"\t 溢水:",self.policy[State.overflow].name,"\t 凋亡:",self.policy[State.apoptosis].name)#策略评估for t in range(self.T):R = self.getReward(t, w,trajectory)state = trajectory[t][0]action = trajectory[t][1]Q = self.Q[state,action]count  = self.count[state, action]self.Q[state,action] = (Q*count+R)/(count+1)self.count[state, action]=count+1#获取权重系数self.improve() def __init__(self):self.S = [State.shortWater, State.health, State.overflow, State.apoptosis]self.A = [Action.water, Action.noWater]self.Q ={} #累积奖赏self.count ={}self.policy ={} #target Policyself.maxIter =500self.epsilon = 0.2self.T = 10if  __name__ == "__main__":agent = Agent()agent.learn()

https://img2020.cnblogs.com/blog/1027447/202110/1027447-20211013112906490-1926128536.png

http://www.hrbkazy.com/news/44469.html

相关文章:

  • 国家家企业信用信息系统网络优化大师下载
  • 百度小程序如何做网站百度视频免费下载
  • 关于网站建设的介绍seo营销方法
  • 网站做好怎么推广哈尔滨最新消息
  • 做app推广上哪些网站吗企业网站seo哪里好
  • 网站 图片水印微信小程序开发详细步骤
  • linux建网站百度热搜榜第一
  • 深圳分为哪几个区口碑优化seo
  • 惠州做网站小程序网站推广方案
  • 模块网站需要多少钱苏州旺道seo
  • 曲靖市住房和城乡建设局网站环球军事新闻最新消息
  • 邵东网站建设网站排名优化方案
  • 用bootstrap做网站外链网站推荐几个
  • 衣服网站建设策划书个人网站制作教程
  • css3网站案例南宁网站推广公司
  • 做游戏解说上传在什么网站好管理方面的培训课程
  • 网站如何开通支付功能seo是搜索引擎营销
  • 黑龙江疫情最新消息今天seo优化必备技巧
  • 网站分析该怎么做口碑营销的产品有哪些
  • 什么做网站推广免费个人网站注册
  • 手机上自己做网站网络营销期末考试题库
  • 赣州经开区疫情最新情况免费seo提交工具
  • 深圳全网营销方案免费seo培训
  • 做保洁网站找谁做浏览器打开是2345网址导航
  • 域名销售网站厦门seo网站优化
  • 做dna胎儿亲子鉴定网站互联网营销师培训机构哪家好
  • python写网站google官网入口注册
  • wordpress博客批量发布成都自然排名优化
  • 长沙网站改版谷歌在线搜索
  • 永年做网站多少钱爱站网站长seo综合查询工具