希赛网 · 知识导航  
 
    软考英语    通信工程英语    软件工程    项目管理    操作系统    数据库系统    计算机网络    IT新技术    硬件数码    应用系统    计算机基础    IT职场  
希赛网 > IT英语 > IT新技术 > 贝叶斯逻辑和过滤器

贝叶斯逻辑和过滤器

www.educity.cn   发布者:wxli00   来源:网络转载   发布日期:2013年08月27日   

  Bayesian Logic And Filters

  Some say that if you can’t measure something, you’re not doing science. Bayesian logic offers a way to measure things that were previously unmeasurable, allowing us to test hypotheses and predictions and thereby refine our conclusions and decisions. Bayesian filtering is a hot topic in the area of spam control today.

  Basic probability is simple to calculate, because you’re dealing with a limited number of factors and possibilities. Let’s consider a horse race with 10 horses entered. If that’s the only information we have on which to base a wager, then we could pick any horse on the basis that its chance of winning is 1 in 10, or 0.10. Take that kind of math to the track, however, and you’ll quickly be separated from the contents of your wallet. The real world is far more complicated, and here’s where Bayesian logic comes into the picture.

  In fact, each of the 10 horses has already run at least a few races and therefore has a history. If Lightning has won every race he has entered, and Thunder has lost every one he has entered, then we’ve got a real evidential basis on which to bet on Lightning instead of on Thunder.

  In fact, there’s a lot more information available about every horse in the race. We know or can easily find out the following:

  Lineage: Is this horse the offspring of a champion? How have his brothers and sisters performed?

  Performance under different weather conditions: If it rains in the morning and the track is soft, how does that affect his speed?

  Position on the track: Is our horse next to the rail or on the outside? And how does the horse react when he’s in that position?

  Length of time since last race: If the horse ran a long, hard race yesterday, how well is he likely to run today?

  Distance of today’s race: How has the horse fared at this distance in the past?

  Other people’s betting patterns also come into play. They don’t affect how well a horse will perform, but they have a clear impact on the size of the payoff if he does win.

  All of this information can help us make a better estimate of our horse’s chance of winning than the simplistic 1 out of 10. Analyzing these factors is a Bayesian process.

  Bayesian Antispam

  The application of Bayesian logic to the spam problem got its start in Paul Graham’s 2002 paper“A Plan for Spam”, an approach that was soon adopted by numerous developers. Bayesian spam filtering is based on the notion that the presence of certain words will indicate spam, while other words will identify a message as legitimate. It has that in common with other types of scoring-content-based filters, but with the added advantage that Bayesian filters create their own lists of telltale words and characteristics rather than working from lists created manually.

  A Bayesian filter starts by examining one set of e —mails known to be spam and another set known to be legitimate (the prior knowledge). It compares the contents for both sets — not just the message body, but also header information and metadata, word pairs and phrases, and even HTML code for information such as the use of specific colors. From this, it builds a database of words, or tokens, with which it can usefully identify future e-mails as spam or not.

  Bayesian filters take into account the whole context of a message. For example, many spam messages contain the word free in the subject line, but so too do some legitimate messages. A Bayesian filter notes this word but also looks at other tokens in the message, because falsely identifying a real message as spam causes more problems (called a false positive)than letting some spam through as legitimate.

  According to proponents, less than 1% of the messages identified as spam by Bayesian filters are false positives.

  The Bayesian spam filter's real power, however, lies in its ability to learn: As the user tags new messages, the filtesr updates its database to identify new patterns of spam.

  贝叶斯逻辑和过滤器

  有人说,如果你不能测量一件东西,那你不是在做科学。贝叶斯逻辑提供了一种方法,来测量过去不能测量的东西,让我们验证假设和预言,从而完善我们的结论和决策。贝叶斯过滤是当今垃圾邮件领域中的热门话题。

  基本概率计算起来很简单,因为你只与数量有限的因素和可能性打交道。让我们来看看有十匹马参加的赛马吧。如果这是我们下赌注依据的惟一信息,那么我们只能任取一匹马,赢的可能性只有十分之一,即0.10。然而按这样的思路计算,你很快会输光兜里的钱。实际世界远远地比此复杂,这正是贝叶斯逻辑的用武之地。

  事实上,每匹马都已赛过多次,因而都有历史纪录。如果“闪电”这匹马只要一出场,每次都必胜,而“雷声”那匹马一出场必输,那么我们已经有了下赌注的有据可查的真实基础,将赌注下在“闪电”上而不是“雷声”上的。

  实际上,比赛中的每一匹马都有大量的信息可资利用。我们知道或者很容易找到下列的信息:

  血统:这是一匹冠军马的后代嘛?它的兄弟姐妹的表现如何?

  在不同天气状况下的表现:如果早晨下雨、赛道松软,会对它的速度有何影响?

  赛道位置:我们下注的马是靠在最里面、还是在最外?我们的马处于这样的位置,会做何反应?

  距上次比赛过了多少时间:如果这匹马昨天跑了长距离的艰难比赛,今天有可能跑多快?

  今天比赛的距离:过去,这匹马在这样的距离上表现如何?

  其他人的下赌方式也起作用。虽然他们不会影响马的表现,但他们对赢者能赢多少钱有明显的影响。

  所有这些信息帮助我们比简单的十个中任选一个的方法能更好地预测我们下注马的赢的机会。分析这些因素就是贝叶斯过程。

  贝叶斯反垃圾邮件

  贝叶斯逻辑应用于垃圾邮件问题,起始于Paul Graham在2002年发表的“对付垃圾邮件的计划”一文,很快众多的开发者采用这个方法。贝叶斯垃圾邮件过滤基于这样一个概念,即某些词的出现象征着垃圾邮件,而其他的词识别合法信息。它除了有其他基于内容评分的过滤器一样的功能外,还具有额外的优势,贝叶斯过滤能生成自己的警告词和特性,而不只是按人工生成的列表工作。

  贝叶斯过滤器是从检查一组已知的垃圾邮件和一组已知的合法邮件(先前的知识)开始工作的。它比较这两组的内容,不仅是信息本体,而且还有报头信息和元数据、词组和短语,甚至信息的HTML代码(如使用指定的颜色)。据此,它编制词语数据库,即令牌,利用这个令牌,就能识别以后的电子邮件是不是垃圾。

  贝叶斯过滤器会考虑一条信息的所有内容。例如,很多垃圾信息在标题中包含“免费”这个词,但是有些合法信息也这么做。贝叶斯过滤器注意到这个词,但也察看信息中的其他令牌,因为把合法信息误识别成垃圾邮件引起的问题,超过将垃圾邮件当成合法邮件。

  据支持者称,贝叶斯过滤器把(合法)信息识别成垃圾邮件,不到1%。

  然而,贝叶斯垃圾邮件过滤器的真正功能在于它学习的能力:当用户给新的信息打上新的标记时,过滤器更新它的数据库,以识别出垃圾邮件的新模式。

标签: IT新技术
1 2
   主编推荐
全局导航
IT认证学院
Adobe认证Cisco认证H3C认证IBM认证IT认证资讯Java认证Linux认证Microsoft认证Oracle认证华为认证
IT英语
IT新技术操作系统基础英语计算机网络软件工程软考英语数据库系统通信专业英语项目管理英语应用系统硬件数码职场英语
程序开发学院
.NETC语言_C++语言DelphiPowerBuilderWeb开发嵌入式开发移动开发游戏开发PHPPythonPerlRuby
等考学院
考试大纲二级考试经验二级模拟试题一级考试资料二级考试资料一级模拟试题三级模拟试题四级模拟试题一级考试经验三级考试经验四级考试经验四级考试资料三级考试资料等级考试动态
软件工程学院
CASE工具构件与中间件软件测试软件过程改进软件设计软件外包需求分析软件质量保证系统分析与建模系统规划业界观点敏捷开发
软考学院
程序员电子商务设计师法律法规考试大纲考试政策历年试题软件评测师软件设计师软考英语数据库系统工程师网络工程师网络管理员网络规划设计师系统分析师系统架构设计师信息技术处理员信息系统管理工程师信息系统监理师
通信学院
初级通信工程师传输与接入高级通信工程师互联网技术交换技术考试大纲考试动态考试题库设备环境通信法规终端与业务综合能力
网络工程学院
交换技术接入技术路由技术实施案例网络布线网络存储网络服务器网络管理无线网络系统应用网络协议网络设备
项目管理学院
系统集成项目管理工程师信息系统项目管理师CPMP考试IPMPPMP考试prince2认证项目采购管理项目成本管理项目范围管理项目风险管理项目沟通管理项目配置管理项目人力资源管理项目时间管理项目管理案例项目管理动态项目管理工具项目经理项目整合管理项目质量管理项目干系人管理
职称考试学院
职称考试题目职称考试指南职称考试资料
研究生院
考研英语考研题库招生信息就业指导考研经验考研政治考研数学
信息安全实验室
网络安全黑客教程杀毒防毒安全设置脚本攻防黑客入侵工具使用漏洞分析加密解密手机安全安全技术
物联网学院
生物识别二维码射频技术传感器物联网感知层物联网网络层物联网传输层物联网应用层物联网标准物联网前沿技术智能生活智慧城市物联网案例分析云计算虚拟化技术
Java学院
Javascript教程Java教程Java核心技术Java高级技术J2EE教程J2ME教程XML教程Java开源技术
Linux学院
Linux系统管理Unix教程Linux教程Linux编程Linux集群Linux内核技术Linux安全Linux服务器Solaris教程AIX教程
Windows学院
Windows系统管理Windows教程Windows安全Windows服务器Windows网络管理Windows故障Windows优化Windows动态
数据库学院
数据库开发Oracle数据库MySQL数据库Sybase数据库DB2数据库SQL Server数据库数据仓库Informix数据库