转到内容


照片

PHP使用Jieba分词提供相应接口

PHP Python JieBa 分词

  • 请先登录再回复
这个主题有 1 个回复

#1 Jamers

Jamers

    管理员

  • 论坛管理员
  • 228 帖子
  • 呼号:BI4TSQ

发表于 2017-11-22 11:09

python里有个结巴分词工具,看着还不错,我们PHP上能够尝试用它么?

 

先来分词py脚本 :

# -*- coding: UTF-8 -*-
import sys
import jieba
import json

str = ""
for i in range(1, len(sys.argv)):
	str += sys.argv[i]+' '

if (str == ""):
	str = "他来到了网易杭研大厦"
t = "jieba V"+jieba.__version__
out = {"TITLE":t}

seg_list = jieba.cut(str, cut_all=True)
#print("Full Mode: " + "/ ".join(seg_list))  # 全模式
out['FULL'] = "||".join(seg_list)

seg_list = jieba.cut(str, cut_all=False)
#print("Default Mode: " + "/ ".join(seg_list))  # 精确模式
out['DEF'] = "||".join(seg_list)

#seg_list = jieba.cut(str)  # 默认是精确模式
#print(", ".join(seg_list))

seg_list = jieba.cut_for_search(str)  # 搜索引擎模式
#print(", ".join(seg_list))
out['SEARCH'] = "||".join(seg_list)

print json.dumps(out)

再来PHP脚本:

<?php
//路径请自行定义
$script = '/usr/local/www/apache24/data/py/tst.py';

$str = "我家住在黄土高坡";
if (isset($_REQUEST['str'])) $str = $_REQUEST['str'];

header('Content-Type:application/json;charset:UTF-8');
system("python {$script} {$str}");

现在只有一个问题,运行速度有点慢,同样的文字,python端只需要0.2秒,但PHP请求需要2-3秒。。看样子后续还是要改成直接用python返回相应请求。



#2 Jamers

Jamers

    管理员

  • 论坛管理员
  • 228 帖子
  • 呼号:BI4TSQ

发表于 2017-11-22 11:57

用了python自己的http服务,速度明显快了,同样的请求,差不多200ms就可以完成了。

# -*- coding: UTF-8 -*-
import tornado.httpserver
import tornado.ioloop
import tornado.options
import tornado.web
import jieba
import json

from tornado.options import define, options
define("port", default=8000, help="run on the given port", type=int)

class IndexHandler(tornado.web.RequestHandler):
    def get(self):
        sstr = self.get_argument('str', '')
        self.set_header("Content-Type", "application/json;charset:UTF-8")
        self.write(self.jjcut(sstr))

    def post(self):
        sstr = self.get_argument('str',"")
        self.set_header("Content-Type", "application/json;charset:UTF-8")
        self.write(self.jjcut(sstr))

    def jjcut(self,sstr=""):
        if (sstr == ''):
            sstr = '他来到了天安门广场'
        t = "jieba V" + jieba.__version__
        out = {"TITLE": t}

        seg_list = jieba.cut(sstr, cut_all=True)
        # print("Full Mode: " + "/ ".join(seg_list))  # 全模式
        out['FULL'] = "||".join(seg_list)

        seg_list = jieba.cut(sstr, cut_all=False)
        # print("Default Mode: " + "/ ".join(seg_list))  # 精确模式
        out['DEF'] = "||".join(seg_list)

        # seg_list = jieba.cut(str)  # 默认是精确模式
        # print(", ".join(seg_list))

        seg_list = jieba.cut_for_search(sstr)  # 搜索引擎模式
        # print(", ".join(seg_list))
        out['SEARCH'] = "||".join(seg_list)

        return json.dumps(out)

if __name__ == "__main__":
    tornado.options.parse_command_line()
    app = tornado.web.Application(handlers=[(r"/", IndexHandler)])
    http_server = tornado.httpserver.HTTPServer(app)
    http_server.listen(options.port)
    tornado.ioloop.IOLoop.instance().start()






同时将以下关键词作为标签:PHP, Python, JieBa, 分词

0 用户正在浏览这个主题

0 会员,0 游客,0 隐身会员