PHP 大数组循环的优化问题

先提示一下$item = 'chinaisbig';
preg_match_all('/china|is|big|which|not|me/', $item, $r);
print_r($r);Array
(
    [0] => Array
        (
            [0] => china
            [1] => is
            [2] => big
        ))
即 echo preg_match_all('/china|is|big|which|not|me/', $item); //3
当然，两轮循环还是要的

利用 trie 算法构造字典，查询起来就快多了$aItems = array(
    'chinaisbig',
    'whichisnot',
    'totalyrightforme',
);
$aTable = array(
    'china,is|small',
    'china,big|me',
    'china,is|big,which|not,me',
    'china,is|small',
);$p = new ttrie;
foreach($aTable as $k=>$r) {
  $r = explode(',', str_replace('|', ',', $r));
  foreach($r as $v) $p->set($v, $k);
}
foreach($aItems as $k=>$s) {
  $t = array_count_values($p->match($s));
  $c = array_search(max($t), $t);
  printf("%s %d \$aTable[%d]=%s\n", $s, max($t), $c, $aTable[$c]);
} class TTrie {
  protected $buffer = array();
  protected $dict = array( array() );
  protected $input = 0; //字符串当前偏移
  protected $backtracking = 0; //字符串回溯位置
  public $debug = 0;
  public $savematch = 1;

  function set($word, $action='') {
    if(is_array($word)) {
        foreach($word as $k=>$v) $this->set($k, $v);
        return;
    }
    $p = count($this->dict);
    $cur = 0; //当前节点号
    foreach(str_split($word) as $c) {
        if (isset($this->dict[$cur][$c])) { //已存在就下移
            $cur = $this->dict[$cur][$c];
            continue;
        }
        $this->dict[$p]= array(); //创建新节点
        $this->dict[$cur][$c] = $p; //在父节点记录子节点号
        $cur = $p; //把当前节点设为新插入的
        $p++;
    }
    $this->dict[$cur]['acc'][] = $action; //一个词结束，标记叶子节点
  }

  function match($s) {
    $this->buffer = array();
    $this->input = 0;
    $this->backtracking = 0;    $ret = array();
    $cur = 0; //当前节点，初始为根节点
    $i =& $this->input; //字符串当前偏移
    $p =& $this->backtracking; //字符串回溯位置
    $s .= "\0"; //附加结束符
    $len = strlen($s);
    $buf = '';
    $res = array();
    while($i < $len) {
        $c = $s{$i};
        if(isset($this->dict[$cur][$c])) { //如果存在
            $cur = $this->dict[$cur][$c]; //转到对应的位置
            if(isset($this->dict[$cur][$s[$i+1]])) {//检查下一个字符是否也能匹配，长度优先
                $i++;
                continue;
            }
            if(isset($this->dict[$cur]['acc'])) { //是叶子节点，单词匹配！
                $res = array_merge($res, $this->dict[$cur]['acc']);

                $p = $i + 1; //设置下一个回溯位置
                $cur = 0; //重置当前节点为根节点
            }
        } else { //不匹配
            $buf .= $s{$p}; //substr($s, $p, $i - $p + 1); //保存未匹配位置和未匹配的内容
            $cur = 0; //重置当前节点为根节点
            $i = $p; //把当前偏移设为回溯位置
            $p = $i + 1; //设置下一个回溯位置
        }
        $i++; //下一个字符
    }
    if(trim($buf, "\0")) $this->buffer[] = trim($buf, "\0");
    return $res;
  }
}
chinaisbig 3 $aTable[2]=china,is|big,which|not,me
whichisnot 3 $aTable[2]=china,is|big,which|not,me
totalyrightforme 1 $aTable[1]=china,big|me

最好别用正则
这个说法比较陈旧了，目前php的正则分析器的效率已经是很高了。
至少和 strstr 不相上下不过正则毕竟是个通用的工具，性能总是要比专用工具要差一些的
比如你的到匹配集 $aItems 和字典 $aTable 规模都是一万
用strstr时，总的检查次数为 1万*1万*字典字平均个数
用正则时，总的检查次数为 1万*1万
即便是正则的效率比strstr低，也不至于低到不可接受的地步
而用trie算法时的检查次数为 1万，这是因为字典被组织成了哈希表，而哈希表的时间复杂度为O(0)，查表被忽略不计了其实你把三种方案都实际测试一下，就能得出结论了

怎么是2种呢？
你自己的 strstr 方案和我的正则、trie 方案，加载一起不等于 3 吗？

方案一： strstr
这个算法是我原来写的6倍。就是好难懂。

就算法本身而言，你只要知道原理就行了，具体如何实现的并不一定要知道
就好比 strstr、preg_mach_all 函数一样，你并不需要知道他们是如何用 C 语言写成的。不也是用的好不过的吗？
从探讨算法的实现角度上看，当然应该了解实现的具体方案了。嘟囔着就需要你花时间研读一下《数据结构》了
在这个版本的 trie 树的构造上，采用的方法是将树状数据把存到一维线性表中的算法。这也是 trie 算法理论描述时采用的算法。这个映射算法的好处是不用递归。
php 提供的关联数组本身就是一个哈希表结构，同样可以方便的保存树。不过访问时要使用“数的遍历”算法trie 算法的基本原理就是：
取输入串的一个字符，看是否存在于字典中
如果不存在，就取下一个，直到出现了存在于字典中的字符
如果下一个字符也是字典中的下一个字符，则继续前进
如果不是则表示已经匹配到了一个词，当前字符是下一个次的开始
如此循环直至输入串结束该算法的好处是只需扫描一遍就可找出输入串中的字典字，而查字典的代价也是极小的我描述的可能你看不懂，你可到网上去查 trie 算法的说明

我做压力测试了, 用trie算法的效率是我自己写的最终版本的10倍之多。7000条内容+1000条字典, 就是好占用内存啊。要设置到1G才肯工作。以我电脑的性能，我写的用了100秒，trie算法10秒左右，好强大。

先回答你 #14 的疑问
给传入串附加一个肯定是单词结束的字符，这样就可以按正常流程判别到位于传入串末尾的字典字
否则还需要在结束后检查最后一个单词是否为字典字为什么可以提升效率，道理很简单
你应该注意到，无论是程序代码和我的说明中都反复提到“字典”（dict）
当你在查字典时，不会是一页一页的翻吧，而是按照单词中字母的顺序直接定位到相应的页上
而那个构造字典的set方法已经将相同的词聚集在一起了，同名词只出现一次。附加的 acc 元素记录了这个单词是由那条规则提供的。
这点你应该已经看出来了，想想就知道为什么快了
比如一个单词在10条规则中出现，一旦我识别出这个单词，就知道了有10条规则与之匹配。而不需要对这10条规则逐一比对了

请问我现在这样写哈有问题吗？
<?php
error_reporting(E_ALL & ~E_NOTICE & ~E_DEPRECATED & ~E_STRICT);
header('Cache-Control: no-cache, must-revalidate');
header("Expires: Sat, 26 Jul 1997 05:00:00 GMT");$aItems = array(
'prefixchinaisbig',
'whichisnotpostfix',
'totalyconfusionrightforme',
);
$aTable = array(
'china,is|small',
'china,big|me',
'china,is|big,which|not,me',
'totaly|right,for,me',
);$oWeight = new weight;
$oWeight->newTable($aTable);
$oWeight->newItems($aItems);$oWeight->bDebug && printf("%s\n=========================================\n", 'Debug Mode');
$tb = microtime(true);
$aRes = $oWeight->getShow();
$ta = microtime(true);
$elasped = ($ta - $tb)*1000;$oWeight->bDebug && printf("%s\n=========================================\n", 'Tire Dict');
$oWeight->bDebug && print_r($oWeight->getDict());echo count($oWeight->getItems()). ' 个条目, ';
echo "执行时间: {$elasped} ms<br/>\n";
var_dump($aRes);class weight{
public $bDebug = true;
public $aShow = array();
protected $aDict = array( array() );
protected $aItems = array(); public function newItems($mItems) {
//导入新的要查询的内容
$this->aItems = (is_array($mItems))? $mItems: array($mItems);
$this->init();
} public function newTable(array $aTable) {
//导入新的对照表，并生成tire树形字典
foreach($aTable as $iTableKey=>$sTableLine) {
$aTableLine = explode(',', str_replace('|', ',', $sTableLine));
$setter = function($v, $k, $paraMeter) {
$k1 = $paraMeter[0]; $oWeight = $paraMeter[1];
$oWeight->genDict($v, $k1);
};
array_walk($aTableLine, $setter, array($iTableKey, $this));
}
$this->init();
} private function init() {
//清空记录的匹配表和输出结果
unset($this->aShow);
} public function getShow() {
//获取最终的显示结果
if (empty($this->aShow))
return $this->genShow();
return $this->aShow;
} public function getItems() {
} public function getDict() {
return $this->aDict;
} private function genShow() {
$aMatchs = array();
$aShow = array();
$debug = $this->bDebug;
$getter = function($v, $k, $oWeight) use(&$aMatchs, &$aShow, $debug) {
$debug && print 'item: '. $v. '(length:'.strlen($v).")\n".'-----------------------'."\n";
$aMatchs[$k] = array_count_values($oWeight->matchElement($v));
$debug && print 'matchRules Generated'."\n";
$debug && print_r($aMatchs[$k]);
$aShow[$k] = array_keys($aMatchs[$k], max($aMatchs[$k]));
};
array_walk($this->aItems, $getter, $this);
$this->aShow = $aShow;
return $this->aShow;
} private function genDict($mWord, $sAction='') {
//将字典处理为Trie树形
if(is_array($mWord)) {
foreach ($mWord as $k=>$v) $this->genDict($v, $k);
return;
}
$iP = count($this->aDict);
$iCur = 0;
foreach (str_split($mWord) as $sChar) {
if (isset($this->aDict[$iCur][$sChar])) {
$iCur = $this->aDict[$iCur][$sChar];
continue;
}
$this->aDict[$iP] = array();
$this->aDict[$iCur][$sChar] = $iP;
$iCur = $iP;
$iP++;
}
$this->aDict[$iCur]['acc'][] = $sAction;
} function matchElement($sLine) {
//同Tire字典比对
$iCur = 0;
$iOffset = 0;
$iBackPos =0;
$iLen = strlen($sLine);
$iWordLength = 0;//记录单词长度
$aRefer = array();
$debug = $this->bDebug;
while ($iOffset < $iLen) {
$sChar = $sLine{$iOffset};
$debug && print 'trying char:'.$sChar ."\n";
if ($debug) {
print "isset?\$this->aDict[$iCur][$sChar],";
isset($this->aDict[$iCur][$sChar])? print 'yes'. "\n": print 'no'. "\n";
}
$debug && print "wordLength: $iWordLength\n";
if (isset($this->aDict[$iCur][$sChar])) {
$iCur = $this->aDict[$iCur][$sChar];
$iWordLength++;
$debug && print "iCur: $iCur\n";
if (isset($this->aDict[$iCur]['acc'])) {
$debug && print 'word('.substr($sLine, $iBackPos, $iWordLength) . ') got new acc'."\n";
$debug && print_r($this->aDict[$iCur]['acc']);
$aRefer = array_merge($aRefer, $this->aDict[$iCur]['acc']);
$iBackPos = $iOffset + 1;
$iCur = 0;
$iWordLength = 0;
$iOffset++;
continue;
}
} else {
$iCur = 0;
$iBackPos = $iOffset + 1;
$iWordLength = 0;
}
$iOffset++;
}
$debug && print_r($aRefer);
return $aRefer;
}
}
?>

而且这个内存占用量实在是太大了。到了不可接受的地步了啊我测试的时候10000条item
1000条table占了1G多的内存啊

Array
(
    [0] => Array
        (
            [0] => 2
        )    [1] => Array
        (
            [0] => 2
        )    [2] => Array
        (
            [0] => 3
        ))没有什么问题，只是把简单的事情弄复杂了

哈哈。我只是添加了一些调试用的代码在里面啊。流程应该是简化了一些。板大可否再问最后一个问题呢。有可能解决内存占用问题么？现在的内存占用太大了。
应该是从 items分离出来的字符串，和字符串分割而成的单个字符造成的吗？有可能解决么？
希望版主能再帮我一次

字典很大？能达到什么规模？没什么概念
我用它做关键字识别，数万关键字也就几兆而已换个字典生成算法，你比较一下？$aItems = array(
    'prefixchinaisbig',
    'whichisnotpostfix',
    'totalyconfusionrightforme',
);
$aTable = array(
    'china,is|small',
    'china,big|me',
    'china,is|big,which|not,me',
    'totaly|right,for,me',
);$p = new trie;
foreach($aTable as $k=>$r) {
  foreach(explode(',', strtr($r, '|', ',')) as $v)
    $p->set($v, $k);
}
foreach($aItems as $s) {
  print_r($p->match($s));
}class trie {
  public $dict = array();
  function set($word, $acc) {
    $p =& $this->dict;
    foreach(str_split($word) as $c) {
      if(! isset($p[$c])) $p[$c] = array();
      $p =& $p[$c];
    }
    $p['acc'][] = $acc;
  }
  function match($s) {
    $s = "$s ";
    $res = array();
    $p =& $this->dict;
    $k = -1;
    for($i=0; $i<strlen($s); $i++) {
      if(isset($p[$s{$i}])) {
        $p =& $p[$s{$i}];
        if($k < 0) $k = $i;
      }else {
        if(isset($p['acc'])) {
          $res = array_merge($res, $p['acc']);
        }else if($k>=0) $i = $k + 1;
        $p =& $this->dict;
        $k = -1;
      }
    }
    return $res;
  }
}
这个字典是 9496b，你用的那个字典是12760b

可是为什么我用我最后发出来的那段程序运行aItem 10000个条目
aTable 1000个条目运行后，ini_set('memory_limit', '1024M');
还是提示
Fatal error: Allowed memory size of 1073741824 bytes exhausted (tried to allocate 4096 bytes) in /home/latel/Workspace/test2/20131213/weightOOP_trie_test.php on line 80838083行是
$aShow[$k] = array_keys($aMatchs[$k], max($aMatchs[$k])); 也就是我最后贴的那段代码的第92行，我弄不清楚这么大的内存占用是怎么来的。

你可在适当的地方用 memory_get_usage 函数观察一下啊
毛病总是得找到的aTable 1000个条目
设平均每条目10个单词，总共也就 10000 个单词，即便是 10000 个完全不同的单词
也就是 10000 * 104 * 平均单词长度
若平均单词长度为 10 也就 10M 不到一点我倒是觉得你把数组传来传去的，而不是引用。才是个问题

调试易

PHP 大数组循环的优化问题

解决方案 »