求一道百度面试题的解

数量巨大~不知道存储格式是什么样的,如果是XML格式的,用XPATH查找比较应该比其它格式快点!

个人认为比较简单的（也容易讲清楚）：
1、先对a、b两个文件先进行外排序（计算复杂度约NLog(N)，空间可限制在4G内)。
2、对a、b两个文件用Streaming的方式找相同记录（计算复杂度为N，空间可限制在4G内)。
ExternalSort("a.txt");
ExternalSort("b.txt");//注，ReadLine可改成ReadBlock64之类。
using( StreamReader aReader = new StreamReader("a.txt"))
using (StreamReader bReader = new StreamReader("b.txt"))
{
    string aLine = aReader.ReadLine();
    string bLine = bReader.ReadLine();    while (aLine != null && bLine != null)
    {
        int diff = string.Compare(aLine, bLine);
        if (diff < 0)
        {
            aLine = aReader.ReadLine();
        }
        else if(diff > 0)
        {
            bLine = bReader.ReadLine();
        }
        else
        {
            Console.WriteLine(aLine);        // output the matched line
            aLine = aReader.ReadLine();
            bLine = bReader.ReadLine();
        }
    }
}

关键是排序   4G = 1024 * 1024 * 1024 * 4  byte  40亿 * 8 bit
比较的文件是  50亿 *  16bit 大概单个文件是 10G
不能直接排序需要分割文件  利用归并排序的方法1. 先分成10个小文件. 把每个小文件读入内存排序后写回去.
2. 对10个有序的文件再做一个多路归并的排序. C/C++ code
    //多路归并的外排序
    //思路如下:
    //1.按各输入文件中下一个读到的元素的大小构造一个输入流最小堆.
    //2.从堆顶文件里读一个元素并写入输出文件.
    //3.同时按读的那个文件的下一个元素的值调整堆.
    //4.若第3步已到达文件结尾.则从堆中删除该输入流
    //5 如果堆中还有元素. 回到第2步
#include<iostream>
#include<fstream>
#include<vector>
#include<algorithm>
#include<iterator>
#include<functional>
    using namespace std;
    //这个类主要用来管理一个输入流.
    //它知道流中还有没有元素.
    //可以查看下一个将读出来的元素的值.
    //可以读出一个元素.
    template <class T>
    class Yudu {
    private:
        istream & _istream;
        T _next;
        bool _have_next;
    public:
        Yudu( istream & x) : _istream(x) {
            if (_istream >> _next) _have_next = true;
            else _have_next = false;
        }
        inline bool have_next() { return _have_next; }
        //读出流中下一个元素
        T read_next() {
            if (! _have_next) {
                cout << "读取错误! 退出" << endl;
                exit(1);
            }
            T temp = _next;
            if (_istream >> _next)
                _have_next = true;
            else
                _have_next = false;
            return temp;
        }
        //看看下一个元素但不读出来
        inline const T& look_next() const { return _next; }
    };
    //比较两个Yudu对象下一个元素的大小
    //提供给paixu函数中的堆操作时使用
    class bijiao
    {
    public:
        template <typename T>
        bool operator() (const Yudu<T>* a, const Yudu<T>* b) {
            return a->look_next() > b->look_next() ;
        }
    };
    //文件排序函数
    template <typename T>
    void paixu(vector<Yudu<T>* >& v , ostream& out){
        //先删除数组中的"没有下一个元素"的Yudu对象.例如一个空的文件构造的Yudu对象.
        v.erase( remove_if(v.begin(),v.end(),
            not1(mem_fun(& Yudu<T>::have_next))), v.end() );        make_heap(v.begin(), v.end(), bijiao());        while (! v.empty() ){
            pop_heap(v.begin(),v.end(), bijiao());
            out << v.back()->read_next() << " ";   // out 以文本方式打开. 如果是其它方式这里要改.
            if (! v.back()->have_next() )
                v.pop_back();
            else
                push_heap(v.begin(), v.end(), bijiao());
        }
    }    int main() {
        ifstream in[3];
        in[0].open("1.txt");
        in[1].open("2.txt");
        in[2].open("3.txt");        vector<Yudu<int>* > v;
        v.push_back(new Yudu<int>(in[0])); //实际使用时要释放new出的对象.或者用boost的智能指针.
        v.push_back(new Yudu<int>(in[1]));
        v.push_back(new Yudu<int>(in[2]));        ofstream o("out.txt");        paixu(v, o);        cin.get();
        return 0;
    }    /*
    测试(三个文件中是已经有序的数)
    1.txt:  1 5 9 12 21
    2.txt:  3 4 5 7 8
    3.txt:  2 3 5 10 11 14 19 25 27     执行归并后
    out.txt:
    1 2 3 3 4 5 5 5 7 8 9 10 11 12 14 19 21 25 27
    */排序后比较很简单  从两个文件慢慢读出来比较大小就好了
呵呵，，我乱说的。。

牛人真多继续学习ing...

使用线程分开读就成比如可以按字符顺序a线程读以字符 "a"开头滴同时条件说明每条url各占用64字节，那么隐含着你可以使用字节比较方式，可以做一些掩码判断工作

嗯，如果没有时间限制，就两个文件，一行一个行地读。从A中取一条url，然后再去遍历B每一条url，将相同的就放在C文件中。两个文件都是一条一条地读，这样功能能实现，内存限制可以忽视！
以前见过类似的一题，不过没有内布限制，而是有时间限制，记得是先把文件分成若干个小文件，多线程来比，然后各个线程中又用hashtable来做比较的。当时不是url，而是话单

为什么是50亿 * 16bit呢？
50亿 * 64byte，算出来是接近300G的一个数字
不过这么大的文件的话，具体大小已经没有意义了

1.这是个除去冗余url的问题（假设50亿里面没有重复的A[50亿],B[50亿])。
2.url全部可以视为2进制编码，不管什么东西，都是512个0，1
3.将50亿条按照前N位排序（N取决于经验值，比如前5位可以将50亿分成M组，每组最大1G或者更小,并创建索引，指示每组的起始位置。
4.每次取A[50亿]，B[50亿]进行交叉查找，除去相同的。————————————————————————————————

调试易

求一道百度面试题的解

解决方案 »