又见csdn！大家帮忙出出主意！求解！谢谢

如果数据量过多，建议使用ETL 工具，而不是使用java硬编程，比如 sun公司的JCAPS，我做过超过千万的数据量的导入导出，没问题，可以到sun网站下载。
另：可以自定义导出格式，规定条件，比如你提到的如果用户名重复，Email不重复，则把Email作为用户名等等条件，以上纯属个人意见。

是啊，Chirs，就是这个想法么，所有来求教SQL语句的啊。怎么过滤呢，这么多。

　　　　　现在你的功能能够实现,但有内存溢出,可能是类太多.消耗太多.SQL语句都是比较简单的,现在更重要的应该是程序的优化.算法的优化

您要的什么代码，是java代码嘛？有500多行。您是要的这个么？

类太多？我就一个类，10几个方法，消耗的cpu不少，我可以把代码发给你么？不知道还有么有优化的余地。

3 update t1
set 你要跟新的三字段信息  from tuser t1, tuser t2
where t1.email=t2.email
anf t1.name<>t2.name2 update t1
set t1.name=t1.email  from tuser t1, tuser t2
where t1.email=<>2.email
anf t1.name=t2.nameselect distinct*from  tuser--最后叫你查出的结果

接16楼select distinct * into tbl_user from tuser
exec master..xp_cmdshell 'bcp "SELECT * FROM 数据库名.dbo.tbl_user" queryout "C:\tbl_user.txt" -c -U"数据库用户名" -P"密码"'用bcp导出文本，再导入mysql应该会好些，没用过mysql，如何导入文本楼下的顶

    1、如果用户名、Email均不重复，则直接导入。
    2、如果用户名重复，Email不重复，则把Email作为用户名。
        2-1、此时作为用户名的Email，同样要和已导入的数据的用户名进行一次判断是否重复。
    3、用户名不重复，Email重复，则认为是一个用户，如果用户信息存在空值的字段，则更新空值字段为新数据的字段信息。
    4、用户名、Email均重复，则跳过。把这些逻辑写到SQL里面去，结果使用楼上的方式导出成文本
再导入到mysql里去
sqlserver有一个大数据的导入导出向导，可以直接使用那个

这样的话，不符合我的条件吧。这样只要Email重复就给过滤掉了啊。
可是
1、如果用户名、Email均不重复，则直接导入。
    2、如果用户名重复，Email不重复，则把Email作为用户名。
        2-1、此时作为用户名的Email，同样要和已导入的数据的用户名进行一次判断是否重复。
    3、用户名不重复，Email重复，则认为是一个用户，如果用户信息存在空值的字段，则更新空值字段为新数据的字段信息。
    4、用户名、Email均重复，则跳过。
似乎没碰上。高手请回复。

如何过滤数据，假设你源表是S，目标表是T
则：首先，在S和T上建好username 和Email的索引然后把逻辑写到SQL中，保证目标表中就是要导出的数据：1、如果用户名、Email均不重复，则直接导入。
insert into T
select distinct username,Email from S
--为了减轻数据库负担，你可以删除掉源表中的数据
delete from S
left join T
where S.username = T.username
and S.Email = T.Email
2、如果用户名重复，Email不重复，则把Email作为用户名。
    2-1、此时作为用户名的Email，同样要和已导入的数据的用户名进行一次判断是否重复。
insert into T
select Email,Email from S --用Email做用户名
where S.username = T.username
and S.Email <> T.Email
3、用户名不重复，Email重复，则认为是一个用户，如果用户信息存在空值的字段，则更新空值字段为新数据的字段信息。
update T
set T.字段1 = T1.字段1,
    T.字段2 = T1.字段2
from T
left join
(select T.Email,
    case when T.字段1 is null then S.字段1 end as 字段1,
    case when T.字段2 is null then S.字段2 end as 字段2
   from S
left join T on S.Email = T.Email
) T1
on T.Email = T1.Email4、用户名、Email均重复，则跳过。
啥都不用写上面三步就可以过滤掉数据了

最后，不要五千万的数据一次性这样去执行
会比较慢
如果是千万左右的数据，索引建好的话
如果你的字段比较少，估计十几分钟可以搞定ps:你的java程序内存溢出，是不是因为你取了的数据全存内存了啊？
释放下内存看不懂java

需要你确认：如果用户名不重复，Email重复，需要插入几条数据？如果一条的话，应该插入哪条？
“如果用户信息存在空值的字段，则更新空值字段为新数据的字段信息。”这句话什么意思？

能不能这样
--第一次导入
insert t2
SELECT max(email),max(name) FROM t1 group by email,name
--以后导入
insert t2
SELECT max(email),max(name)
FROM t1
where email not in(select email from t2)
group by email,name -- 导入t2中没有的emailupdate t2 ,t1
set name = t1.name
where t1.email =t2.email
and t1.name is not null
and t2.email is null--更新email重复，name空的
数据量小的时候应该可以
大量不清楚期待高手

修改下

--第一次导入
insert t2
SELECT max(email),max(name) FROM t1 group by email,name
--以后导入
insert t2
SELECT max(email),max(name)
FROM t1
where email not in(select email from t2)
group by email,name -- 导入t2中没有的emailupdate t2
set name = (select name from t1
where t1.email =t2.email
       and t1.name is not null
       and t2.email is null
)--更新email重复，name空的
数据量小的时候应该可以
大量不清楚期待高手

0.数据归集：将所有数据都导入到一个MS SQL
1.建聚集索引：email,姓名
2.将所有数据过滤导入到一个表：select email,max(姓名),max(...)... into newtb from tb group by email
3.将newtb导入MYSQL。
搞定

您的第三条，怎么过滤呢，我的四个条件。麻烦指教一下
  1、如果用户名、Email均不重复，则直接导入。
    2、如果用户名重复，Email不重复，则把Email作为用户名。
        2-1、此时作为用户名的Email，同样要和已导入的数据的用户名进行一次判断是否重复。
    3、用户名不重复，Email重复，则认为是一个用户，如果用户信息存在空值的字段，则更新空值字段为新数据的字段信息。
    4、用户名、Email均重复，则跳过。

建议用sql2005，这个可以比较轻松解决你的问题

谢谢你，可以具体说说嘛，那么好用么？难道sql2005工具就可以了嘛？谢谢您

您的第三条，怎么过滤呢，我的四个条件。麻烦指教一下
  1、如果用户名、Email均不重复，则直接导入。
    2、如果用户名重复，Email不重复，则把Email作为用户名。
        2-1、此时作为用户名的Email，同样要和已导入的数据的用户名进行一次判断是否重复。
    3、用户名不重复，Email重复，则认为是一个用户，如果用户信息存在空值的字段，则更新空值字段为新数据的字段信息。
    4、用户名、Email均重复，则跳过。 select email,max(姓名),max(...)... into newtb from tb group by  email 因为是按照email进行分组排序，因此，只要email重复就可以认为是两条记录，而email重复就一定是同一条记录。
同时通过MAX函数过滤非空字段。
只有一条没有做，就是名字相同而email不同的需要将名字改成emial,不用担心修改后会重复，因为已经经过判定。

你的要求就是按照email来进行判断。
email不同就是不同的用户，email相同就是同一个用户。
对于重名的不同用户，需要用email来替代用户名，从而实现用户名的不同。
然后需要尽量详细的其它信息（如果一个用户存在多条记录，用多条记录的值拼凑用户的信息，拼凑的要求就是尽量不要有空值就行了）。
因此，我的过滤语句除了达不到你的第二条外，其它全部达到。
通过聚集索引的使用，速度应该不是问题（不过5000W的数据量，一般的PC SERVER 的处理时间应该在1个小时左右）。如果需要实现第二条，增加一条语句：
增加一个索引 username
update a set username = email from newtb a where username <> email
and (select 1 from newtb where username = a.username or email=a.username )>1

我可以加您好友嘛？我的QQ 396615834 MSN [email protected]

您太清楚我要做什么了，就是这个意思，谢谢您了，加您个好友可以嘛？
我的QQ 396615834 MSN [email protected]

测试一下这个例子--测试环境，比如name，email，other1和other2代表其他信息
create table test(name varchar(100),email varchar(100),other1 varchar(100),other2 varchar(100))
insert into test select 'a','[email protected]','','test1'
insert into test select 'a','[email protected]','test2',''
insert into test select 'a1','[email protected]','test3',''
insert into test select 'a1','[email protected]','test3','test4'
insert into test select 'a1','[email protected]','','test4'
insert into test select 'b','[email protected]','',''--建立索引
create index test_name on test(name)
create index test_mail on test(email)
--临时表
select top 0 * into temp from test--插入有名字相同email不同的，用email代替名字
insert into temp
select email,email,other1,other2 from test a where exists(select 1 from test
where a.name=name and a.email<>email)
--插入其他的
insert into temp
select name,email,other1,other2 from test a where not exists(select 1 from test
where a.name=name and a.email<>email)--建立索引
create index test_name on temp(name)
create index test_mail on temp(email)--根据email建立一个最全的信息补充表
select email,max(other1) as other1,max(other2) as other2
into temp_other from temp a
where exists(select 1 from temp where email=a.email)
group by email
--建立索引
create index test_mail on temp_other(email)
--补充email相同的记录其他为空的信息
update temp set other1=b.other1 , other2=b.other2
from temp a inner join temp_other b
on a.email=b.email--补充后记录会有重复
select distinct name,email,other1,other2 from tempdrop table test
drop table temp
drop table temp_other

a         [email protected] test2 test1
b         [email protected] test2 test1
[email protected]         [email protected] test3
[email protected] [email protected] test3 test4
[email protected] [email protected] test4是这个结果嘛？Email有重复了啊。

>>用户名不重复，Email重复，则认为是一个用户，如果用户信息存在空值的字段，则更新空值字段为新数据的字段信息。你又没说忽略

--测试环境，比如name，email，other1和other2代表其他信息
create table test(name varchar(100),email varchar(100),other1 varchar(100),other2 varchar(100))
insert into test select 'a','[email protected]','','test1'
insert into test select 'a','[email protected]','test2',''
insert into test select 'a1','[email protected]','test3',''
insert into test select 'a1','[email protected]','test3','test4'
insert into test select 'a1','[email protected]','','test4'
insert into test select 'b','[email protected]','',''--建立索引
create index test_name on test(name)
create index test_mail on test(email)
--临时表
select top 0 * into temp from test--插入有名字相同email不同的，用email代替名字
insert into temp
select email,email,other1,other2 from test a where exists(select 1 from test
    where a.name=name and a.email<>email)
--插入其他的
insert into temp
select name,email,other1,other2 from test a where not exists(select 1 from test
    where a.name=name and a.email<>email)--建立索引
create index test_name on temp(name)
create index test_mail on temp(email)--根据email建立一个最全的信息补充表
select email,max(other1) as other1,max(other2) as other2
into temp_other from temp a
where exists(select 1 from temp where email=a.email)
group by email
--建立索引
create index test_mail on temp_other(email)
--补充email相同的记录其他为空的信息
update temp set other1=b.other1 , other2=b.other2
from temp a inner join temp_other b
on a.email=b.email--补充后记录会有重复
select  max(name),email,other1,other2 from temp
group by email,other1,other2
drop table test
drop table temp
drop table temp_other

if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[users]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
drop table [dbo].[users]
GOCREATE TABLE [dbo].[users] (
[userid] [int] IDENTITY (1, 1) NOT NULL ,
[username] [varchar] (200) COLLATE Chinese_PRC_CI_AS NULL ,
[useremail] [varchar] (200) COLLATE Chinese_PRC_CI_AS NULL ,
[tel] [varchar] (200) COLLATE Chinese_PRC_CI_AS NULL ,
[cellphone] [varchar] (50) COLLATE Chinese_PRC_CI_AS NULL ,
[address] [varchar] (1000) COLLATE Chinese_PRC_CI_AS NULL ,
[zip] [char] (6) COLLATE Chinese_PRC_CI_AS NULL ,
[realname] [varchar] (200) COLLATE Chinese_PRC_CI_AS NULL ,
[flag] [int] NOT NULL
) ON [PRIMARY]
GO
这个是我源表的表结构，数据都从这里来。

调试易

又见csdn！大家帮忙出出主意！求解！谢谢

解决方案 »