你们说的意思我知道, 我这里的2万多张这样的网页, 我现在是要将这些网页中的某一段信息存入mysql的数据库中。直接说就是我是采集其它网站的信息,它们使用的是gb2312,我采集这些网页中的相关信息也得用gb2312, 不然出现乱码(不知道我这样说是不是有一点绝对了)。而我的网站是用utf-8,mysql数据为也是采用utf-8 所以,我现在用php读取我采集得到的这些网页文件, 并将它们当中的信息转换成utf-8然后,写入数据库中 (不转换 insert into 时出错,转换了可以写入,但是乱码 )。 我使用的转换是上述帖出来代码,“gb2312.txt”我也将它帖出来让大家看一下# gb2312.txt -- # # GB2312 to Unicode table (modified) # from: # http://tcl.apache.org/sources/tcl/tools/encoding/gb2312.txt # ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/GB/GB2312.TXT # # Copyright (c) 1998-1999 by Scriptics Corporation. # # See the file "license.terms" for information on usage and redistribution # of this file, and for a DISCLAIMER OF ALL WARRANTIES. # # RCS: @(#) $Id: gb2312.txt,v 1.2 1999/04/16 00:47:55 stanton Exp $ # # NOTE: this table has been modified to include the 7-bit ASCII # characters that are allowed in GB2312 files. # # # Name: GB2312-80 to Unicode table (complete, hex format) # Unicode version: 1.1 # Table version: 0.0d2 # Table format: Format A # Date: 6 December 1993 # Author: Glenn Adams <[email protected]> # John H. Jenkins <[email protected]> # # Copyright (c) 1991-1994 Unicode, Inc. All Rights reserved. # # This file is provided as-is by Unicode, Inc. (The Unicode Consortium). # No claims are made as to fitness for any particular purpose. No # warranties of any kind are expressed or implied. The recipient # agrees to determine applicability of information provided. If this # file has been provided on magnetic media by Unicode, Inc., the sole # remedy for any claim will be exchange of defective media within 90 # days of receipt. # # Recipient is granted the right to make copies in any form for # internal distribution and to freely use the information supplied # in the creation of products supporting Unicode. Unicode, Inc. # specifically excludes the right to re-distribute this file directly # to third parties or other organizations whether for profit or not. # # General notes: # # This table contains the data Metis and Taligent currently have on how # GB2312-80 characters map into Unicode. # # Format: Three tab-separated columns # Column #1 is the GB2312 code (in hex as 0xXXXX) # Column #2 is the Unicode (in hex as 0xXXXX) # Column #3 the Unicode name (follows a comment sign, '#') # The official names for Unicode characters U+4E00 # to U+9FA5, inclusive, is "CJK UNIFIED IDEOGRAPH-XXXX", # where XXXX is the code point. Including all these # names in this file increases its size substantially # and needlessly. The token "<CJK>" is used for the # name of these characters. If necessary, it can be # expanded algorithmically by a parser or editor. # # The entries are in GB2312 order # # The following algorithms can be used to change the hex form # of GB2312 to other standard forms: # # To change hex to EUC form, add 0x8080 # To change hex to kuten form, first subtract 0x2020. Then # the high and low bytes correspond to the ku and ten of # the kuten form. For example, 0x2121 -> 0x0101 -> 0101; # 0x777E -> 0x575E -> 8794 # # Any comments or problems, contact <[email protected]> # # 太长了,帖不完,下面是它的网址: http://www.g569.com/special/gb2312.txt
我这里的2万多张这样的网页,
我现在是要将这些网页中的某一段信息存入mysql的数据库中。直接说就是我是采集其它网站的信息,它们使用的是gb2312,我采集这些网页中的相关信息也得用gb2312,
不然出现乱码(不知道我这样说是不是有一点绝对了)。而我的网站是用utf-8,mysql数据为也是采用utf-8
所以,我现在用php读取我采集得到的这些网页文件,
并将它们当中的信息转换成utf-8然后,写入数据库中
(不转换 insert into 时出错,转换了可以写入,但是乱码 )。
我使用的转换是上述帖出来代码,“gb2312.txt”我也将它帖出来让大家看一下# gb2312.txt --
#
# GB2312 to Unicode table (modified)
# from:
# http://tcl.apache.org/sources/tcl/tools/encoding/gb2312.txt
# ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/GB/GB2312.TXT
#
# Copyright (c) 1998-1999 by Scriptics Corporation.
#
# See the file "license.terms" for information on usage and redistribution
# of this file, and for a DISCLAIMER OF ALL WARRANTIES.
#
# RCS: @(#) $Id: gb2312.txt,v 1.2 1999/04/16 00:47:55 stanton Exp $
#
# NOTE: this table has been modified to include the 7-bit ASCII
# characters that are allowed in GB2312 files.
#
#
# Name: GB2312-80 to Unicode table (complete, hex format)
# Unicode version: 1.1
# Table version: 0.0d2
# Table format: Format A
# Date: 6 December 1993
# Author: Glenn Adams <[email protected]>
# John H. Jenkins <[email protected]>
#
# Copyright (c) 1991-1994 Unicode, Inc. All Rights reserved.
#
# This file is provided as-is by Unicode, Inc. (The Unicode Consortium).
# No claims are made as to fitness for any particular purpose. No
# warranties of any kind are expressed or implied. The recipient
# agrees to determine applicability of information provided. If this
# file has been provided on magnetic media by Unicode, Inc., the sole
# remedy for any claim will be exchange of defective media within 90
# days of receipt.
#
# Recipient is granted the right to make copies in any form for
# internal distribution and to freely use the information supplied
# in the creation of products supporting Unicode. Unicode, Inc.
# specifically excludes the right to re-distribute this file directly
# to third parties or other organizations whether for profit or not.
#
# General notes:
#
# This table contains the data Metis and Taligent currently have on how
# GB2312-80 characters map into Unicode.
#
# Format: Three tab-separated columns
# Column #1 is the GB2312 code (in hex as 0xXXXX)
# Column #2 is the Unicode (in hex as 0xXXXX)
# Column #3 the Unicode name (follows a comment sign, '#')
# The official names for Unicode characters U+4E00
# to U+9FA5, inclusive, is "CJK UNIFIED IDEOGRAPH-XXXX",
# where XXXX is the code point. Including all these
# names in this file increases its size substantially
# and needlessly. The token "<CJK>" is used for the
# name of these characters. If necessary, it can be
# expanded algorithmically by a parser or editor.
#
# The entries are in GB2312 order
#
# The following algorithms can be used to change the hex form
# of GB2312 to other standard forms:
#
# To change hex to EUC form, add 0x8080
# To change hex to kuten form, first subtract 0x2020. Then
# the high and low bytes correspond to the ku and ten of
# the kuten form. For example, 0x2121 -> 0x0101 -> 0101;
# 0x777E -> 0x575E -> 8794
#
# Any comments or problems, contact <[email protected]>
#
#
太长了,帖不完,下面是它的网址:
http://www.g569.com/special/gb2312.txt
写漏了点,关闭缓存后,用 iconv将获取的缓冲区内容编码转移
若实在想转码,则在插入数据库前作 $text = iconv('gbk', 'utf-8', $text);
也是用你们提示的iconv()转换了一下,已经写入数据库了,
现在剩下的工作就是生成静态网页了。
接分吧!