科技行者

行者学院 转型私董会 科技行者专题报道 网红大战科技行者

知识库

知识库 安全导航



ZDNet>软件频道>数据库-zhiding>Oracle诊断案例 -SGA与Swap

  • 扫一扫
    分享文章到微信

  • 扫一扫
    关注官方公众号
    至顶头条

  案例描述:   用户报告,服务器启动一段时间以后,无法建立数据库连接   重新启动几分钟以后。

来源:中国IT实验室 2007年10月07日

关键字:ORACLE 数据库 优化


  案例描述:
  用户报告,服务器启动一段时间以后,无法建立数据库连接
  重新启动几分钟以后,再次无法连接
  
  系统无法正常使用.
  
  1.登陆系统
  SunOS 5.8
  
  login: root
  Password:
  Last login: Tue Mar 23 13:56:59 from 172.16.31.41
  Sun Microsystems Inc. SunOS 5.8 Generic Patch October 2001
  You have new mail.
  
  2.su 为Oracle用户
  检查启动的Oracle进程
  
  发现后台进程正常,有一定量的用户连接
  
  wapplatform:/>su - oracle
  Sun Microsystems Inc. SunOS 5.8 Generic Patch October 2001
  You have new mail.
  /export/home1/oracle>ls
  admin codesyndealt31 exp.sh local.cshrc local.profile oraclebak oui v6_database
  app exp.log jre local.login nsmail oradata swan
  export/home1/oracle>cd admin
  /export/home1/oracle/admin>ps -ef|grep ora
  oracle 25269 25258 0 13:58:36 pts/3 0:00 grep ora
  oracle 25257 24906 0 13:58:31 pts/4 0:00 vi alert_HSWAPDB.log
  oracle 25267 1 1 13:58:34 ? 0:00 oracleHSWAPDB (LOCAL=NO)
  oracle 25184 1 0 13:56:57 ? 0:00 ora_p007_HSWAPDB
  oracle 25182 1 0 13:56:57 ? 0:00 ora_p006_HSWAPDB
  oracle 25193 1 0 13:57:03 ? 0:01 oracleHSWAPDB (LOCAL=NO)
  oracle 25209 1 0 13:57:09 ? 0:00 oracleHSWAPDB (LOCAL=NO)
  oracle 25176 1 0 13:56:57 ? 0:00 ora_p003_HSWAPDB
  oracle 25180 1 0 13:56:57 ? 0:00 ora_p005_HSWAPDB
  oracle 25172 1 0 13:56:56 ? 0:00 ora_p001_HSWAPDB
  oracle 25178 1 0 13:56:57 ? 0:00 ora_p004_HSWAPDB
  oracle 25170 1 0 13:56:56 ? 0:00 ora_p000_HSWAPDB
  oracle 24254 24240 0 12:08:25 pts/2 0:00 -ksh
  oracle 25174 1 0 13:56:56 ? 0:00 ora_p002_HSWAPDB
  oracle 25244 1 1 13:58:23 ? 0:00 oracleHSWAPDB (LOCAL=NO)
  oracle 25218 1 0 13:57:23 ? 0:00 oracleHSWAPDB (LOCAL=NO)
  oracle 25159 1 0 13:56:42 ? 0:02 ora_qmn0_HSWAPDB
  oracle 25230 1 0 13:57:40 ? 0:01 oracleHSWAPDB (LOCAL=NO)
  oracle 25161 1 0 13:56:42 ? 0:00 ora_s000_HSWAPDB
  oracle 25149 1 0 13:56:41 ? 0:01 ora_lgwr_HSWAPDB
  oracle 25157 1 0 13:56:42 ? 0:00 ora_cjq0_HSWAPDB
  oracle 24906 3698 0 13:47:47 pts/4 0:00 -ksh
  oracle 25153 1 0 13:56:42 ? 0:01 ora_smon_HSWAPDB
  oracle 25058 7464 0 13:55:14 pts/1 0:00 -ksh
  oracle 25163 1 0 13:56:42 ? 0:00 ora_d000_HSWAPDB
  oracle 25155 1 0 13:56:42 ? 0:00 ora_reco_HSWAPDB
  oracle 25151 1 0 13:56:41 ? 0:00 ora_ckpt_HSWAPDB
  oracle 25145 1 0 13:56:41 ? 0:00 ora_dbw0_HSWAPDB
  oracle 25199 1 15 13:57:04 ? 0:49 ora_j000_HSWAPDB
  oracle 4149 4146 0 12:05:11 pts/5 0:00 -ksh
  oracle 25232 1 0 13:57:41 ? 0:00 oracleHSWAPDB (LOCAL=NO)
  oracle 25119 1 0 13:56:29 ? 0:00 oraclehswapdb (LOCAL=NO)
  oracle 25075 1 0 13:55:34 ? 0:00 /export/home1/oracle/app/bin/tnslsnr LISTENER -inherit
  oracle 24374 4149 0 12:21:56 pts/5 0:00 sqlplus /nolog
  oracle 25143 1 0 13:56:41 ? 0:00 ora_pmon_HSWAPDB
  oracle 25258 25242 0 13:58:31 pts/3 0:00 -ksh
  /export/home1/oracle/admin>ps -ef|grep ora_
  oracle 25275 25258 0 13:58:42 pts/3 0:00 grep ora_
  oracle 25184 1 0 13:56:57 ? 0:00 ora_p007_HSWAPDB
  oracle 25182 1 0 13:56:57 ? 0:00 ora_p006_HSWAPDB
  oracle 25176 1 0 13:56:57 ? 0:00 ora_p003_HSWAPDB
  oracle 25180 1 0 13:56:57 ? 0:00 ora_p005_HSWAPDB
  oracle 25172 1 0 13:56:56 ? 0:00 ora_p001_HSWAPDB
  oracle 25178 1 0 13:56:57 ? 0:00 ora_p004_HSWAPDB
  oracle 25170 1 0 13:56:56 ? 0:00 ora_p000_HSWAPDB
  oracle 25174 1 0 13:56:56 ? 0:00 ora_p002_HSWAPDB
  oracle 25159 1 0 13:56:42 ? 0:02 ora_qmn0_HSWAPDB
  oracle 25161 1 0 13:56:42 ? 0:00 ora_s000_HSWAPDB
  oracle 25149 1 0 13:56:41 ? 0:01 ora_lgwr_HSWAPDB
  oracle 25157 1 0 13:56:42 ? 0:00 ora_cjq0_HSWAPDB
  oracle 25153 1 0 13:56:42 ? 0:01 ora_smon_HSWAPDB
  oracle 25163 1 0 13:56:42 ? 0:00 ora_d000_HSWAPDB
  oracle 25155 1 0 13:56:42 ? 0:00 ora_reco_HSWAPDB
  oracle 25151 1 0 13:56:41 ? 0:00 ora_ckpt_HSWAPDB
  oracle 25145 1 0 13:56:41 ? 0:00 ora_dbw0_HSWAPDB
  oracle 25199 1 13 13:57:04 ? 0:51 ora_j000_HSWAPDB
  oracle 25143 1 0 13:56:41 ? 0:00 ora_pmon_HSWAPDB
  
  3.检查Alert.log警报日志文件
  /export/home1/oracle/admin>ls
  hswapdb
  /export/home1/oracle/admin>cd *
  /export/home1/oracle/admin/hswapdb>ls
  bdump cdump create pfile udump
  /export/home1/oracle/admin/hswapdb>cd bdump
  /export/home1/oracle/admin/hswapdb/bdump>
  
  /export/home1/oracle/admin/hswapdb/bdump>ls -l *.log
  
  -rw-r--r-- 1 oracle dba 813396 Mar 23 13:57 alert_HSWAPDB.log
  /export/home1/oracle/admin/hswapdb/bdump>vi *.log
  "alert_HSWAPDB.log" 18888 lines, 813396 characters (115 null)
  Tue Jun 24 21:17:14 2003
  Starting ORACLE instance (normal)
  LICENSE_MAX_SESSION = 0
  LICENSE_SESSIONS_WARNING = 0
  SCN scheme 3
  Using log_archive_dest parameter default value
  LICENSE_MAX_USERS = 0
  SYS auditing is disabled
  Starting up ORACLE RDBMS Version: 9.2.0.3.0.
  System parameters with non-default values:
  processes = 400
  timed_statistics = TRUE
  shared_pool_size = 117440512
  large_pool_size = 83886080
  java_pool_size = 33554432
  control_files = /export/home1/oracle/oradata/hswapdb/control01.ctl,
  
  /export/home1/oracle/oradata/hswapdb/control02.ctl,
  /export/home1/oracle/oradata/hswapdb/control03.ctl
  db_block_size = 8192
  db_cache_size = 352321536
  compatible = 9.2.0.0.0
  db_file_multiblock_read_count= 16
  fast_start_mttr_target = 300
  undo_management = AUTO
  undo_tablespace = UNDOTBS1
  undo_retention = 10800
  remote_login_passwordfile= EXCLUSIVE
  db_domain = eygle.com
  instance_name = hswapdb
  dispatchers = (PROTOCOL=TCP) (SERVICE=hswapdbXDB)
  job_queue_processes = 10
  hash_join_enabled = TRUE
  background_dump_dest = /export/home1/oracle/admin/hswapdb/bdump
  user_dump_dest = /export/home1/oracle/admin/hswapdb/udump
  core_dump_dest = /export/home1/oracle/admin/hswapdb/cdump
  sort_area_size = 524288
  db_name = hswapdb
  open_cursors = 300
  star_transformation_enabled= FALSE
  query_rewrite_enabled = FALSE
  pga_aggregate_target = 154140672
  aq_tm_processes = 1
  
  .................
  
  Tue Mar 23 13:40:45 2004
  skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
  skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
  skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
  skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
  skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
  skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
  skgpspawn failed:category = 27142, depinfo = 11, op = fork, loc = skgpspawn5
  skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
  skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
  Tue Mar 23 13:42:02 2004
  skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
  skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
  skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
  skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
  Tue Mar 23 13:55:38 2004
  Starting ORACLE instance (normal)
  Shutting down instance: further logons disabled
  Tue Mar 23 13:56:20 2004
  Shutting down instance (abort)
  License high water mark = 26
  Instance terminated by USER, pid = 25112
  Tue Mar 23 1

查看本文来源


  案例描述:
  这是一个大型生产系统
  问题出现时系统累计大量用户进程
  用户请求得不到及时响应,新的进程不断尝试建立连接
  连接数很快被用完
  
  数据库版本:9.2.0.3
  操作系统:Solaris8
  
  1.检查alert文件
  日志中记录如下错误信息,说明磁盘异步IO出现问题:
  
  WARNING: aiowait timed out 2 times
  Tue Aug 26 15:33:32 2003
  WARNING: aiowait timed out 2 times
  Tue Aug 26 15:33:34 2003
  WARNING: aiowait timed out 2 times
  Tue Aug 26 15:33:36 2003
  WARNING: aiowait timed out 2 times
  Tue Aug 26 15:33:38 2003
  WARNING: aiowait timed out 2 times
  Tue Aug 26 15:33:43 2003
  WARNING: aiowait timed out 1 times
  Tue Aug 26 15:33:46 2003
  WARNING: aiowait timed out 1 times
  Tue Aug 26 15:33:49 2003
  WARNING: aiowait timed out 1 times
  Tue Aug 26 15:33:51 2003
  WARNING: aiowait timed out 1 times
  Tue Aug 26 15:33:52 2003
  WARNING: aiowait timed out 1 times
  Tue Aug 26 15:33:53 2003
  WARNING: aiowait timed out 1 times
  .............
  
  我们知道在SUN的某些版本上异步IO存在问题,而异步IO缺省是打开的
  代码:
  
  SQL> show parameter disk_a
  
  NAME                 TYPE    VALUE
  ------------------------------------ ----------- ------------------------------
  disk_asynch_io            boolean   'TRUE'
  
  针对此问题,我们停用了数据库的异步IO写入。
  
  2.共享内存问题
  alert文件中还记录了以下错误信息:
  
  Tue Aug 26 21:37:40 2003
  WARNING: EINVAL creating segment of size 0x0000000190400000
  fix shm parameters in /etc/system or equivalent
  
  该信息说明内核参数设置过小或者和SGA不匹配
  
  我们检查system配置文件
  
  $ cat /etc/system
  .......................
  set shmsys:shminfo_shmmax=4096000000
  set shmsys:shminfo_shmmin=1
  set shmsys:shminfo_shmmni=200
  set shmsys:shminfo_shmseg=200
  set semsys:seminfo_semmap=1024
  set semsys:seminfo_semmni=2048
  set semsys:seminfo_semmns=2048
  set semsys:seminfo_semmnu=2048
  set semsys:seminfo_semume=200
  set semsys:seminfo_semmsl=2048
  
  我们发现最大共享内存设置仅有4G
  
  3.检查SGA设置
  SQL*Plus: Release 9.2.0.3.0 - Production on 星期二 8月 26 21:46:35 2003
  
  Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.
  
  Connected to:
  Oracle9i Enterprise Edition Release 9.2.0.3.0 - 64bit Production
  With the Partitioning, OLAP and Oracle Data Mining options
  JServer Release 9.2.0.3.0 - Production
  
  SQL> show sga
  
  Total System Global Area 6695660272 bytes
  Fixed Size 740080 bytes
  Variable Size 2399141888 bytes
  Database Buffers 4294967296 bytes
  Redo Buffers 811008 bytes
  
  我们发现SGA设置接近7G,这也就是步骤2中错误提示出现的原因
  
  4.交换区问题
  我们用top工具检查系统运行状况
   
  代码:
  
  # /usr/local/bin/top
  
  last pid: 16899; load averages: 0.82, 0.81, 0.83                       21:49:05
  
  1230 processes:1228 sleeping, 1 running, 1 on cpu
  
  CPU states: 50.1% idle, 7.4% user, 8.6% kernel, 33.9% iowait, 0.0% swap
  
  Memory: 8192M real, 118M free, 12G swap in use, 11G swap free
  
   PID USERNAME THR PRI NICE SIZE  RES STATE  TIME  CPU COMMAND
  
   15751 oracle  11 44  0 6456M 6408M sleep  0:02 0.49% oracle
  
   15725 oracle  11 58  0 6458M 6410M sleep  0:02 0.46% oracle
  
    251 root   12 48  0 7096K 1944K sleep 126:00 0.45% picld
  
   16540 oracle  11 58  0 6458M 6411M sleep  0:01 0.45% oracle
  
   16766 root    1 43  0 3744K 2248K cpu/1  0:01 0.41% top
  
   16408 oracle  11 58  0 6457M 6410M sleep  0:01 0.34% oracle
  
   15989 oracle  11 58  0 6458M 6409M sleep  0:01 0.34% oracle
  
   15919 oracle  11 58  0 6457M 6409M sleep  0:02 0.30% oracle
  
   16404 oracle  11 58  0 6457M 6409M sleep  0:00 0.28% oracle
  
   16327 oracle  11 55  0 6457M 6410M sleep  0:00 0.27% oracle
  
   14870 oracle  11 58  0 6457M 6412M sleep  0:05 0.24% oracle
  
   16851 oracle  11 35  0 6457M 6411M sleep  0:00 0.22% oracle
  
   16467 oracle  11 58  0 6457M 6409M sleep  0:00 0.21% oracle
  
   16163 oracle  11 58  0 6457M 6408M sleep  0:03 0.21% oracle
  
  ' 15159 oracle  11 58  0 6457M 6408M sleep  0:05 0.21% oracle'
  
  Memory: 8192M real, 118M free, 12G swap in use, 11G swap free
  
  我们发现系统仅有8G RAM,物理内存仅有118M可用,现在SWAP区使用了12G
  
  我们初步作出以下判断:
  
  SGA设置过大(将近7G)导致运行时产生大量交换
  
  大量SWAP交换进而引发磁盘问题,这也就应该是我们第一步看到
  WARNING: aiowait timed out 1 times的原因
  
  大量交换导致数据库性能急剧下降,进而导致用户请求得不到快速响应,堵塞、累积,直至数据库失去响应
  
  5.解决方案
  此问题主要是由于SGA设置不当引起,我们马上缩小了SGA设置:
  
  SQL> show sga
  
  Total System Global Area 3591870848 bytes
  Fixed Size 735616 bytes
  Variable Size 1442840576 bytes
  Database Buffers 2147483648 bytes
  Redo Buffers 811008 bytes
  
  此时,数据库减少了交换,达到了稳定运行,用户请求可以得到快速响应。
  
  问题解决完成.
  
  6.系统状态
  调整后系统运行状况:
   
  代码:
  
  $ top
  
  last pid: 12745; load averages: 0.46, 0.79, 0.65      22:22:49
  
  228 processes: 227 sleeping, 1 on cpu
  
  CPU states: 92.3% idle, 5.0% user, 1.6% kernel, 1.1% iowait, 0.0% swap
  
  Memory: 8192M real, 3817M free, 4015M swap in use, 15G swap free
  
    PID USERNAME THR PRI NICE SIZE  RES STATE  TIME  CPU COMMAND
  
   12610 oracle   1 51  0 3511M  22M sleep  0:04 1.96% oracle
  
   12595 oracle   1 48  0 3511M  22M sleep  0:03 0.92% oracle
  
   12630 oracle   1 38  0 3511M  21M sleep  0:01 0.84% oracle
  
   12614 oracle   1 46  0 3511M  22M sleep  0:01 0.64% oracle
  
   12620 oracle   1 58  0 3511M  22M sleep  0:01 0.53% oracle
  
   12709 oracle   1 48  0 3511M  21M sleep  0:00 0.45% oracle
  
    265 root   11 38  0 7032K 1920K sleep  3:16 0.42% picld
  
   12729 oracle   1  0  0 3511M  20M sleep  0:00 0.26% oracle
  
   12741 oracle   1 58  0 2768K 1760K cpu/3  0:00 0.19% top
  
   12745 oracle   1 44  0 3506M  16M sleep  0:00 0.17% oracle
  
   12711 oracle   1 48  0 3506M  16M sleep  0:00 0.11% oracle
  
   12738 oracle   1 43  0 3506M  16M sleep  0:00 0.06% oracle
  
   7606 oracle   1 45  0  17M 6928K sleep  0:07 0.05% tnslsnr
  
   12721 oracle   1 34  0 3506M  16M sleep  0:00 0.05% oracle
  
   '12723 oracle   1 53  0 3506M  16M sleep  0:00 0.05% oracle'
  
  该系统调整完以后,一直稳定运行至今.
  
  一点总结:
  这个案例和前面我提到的另外一个极其相似,同样都是SGA设置不当引起的数据库问题
  
  本身并不复杂
  这一类问题应该在数据库规划和建设阶段就避免掉.
  
  其时,该问题对我更像是个心理测试,当所有老板都站在你背后的时候,你能否冷静快速的找到并解决问题.
  
  关于SUN上的aiowait timed out 有很多总情况及诱因

查看本文来源

推广二维码
邮件订阅

如果您非常迫切的想了解IT领域最新产品与技术信息,那么订阅至顶网技术邮件将是您的最佳途径之一。

重磅专题