分布式引擎BI端常见问题

编辑
文档创建者:doreen0813 (83193 )     浏览次数:140次     编辑次数:6次     最近更新:doreen0813 于 2018-11-08     

目录:

1、常见问题定位步骤编辑

1、分析Spider日志(%FineBI%/logs下的fanruan.log),查找相应的关键字。尽可能找到最开始的异常报错;
2、根据关键字查找相关的异常文档,并获取可能的原因;
3、验证相应的原因在自身的环境是否存在,如果存在则按照解决方案去处理;
4、如果1-3步无法定位到问题,联系帆软技术支持处理。

2、java.net.UnknownHostException 异常编辑

日志报错示例:

org.apache.spark.SparkException: Exception thrown in awaitResult: 
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1$$anon$1.run(StandaloneAppClient.scala:106)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed to connect to slave1:7077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
... 4 more
Caused by: java.net.UnknownHostException: slave1
at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at java.net.InetAddress.getByName(InetAddress.java:1076)
at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146)
at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143)
at java.security.AccessController.doPrivileged(Native Method)
at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143)
at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43)
at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63)
at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55)
at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57)
at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32)
at io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108)
at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:208)
at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:49)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:188)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:174)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82)
at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:978)
at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:512)
at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:423)
at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:482)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)

原因分析:

BI端未配置hostname,报错比较明显,启动时就会出现异常。

解决方案:

参考Spider分布式与BI对接文档,在BI端配置分布式集群的对应hostname。

3、Command exited with code 1编辑

问题现象:

任务一直无法结束,BI端日志循环输出日志Executor app-20181022142028-0001/10 removed: Command exited with code 1

日志报错示例:

18/10/22 14:20:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 7 from BlockManagerMaster.
18/10/22 14:20:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 7
18/10/22 14:20:45 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20181022142028-0001/12 on worker-20181022112605-slave2-37570 (slave2:37570) with 1 core(s)
18/10/22 14:20:45 INFO StandaloneSchedulerBackend: Granted executor ID app-20181022142028-0001/12 on hostPort slave2:37570 with 1 core(s), 1024.0 MB RAM
18/10/22 14:20:45 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20181022142028-0001/12 is now RUNNING
18/10/22 14:20:48 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20181022142028-0001/10 is now EXITED (Command exited with code 1)
18/10/22 14:20:48 INFO StandaloneSchedulerBackend: Executor app-20181022142028-0001/10 removed: Command exited with code 1
18/10/22 14:20:48 INFO BlockManagerMaster: Removal of executor 10 requested
18/10/22 14:20:48 INFO BlockManagerMasterEndpoint: Trying to remove executor 10 from BlockManagerMaster.
18/10/22 14:20:48 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 10
18/10/22 14:20:48 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20181022142028-0001/13 on worker-20181022112638-master-43938 (master:43938) with 1 core(s)
18/10/22 14:20:48 INFO StandaloneSchedulerBackend: Granted executor ID app-20181022142028-0001/13 on hostPort master:43938 with 1 core(s), 1024.0 MB RAM
18/10/22 14:20:48 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20181022142028-0001/13 is now RUNNING
18/10/22 14:20:48 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20181022142028-0001/11 is now EXITED (Command exited with code 1)
18/10/22 14:20:48 INFO StandaloneSchedulerBackend: Executor app-20181022142028-0001/11 removed: Command exited with code 1
18/10/22 14:20:48 INFO BlockManagerMasterEndpoint: Trying to remove executor 11 from BlockManagerMaster.
18/10/22 14:20:48 INFO BlockManagerMaster: Removal of executor 11 requested
18/10/22 14:20:48 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 11
18/10/22 14:20:48 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20181022142028-0001/14 on worker-20181022112613-slave1-43014 (slave1:43014) with 1 core(s)
18/10/22 14:20:48 INFO StandaloneSchedulerBackend: Granted executor ID app-20181022142028-0001/14 on hostPort slave1:43014 with 1 core(s), 1024.0 MB RAM
18/10/22 14:20:48 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20181022142028-0001/14 is now RUNNING
18/10/22 14:20:50 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20181022142028-0001/12 is now EXITED (Command exited with code 1)
18/10/22 14:20:50 INFO StandaloneSchedulerBackend: Executor app-20181022142028-0001/12 removed: Command exited with code 1
18/10/22 14:20:50 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20181022142028-0001/15 on worker-20181022112605-slave2-37570 (slave2:37570) with 1 core(s)
18/10/22 14:20:50 INFO BlockManagerMasterEndpoint: Trying to remove executor 12 from BlockManagerMaster.
18/10/22 14:20:50 INFO BlockManagerMaster: Removal of executor 12 requested
18/10/22 14:20:50 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 12
18/10/22 14:20:50 INFO StandaloneSchedulerBackend: Granted executor ID app-20181022142028-0001/15 on hostPort slave2:37570 with 1 core(s), 1024.0 MB RAM
18/10/22 14:20:50 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20181022142028-0001/15 is now RUNNING
18/10/22 14:20:50 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
18/10/22 14:20:52 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20181022142028-0001/14 is now EXITED (Command exited with code 1)
18/10/22 14:20:52 INFO StandaloneSchedulerBackend: Executor app-20181022142028-0001/14 removed: Command exited with code 1
18/10/22 14:20:52 INFO BlockManagerMasterEndpoint: Trying to remove executor 14 from BlockManagerMaster.
18/10/22 14:20:52 INFO BlockManagerMaster: Removal of executor 14 requested
...

原因分析:

a、分布式集群的机器无法访问bi机器的hostname;

b、分布式集群的机器无法访问bi机器的端口 17777,17778+;

解决方案:

1、检查spider集群是否能访问bi端的hostanme。如果不行则配置bi的spark_driver_host参数;

2、检查spider集群机器对bi机器是否有端口限制。(telnet工具等)

4、WARN Utils: Service 'sparkDriver'could not bind on port 17777编辑

问题现象:

访问模板一直无数据,BI端日志循环输出日志WARN Utils: Service 'sparkDriver' could not bind on port 17777. Attempting port 17778

日志报错示例:

18/09/21 14:04:02 WARN Utils: Service 'sparkDriver' could not bind on port 17777. Attempting port 17778.
18/09/21 14:04:02 WARN Utils: Service 'sparkDriver' could not bind on port 17778. Attempting port 17779.
18/09/21 14:04:02 WARN Utils: Service 'sparkDriver' could not bind on port 17779. Attempting port 17780.
18/09/21 14:04:02 WARN Utils: Service 'sparkDriver' could not bind on port 17780. Attempting port 17781.
18/09/21 14:04:02 WARN Utils: Service 'sparkDriver' could not bind on port 17781. Attempting port 17782.
18/09/21 14:04:02 WARN Utils: Service 'sparkDriver' could not bind on port 17782. Attempting port 17783.
18/09/21 14:04:02 WARN Utils: Service 'sparkDriver' could not bind on port 17783. Attempting port 17784.
18/09/21 14:04:02 WARN Utils: Service 'sparkDriver' could not bind on port 17784. Attempting port 17785.
18/09/21 14:04:02 WARN Utils: Service 'sparkDriver' could not bind on port 17785. Attempting port 17786.
18/09/21 14:04:02 WARN Utils: Service 'sparkDriver' could not bind on port 17786. Attempting port 17787.
18/09/21 14:04:02 WARN Utils: Service 'sparkDriver' could not bind on port 17787. Attempting port 17788.
18/09/21 14:04:02 WARN Utils: Service 'sparkDriver' could not bind on port 17788. Attempting port 17789.
18/09/21 14:04:02 WARN Utils: Service 'sparkDriver' could not bind on port 17789. Attempting port 17790.
18/09/21 14:04:02 WARN Utils: Service 'sparkDriver' could not bind on port 17790. Attempting port 17791.
18/09/21 14:04:02 WARN Utils: Service 'sparkDriver' could not bind on port 17791. Attempting port 17792.
18/09/21 14:04:02 WARN Utils: Service 'sparkDriver' could not bind on port 17792. Attempting port 17793.

可能原因分析:

spark_driver_host 可能配置错了,ip不对;

hosts文件中 localhost对应了多个ip(单机版),集群版则可能是本机同一个hostname配置了多个ip或者hostanme与ip对应错误;

解决方案:

确认可能的原因,并参考Spider分布式参数配置方法文档配置对应参数;

5、Initial job has not accepted any resources编辑

日志报错示例:

18/10/22 17:17:42 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
18/10/22 17:17:57 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
18/10/22 17:18:12 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

原因分析:

a、BI申请资源超过spark系统资源;
b、BI端防火墙未开放,需要确认防火墙的状态;

解决方案:

a、参考Spider分布式参数配置方法修改bi端spark资源相关配置,spark_executor_memory;spark_executor_cores;spark_cores_max

222

如果资源配置正常,则查看是否存在遗留进程占用资源

222

b、关闭bi端防火墙,或者允许bi应用通过防火墙;

6、alluxio.exception.status.DeadlineExceededException编辑

问题现象:

BI端日志有异常输出,alluxio.exception.status.DeadlineExceededException

日志报错示例:

[ERROR]alluxio.exception.status.DeadlineExceededException: Timeout closing PacketWriter to WorkerNetAddress{host=master, rpcPort=29998, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=master, rack=null)} for request type: UFS_FILE
id: -1
tier: 0
create_ufs_file_options {
ufs_path: "hdfs://master:8020/ROOT/scenedb/dummyT_2/super/P-1/S-1/col-4-dic-index"
owner: "hdfs"
group: ""
mode: 420
mount_id: 1
}
after 1800000ms.

原因分析:

确认HDFS的存储空间是不是不够充足;

解决方案:

参考Ambari界面组件使用介绍文档,扩展hdfs的存储空间。

7、could only be replicated to 0 nodes instead of minReplication(=1). There are 3 datanode(s) running and no node(s) are excluded in this operation.编辑

问题现象:

alluxio 端日志有以下异常,could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and no node(s) are excluded in this operation

Alluxio端日志报错示例:

2018-10-22 19:09:26,530 WARN DFSClient - DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /ROOT/scenedb/dummyT_2/super/P-1/S-1/col-4-dic-index could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1709)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3337)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3261)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:850)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:504)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2345)

at org.apache.hadoop.ipc.Client.call(Client.java:1475)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy45.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:418)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy46.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1455)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1251)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448)
2018-10-22 19:09:26,733 WARN AbstractWriteHandler - Failed to cleanup states with error File /ROOT/scenedb/dummyT_2/super/P-1/S-1/col-4-dic-index could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1709)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3337)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3261)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:850)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:504)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2345)

可能原因:

确认HDFS的存储空间是不是不够充足;

解决方案:

参考Ambari界面组件使用介绍文档,扩展hdfs的存储空间。

8、com.finebi.spider.common.exception.DSAlluxioIOException: Failed to open stream read / Failed to open stream read编辑

问题描述:

生成数据失败,BI 日志中有下面异常:

  • com.finebi.spider.common.exception.DSAlluxioIOException: Failed to open stream read  
  • Failed to open stream read

日志报错示例:

[ERROR]Failed to open stream read: /ROOT/bdb/##R#T_1#1/super/P-1/S-3/col-0-slices/slice-0
com.finebi.spider.common.exception.DSAlluxioIOException: Failed to open stream read: /ROOT/bdb/##R#T_1#1/super/P-1/S-3/col-0-slices/slice-0
at com.finebi.spider.io.AlluxioStreamInput.init(AlluxioStreamInput.java:48)
at com.finebi.spider.io.AlluxioStreamInput.<init>(AlluxioStreamInput.java:36)
at com.finebi.spider.io.AlluxioStreamInput.<init>(AlluxioStreamInput.java:31)
at com.finebi.spider.io.AlluxioStreamInput.<init>(AlluxioStreamInput.java:27)
at com.finebi.spider.io.AlluxioInputFactory.createStreamInput(AlluxioInputFactory.java:19)
at com.finebi.spider.io.AlluxioFileSystem.openStream(AlluxioFileSystem.java:334)
at com.finebi.spider.common.struct.columnslice.ColumnarSliceCreator.createBySlice(ColumnarSliceCreator.java:17)
at com.finebi.spider.cluster.write.AlluxioMassColFlow.openSlice(AlluxioMassColFlow.java:49)
at com.finebi.spider.common.mergewrite.dic.AbstractDicMergeWriteOperation.merge(AbstractDicMergeWriteOperation.java:42)
at com.finebi.spider.common.mergewrite.MergeWriteUtils.tryDicMergeWrite(MergeWriteUtils.java:84)
at com.finebi.spider.common.mergewrite.MergeWriteUtils.mergeAndWrite(MergeWriteUtils.java:66)
at com.finebi.spider.common.task.slice.ColumnarMergeSlice.run(ColumnarMergeSlice.java:26)
at com.finebi.spider.common.task.service.AsynSliceService.loop(AsynSliceService.java:41)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: alluxio.exception.status.NotFoundException: Block 187,311,862,054,912 is unavailable in both Alluxio and UFS.
at alluxio.client.block.AlluxioBlockStore.getInStream(AlluxioBlockStore.java:189)
at alluxio.client.file.FileInStream.updateStream(FileInStream.java:582)
at alluxio.client.file.FileInStream.updateStreams(FileInStream.java:347)
at alluxio.client.file.FileInStream.read(FileInStream.java:439)
at alluxio.client.file.FileInStream.read(FileInStream.java:419)
at com.finebi.spider.io.AlluxioStreamInput.init(AlluxioStreamInput.java:44)

... 17 more

可能原因:

确认Alluxio的 HDD 是不是快满了;

解决方案:

扩大alluxio的hdd配置,详细操作见Ambari界面组件使用介绍

9、java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)编辑

问题现象:

生成数据失败,或者自助数据集更新失败,BI 日志中有下面异常:java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)

日志报错示例:

[ERROR]java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4573.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4573.0 (TID 10872, localhost, executor driver): java.lang.OutOfMemoryError
    at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:220)
    at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:173)
    at java.io.DataOutputStream.write(DataOutputStream.java:107)
    at org.apache.spark.sql.catalyst.expressions.UnsafeRow.writeToStream(UnsafeRow.java:552)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:256)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

解决方案:

该问题已在新版本中修复,若升级新版本后仍然存在,请联系技术支持;

10、com.finebi.spider.common.exception.DSRandomUnsupportedException: The gotten file stream does not support random read编辑

问题现象:

生成数据失败或者模板访问失败,BI 日志中有下面异常:com.finebi.spider.common.exception.DSRandomUnsupportedException: The gotten file stream does not support random read

日志报错示例:

WARN TaskSetManager: Lost task 4.0 in stage 16.0 (TID 817, pftest2, executor 1): com.finebi.spider.common.exception.DSRandomUnsupportedException: The gotten file stream does not support random read: /ROOT18/db/T_6E545B/super/P-1/S-49/col-3-meta; class alluxio.client.file.FileInStream
    at com.finebi.spider.io.AlluxioInput.init(AlluxioInput.java:41)
    at com.finebi.spider.io.AlluxioInput.<init>(AlluxioInput.java:31)
    at com.finebi.spider.io.AlluxioInput.<init>(AlluxioInput.java:26)
    at com.finebi.spider.io.AlluxioInputFactory.createRandomInput(AlluxioInputFactory.java:23)
    at com.finebi.spider.io.AlluxioFileSystem.open(AlluxioFileSystem.java:318)
    at com.finebi.spider.db.section.ColumnStreamCreator.create(ColumnStreamCreator.java:59)
    at com.finebi.spider.db.section.AbstractReadSection.readColumnStream(AbstractReadSection.java:76)
    at com.finebi.spider.db.section.AbstractReadSection.readColumnStream(AbstractReadSection.java:88)
    at com.finebi.spider.direct.adapter.ColumnReadUtils$1.visit(ColumnReadUtils.java:207)
    at com.finebi.spider.direct.adapter.ColumnReadUtils$1.visit(ColumnReadUtils.java:26)
    at com.fr.engine.criterion.projection.disaggregate.FieldProjection.accept(FieldProjection.java:45)
    at com.finebi.spider.direct.adapter.ColumnReadUtils.readStreamByProjection(ColumnReadUtils.java:26)
    at com.finebi.spider.compute.context.resource.ColumnStreamResourceImpl.getOrRead(ColumnStreamResourceImpl.java:34)
    at com.finebi.spider.direct.adapter.ActorContextCreator.lambda$adaptSelect$0(ActorContextCreator.java:122)
    at java.util.ArrayList.forEach(ArrayList.java:1249)
    at java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080)
    at com.finebi.spider.direct.adapter.ActorContextCreator.adaptSelect(ActorContextCreator.java:122)
    at com.finebi.spider.direct.adapter.ActorContextCreator.createQuery(ActorContextCreator.java:87)
    at com.finebi.spider.compute.SectionComputer.compute(SectionComputer.java:74)
    at com.finebi.spider.cluster.spark.ComputePartition.compute(ComputePartition.java:42)
    at com.finebi.spider.cluster.spark.TableRDD.compute(TableRDD.java:32)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)

可能原因:

1、alluxio mem满了;

2、alluxio worker挂了;

解决方案:

参考Ambari界面组件使用介绍查看服务状态,若是alluxio worker挂了可以重启alluxio服务;如果alluxio mem满了请联系帆软技术支持;

11、java.io.FileNotFoundException: /tmp/spark-xxx/executor-xxx/blockmgr-xxx/xx/shuffle_xxx (No such file or directory)编辑

问题现象:

生成数据失败或者模板访问失败,BI 日志中有下面异常:java.io.FileNotFoundException: /tmp/spark-903244ba-4d5a-446f-b67e-8e95309c68b6/executor-cd591197-d452-4421-a776-93c6404d18b9/blockmgr-3deab159-41ec-472f-86ee-5a7626e85622/31/shuffle_1121_0_0.data.4d7755e7-0a8f-49d0-88e2-58b8daeb46c2 (No such file or directory)

日志报错示例:

14:40:23 http-nio-37799-exec-24 ERROR [standard] Job aborted due to stage failure: Task 0 in stage 2454.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2454.0 (TID 18375, i-8m7ope9g, executor 2): java.io.FileNotFoundException: /tmp/spark-903244ba-4d5a-446f-b67e-8e95309c68b6/executor-cd591197-d452-4421-a776-93c6404d18b9/blockmgr-3deab159-41ec-472f-86ee-5a7626e85622/31/shuffle_1121_0_0.data.4d7755e7-0a8f-49d0-88e2-58b8daeb46c2 (No such file or directory)
    at java.io.FileOutputStream.open0(Native Method)
    at java.io.FileOutputStream.open(FileOutputStream.java:270)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
    at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)
    at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
    at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
    at org.apache.spark.util.collection.WritablePartitionedPairCollection$$anon$1.writeNext(WritablePartitionedPairCollection.scala:56)
    at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:699)
    at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:72)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)

可能原因:

spark 的tmp路径空间不足,或者机器磁盘空间不足;

解决方案:

磁盘空间不足的情况,需要扩容机器磁盘。spark tmp目录不足的情况需要修改spark相应的tmp目录配置,参考分布式引擎维护常见问题中3.4节spark临时目录空间不足的解决方案;

12、com.finebi.spider.common.exception.DSAlluxioIOException: Failed to open write file编辑

问题现象:

生成数据失败或者模板访问失败,BI 日志中有下面异常:Caused by: com.finebi.spider.common.exception.DSAlluxioIOException: Failed to open write file:

Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: syscall:getsockopt(..) failed: 拒绝连接: hostname/ip:29999

日志报错示例:

15:06:24 spider-append-pool-1-thread-4 ERROR [standard] failed task0:TaskInfo{taskName='P-1S-1', jobInfo=JobInfo{jobName='dummyT_167611#build_1538203597259_4'}}
com.finebi.spider.index.exception.BuildBitmapException: build path:/ROOT/db/dummyT_167611/super/P-1/S-1/bitmap/col-0-bm bitmap failed
    at com.finebi.spider.index.build.list.BuildIndexOperation.buildColumnIndex(BuildIndexOperation.java:113)
    at com.finebi.spider.index.build.list.BuildIndexOperation.build(BuildIndexOperation.java:83)
    at com.finebi.spider.index.build.list.SectionBuildIndex.compute(SectionBuildIndex.java:32)
    at com.finebi.spider.index.build.list.BitmapPartition.compute(BitmapPartition.java:46)
    at com.finebi.spider.index.build.list.BuildRDD.compute(BuildRDD.java:33)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: com.finebi.spider.common.exception.DSAlluxioIOException: Failed to open write file: /ROOT/db/dummyT_167611/super/P-1/S-1/bitmap/col-0-bm
    at com.finebi.spider.io.AlluxioOutput.init(AlluxioOutput.java:78)
    at com.finebi.spider.io.AlluxioOutput.<init>(AlluxioOutput.java:31)
    at com.finebi.spider.io.AlluxioOutput.cacheThrough(AlluxioOutput.java:44)
    at com.finebi.spider.io.AlluxioFileSystem.createWithPersist(AlluxioFileSystem.java:344)
    at com.finebi.spider.zcube.index.IndexCreator.buildBitmapWriter(IndexCreator.java:54)
    at com.finebi.spider.index.build.list.BuildIndexOperation.buildColumnIndex(BuildIndexOperation.java:105)
    ... 12 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: syscall:getsockopt(..) failed: 拒绝连接: FBIS4/10.252.0.104:29999
    at io.netty.channel.unix.Socket.finishConnect(..)(Unknown Source)
Caused by: io.netty.channel.unix.Errors$NativeConnectException: syscall:getsockopt(..) failed: 拒绝连接
    ... 1 more

可能原因:

alluxio 的worker挂掉;

解决方案:

参考Ambari界面组件使用介绍查看服务状态,重启挂掉的alluxio服务;

13、Error writing blockId: xxx, sessionId: xxx, address: BI-DB4/192.168.5.9:29999, message: Failed to write block.编辑

问题现象:

生成数据失败或者模板访问失败,BI 日志中有下面异常:Caused by: java.io.IOException: Error writing blockId: 9365176188928, sessionId: 3681370328933487755, address: BI-DB4/192.168.5.9:29999, message: Failed to write block.

日志报错示例:

2018-08-20 11:38:37  [ Distributed-Log spider-slice-pool-2-thread-4:1130269735 ] - [ ERROR ] com.finebi.spider.common.task.service.AsynSliceService.loop(AsynSliceService.java:44)
Failed to cache: java.io.IOException: Error writing blockId: 9365176188928, sessionId: 3681370328933487755, address: BI-DB4/192.168.5.9:29999, message: Failed to write block.
java.io.IOException: Failed to cache: java.io.IOException: Error writing blockId: 9365176188928, sessionId: 3681370328933487755, address: BI-DB4/192.168.5.9:29999, message: Failed to write block.
    at alluxio.client.file.FileOutStream.handleCacheWriteException(FileOutStream.java:338)
    at alluxio.client.file.FileOutStream.write(FileOutStream.java:301)
    at com.finebi.spider.io.AlluxioOutput.flush(AlluxioOutput.java:151)
    at com.finebi.spider.io.AlluxioOutput.writeInt(AlluxioOutput.java:108)
    at com.finebi.spider.common.struct.columnar.DicIndexUtils.writeIndex(DicIndexUtils.java:29)
    at com.finebi.spider.common.struct.columnar.TimestampDicColumnar.serialize(TimestampDicColumnar.java:97)
    at com.finebi.spider.common.task.slice.ColumnarCompressSlice.run(ColumnarCompressSlice.java:35)
    at com.finebi.spider.common.task.service.AsynSliceService.loop(AsynSliceService.java:41)
    at com.finebi.spider.common.task.service.AsynSliceService$$Lambda$596/253578804.run(Unknown Source)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
    Suppressed: java.io.IOException: java.io.IOException: Error writing blockId: 9365176188928, sessionId: 3681370328933487755, address: BI-DB4/192.168.5.9:29999, message: Failed to write block.
        at alluxio.client.netty.NettyRemoteBlockWriter.write(NettyRemoteBlockWriter.java:142)
        at alluxio.client.block.RemoteBlockOutStream.writeToRemoteBlock(RemoteBlockOutStream.java:124)
        at alluxio.client.block.RemoteBlockOutStream.flush(RemoteBlockOutStream.java:114)
        at alluxio.client.block.RemoteBlockOutStream.close(RemoteBlockOutStream.java:95)
        at alluxio.client.file.FileOutStream.close(FileOutStream.java:216)
        at com.finebi.spider.io.AlluxioOutput.close(AlluxioOutput.java:166)
        at com.finebi.spider.common.task.slice.ColumnarCompressSlice.run(ColumnarCompressSlice.java:36)
        ... 7 more
    Caused by: java.io.IOException: Error writing blockId: 9365176188928, sessionId: 3681370328933487755, address: BI-DB4/192.168.5.9:29999, message: Failed to write block.
        at alluxio.client.netty.NettyRemoteBlockWriter.write(NettyRemoteBlockWriter.java:121)
        ... 13 more
Caused by: java.io.IOException: java.io.IOException: Error writing blockId: 9365176188928, sessionId: 3681370328933487755, address: BI-DB4/192.168.5.9:29999, message: Failed to write block.
    at alluxio.client.netty.NettyRemoteBlockWriter.write(NettyRemoteBlockWriter.java:142)
    at alluxio.client.block.RemoteBlockOutStream.writeToRemoteBlock(RemoteBlockOutStream.java:124)
    at alluxio.client.block.RemoteBlockOutStream.flush(RemoteBlockOutStream.java:114)
    at alluxio.client.block.BufferedBlockOutStream.write(BufferedBlockOutStream.java:108)
    at alluxio.client.file.FileOutStream.write(FileOutStream.java:292)
    ... 12 more
Caused by: java.io.IOException: Error writing blockId: 9365176188928, sessionId: 3681370328933487755, address: BI-DB4/192.168.5.9:29999, message: Failed to write block.
    at alluxio.client.netty.NettyRemoteBlockWriter.write(NettyRemoteBlockWriter.java:121)
    ... 16 more

可能原因:

1、alluxio 的worker挂掉;

2、temp block id重复的问题;

解决方案:

1、参考Ambari界面组件使用介绍查看服务状态,重启挂掉的alluxio服务;

2、如果alluxio服务全部正常,可能是block id重复问题,参考分布式引擎维护常见问题中1.4章节解决;

14、Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try.编辑

问题现象:

生成数据失败或者模板访问失败,BI 日志中有下面异常:alluxio.exception.UnexpectedAlluxioException: java.lang.RuntimeException: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try

日志报错示例:

2018-07-26 10:33:35  [ Distributed-Log pool-50-thread-4:171829971 ] - [ ERROR ] com.finebi.spider.common.task.service.AsynSliceService.loop(AsynSliceService.java:42)
alluxio.exception.UnexpectedAlluxioException: java.lang.RuntimeException: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[30.16.24.69:50010,DS-4bf47e51-a402-489e-888c-def62a6ec72c,DISK], DatanodeInfoWithStorage[30.16.24.68:50010,DS-7be9a6ec-eaf1-4927-85e9-327bb4966678,DISK]], original=[DatanodeInfoWithStorage[30.16.24.69:50010,DS-4bf47e51-a402-489e-888c-def62a6ec72c,DISK], DatanodeInfoWithStorage[30.16.24.68:50010,DS-7be9a6ec-eaf1-4927-85e9-327bb4966678,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
com.finebi.spider.common.exception.DSAlluxioException: alluxio.exception.UnexpectedAlluxioException: java.lang.RuntimeException: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[30.16.24.69:50010,DS-4bf47e51-a402-489e-888c-def62a6ec72c,DISK], DatanodeInfoWithStorage[30.16.24.68:50010,DS-7be9a6ec-eaf1-4927-85e9-327bb4966678,DISK]], original=[DatanodeInfoWithStorage[30.16.24.69:50010,DS-4bf47e51-a402-489e-888c-def62a6ec72c,DISK], DatanodeInfoWithStorage[30.16.24.68:50010,DS-7be9a6ec-eaf1-4927-85e9-327bb4966678,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
    at com.finebi.spider.io.AlluxioOutput.init(AlluxioOutput.java:80)
    at com.finebi.spider.io.AlluxioOutput.<init>(AlluxioOutput.java:31)
    at com.finebi.spider.io.AlluxioOutput.mustCache(AlluxioOutput.java:52)
    at com.finebi.spider.io.AlluxioFileSystem.create(AlluxioFileSystem.java:391)
    at com.finebi.spider.cluster.write.AlluxioMassColFlow.createSlice(AlluxioMassColFlow.java:43)
    at com.finebi.spider.common.task.slice.ColumnarCompressSlice.run(ColumnarCompressSlice.java:33)
    at com.finebi.spider.common.task.service.AsynSliceService.loop(AsynSliceService.java:39)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: alluxio.exception.UnexpectedAlluxioException: java.lang.RuntimeException: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[30.16.24.69:50010,DS-4bf47e51-a402-489e-888c-def62a6ec72c,DISK], DatanodeInfoWithStorage[30.16.24.68:50010,DS-7be9a6ec-eaf1-4927-85e9-327bb4966678,DISK]], original=[DatanodeInfoWithStorage[30.16.24.69:50010,DS-4bf47e51-a402-489e-888c-def62a6ec72c,DISK], DatanodeInfoWithStorage[30.16.24.68:50010,DS-7be9a6ec-eaf1-4927-85e9-327bb4966678,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at alluxio.exception.AlluxioException.fromThrift(AlluxioException.java:92)
    at alluxio.AbstractClient.retryRPC(AbstractClient.java:353)
    at alluxio.client.file.FileSystemMasterClient.createFile(FileSystemMasterClient.java:147)
    at alluxio.client.file.BaseFileSystem.createFile(BaseFileSystem.java:146)
    at com.finebi.spider.io.AlluxioOutput.init(AlluxioOutput.java:74)
    ... 11 more

可能原因:

hdfs的datanode挂掉,并且存活的datanode数量小于hdfs的备份数;

解决方案:

参考Ambari界面组件使用介绍重启挂掉的hdfs服务;

15、java.lang.OutOfMemoryError: GC overhead limit exceeded编辑

问题现象:

生成数据失败或者模板访问失败,BI 日志中有下面异常:java.lang.OutOfMemoryError: GC overhead limit exceeded

日志报错示例:

java.lang.OutOfMemoryError: GC overhead limit exceeded

可能原因:

bi端内存不足;

解决方案:

参考部署初始化修改内存调大BI端内存;

16、java.lang.OutOfMemoryError: Java heap space编辑

问题现象:

生成数据失败或者模板访问失败,BI 日志中有下面异常:Java heap space

日志报错示例:

Java heap space

可能原因:

bi端内存不足;

解决方案:

参考部署初始化修改内存调大BI端内存。重复出现的话请联系帆软技术支持;

17、SparkContext: Error initializing SparkContext.| java.lang.NoClassDefFoundError: Could not initialize class com.finebi.spider.context.DSContext编辑

问题现象:

BI启动不来,没法访问,页面报错。BI日志中有如下报错:

ERROR SparkContext: Error initializing SparkContext.

java.lang.NoClassDefFoundError: Could not initialize class com.finebi.spider.context.DSContext

日志报错示例:

ERROR SparkContext: Error initializing SparkContext.
java.lang.NullPointerException
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:558)
    at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2486)
    at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930)
    at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921)
    at com.finebi.spider.cluster.spark.ContextManager.<init>(ContextManager.java:45)
    at com.finebi.spider.cluster.spark.ContextManager.<clinit>(ContextManager.java:33)
    at com.finebi.spider.context.DSContext.initSparkContext(DSContext.java:58)
    at com.finebi.spider.context.DSContext.<init>(DSContext.java:36)
    at com.finebi.spider.context.DSContext.<clinit>(DSContext.java:29)
    at com.fr.engine.distribute.local.source.EngineProcedureDriver.getStrategyProcedure(EngineProcedureDriver.java:18)
    at com.fr.engine.bi.config.DirectDataSourceDriverRegisterFactory.registerProcedure(DirectDataSourceDriverRegisterFactory.java:65)
    at com.fr.engine.bi.config.DirectDataSourceDriverRegisterFactory.realRegister(DirectDataSourceDriverRegisterFactory.java:54)
    at com.fr.engine.bi.config.DirectDataSourceDriverRegisterFactory.realRegister(DirectDataSourceDriverRegisterFactory.java:21)
    at com.fr.engine.bi.config.CompletableFutureRegisterFactory.lambda$register$0(CompletableFutureRegisterFactory.java:28)
    at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)
    at java.lang.Thread.run(Thread.java:748)
18/11/01 13:06:45 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/11/01 13:06:45 INFO SparkContext: SparkContext already stopped.
18/11/01 13:06:45 ERROR [decision]: Servlet.service() for servlet [decision] in context with path [/webroot] threw exception [Handler processing failed; nested exception is java.lang.NoClassDefFoundError: Could not initialize class com.finebi.spider.context.DSContext] with root cause
java.lang.NoClassDefFoundError: Could not initialize class com.finebi.spider.context.DSContext
    at com.fr.engine.bi.license.counter.DefaultSpiderActiveNodeCounter.getActiveNodeCount(DefaultSpiderActiveNodeCounter.java:17)
    at com.fr.engine.bi.license.matcher.SpiderEngineNodeLicenseMatcher.getCurrentActiveNodeNumber(SpiderEngineNodeLicenseMatcher.java:41)
    at com.fr.engine.bi.license.matcher.SpiderEngineNodeLicenseMatcher.match(SpiderEngineNodeLicenseMatcher.java:21)
    at com.fr.decision.webservice.impl.system.available.BusinessNodeDetector.availableCheck(BusinessNodeDetector.java:20)
    at com.fr.decision.webservice.interceptor.SystemAvailableInterceptor.preHandle(SystemAvailableInterceptor.java:25)
    at com.fr.third.springframework.web.servlet.HandlerExecutionChain.applyPreHandle(HandlerExecutionChain.java:134)
    at com.fr.third.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:932)
    at com.fr.third.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:870)
    at com.fr.third.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:961)
    at com.fr.third.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:852)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:634)
    at com.fr.third.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:837)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:741)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at com.fr.third.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:88)
    at com.fr.third.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at com.fr.decision.base.DecisionServletInitializer$4.doFilter(DecisionServletInitializer.java:123)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:199)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96)
    at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:491)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:139)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92)
    at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:668)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87)
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343)
    at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:408)
    at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
    at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:764)
    at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1388)
    at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    at java.lang.Thread.run(Thread.java:748)

可能原因:

出现这个报错,基本都是bi刚启动的时候就出现了报错。所以从bi日志最近一次启动开始找。看日志输出信息,任何异常的info,warn和error的输出都不要放过。可能是下面某一项导致的:

1. spark_driver_host参数可能配置错了 ip不对;

2. hosts文件中 localhost对应了多个ip(单机版),集群版则可能是本机同一个hostname配置了多个ip或者hostanme与ip对应错误;

3. spark服务存在异常;

解决方案:

1&2:修改对应参数及ip/hostname即可;

3:参考Ambari界面组件使用介绍,检查spark的web页面状态(必须要有一个ALIVE状态的节点),若服务存在异常,重启spark服务;

18、登录FineBI数据决策平台一直处于加载状态编辑

问题现象:

在输入FineBI数据决策系统登录的时候,该决策平台一直在加载,但是无法登录。如下图,tomcat页面可以正常访问,但是决策平台一直在加载。

222

在 Spark 的 Workers 页面,如下图:

222

依次点击进入每个 worker 的页面,会看到如下的内容:

222

这里有两个 Executor 处于 Running 状态,ID 分别是40104和40106。随后刷新页面,发现如下图所示,Running状态的 ExecutorID改变了。说明刚刚的40104和40106的ID已经挂掉了。

222

可能原因:

Executor 没法解析 Driver 的 hostname,自己主动退出。Worker 发现 Executor 不会再一次启动Executor。由于spark executor 无法访问 driver 导致 executor 一直无法正常运行。最终导致决策平台一直加载,无法登录。

解决方案:

只需要将 BI 的 driver 的 hostname 配置为 IP 即可。参考Spider分布式参数配置方法配置DistributedOptimizationConfig.spiderConfig.spark_driver_host_hostname参数,对应值为本机 hostname对应的 ip。

注:注意配置名中的 _hostname为动态后缀,比如节点 A的hostname是 hostA,那配置名就应该是 DistributedOptimizationConfig.spiderConfig.spark_driver_host_hostA。

附件列表


主题: 部署集成
标签: 暂无标签 编辑/添加标签
如果您认为本文档还有待完善,请编辑

文档内容仅供参考,如果你需要获取更多帮助,付费/准付费客户请咨询帆软技术支持
关于技术问题,您还可以前往帆软社区,点击顶部搜索框旁边的提问按钮
若您还有其他非技术类问题,可以联系帆软传说哥(qq:1745114201

本文档是否有用? [ 去社区提问 ]