[JS-1303] In a distributed job chain triggered by file orders some files create conflicts between cluster members - SOS JIRA

Details

Type: Fix
Status: Dismissed (View Workflow)
Priority: Major
Resolution: Works as designed
Affects Version/s: 1.8
Fix Version/s: 1.9
Component/s: Job Scheduler Binaries
Labels:
None
Environment:
gollum:4241/ / homer:4111/ / share:8of9/data

Description

Starting Situation

Cluster setup with two JobScheduler instances, e.g. gollum.sos and homer.sos
Common CIFS mount between two servers

Current Behavior

When 50 files each 2MB size are moved to the common mount point, then out of 50 files just 1-3 files are in error and the others are processed without error.
The error _file not found_ occurs when both JobScheduler instances try to process the file, but one processes the file and removes it before the second instance can get hold of the file.

The following log output is created:

2015-02-04 11:56:59.391+0100 [info]   (Task distributed/40_DistributedFileProcessing/compress_file:7405293) SCHEDULER-842  Task is going to process Order distributed/40_DistributedFileProcessing/compress_archive_files:/mnt/8of9/data/in/TRX-20150121-007.DAT, state=compress, on JobScheduler http://gollum:4241
2015-02-04 11:56:59.393+0100 [info]   (Task distributed/40_DistributedFileProcessing/compress_file:7405293) 
2015-02-04 11:56:59.393+0100 [info]   (Task distributed/40_DistributedFileProcessing/compress_file:7405293) Task distributed/40_DistributedFileProcessing/compress_file:7405293 - Protocol starts in /home/jenkins/sos-berlin.com/jobscheduler/scheduler_distributed_sos/logs/task.distributed,40_DistributedFileProcessing,compress_file.7405293.log
2015-02-04 11:56:59.394+0100 [info]   (Task distributed/40_DistributedFileProcessing/compress_file:7405293) SCHEDULER-918  state=starting (at=never)
2015-02-04 11:56:59.394+0100 [info]   (Task distributed/40_DistributedFileProcessing/compress_file:7405293) SCHEDULER-987  Starting process: '/bin/sh' '-c' '"/tmp/jenkins/sos.Vc88iD"'
2015-02-04 11:57:21.606+0100 [info]   (Task distributed/40_DistributedFileProcessing/compress_file:7405293) compress_file : job starting
2015-02-04 11:57:21.606+0100 [info]   (Task distributed/40_DistributedFileProcessing/compress_file:7405293) .
2015-02-04 11:57:21.606+0100 [info]   (Task distributed/40_DistributedFileProcessing/compress_file:7405293) .
2015-02-04 11:57:21.606+0100 [info]   (Task distributed/40_DistributedFileProcessing/compress_file:7405293) .
2015-02-04 11:57:21.606+0100 [info]   (Task distributed/40_DistributedFileProcessing/compress_file:7405293) Processing File /mnt/8of9/data/in/TRX-20150121-007.DAT
2015-02-04 11:57:21.606+0100 [info]   (Task distributed/40_DistributedFileProcessing/compress_file:7405293) .
2015-02-04 11:57:21.606+0100 [info]   (Task distributed/40_DistributedFileProcessing/compress_file:7405293) .
2015-02-04 11:57:21.606+0100 [info]   (Task distributed/40_DistributedFileProcessing/compress_file:7405293) .
2015-02-04 11:57:21.606+0100 [info]   (Task distributed/40_DistributedFileProcessing/compress_file:7405293) START Compress file :TRX-20150121-007.DAT
2015-02-04 11:57:21.606+0100 [info]   (Task distributed/40_DistributedFileProcessing/compress_file:7405293) CMD> gzip -cv /mnt/8of9/data/in/TRX-20150121-007.DAT > /mnt/8of9/data/in/TRX-20150121-007.DAT.gz
2015-02-04 11:57:21.606+0100 [info]   (Task distributed/40_DistributedFileProcessing/compress_file:7405293) .
2015-02-04 11:57:21.606+0100 [info]   (Task distributed/40_DistributedFileProcessing/compress_file:7405293) .
2015-02-04 11:57:21.606+0100 [info]   (Task distributed/40_DistributedFileProcessing/compress_file:7405293) CMD> gzip -cv /mnt/8of9/data/in/TRX-20150121-007.DAT > /mnt/8of9/data/in/TRX-20150121-007.DAT.gz : unsuccessful , Exit 99
2015-02-04 11:57:21.606+0100 [info]   (Task distributed/40_DistributedFileProcessing/compress_file:7405293) gzip: /mnt/8of9/data/in/TRX-20150121-007.DAT: No such file or directory
2015-02-04 11:57:23.032+0100 [info]   (Task distributed/40_DistributedFileProcessing/compress_file:7405293) SCHEDULER-915  Process event
2015-02-04 11:57:23.032+0100 [ERROR]  (Task distributed/40_DistributedFileProcessing/compress_file:7405293) SCHEDULER-280  Process terminated with exit code 99 (0x63)
2015-02-04 11:57:23.033+0100 [info]   (Task distributed/40_DistributedFileProcessing/compress_file:7405293) SCHEDULER-843  Task has ended processing of Order distributed/40_DistributedFileProcessing/compress_archive_files:/mnt/8of9/data/in/TRX-20150121-007.DAT, state=compress, on JobScheduler http://gollum:4241
2015-02-04 11:57:23.033+0100 [info]   set_state error, Job /scheduler_file_order_sink
2015-02-04 11:57:23.127+0100 [info]   (Task scheduler_file_order_sink:7405246) SCHEDULER-842  Task is going to process Order distributed/40_DistributedFileProcessing/compress_archive_files:/mnt/8of9/data/in/TRX-20150121-007.DAT, state=error, on JobScheduler http://gollum:4241
2015-02-04 11:57:23.127+0100 [WARN]   (Task scheduler_file_order_sink:7405246) SCHEDULER-339  File does not exist and can therefore neither be moved nor removed: /mnt/8of9/data/in/TRX-20150121-007.DAT
2015-02-04 11:57:23.128+0100 [info]   (Task scheduler_file_order_sink:7405246) SCHEDULER-843  Task has ended processing of Order distributed/40_DistributedFileProcessing/compress_archive_files:/mnt/8of9/data/in/TRX-20150121-007.DAT, state=error, on JobScheduler http://gollum:4241
2015-02-04 11:57:23.128+0100 [info]   SCHEDULER-945  No further job in job chain - order has been carried out
2015-02-04 11:57:23.128+0100 [info]   SCHEDULER-940  Removing order from job chain

Resolution

This problem has been tested to occur if the latency of the common mount point exceeds the time required to process the incoming file by a job chain, e.g. the file appears to one JobScheduler instance with a delay of 3s and during that delay another instance has processed the file completely by a job chain.
We cannot modify the behavior to compensate such delays without serious impact on the functionality of this feature that requires timely processing of incoming files.
- We tested with JobScheduler 1.9-SNAPSHOT on Windows x64 and no errors in the behavior have been found as JobScheduler is working as expected.
- in parallel tests with CISF/Samba drives due to the shared drive's latency files appeared at a different point in time to the individual cluster members. This caused some conflicts between cluster members that resulted in the above error.
To compensate the latency which depends upon individual system setup, the application can create a delay in the first node of the job chain that corresponds to the latency value (sleep).
For fast shared drives (as simulated in Windows Cluster on one computer) a forced delay of 3-5 seconds has avoided any conflict.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Gollum.4241.png
215 kB
04 February 2015 13:46

In a distributed job chain triggered by file orders some files create conflicts between cluster members

Details

Description

Attachments

Attachments

Activity

People

Dates