Is it good to create Spark batch job for every new Use cases -
i run 100s
of computer in network , 100s
of user access machines. every day, thousands or more syslogs
are generated machines. syslog
log including system failures
, network, firewall
, application errors
etc.
sample log below
may 11 11:32:40 scrooge sg_child[1829]: [id 748625 user.info] m:wr-sg-block-111- 00 c:y th:block , no allow rule matched request entryurl:http:url on mapping:bali [ rid:t6zcuh8aaaeaagxyaqyaaaaq sid:a6bbd3447766384f3bccc3ca31dbd50n ip:192.24.61.1]
from logs, extract fields timestamp, loghost, msg, process, facility
etc , store them in hdfs
. logs
are stored in json format
. now, want build system can type query in web application
, analysis on logs
. able queries like
- get logs message contains "firewall blocked" keywords.
- get logs generated user jason
- get logs containing "access denied" msg.
- get log count grouped user, process, loghost etc. there thousands of different types of analytics want do. add more, want combined results of historical data , real time data i.e. combining batch , realtime results.
now questions is
- to batch result, need run batch spark jobs. should creating batch jobs every unique query user makes. if so, end creating 1000s of batch jobs. if not, kind of batch jobs should run can results type of analytics.
- am thinking right way. if approach wrong, share should correct procedure.
while it's possible (via thrift server example), apache spark main objective not query engine building data pipelines stream , batch data sources.
if transformation projecting fields , want enable ad-hoc queries, sounds need data store - such elasticsearch example. additional benefit comes kibana enable analytics extent.
another option use sql engine such apache drill.
Comments
Post a Comment