======Apache Hadoop====== ====Architektura====
  • Node - jednotlivy pocitac
  • Rack - 30-40 Nodu na stejnem switchi - velky sitovy provoz
  • Cluster - Racky managovane pomoci Hadoop
  • ====Komponenty====
  • Distribuovane filesystemy: Hadoop Distributed File System (HDFS), nebo IBM Spectrum Scale
  • MapReduce Engine - Framework na provadeni kalkulaci na datach v FS a obsahujici casovac a resource manager
  • ====Projekty souvisejici s Hadoop====
  • Lucene je knihovna na engine textoveho vyhledavani napsana v Java
  • Hbase je Hadoop databaze
  • Hive poskytuje data warehousing nastroj k extrakci, transformani a natazeni (ETL) dat, a dotazovani techto dat ulozenych v Hadoop souborech
  • Pig je high level jazyk, ktery produkuje MapReduce kod k analyze velkych mnozin dat
  • Spark je cluster computing framework
  • ZooKeeper je centralizovany configuracni sluzba a jmenny registr pro velke distribuovane systemy
  • Ambari spravuje a monitoruje Hadoop clustery pomoc intuitivniho web UI
  • Avro je data serialization system
  • UIMA je architekture pro vyvoj, nachazeni, kompozici a nasazovani pro analyzu nestrukturovanych dat
  • Yarn is a large-scale operating system for big data applications
  • Mapreduce is a software framework for easily writing applications which processes vast amounts of data
  • Highly extensible, highly scalable Web crawler Nutch 1.x enables fine grained configuration, relying on Apache Hadoop™ data structures, which are great for batch processing.
  • Jaql is primarily a query language for JavaScript Object Notation (JSON)
  • [[https://ibm-open-platform.ibm.com/biginsights/download-pages/download-qse-vm/?&S_TACT=M1610EPW|IBM BigInsight Quick Start Edition]]
  • ===HDFS=== FS na existujicich FS operacniho systemu, navrzeny k toleranci selhani komponent a k obsahovani velikych souboru. Male soubory - pomaly pristup. Hledani draha operace. Designovano na streamy nebo sekvencni data. Hadoop blok je soubor na disku o jiste velikosti (napriklad 128MB).\\