Role and Responsibilities

Published on

I have seen many hadoop over the internet where they search for bigdata architect roles and responsibilities. So here I tried to help them by putting most of the points. These are the main tasks which Architect need to do or should have these skill set to become BigData Architect. Should be able to Design, … Continue reading Role and Responsibilities

Large datasets open to public

Published on

Cross-disciplinary data repositories, data collections and data search engines: https://www.kaggle.com/datasets http://www.assetmacro.com http://usgovxml.com http://aws.amazon.com/datasets http://databib.org http://datacite.org http://figshare.com http://linkeddata.org http://reddit.com/r/datasets http://thewebminer.com/ http://thedatahub.org alias http://ckan.net http://quandl.com Social Network Analysis Interactive Dataset Library (Social Network Datasets) Datasets for Data Mining http://enigma.io http://www.ufindthem.com/ http://NetworkRepository.com – The First Interactive Network Data Repository http://MLvis.com Open Data Inception – A Comprehensive List of … Continue reading Large datasets open to public

HDFS VS HBase

Published on

The below tables gives the difference betweek HDFS and HBase HDFS HBase HDFS is a distributed file system suitable for storing large file. HBase is NoSQL database built on top of the HDFS. It doesn’t support fast individual record lookups It provides fast lookups for large tables. It provides high latency batch processing It internally … Continue reading HDFS VS HBase

Big Data Frameworks every programmer should know

Published on

Introduction Big Data is a major buzz word in the current technological forefront. Big Data technologies have given rise to the usage of cutting-edge research to practical applications. Machine Learning and Analytics is one such example. Prior to the adoption of Big Data technologies Artificial Intelligence and Machine Learning were limited to academic research. But … Continue reading Big Data Frameworks every programmer should know

Hive Optimization Techniques in Hadoop 2.x

Published on

Enable the below Properties in hive SQL for large volumes of data: SET hive.execution.engine = tez; SET mapreduce.framework.name=yarn-tez; SET tez.queue.name=SIU; SET hive.vectorized.execution.enabled=true; SET hive.auto.convert.join=true; SET hive.compute.query.using.stats = true; SET hive.stats.fetch.column.stats = true; SET hive.stats.fetch.partition.stats = true; SET hive.cbo.enable = true; SET hive.exec.dynamic.partition = true; SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.parallel=true; SET hive.exec.mode.local.auto=true; SET hive.exec.reducers.bytes.per.reducer=1000000000; (Depends on your … Continue reading Hive Optimization Techniques in Hadoop 2.x