What is best practice indexing hdfs data into solr using hive ?

Answer : Here,based on the requirement especially how typically your data gets updated, volume and architecture.

Best practice indexing hdfs data into solr using hive

partitionned table in hive
Here,based on the requirement especially how typically your data gets updated, volume and architecture.
  • Run a MR job to index data using solrj.
  • Create Lucene index using mr job and duplicate to the appropriate shards.
  • Use Hbase indexer to populate Solr.

Properly Size Index:

  • Understanding what to index typically requires deep business domain expertise on the data.
  • This yields better indexing plan and increases accuracy for searching data.
  • Not all data will be indexed but for an organization user have new data,Needs classification of all data untill it is understood what value it brings to the business.
  • It implies is that data needs to be re-indexed so it is a good practice to store raw data somewhere low cost, often in HDFS or in the cloud object storage.
Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like