Best practice indexing hdfs data into solr using hive

partitionned table in hive
Here,based on the requirement especially how typically your data gets updated, volume and architecture.
  • Run a MR job to index data using solrj.
  • Create Lucene index using mr job and duplicate to the appropriate shards.
  • Use Hbase indexer to populate Solr.

Properly Size Index:

  • Understanding what to index typically requires deep business domain expertise on the data.
  • This yields better indexing plan and increases accuracy for searching data.
  • Not all data will be indexed but for an organization user have new data,Needs classification of all data untill it is understood what value it brings to the business.
  • It implies is that data needs to be re-indexed so it is a good practice to store raw data somewhere low cost, often in HDFS or in the cloud object storage.

Categorized in:

Tagged in:

, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,