{"id":33421,"date":"2020-03-17T13:47:59","date_gmt":"2020-03-17T08:17:59","guid":{"rendered":"https:\/\/www.wikitechy.com\/technology\/?p=33421"},"modified":"2020-03-17T13:47:59","modified_gmt":"2020-03-17T08:17:59","slug":"how-to-migrate-from-elasticsearch-1-7-to-6-8-with-zero-downtime","status":"publish","type":"post","link":"https:\/\/www.wikitechy.com\/technology\/how-to-migrate-from-elasticsearch-1-7-to-6-8-with-zero-downtime\/","title":{"rendered":"How to migrate from Elasticsearch 1.7 to 6.8 with zero downtime"},"content":{"rendered":"<p>My last task at BigPanda was to upgrade an existing service that was using Elasticsearch version 1.7 to a more moderen Elasticsearch version, 6.8.1.<\/p>\n<p>In this post, i will be able to share how we migrated from Elasticsearch 1.6 to 6.8 with harsh constraints like zero downtime, no data loss, and 0 bugs. I&#8217;ll also provide you with a script that does the migration for you.<\/p>\n<p>This post contains 6 chapters (and one is optional):<\/p>\n<p><strong>What\u2019s in it for me?<\/strong> &#8211;&gt; What were the new features that led us to upgrade our version?<\/p>\n<p><strong>The constraints<\/strong> &#8211;&gt; What were our business requirements?<\/p>\n<p><strong>Problem solving<\/strong> &#8211;&gt; How did we address the constraints?<\/p>\n<p><strong>Moving forward<\/strong> &#8211;&gt; The plan.<\/p>\n[Optional chapter] &#8211;&gt; How did we handle the infamous mapping explosion problem?<\/p>\n<p><strong><span style=\"color: #003300;\">Finally<\/span><\/strong> &#8211;&gt; the way to do data migration between clusters.<\/p>\n<h2 id=\"chapter-1-whats-in-it-for-me\">Chapter 1 \u2014 What\u2019s in it for me?<\/h2>\n<p>What benefits were we expecting to unravel by upgrading our data store?<\/p>\n<p>There were a few of reasons:<\/p>\n<ul>\n<li><strong>Performance and stability issues<\/strong> \u2014 We were experiencing an enormous number of outages with long MTTR that caused us tons of headaches. This was reflected in frequent high latencies, high CPU usage, and more issues.<\/li>\n<li><strong>Non-existent support in old Elasticsearch versions<\/strong> \u2014 We were missing some operative knowledge in Elasticsearch, and once we looked for outside consulting we were encouraged to migrate forward to receive support.<\/li>\n<li><strong>Dynamic mappings in our schema<\/strong> \u2014 Our current schema in Elasticsearch 1.7 used a feature called dynamic mappings that made our cluster explode multiple times. So we wanted to deal with this issue.<\/li>\n<li><strong>Poor visibility on our existing cluster<\/strong> \u2014 We wanted a far better view under the hood and saw that later versions had great metrics exporting tools.<\/li>\n<\/ul>\n<h2 id=\"chapter-2-the-constraints\">Chapter 2 \u2014 The constraints<\/h2>\n<ul>\n<li><strong>ZERO downtime migration<\/strong> \u2014 we&#8217;ve active users on our system, and that we couldn&#8217;t afford for the system to be down while we were migrating.<\/li>\n<li><strong>Recovery plan<\/strong> \u2014 We couldn&#8217;t afford to \u201close\u201d or \u201ccorrupt\u201d data, regardless of the value . So we would have liked to organize a recovery plan just in case our migration failed.<\/li>\n<li><strong>Zero bugs<\/strong> \u2014 We couldn&#8217;t change existing search functionality for end-users.<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h2 id=\"chapter-3-problem-solving-and-thinking-of-a-plan\">Chapter 3 \u2014 Problem solving and thinking of a plan<\/h2>\n<p>Let\u2019s tackle the constraints from the only to the foremost difficult:<\/p>\n<h3 id=\"zero-bugs\">Zero bugs<\/h3>\n<p>In order to deal with this requirement, I studied all the possible requests the service receives and what its outputs were. Then I added unit-tests where needed.<\/p>\n<p>In addition, I added multiple metrics (to the\u00a0<strong>Elasticsearch<\/strong>\u00a0<strong>Indexer<\/strong> and therefore the <strong>new Elasticsearch Indexer<\/strong> ) to trace latency, throughput, and performance, which allowed me to validate that we only improved them.<\/p>\n<h3 id=\"recovery-plan\">Recovery plan<\/h3>\n<p>This means that I needed to deal with the subsequent situation: I deployed the new code to production and stuff wasn&#8217;t working needless to say . What am i able to do about it then<\/p>\n<p>Since i used to be working during a service that used event-sourcing, I could add another listener (diagram attached below) and begin writing to a replacement Elasticsearch cluster without affecting production status<\/p>\n<h3 id=\"zero-downtime-migration\">Zero downtime migration<\/h3>\n<p>The current service is in live mode and can&#8217;t be \u201cdeactivated\u201d for periods longer than 5\u201310 minutes. The trick to getting this right is this:<\/p>\n<ul>\n<li>Store a log of all the actions your service is handling (we use Kafka in production)<\/li>\n<li>Start the migration process offline (and keep track of the offset before you started the migration)<\/li>\n<li>When the migration ends, start the new service against the log with the recorded offset and catch up the lag<\/li>\n<li>When the lag finishes, change your frontend to question against the new service and you&#8217;re done<\/li>\n<\/ul>\n<h2 id=\"chapter-4-the-plan\">Chapter 4 \u2014 The plan<\/h2>\n<p>Our current service uses the subsequent architecture (based on message passing in Kafka):<\/p>\n<p><img fetchpriority=\"high\" decoding=\"async\" class=\"aligncenter size-full wp-image-33461\" src=\"https:\/\/www.wikitechy.com\/technology\/wp-content\/uploads\/2020\/03\/indxr2-1.jpeg\" alt=\"\" width=\"700\" height=\"140\" srcset=\"https:\/\/www.wikitechy.com\/technology\/wp-content\/uploads\/2020\/03\/indxr2-1.jpeg 700w, https:\/\/www.wikitechy.com\/technology\/wp-content\/uploads\/2020\/03\/indxr2-1-300x60.jpeg 300w\" sizes=\"(max-width: 700px) 100vw, 700px\" \/><\/p>\n<p>&nbsp;<\/p>\n<ul>\n<li><strong>Event topic<\/strong> contains events produced by other applications (for example, <strong>UserId 3 created<\/strong>)<\/li>\n<li><strong>Command topic<\/strong> contains the interpretation of those events into specific commands employed by this application (for example: <strong>Add userId 3<\/strong>)<\/li>\n<li>Elasticsearch 1.7 \u2014 The datastore of the<strong> command Topic<\/strong> read by the Elasticsearch Indexer.<\/li>\n<\/ul>\n<p>We planned to feature another consumer <strong>(new Elasticsearch Indexer)<\/strong> to the <strong>command topic<\/strong>, which can read an equivalent exact messages and write them in parallel to Elasticsearch 6.8.<\/p>\n<p><img decoding=\"async\" class=\"aligncenter size-full wp-image-33467\" src=\"https:\/\/www.wikitechy.com\/technology\/wp-content\/uploads\/2020\/03\/indxr.jpeg\" alt=\"\" width=\"701\" height=\"341\" srcset=\"https:\/\/www.wikitechy.com\/technology\/wp-content\/uploads\/2020\/03\/indxr.jpeg 701w, https:\/\/www.wikitechy.com\/technology\/wp-content\/uploads\/2020\/03\/indxr-300x146.jpeg 300w\" sizes=\"(max-width: 701px) 100vw, 701px\" \/><\/p>\n<p>&nbsp;<\/p>\n<h3 id=\"where-should-i-start\"><strong>Where should I start?<\/strong><\/h3>\n<p>To be honest, I considered myself a newbie Elasticsearch user. To feel confident to perform this task, I had to believe the simplest thanks to approach this subject and learn it. a couple of things that helped were:<\/p>\n<ul>\n<li><strong>Documentation<\/strong> \u2014 It\u2019s an insanely useful resource for everything Elasticsearch. Take the time to read it and take notes (don\u2019t miss: Mapping and QueryDsl).<\/li>\n<li><strong>HTTP API<\/strong> \u2014 everything under CAT API. This was super useful to debug things locally and see how Elasticsearch responds (don\u2019t miss: cluster health, cat indices, search, delete index).<\/li>\n<li><strong>Metrics (\u2764\ufe0f)<\/strong> \u2014 From the primary day, we configured a shiny new dashboard with many cool metrics (taken from elasticsearch-exporter-for-Prometheus) that helped and pushed us to know more about Elasticsearch.<\/li>\n<\/ul>\n<h2 id=\"the-code\">The code<\/h2>\n<p>Our codebase was employing a library called <span style=\"color: #ff0000;\">elastic4s<\/span> and was using the oldest release available within the library \u2014 a very good reason to migrate! therefore the very first thing to try to to was just to migrate versions and see what broke.<\/p>\n<p>There are a couple of tactics on the way to do that code migration. The tactic we chose was to undertake and restore existing functionality first within the new Elasticsearch version without re-writing the all code from the beginning . In other words, to succeed in existing functionality but on a more moderen version of Elasticsearch.<\/p>\n<p>Luckily for us, the code already contained almost full testing coverage so our task was much much simpler, which took around 2 weeks of development time.<\/p>\n<p><img decoding=\"async\" class=\"aligncenter size-full wp-image-33469\" src=\"https:\/\/www.wikitechy.com\/technology\/wp-content\/uploads\/2020\/03\/you_need_some_tests_yo.jpg\" alt=\"\" width=\"400\" height=\"300\" srcset=\"https:\/\/www.wikitechy.com\/technology\/wp-content\/uploads\/2020\/03\/you_need_some_tests_yo.jpg 400w, https:\/\/www.wikitechy.com\/technology\/wp-content\/uploads\/2020\/03\/you_need_some_tests_yo-300x225.jpg 300w, https:\/\/www.wikitechy.com\/technology\/wp-content\/uploads\/2020\/03\/you_need_some_tests_yo-74x55.jpg 74w, https:\/\/www.wikitechy.com\/technology\/wp-content\/uploads\/2020\/03\/you_need_some_tests_yo-111x83.jpg 111w, https:\/\/www.wikitechy.com\/technology\/wp-content\/uploads\/2020\/03\/you_need_some_tests_yo-215x161.jpg 215w\" sizes=\"(max-width: 400px) 100vw, 400px\" \/><\/p>\n<p><em>It&#8217;s important to note that, if that wasn&#8217;t the case, we would have had to invest some time in filling that coverage up. Only then would we be able to migrate since one of our constraints was to not break existing functionality.<\/em><\/p>\n<h2 id=\"chapter-5-the-mapping-explosion-problem\">Chapter 5 \u2014 The mapping explosion problem<\/h2>\n<p>Let\u2019s describe our use-case in more detail. This is our model:<\/p>\n<div class=\"code-embed-wrapper\"> <div class=\"code-embed-infos\"> <\/div> <pre class=\"language-python code-embed-pre line-numbers\"  data-start=\"1\" data-line-offset=\"0\"><code class=\"language-python code-embed-code\">class InsertMessageCommand(tags: Map[String,String])<\/code><\/pre> <\/div>\n<p style=\"text-align: justify;\">And for example, an instance of this message would be:<\/p>\n<div class=\"code-embed-wrapper\"> <div class=\"code-embed-infos\"> <\/div> <pre class=\"language-python code-embed-pre line-numbers\"  data-start=\"1\" data-line-offset=\"0\"><code class=\"language-python code-embed-code\">new InsertMessageCommand(Map(&quot;name&quot;-&gt;&quot;dor&quot;,&quot;lastName&quot;-&gt;&quot;sever&quot;))<\/code><\/pre> <\/div>\n<p style=\"text-align: justify;\">And given this model, we needed to support the following query requirements:<\/p>\n<ol style=\"text-align: justify;\">\n<li>Query by value<\/li>\n<li>Query by tag name and value<\/li>\n<\/ol>\n<p style=\"text-align: justify;\">The way this was modeled in our Elasticsearch 1.7 schema was using a dynamic template schema (since the tag keys are dynamic, and cannot be modeled in advanced).<\/p>\n<p style=\"text-align: justify;\">The dynamic template caused us multiple outages due to the mapping explosion problem, and the schema looked like this:<\/p>\n<div class=\"code-embed-wrapper\"> <div class=\"code-embed-infos\"> <\/div> <pre class=\"language-python code-embed-pre line-numbers\"  data-start=\"1\" data-line-offset=\"0\"><code class=\"language-python code-embed-code\">curl -X PUT &quot;localhost:9200\/_template\/my_template?pretty&quot; -H &#039;Content-Type: application\/json&#039; -d &#039;<br\/>{<br\/>    &quot;index_patterns&quot;: [<br\/>        &quot;your-index-names*&quot;<br\/>    ],<br\/>    &quot;mappings&quot;: {<br\/>            &quot;_doc&quot;: {<br\/>                &quot;dynamic_templates&quot;: [<br\/>                    {<br\/>                        &quot;tags&quot;: {<br\/>                            &quot;mapping&quot;: {<br\/>                                &quot;type&quot;: &quot;text&quot;<br\/>                            },<br\/>                            &quot;path_match&quot;: &quot;actions.tags.*&quot;<br\/>                        }<br\/>                    }<br\/>                ]<br\/>            }<br\/>        },<br\/>    &quot;aliases&quot;: {}<br\/>}&#039;  <br\/><br\/>curl -X PUT &quot;localhost:9200\/your-index-names-1\/_doc\/1?pretty&quot; -H &#039;Content-Type: application\/json&#039; -d&#039;<br\/>{<br\/>  &quot;actions&quot;: {<br\/>    &quot;tags&quot; : {<br\/>        &quot;name&quot;: &quot;John&quot;,<br\/>        &quot;lname&quot; : &quot;Smith&quot;<br\/>    }<br\/>  }<br\/>}<br\/>&#039;<br\/><br\/>curl -X PUT &quot;localhost:9200\/your-index-names-1\/_doc\/2?pretty&quot; -H &#039;Content-Type: application\/json&#039; -d&#039;<br\/>{<br\/>  &quot;actions&quot;: {<br\/>    &quot;tags&quot; : {<br\/>        &quot;name&quot;: &quot;Dor&quot;,<br\/>        &quot;lname&quot; : &quot;Sever&quot;<br\/>  }<br\/>}<br\/>}<br\/>&#039;<br\/><br\/>curl -X PUT &quot;localhost:9200\/your-index-names-1\/_doc\/3?pretty&quot; -H &#039;Content-Type: application\/json&#039; -d&#039;<br\/>{<br\/>  &quot;actions&quot;: {<br\/>    &quot;tags&quot; : {<br\/>        &quot;name&quot;: &quot;AnotherName&quot;,<br\/>        &quot;lname&quot; : &quot;AnotherLastName&quot;<br\/>  }<br\/>}<br\/>}<br\/>&#039;<\/code><\/pre> <\/div>\n<div class=\"code-embed-wrapper\"> <div class=\"code-embed-infos\"> <\/div> <pre class=\"language-python code-embed-pre line-numbers\"  data-start=\"1\" data-line-offset=\"0\"><code class=\"language-python code-embed-code\">curl -X GET &quot;localhost:9200\/_search?pretty&quot; -H &#039;Content-Type: application\/json&#039; -d&#039;<br\/>{<br\/>    &quot;query&quot;: {<br\/>        &quot;match&quot; : {<br\/>            &quot;actions.tags.name&quot; : {<br\/>                &quot;query&quot; : &quot;John&quot;<br\/>            }<br\/>        }<br\/>    }<br\/>}<br\/>&#039;<br\/># returns 1 match(doc 1)<br\/><br\/><br\/>curl -X GET &quot;localhost:9200\/_search?pretty&quot; -H &#039;Content-Type: application\/json&#039; -d&#039;<br\/>{<br\/>    &quot;query&quot;: {<br\/>        &quot;match&quot; : {<br\/>            &quot;actions.tags.lname&quot; : {<br\/>                &quot;query&quot; : &quot;John&quot;<br\/>            }<br\/>        }<br\/>    }<br\/>}<br\/>&#039;<br\/># returns zero matches<br\/><br\/># search by value<br\/>curl -X GET &quot;localhost:9200\/_search?pretty&quot; -H &#039;Content-Type: application\/json&#039; -d&#039;<br\/>{<br\/>    &quot;query&quot;: {<br\/>        &quot;query_string&quot; : {<br\/>            &quot;fields&quot;: [&quot;actions.tags.*&quot; ],<br\/>            &quot;query&quot; : &quot;Dor&quot;<br\/>        }<br\/>    }<br\/>}<br\/>&#039;<\/code><\/pre> <\/div>\n<h2 id=\"nested-documents-solution\" style=\"text-align: justify;\">Nested documents solution<\/h2>\n<p style=\"text-align: justify;\">Our first instinct in solving the mapping explosion problem was to use nested documents.<\/p>\n<p style=\"text-align: justify;\">We read the nested data type tutorial within the Elastic docs and defined the subsequent schema and queries:<\/p>\n<div class=\"code-embed-wrapper\"> <div class=\"code-embed-infos\"> <\/div> <pre class=\"language-python code-embed-pre line-numbers\"  data-start=\"1\" data-line-offset=\"0\"><code class=\"language-python code-embed-code\">curl -X PUT &quot;localhost:9200\/my_index?pretty&quot; -H &#039;Content-Type: application\/json&#039; -d&#039;<br\/>{<br\/>        &quot;mappings&quot;: {<br\/>            &quot;_doc&quot;: {<br\/>            &quot;properties&quot;: {<br\/>            &quot;tags&quot;: {<br\/>                &quot;type&quot;: &quot;nested&quot; <br\/>                }                <br\/>            }<br\/>        }<br\/>        }<br\/>}<br\/>&#039;<br\/><br\/>curl -X PUT &quot;localhost:9200\/my_index\/_doc\/1?pretty&quot; -H &#039;Content-Type: application\/json&#039; -d&#039;<br\/>{<br\/>  &quot;tags&quot; : [<br\/>    {<br\/>      &quot;key&quot; : &quot;John&quot;,<br\/>      &quot;value&quot; :  &quot;Smith&quot;<br\/>    },<br\/>    {<br\/>      &quot;key&quot; : &quot;Alice&quot;,<br\/>      &quot;value&quot; :  &quot;White&quot;<br\/>    }<br\/>  ]<br\/>}<br\/>&#039;<br\/><br\/><br\/># Query by tag key and value<br\/>curl -X GET &quot;localhost:9200\/my_index\/_search?pretty&quot; -H &#039;Content-Type: application\/json&#039; -d&#039;<br\/>{<br\/>  &quot;query&quot;: {<br\/>    &quot;nested&quot;: {<br\/>      &quot;path&quot;: &quot;tags&quot;,<br\/>      &quot;query&quot;: {<br\/>        &quot;bool&quot;: {<br\/>          &quot;must&quot;: [<br\/>            { &quot;match&quot;: { &quot;tags.key&quot;: &quot;Alice&quot; }},<br\/>            { &quot;match&quot;: { &quot;tags.value&quot;:  &quot;White&quot; }} <br\/>          ]<br\/>        }<br\/>      }<br\/>    }<br\/>  }<br\/>}<br\/>&#039;<br\/><br\/># Returns 1 document<br\/><br\/><br\/>curl -X GET &quot;localhost:9200\/my_index\/_search?pretty&quot; -H &#039;Content-Type: application\/json&#039; -d&#039;<br\/>{<br\/>  &quot;query&quot;: {<br\/>    &quot;nested&quot;: {<br\/>      &quot;path&quot;: &quot;tags&quot;,<br\/>      &quot;query&quot;: {<br\/>        &quot;bool&quot;: {<br\/>          &quot;must&quot;: [<br\/>            { &quot;match&quot;: { &quot;tags.value&quot;:  &quot;Smith&quot; }} <br\/>          ]<br\/>        }<br\/>      }<br\/>    }<br\/>  }<br\/>}<br\/>&#039;<br\/><br\/># Query by tag value<br\/># Returns 1 result<\/code><\/pre> <\/div>\n<p style=\"text-align: justify;\">And this solution worked. However, once we tried to insert real customer data we saw that the amount of documents in our index increased by around 500 times.<\/p>\n<p style=\"text-align: justify;\">We considered the subsequent problems and went on to seek out a far better solution:<\/p>\n<p style=\"text-align: justify;\">The amount of documents we had in our cluster was around 500 million documents. This meant that, with the new schema, we were getting to reach 2 hundred fifty billion documents (that\u2019s 250,000,000,000 documents \ud83d\ude31).<\/p>\n<p style=\"text-align: justify;\">We read this specialized blog post \u2014 https:\/\/blog.gojekengineering.com\/elasticsearch-the-trouble-with-nested-documents-e97b33b46194 which highlights that nested documents can cause high latency in queries and heap usage problems.<\/p>\n<p style=\"text-align: justify;\">Testing \u2014 Since we were converting 1 document within the old cluster to an unknown number of documents within the new cluster, it might be much harder to trace if the migration process worked with none data loss. If our conversion was 1:1, we could assert that the count within the old cluster equalled the count within the new cluster.<\/p>\n<h2 id=\"avoiding-nested-documents\" style=\"text-align: justify;\">Avoiding nested documents<\/h2>\n<p style=\"text-align: justify;\">The real trick during this was to specialise in what supported queries we were running: search by tag value, and search by tag key and value.<\/p>\n<p style=\"text-align: justify;\">The first query doesn&#8217;t require nested documents since it works on one field. For the latter, we did the subsequent trick. We created a field that contains the mixture of the key and therefore the value. Whenever a user queries on a key, value match, we translate their request to the corresponding text and query against that field.<\/p>\n<p style=\"text-align: justify;\">Example:<\/p>\n<div class=\"code-embed-wrapper\"> <div class=\"code-embed-infos\"> <\/div> <pre class=\"language-python code-embed-pre line-numbers\"  data-start=\"1\" data-line-offset=\"0\"><code class=\"language-python code-embed-code\">curl -X PUT &quot;localhost:9200\/my_index_2?pretty&quot; -H &#039;Content-Type: application\/json&#039; -d&#039;<br\/>{<br\/>    &quot;mappings&quot;: {<br\/>        &quot;_doc&quot;: {<br\/>            &quot;properties&quot;: {<br\/>                &quot;tags&quot;: {<br\/>                    &quot;type&quot;: &quot;object&quot;,<br\/>                    &quot;properties&quot;: {<br\/>                        &quot;keyToValue&quot;: {<br\/>                            &quot;type&quot;: &quot;keyword&quot;<br\/>                        },<br\/>                        &quot;value&quot;: {<br\/>                            &quot;type&quot;: &quot;keyword&quot;<br\/>                        }<br\/>                    }<br\/>                }<br\/>            }<br\/>        }<br\/>    }<br\/>}<br\/>&#039;<br\/><br\/><br\/>curl -X PUT &quot;localhost:9200\/my_index_2\/_doc\/1?pretty&quot; -H &#039;Content-Type: application\/json&#039; -d&#039;<br\/>{<br\/>  &quot;tags&quot; : [<br\/>    {<br\/>      &quot;keyToValue&quot; : &quot;John:Smith&quot;,<br\/>      &quot;value&quot; : &quot;Smith&quot;<br\/>    },<br\/>    {<br\/>      &quot;keyToValue&quot; : &quot;Alice:White&quot;,<br\/>      &quot;value&quot; : &quot;White&quot;<br\/>    }<br\/>  ]<br\/>}<br\/>&#039;<br\/><br\/># Query by key,value<br\/># User queries for key: Alice, and value : White , we then query elastic with this query:<br\/><br\/>curl -X GET &quot;localhost:9200\/my_index_2\/_search?pretty&quot; -H &#039;Content-Type: application\/json&#039; -d&#039;<br\/>{<br\/>  &quot;query&quot;: {<br\/>        &quot;bool&quot;: {<br\/>          &quot;must&quot;: [ { &quot;match&quot;: { &quot;tags.keyToValue&quot;: &quot;Alice:White&quot; }}]<br\/>  }}}<br\/>&#039;<br\/><br\/># Query by value only<br\/>curl -X GET &quot;localhost:9200\/my_index_2\/_search?pretty&quot; -H &#039;Content-Type: application\/json&#039; -d&#039;<br\/>{<br\/>  &quot;query&quot;: {<br\/>        &quot;bool&quot;: {<br\/>          &quot;must&quot;: [ { &quot;match&quot;: { &quot;tags.value&quot;: &quot;White&quot; }}]<br\/>  }}}<br\/>&#039;<\/code><\/pre> <\/div>\n<h2 id=\"chapter-6-the-migration-process\" style=\"text-align: justify;\">Chapter 6 \u2014 The migration process<\/h2>\n<p style=\"text-align: justify;\">We planned to migrate about 500 million documents with zero downtime. to try to to that we needed:<\/p>\n<ul style=\"text-align: justify;\">\n<li>A strategy on the way to transfer data from the old Elastic to the new Elasticsearch<\/li>\n<li>A strategy on the way to close the lag between the beginning of the migration and therefore the end of it<\/li>\n<\/ul>\n<p style=\"text-align: justify;\"><strong>And our two options in closing the lag:<\/strong><\/p>\n<ul style=\"text-align: justify;\">\n<li>Our messaging system is Kafka based. We could have just taken the present offset before the migration started, and after the migration ended, start consuming from that specific offset. This solution requires some manual tweaking of offsets and a few other stuff, but will work.<\/li>\n<li>Another approach to solving this issue was to start out consuming messages from the start of the subject in Kafka and make our actions on Elasticsearch idempotent \u2014 meaning, if the change was \u201capplied\u201d already, nothing would change in Elastic store.<\/li>\n<\/ul>\n<p style=\"text-align: justify;\">The requests made by our service against Elastic were already idempotent, so we elect option 2 because it required zero manual work (no got to take specific offsets, then set them afterward during a new consumer group).<\/p>\n<h3 id=\"how-can-we-migrate-the-data\" style=\"text-align: justify;\">How can we migrate the data?<\/h3>\n<p style=\"text-align: justify;\">These were the choices we thought of:<\/p>\n<ul style=\"text-align: justify;\">\n<li>If our Kafka contained all messages from the start of your time , we could just play from the beginning and therefore the end state would be equal. But since we apply retention to out topics, this wasn&#8217;t an option.<\/li>\n<li>Dump messages to disk then ingest them to Elastic directly \u2013 This solution looked quite weird. Why store them in disk rather than just writing them on to Elastic?<\/li>\n<li>Transfer messages between old Elastic to new Elastic \u2014 This meant, writing some kind of \u201cscript\u201d (did anyone say Python? \ud83d\ude03) which will hook up with the old Elasticsearch cluster, query for items, transform them to the new schema, and index them within the cluster.<\/li>\n<\/ul>\n<p style=\"text-align: justify;\"><strong>We choose the last option. These were the planning choices we had in mind:<\/strong><\/p>\n<ul style=\"text-align: justify;\">\n<li>Let\u2019s not attempt to believe error handling unless we&#8217;d like to. Let\u2019s attempt to write something super simple, and if errors occur, let\u2019s attempt to address them. within the end, we didn&#8217;t got to address this issue since no errors occurred during the migration.<\/li>\n<li>It\u2019s a one-off operation, so whatever works first \/ KISS.<\/li>\n<li>Metrics \u2014 Since the migration processes can take hours to days, we wanted the power from day 1 to be ready to monitor the error count and to trace the present progress and replica rate of the script.<\/li>\n<\/ul>\n<p style=\"text-align: justify;\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-33485\" src=\"https:\/\/www.wikitechy.com\/technology\/wp-content\/uploads\/2020\/03\/python.gif\" alt=\"\" width=\"480\" height=\"480\" \/><\/p>\n<p style=\"text-align: justify;\">We thought long and hard and choose Python as our weapon of choice. The final version of the code is below:<\/p>\n<div class=\"code-embed-wrapper\"> <div class=\"code-embed-infos\"> <\/div> <pre class=\"language-python code-embed-pre line-numbers\"  data-start=\"1\" data-line-offset=\"0\"><code class=\"language-python code-embed-code\">dictor==0.1.2 - to copy and transform our Elasticsearch documentselasticsearch==1.9.0 - to connect to &quot;old&quot; Elasticsearchelasticsearch6==6.4.2 - to connect to the &quot;new&quot; Elasticsearchstatsd==3.3.0 - to report metrics<\/code><\/pre> <\/div>\n<div class=\"code-embed-wrapper\"> <div class=\"code-embed-infos\"> <\/div> <pre class=\"language-python code-embed-pre line-numbers\"  data-start=\"1\" data-line-offset=\"0\"><code class=\"language-python code-embed-code\">from elasticsearch import Elasticsearch<br\/>from elasticsearch6 import Elasticsearch as Elasticsearch6<br\/>import sys<br\/>from elasticsearch.helpers import scan<br\/>from elasticsearch6.helpers import parallel_bulk<br\/>import statsd<br\/><br\/>ES_SOURCE = Elasticsearch(sys.argv[1])<br\/>ES_TARGET = Elasticsearch6(sys.argv[2])<br\/>INDEX_SOURCE = sys.argv[3]<br\/>INDEX_TARGET = sys.argv[4]<br\/>QUERY_MATCH_ALL = {&quot;query&quot;: {&quot;match_all&quot;: {}}}<br\/>SCAN_SIZE = 1000<br\/>SCAN_REQUEST_TIMEOUT = &#039;3m&#039;<br\/>REQUEST_TIMEOUT = 180<br\/>MAX_CHUNK_BYTES = 15 * 1024 * 1024<br\/>RAISE_ON_ERROR = False<br\/><br\/><br\/>def transform_item(item, index_target):<br\/>    # implement your logic transformation here<br\/>    transformed_source_doc = item.get(&quot;_source&quot;)<br\/>    return {&quot;_index&quot;: index_target,<br\/>            &quot;_type&quot;: &quot;_doc&quot;,<br\/>            &quot;_id&quot;: item[&#039;_id&#039;],<br\/>            &quot;_source&quot;: transformed_source_doc}<br\/><br\/><br\/>def transformedStream(es_source, match_query, index_source, index_target, transform_logic_func):<br\/>    for item in scan(es_source, query=match_query, index=index_source, size=SCAN_SIZE,<br\/>                     timeout=SCAN_REQUEST_TIMEOUT):<br\/>        yield transform_logic_func(item, index_target)<br\/><br\/><br\/>def index_source_to_target(es_source, es_target, match_query, index_source, index_target, bulk_size, statsd_client,<br\/>                           logger, transform_logic_func):<br\/>    ok_count = 0<br\/>    fail_count = 0<br\/>    count_response = es_source.count(index=index_source, body=match_query)<br\/>    count_result = count_response[&#039;count&#039;]<br\/>    statsd_client.gauge(stat=&#039;elastic_migration_document_total_count,index={0},type=success&#039;.format(index_target),<br\/>                        value=count_result)<br\/>    with statsd_client.timer(&#039;elastic_migration_time_ms,index={0}&#039;.format(index_target)):<br\/>        actions_stream = transformedStream(es_source, match_query, index_source, index_target, transform_logic_func)<br\/>        for (ok, item) in parallel_bulk(es_target,<br\/>                                        chunk_size=bulk_size,<br\/>                                        max_chunk_bytes=MAX_CHUNK_BYTES,<br\/>                                        actions=actions_stream,<br\/>                                        request_timeout=REQUEST_TIMEOUT,<br\/>                                        raise_on_error=RAISE_ON_ERROR):<br\/>            if not ok:<br\/>                logger.error(&quot;got error on index {} which is : {}&quot;.format(index_target, item))<br\/>                fail_count += 1<br\/>                statsd_client.incr(&#039;elastic_migration_document_count,index={0},type=failure&#039;.format(index_target),<br\/>                                   1)<br\/>            else:<br\/>                ok_count += 1<br\/>                statsd_client.incr(&#039;elastic_migration_document_count,index={0},type=success&#039;.format(index_target),<br\/>                                   1)<br\/><br\/>    return ok_count, fail_count<br\/><br\/><br\/>statsd_client = statsd.StatsClient(host=&#039;localhost&#039;, port=8125)<br\/><br\/>if __name__ == &quot;__main__&quot;:<br\/>    index_source_to_target(ES_SOURCE, ES_TARGET, QUERY_MATCH_ALL, INDEX_SOURCE, INDEX_TARGET, BULK_SIZE,<br\/>                           statsd_client, transform_item)<\/code><\/pre> <\/div>\n<h2 id=\"conclusion\" style=\"text-align: justify;\">Conclusion<\/h2>\n<p style=\"text-align: justify;\">Migrating data during a live production system may be a complicated task that needs tons of attention and careful planning. i like to recommend taking the time to figure through the steps listed above and find out what works best for your needs.<\/p>\n<p style=\"text-align: justify;\">As a rule of thumb, always attempt to reduce your requirements the maximum amount as possible. for instance , may be a zero downtime migration required? are you able to afford data-loss?<\/p>\n<p style=\"text-align: justify;\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-33488\" src=\"https:\/\/www.wikitechy.com\/technology\/wp-content\/uploads\/2020\/03\/enjoy-the-ride.gif\" alt=\"\" width=\"480\" height=\"269\" \/><\/p>\n<p style=\"text-align: justify;\">Upgrading data stores is typically a marathon and not a sprint, so take a deep breath and check out to enjoy the ride.<\/p>\n<ul>\n<li style=\"text-align: justify;\">The whole process listed above took me around 4 months of labor<\/li>\n<li style=\"text-align: justify;\">All of the Elasticsearch examples that appear during this post are tested against version 6.8.1<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>My last task at BigPanda was to upgrade an existing service that was using Elasticsearch version 1.7 to a more moderen Elasticsearch version, 6.8.1. In this post, i will be able to share how we migrated from Elasticsearch 1.6 to 6.8 with harsh constraints like zero downtime, no data loss, and 0 bugs. I&#8217;ll also [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":33501,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4148],"tags":[86894,86761],"class_list":["post-33421","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-python","tag-migrate-from-elasticsearch","tag-python"],"_links":{"self":[{"href":"https:\/\/www.wikitechy.com\/technology\/wp-json\/wp\/v2\/posts\/33421","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wikitechy.com\/technology\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wikitechy.com\/technology\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wikitechy.com\/technology\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wikitechy.com\/technology\/wp-json\/wp\/v2\/comments?post=33421"}],"version-history":[{"count":0,"href":"https:\/\/www.wikitechy.com\/technology\/wp-json\/wp\/v2\/posts\/33421\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.wikitechy.com\/technology\/wp-json\/wp\/v2\/media\/33501"}],"wp:attachment":[{"href":"https:\/\/www.wikitechy.com\/technology\/wp-json\/wp\/v2\/media?parent=33421"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wikitechy.com\/technology\/wp-json\/wp\/v2\/categories?post=33421"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wikitechy.com\/technology\/wp-json\/wp\/v2\/tags?post=33421"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}